[计算机]Lucene代码分析_第1页
[计算机]Lucene代码分析_第2页
[计算机]Lucene代码分析_第3页
[计算机]Lucene代码分析_第4页
[计算机]Lucene代码分析_第5页
已阅读5页,还剩6页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、.1. Lucene源代码分析1首先讲一下Lucene的发音是Loo-seen,这是Lucene in Action中提到过的。另外强调的一点是我用的版本是1.9版,大家看到这里不要惊讶,我用这么早的版本,是为了能更好的了解Lucene的核心。如果有人看过最新的版本就应该了解,对于一个初学者,Lucene最高版本并不那么简单,它涉及了多线程,设计模式,对我反正是很挑战了。我先看老版本这也是受LINUX内核完全注释作者赵炯的启发,他分析的不是最新的Linux内核,而是1.11的版本。我开始还是用调试的方式来解释,我想大家和我一样,如果看了半天analyzer也会有点不耐烦,我先写一个什么意义都没

2、有例子(有那么一点意义的例子,网上很多):package forfun; import org.apache.lucene.analysis.SimpleAnalyzer;import org.apache.lucene.index.IndexWriter; public class Test public static void main( String args ) throws ExceptionIndexWriter writer = new IndexWriter( "E:a", new SimpleAnalyzer(), true);Inde

3、xWriter是最核心的一个类,一般的Blogger把其它所有的包都分析完了,就剩这最核心的一个包的时候,就分析不动了。我们先看一下它的参数,第一个就是索引存放的路径,第二个参数是一个Analyzer对象,它对输入数据进行过滤,分词等,第三个参数如果为true,那么它删除原目录内的所有内容重建索引,如果为false,就在已经存在的索引上追加新的内容。你可以先运行一下,就会发现指定的目录下有一个segments文件。调试的时候,暂时不去管SimpleAnalyzer类。我们看一下IndexWriter类的构造函数:public IndexWriter(String path, Analyzer

4、a, boolean create)throws IOException this(FSDirectory.getDirectory(path, create), a, create, true);这里我们看到一个新的类FSDirectory:public static FSDirectory getDirectory(String path, boolean create) throws IOException return getDirectory(new File(path), create);再看getDirectory函数:public static FSDirectory getD

5、irectory(File file, boolean create)throws IOException file = new File(file.getCanonicalPath();FSDirectory dir;synchronized (DIRECTORIES) dir = (FSDirectory) DIRECTORIES.get(file);if (dir = null) try dir = (FSDirectory) IMPL.newInstance(); catch (Exception e) throw new RuntimeException("cannot l

6、oad FSDirectory class: " + e.toString();dir.init(file, create);DIRECTORIES.put(file, dir); else if (create) dir.create();synchronized (dir) dir.refCount+;return dir;DIRECTORIES是一个Hashtable对象,DIRECTORIES注释上讲,目录的缓存,保证唯一的路径和Directory对应,所以在Directory上同步可以对读写进行同步访问。(This cache of directories ensures

7、that there is a unique Directory instance per path, so that synchronization on the Directory can be used to synchronize access between readers and writers.)也懒得解释了,就是创建一下目录,最后将refCount+。我们回过头来看IndexWriter的构造函数:private IndexWriter(Directory d, Analyzer a, final boolean create,boolean closeDir) throws

8、IOException this.closeDir = closeDir;directory = d;analyzer = a; Lock writeLock = directory.makeLock(IndexWriter.WRITE_LOCK_NAME);if (!writeLock.obtain(WRITE_LOCK_TIMEOUT) / obtain write lockthrow new IOException("Index locked for write: " + writeLock);this.writeLock = writeLock; / sa

9、ve it synchronized (directory) / in- & inter-process syncnew Lock.With(directory.makeLock(IndexWriter.COMMIT_LOCK_NAME),COMMIT_LOCK_TIMEOUT) public Object doBody() throws IOException if (create)segmentInfos.write(directory);elsesegmentInfos.read(directory);return null;.run();这里让我感兴趣的是doBody

10、中的segmentInfos.writer,我们进入看一下这个函数:public final void write(Directory directory) throws IOException IndexOutput output = directory.createOutput("segments.new");try output.writeInt(FORMAT); / write FORMAToutput.writeLong(+version); / every write changes the indexoutput.writeInt(counter); / wr

11、ite counteroutput.writeInt(size(); / write infosfor (int i = 0; i < size(); i+) SegmentInfo si = info(i);output.writeString();output.writeInt(si.docCount); finally output.close(); / install new segment infodirectory.renameFile("segments.new", IndexFileNames.SEGMENTS);先看一下第一个

12、函数,它建立了一个segments.new的文件,你如果在调试,就可以看到这个文件产生了,它返回一个IndexOutput对象,用它来写文件。我们就不去理睬这些有什么用了,第一个FORMAT是-1,第二个version是用System.currentTimeMillis()产生的,目的是产生唯一的一个版本号。下面counter是0。SegmentInfos继承自Vector,下面的size()就是它有多少个元素,但是我们没有对任何文档建索引,所以它是空的。最后一句话是把segments.new文件名重命名为segment。你可以用UltraEdit或是WinHex打开segments看一下里面

13、的内容。我这里把它列出来:FF FF FF FF 00 00 01 22 15 02 07 2A 00 00 00 00 00 00 00 00writeInt是写入四个字节,writeLong是八个字节,现在可以看到所写入的四个内容分别是什么了。2. Lucene源代码分析2上次提到了Analyzer类,说它是用于对输入进行过滤,分词等,现在我们详细看一个这个类,Lucene中一个Analyzer通常由Tokenizer和TokenFilter组成,我们先看一下Tokenizer:public abstract class Tokenizer extends TokenStream /* T

14、he text source for this Tokenizer. */protected Reader input; /* Construct a tokenizer with null input. */protected Tokenizer()  /* Construct a token stream processing the given input. */protected Tokenizer(Reader input) this.input = input; /* By default, closes the input Reader. */pub

15、lic void close() throws IOException input.close();只是一个抽象类,而且也没什么值得我们关注的函数,我们看一下他的父类TokenStream:public abstract class TokenStream /* Returns the next token in the stream, or null at EOS. */public abstract Token next() throws IOException; /* Releases resources associated with this stream. */publi

16、c void close() throws IOException 原来值得我们关注的函数在它的父类中,next函数,它会返回流中的下一个token。其实刚才提到的另一个类TokenFilter也继承自TokenStream:public abstract class TokenFilter extends TokenStream /* The source of tokens for this filter. */protected TokenStream input; /* Call TokenFilter(TokenStream) instead. * deprecated *

17、/protected TokenFilter()  /* Construct a token stream filtering the given input. */protected TokenFilter(TokenStream input) this.input = input; /* Close the input TokenStream. */public void close() throws IOException input.close();先写一个依然没有意义的测试类:package forfun; import java.io.Buffered

18、Reader;import java.io.File;import java.io.FileReader;import org.apache.lucene.analysis.LetterTokenizer; public class TokenTest public static void main( String args ) throws ExceptionFile f = new File( "E:source.txt" );BufferedReader reader = new BufferedReader(new FileReader(f);Letter

19、Tokenizer lt = new LetterTokenizer( reader );System.out.println( lt.next() );Source.txt中我写的hello world!。当然你也可以写别的,我用LetterTokenizer进行分词,最后打印分词后的第一个token。我们先看一下他是如何分词的,也就是next到底在做什么。public class LetterTokenizer extends CharTokenizer /* Construct a new LetterTokenizer. */public LetterTokenizer(Reader

20、in) super(in); /* Collects only characters which satisfy * link Character#isLetter(char).*/protected boolean isTokenChar(char c) return Character.isLetter(c);函数isTokenChar来判断c是不是一个字母,它并没有实现next函数,我们到它的父类看一下,找到了next函数:/* Returns the next token in the stream, or null at EOS. */public final Token

21、next() throws IOException int length = 0;int start = offset;while (true) final char c; offset+;if (bufferIndex >= dataLen) dataLen = input.read(ioBuffer);bufferIndex = 0;if (dataLen = -1) if (length > 0)break;elsereturn null; elsec = ioBufferbufferIndex+; if (isTokenChar(c) / if it&#

22、39;s a token char if (length = 0) / start of tokenstart = offset - 1; bufferlength+ = normalize(c); / buffer it, normalized if (length = MAX_WORD_LEN) / buffer overflow!break;  else if (length > 0) / at non-Letter w/ charsbreak; / return 'em  return new Token(new

23、 String(buffer, 0, length), start, start + length);看起来很长,其实很简单,至少读起来很简单,其中isTokenChar就是我们刚才在LetterTokenizer中看到的,代码中用start记录一个token的起始位置,用length记录它的长度,如果不是字符的话,就break;,我们看到一个新的类Token,这里它的构造参数有字符串,起始位置,结束位置。看一下Token的源代码:String termText; / the text of the termint startOffset; / start in source textint

24、endOffset; / end in source textString type = "word" / lexical typeprivate int positionIncrement = 1; /* Constructs a Token with the given term text, and start & end offsets. The type defaults to "word." */public Token(String text, int start, int end) termText = text;star

25、tOffset = start;endOffset = end; /* Constructs a Token with the given text, start and end offsets, & type. */public Token(String text, int start, int end, String typ) termText = text;startOffset = start;endOffset = end;type = typ;和我们刚才用到的构造函数对应一下,就知道三个成员变量的意思了,type和positionIncrement我还是引用一下别

26、的人话,Type主要用来表示文本编码和语言类型,single表示单个ASCII字符,double表示non-ASCII字符,Word是默认的不区分的字符类型。而positionIncrement表示位置增量,用于处理拼音之类的情况(拼音就在那个词的上方)。3. Lucene源代码分析3关于TokenFilter我们先看一个最简单的LowerCaseFilter,它的next函数如下:public final Token next() throws IOException Token t = input.next(); if (t = null)return null; t.

27、termText = t.termText.toLowerCase(); return t;没什么意思,就是把Token对象中的字符串换成了小写,你想看有意思的可以看PortStemFilter,剑桥大学出的那本Introduction to information retrieval中也提到过这种方法,34页。再看一个稍有一点意义的TokenFilter,StopFilter,我们看一下public static final Set makeStopSet(String stopWords) return makeStopSet(stopWords, false); pu

28、blic static final Set makeStopSet(String stopWords, boolean ignoreCase) HashSet stopTable = new HashSet(stopWords.length);for (int i = 0; i < stopWords.length; i+)stopTable.add(ignoreCase ? stopWordsi.toLowerCase(): stopWordsi);return stopTable; public final Token next() throws IOException /

29、 return the first non-stop word foundfor (Token token = input.next(); token != null; token = input.next() String termText = ignoreCase ? token.termText.toLowerCase(): token.termText;if (!stopWords.contains(termText)return token;/ reached EOS - return nullreturn null;makeStopSet是把所有要过滤的词加到stopTable中去

30、(不清楚为什么不用HashSet呢),在next函数中,它过滤掉stopTable有的字符串。再来看一个简单的Analyzer,StopAnalyzer的next函数:public TokenStream tokenStream(String fieldName, Reader reader) return new StopFilter(new LowerCaseTokenizer(reader), stopWords);记得这句话吗?Lucene中一个Analyzer通常由Tokenizer和TokenFilter组成,这里就是这句话的证据,我们先对reader传进来的字符串进行分词,再对它

31、进行过滤。而其中的tokenStream当然就是我们在分词时要调用的那个函数了。4. Lucene源代码分析4写一个略有一点意义的例子,我们把”Hello World”加入索引:package forfun; import org.apache.lucene.analysis.SimpleAnalyzer;import org.apache.lucene.document.Document;import org.apache.lucene.document.Field;import org.apache.lucene.index.IndexWriter; public cl

32、ass FieldTest public static void main( String args ) throws ExceptionIndexWriter writer = new IndexWriter( "E:a", new SimpleAnalyzer(), true);writer.setUseCompoundFile( false ); Document doc = new Document();Field name = new Field( "TheField", "hello world",Field.S

33、tore.YES, Field.Index.TOKENIZED );doc.add( name );writer.addDocument( doc );writer.close();Document和Field不太想讲了,参数的含意也查的到,直接开始最重要的函数,IndexWriter中的addDocument函数:public void addDocument(Document doc, Analyzer analyzer) throws IOException DocumentWriter dw = new DocumentWriter(ramDirectory, analyzer, th

34、is);dw.setInfoStream(infoStream);String segmentName = newSegmentName();dw.addDocument(segmentName, doc);synchronized (this) segmentInfos.addElement(new SegmentInfo(segmentName, 1,ramDirectory);maybeMergeSegments();我们看到它用DocumentWriter进行加入文档dw.addDocument,这里newSegment是”_o”,我们看它的加入文档函数:final void addD

35、ocument(String segment, Document doc) throws IOException / write field namesfieldInfos = new FieldInfos();fieldInfos.add(doc);fieldInfos.write(directory, segment + ".fnm"); / write field valuesFieldsWriter fieldsWriter = new FieldsWriter(directory, segment,fieldInfos);try fieldsWriter

36、.addDocument(doc); finally fieldsWriter.close(); / invert doc into postingTablepostingTable.clear(); / clear postingTablefieldLengths = new intfieldInfos.size(); / init fieldLengthsfieldPositions = new intfieldInfos.size(); / init fieldPositionsfieldOffsets = new intfieldInfos.size(); / init fi

37、eldOffsets fieldBoosts = new floatfieldInfos.size(); / init fieldBoostsArrays.fill(fieldBoosts, doc.getBoost(); invertDocument(doc); / sort postingTable into an arrayPosting postings = sortPostingTable(); / write postingswritePostings(postings, segment); / write norms of ind

38、exed fieldswriteNorms(segment);有一个新的类,FieldInfos,看它的名字应该是保存Field信息的,看一下它的addDocument函数:public void add(Document doc) Enumeration fields = doc.fields();while (fields.hasMoreElements() Field field = (Field) fields.nextElement();add((), field.isIndexed(), field.isTermVectorStored(),field.isSt

39、orePositionWithTermVector(), field.isStoreOffsetWithTermVector(), field.getOmitNorms();果然是记录Field的信息的,dig deeper:public void add(String name, boolean isIndexed, boolean storeTermVector,boolean storePositionWithTermVector,boolean storeOffsetWithTermVector, boolean omitNorms) FieldInfo fi = fieldInfo(

40、name);if (fi = null) addInternal(name, isIndexed, storeTermVector,storePositionWithTermVector, storeOffsetWithTermVector,omitNorms); else if (fi.isIndexed != isIndexed) fi.isIndexed = true; / once indexed, always indexif (fi.storeTermVector != storeTermVector) fi.storeTermVector = true; / once vecto

41、r, always vectorif (fi.storePositionWithTermVector != storePositionWithTermVector) fi.storePositionWithTermVector = true; if (fi.storeOffsetWithTermVector != storeOffsetWithTermVector) fi.storeOffsetWithTermVector = true; if (fi.omitNorms != omitNorms) fi.omitNorms = false; / once norms are stored,

42、always store再到addInternal中看一下:private void addInternal(String name, boolean isIndexed,boolean storeTermVector, boolean storePositionWithTermVector,boolean storeOffsetWithTermVector, boolean omitNorms) FieldInfo fi = new FieldInfo(name, isIndexed, byNumber.size(),storeTermVector, storePositionWithTer

43、mVector,storeOffsetWithTermVector, omitNorms);byNumber.add(fi);byName.put(name, fi);回到fieldInfos.add(doc)的下一句:public void write(Directory d, String name) throws IOException IndexOutput output = d.createOutput(name);try write(output); finally output.close();这个函数我们见过,但是因为Directory是RAMDirectory,所以并没有以文

44、件的形式产生。public void write(IndexOutput output) throws IOException output.writeVInt(size();for (int i = 0; i < size(); i+) FieldInfo fi = fieldInfo(i);byte bits = 0x0;if (fi.isIndexed)bits |= IS_INDEXED;if (fi.storeTermVector)bits |= STORE_TERMVECTOR;if (fi.storePositionWithTermVector)bits |= STORE_

45、POSITIONS_WITH_TERMVECTOR;if (fi.storeOffsetWithTermVector)bits |= STORE_OFFSET_WITH_TERMVECTOR;if (fi.omitNorms)bits |= OMIT_NORMS;output.writeString();output.writeByte(bits);我们这里可以看到,”_0.fnm”中保存的是Field的名字以及设置。我们看一下,一共向_0.fnm中写入了多少内容,第一,写入了有多少个Field,第二,分别写入了Field的名字与设置。我把”_0.fnm”中的内容列出来看一下:0

46、1 08 54 68 65 46 69 65 6C 64 01 可能会有点奇怪,写入的是size()是一个int,为什么就用了一个字节表示了呢?其实与我们上次看到的writeInt不同的是这里用的是writeVInt:public void writeInt(int i) throws IOException writeByte(byte) (i >> 24);writeByte(byte) (i >> 16);writeByte(byte) (i >> 8);writeByte(byte) i); /* Writes an int in a variable-length format. Writes between one and * five bytes. Smaller values take fewer bytes. Negative numbers are not * supported.*/public void writeVInt(int i) throws IOEx

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论