The Lucene search engine - Powerful, flexible, and free.doc_第1页
The Lucene search engine - Powerful, flexible, and free.doc_第2页
The Lucene search engine - Powerful, flexible, and free.doc_第3页
The Lucene search engine - Powerful, flexible, and free.doc_第4页
The Lucene search engine - Powerful, flexible, and free.doc_第5页
已阅读5页,还剩4页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Sponsored by:This story appeared on JavaWorld at /jw-09-2000/jw-0915-lucene.htmlThe Lucene search engine: Powerful, flexible, and freeEasily add searching to your application with LuceneBy Brian Goetz, JavaW, 09/15/00Dont let the low version number - 0.04 as of August 2000 - fool you. The Lucene search engine is a robust, powerful, and flexible search toolkit, ready to tackle many common search problems. And since its now available under the more flexible LGPL open source license, the price (free!) is right too.Doug Cutting, an experienced developer of text-search and retrieval tools, created Lucene. Cutting is the primary author of the V-Twin search engine (part of Apples Copland operating system effort) and is currently a senior architect at Excite. He designed Lucene to make it easy to add indexing and search capability to a broad range of applications, including: Searchable email: An email application could let users search archived messages and add new messages to the index as they arrive. Online documentation search: A documentation reader - CD-based, Web-based, or embedded within the application - could let users search online documentation or archived publications. Searchable Webpages: A Web browser or proxy server could build a personal search engine to index every Webpage a user has visited, allowing users to easily revisit pages. Website search: A CGI program could let users search your Website. Content search: An application could let the user search saved documents for specific content; this could be integrated into the Open Document dialog. Version control and content management: A document management system could index documents, or document versions, so they can be easily retrieved. News and wire service feeds: A news server or relay could index articles as they arrive. Of course, many search engines could perform most of those functions, but few open source search tools offer Lucenes ease of use, rapid implementation, and flexibility. I first used Lucene when developing Eyebrowse, an open source Java-based tool for cataloguing and browsing mailing lists. (See Resources for a link.) A core requirement for Eyebrowse was flexible message search and retrieval capability. It demanded an indexing and search component that would efficiently update the index base as new messages arrived, allow multiple users to search and update the index base concurrently, and scale to archives containing millions of messages. Every other open source search engine I evaluated, including Swish-E, Glimpse, iSearch, and libibex, was poorly suited to Eyebrowses requirements in some way. This would have made integration problematic and/or time-consuming. With Lucene, I added indexing and searching to Eyebrowse in little more than half a day, from initial download to fully working code! This was less than one-tenth of the development time I had budgeted, and yielded a more tightly integrated and feature-rich result than any other search tool I considered. How search engines workCreating and maintaining an inverted index is the central problem when building an efficient keyword search engine. To index a document, you must first scan it to produce a list of postings. Postings describe occurrences of a word in a document; they generally include the word, a document ID, and possibly the location(s) or frequency of the word within the document. If you think of the postings as tuples of the form , a set of documents will yield a list of postings sorted by document ID. But in order to efficiently find documents that contain specific words, you should instead sort the postings by word (or by both word and document, which will make multiword searches faster). In this sense, building a search index is basically a sorting problem. The search index is a list of postings sorted by word. An innovative implementationMost search engines use B-trees to maintain the index; they are relatively stable with respect to insertion and have well-behaved I/O characteristics (lookups and insertions are O(log n) operations). Lucene takes a slightly different approach: rather than maintaining a single index, it builds multiple index segments and merges them periodically. For each new document indexed, Lucene creates a new index segment, but it quickly merges small segments with larger ones - this keeps the total number of segments small so searches remain fast. To optimize the index for fast searching, Lucene can merge all the segments into one, which is useful for infrequently updated indexes. To prevent conflicts (or locking overhead) between index readers and writers, Lucene never modifies segments in place, it only creates new ones. When merging segments, Lucene writes a new segment and deletes the old ones - after any active readers have closed it. This approach scales well, offers the developer a high degree of flexibility in trading off indexing speed for searching speed, and has desirable I/O characteristics for both merging and searching. A Lucene index segment consists of several files:A dictionary index containing one entry for each 100 entries in the dictionaryA dictionary containing one entry for each unique wordA postings file containing an entry for each postingSince Lucene never updates segments in place, they can be stored in flat files instead of complicated B-trees. For quick retrieval, the dictionary index contains offsets into the dictionary file, and the dictionary holds offsets into the postings file. Lucene also implements a variety of tricks to compress the dictionary and posting files - thereby reducing disk I/O - without incurring substantial CPU overhead. Evaluating search enginesOther widely used open source search engines include Swish-E, Glimpse, libibex, freeWAIS, and iSearch. Like any software package, each is optimized for use in particular situations; it is often difficult to deploy these tools outside of their intended domains. Consider the following features when evaluating a search engine: Incremental versus batch indexing: Some search engines only support batch indexing; once they create an index for a set of documents, adding new documents becomes difficult without reindexing all the documents. Incremental indexing allows easy adding of documents to an existing index. For some applications, like those that handle live data feeds, incremental indexing is critical. Lucene supports both types of indexing. Data sources: Many search engines can only index files or Webpages. This handicaps applications where indexed data comes from a database, or where multiple virtual documents exist in a single file, such as a ZIP archive. Lucene allows developers to deliver the document to the indexer through a String or an InputStream, permitting the data source to be abstracted from the data. However, with this approach, the developer must supply the appropriate readers for the data. Indexing control: Some search engines can automatically crawl through a directory tree or a Website to find documents to index. While this is convenient if your data is already stored in this manner, crawler-based indexers often provide limited flexibility for applications that require fine-grained control over the indexed documents. Since Lucene operates primarily in incremental mode, it lets the application find and retrieve documents. File formats: Some search engines can only index text or HTML documents; others support a filter mechanism, which offers a simple alternative to indexing word processing documents, SGML documents, and other file formats. Lucene supports such a mechanism. Content tagging: Some search engines treat a document as a single stream of tokens; others allow the specification of multiple data fields within a document, such as subject, abstract, author, and body. This permits semantically richer queries like author contains Hamilton AND body contains Constitution. Lucene supports content tagging by treating documents as collections of fields, and supports queries that specify which field(s) to search. Stop-word processing: Common words, such as a, and, and the, add little value to a search index. But since these words are so common, cataloging them will contribute considerably to the indexing time and index size. Most search engines will not index certain words, called stop words. Some use a list of stop words, while others select stop words statistically. Lucene handles stop words with the more general Analyzer mechanism, to be described later, and provides the StopAnalyzer class, which eliminates stop words from the input stream. Stemming: Often, a user desires a query for one word to match other similar words. For example, a query for jump should probably also match the words jumped, jumper, or jumps. Reducing a word to its root form is called stemming. Lucene does not yet implement stemming, but you could easily add a stemmer through a more sophisticated Analyzer class. Query features: Search engines support a variety of query features. Some support full Boolean queries; others support only and queries. Some return a relevance score with each hit. Some can handle adjacency or proximity queries - search followed by engine or Knicks near Celtics - others can only search on single keywords. Some can search multiple indexes at once and merge the results to give a meaningful relevance score. Lucene supports a wide range of query features, including all of those listed above. However, Lucene does not support the valuable Soundex, or sounds like, query. Concurrency: Can multiple users search an index at the same time? Can a user search an index while another updates it? Lucene allows users to search an index transactionally, even if another user is simultaneously updating the index. Non-English support: Many search engines implicitly assume that English is the target language; this is evident in areas such as stop-word lists, stemming algorithms, and the use of proximity to match phrase queries. As Lucene preprocesses the input stream through the Analyzer class provided by the developer, it is possible to perform language-specific filtering. Though by no means exhaustive, the above list offers a starting point for evaluating a search engine for a particular project. Some search tools are poorly suited to certain tasks - understanding your applications requirements can help you choose the right tool for the job. Using LuceneI will illustrate how to use Lucene to create, populate, and search an index. For clarity, import statements and exception handling have been omitted from the sample programs. In these illustrations, I have stored the search index in the filesystem (you can store indexes anywhere, e.g., in memory or in a database). The files being indexed are simple text files. With Lucene, you can also easily index other document formats and documents not stored in files. Create an indexThe simple program CreateIndex.java creates an empty index by generating an IndexWriter object and instructing it to build an empty index. In this example, the name of the directory that will store the index is specified on the command line. public class CreateIndex / usage: CreateIndex index-directory public static void main(String args) throws Exception String indexPath = args0; IndexWriter writer; / An index is created by opening an IndexWriter with the / create argument set to true. writer = new IndexWriter(indexPath, null, true); writer.close(); Index text documentsIndexFile.java shows how to add documents - the files named on the command line - to an index. For each file, IndexFiles creates a Document object, then calls IndexWriter.addDocument to add it to the index. From Lucenes point of view, a Document is a collection of fields that are name-value pairs. A Field can obtain its value from a String, for short fields, or an InputStream, for long fields. Using fields allows you to partition a document into separately searchable and indexable sections, and to associate metadata - such as name, author, or modification date - with a document. For example, when storing mail messages, you could put a messages subject, author, date, and body in separate fields, then build semantically richer queries like subject contains Java AND author contains Gosling. In the code below, we store two fields in each Document: path, to identify the original file path so it can be retrieved later, and body, for the files contents. public class IndexFiles / usage: IndexFiles index-path file . . . public static void main(String args) throws Exception String indexPath = args0; IndexWriter writer; writer = new IndexWriter(indexPath, new SimpleAnalyzer(), false); for (int i=1; iargs.length; i+) System.out.println(Indexing file + argsi); InputStream is = new FileInputStream(argsi); / We create a Document with two Fields, one which contains / the file path, and one the files contents. Document doc = new Document(); doc.add(Field.UnIndexed(path, argsi); doc.add(Field.Text(body, (Reader) new InputStreamReader(is); writer.addDocument(doc); is.close(); ; writer.close(); SearchSearch.java provides an example of how to search the index. While the com.lucene.Query package contains many classes for building sophisticated queries, here we use the built-in query parser, which handles the most common queries and is less complicated to use. We create a Searcher object, use the QueryParser to create a Query object, and call Searcher.search on the query. The search operation returns a Hits object - a collection of Document objects, one for each document matched by the query - and an associated relevance score for each document, sorted by score. public class Search public static void main(String args) throws Exception String indexPath = args0, queryString = args1; Searcher searcher = new IndexSearcher(indexPath); Query query = QueryParser.parse(queryString, body, new SimpleAnalyzer(); Hits hits = searcher.search(query); for (int i=0; ihits.length(); i+) System.out.println(hits.doc(i).get(path) + ; Score: + hits.score(i); ; The built-in query parser supports most queries, but if it is insufficient, you can always fall back on the rich set of query-building constructs provided. The query parser can parse queries like these: free AND text searchSearch for documents containing free and the phrase text search+text searchSearch for documents containing text and preferentially containing searchgiants -footballSearch for giants but omit documents containing footballauthor:gosling javaSearch for documents containing gosling in the author field and java in the bodyBeyond basic text documentsLucene uses three major abstractions to support building text indexes: Document, Analyzer, and Directory. The Document object represents a single document, modeled as a collection of Field objects (name-value pairs). For each document to be indexed, the application creates a Document object and adds it to the index store. The Analyzer converts the contents of each Field into a sequence of tokens. A Token, the basic unit of indexing in Lucene, represents a single word to be indexed after any document domain transformation - such as stop-word elimination, stemming, filtering, term normalization, or language translation - has been applied. The application filters undesired tokens, like stop words or portions of the input that do not need to be indexed, through the Analyzer class. It also modifies tokens as they are encountered in the input, to perform stemming or other term normalization. Conveniently, Lucene comes with a set of standard Analyzer objects for handling common transformations like word identification and stop-word elimination, so indexing simple text documents requires no additional work. If these arent enough, the developer can provide more sophisticated analyzers. The application provides the document data in the form of a String or InputStream, which the Analyzer converts to a stream of tokens. Because of this, Lucene can index data from any data source, not just files. If the documents are stored in files, use FileInputStream to retrieve them, as illustrated in IndexFile.java. If they are stored in an Oracle database, provide an InputStream class to retrieve them. If a document is not a text file but an HTML or XML file, for example, you can extract content by eliminating markups like HTML tags, document headers, or formatting instructions. This can be done with a FilterInputStream, which would convert a document stream into a stream containing only the documents content text, and connect it to the InputStream that retrieves the document. So, if we wanted to index a collection of XML documents stored in an Oracle database, the resulting code would be very similar to IndexFiles.java. But it would use an application-provided InputStream class to retrieve the document from the database (instead of FileInputStream), as well as an application-provided FilterInputStream to parse the XML and extract the desired content. Just as Lucene allows the application to control the handling of raw document data through the Analyzer and InputStream classes, it also defines an abstract class for reading and writing the index store (Directory). Lucene also provides concrete implementations of Directory for storing indexes in RAM (RAMDirectory) or in files (FSDirectory). If, for instance, you want to store the index data in a document con

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论