版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Lecture 2: The term vocabulary and postings listsRecap of the previous lectureBasic inverted indexes:Structure: Dictionary and PostingsKey step in construction: SortingBoolean query processingIntersection by linear time “merging”Simple optimizationsOverview of course topicsCh. 12Plan for this lectur
2、eElaborate basic indexingPreprocessing to form the term vocabularyDocumentsTokenizationWhat terms do we put in the index?PostingsFaster merges: skip listsPositional postings and phrase queries3Recall the basic indexing pipelineTokenizerToken stream.FriendsRomansCountrymenLinguistic modulesModified t
3、okens.friendromancountrymanIndexerInverted index.friendromancountryman24213161Documents tobe indexed.Friends, Romans, countrymen.4Parsing a documentWhat format is it in?pdf/word/excel/html?What language is it in?What character set is in use?Each of these is a classification problem, which we will st
4、udy later in the course.But these tasks are often done heuristically Sec. 2.15Complications: Format/languageDocuments being indexed can include docs from many different languagesA single index may have to contain terms of several languages.Sometimes a document or its components can contain multiple
5、languages/formatsFrench email with a German pdf attachment.What is a unit document?A file?An email? (Perhaps one of many in an mbox.)An email with 5 attachments?A group of files (PPT or LaTeX as HTML pages)Sec. 2.16Tokens and Terms7TokenizationInput: “Friends, Romans, Countrymen”Output: TokensFriend
6、sRomansCountrymenA token is a sequence of characters in a documentEach such token is now a candidate for an index entry, after further processingDescribed belowBut what are valid tokens to emit?Sec. 2.2.18TokenizationIssues in tokenization:Finlands capital Finland? Finlands? Finlands?Hewlett-Packard
7、 Hewlett and Packard as two tokens?state-of-the-art: break up hyphenated sequence. co-educationlowercase, lower-case, lower case ?It can be effective to get the user to put in possible hyphensSan Francisco: one token or two? How do you decide it is one token?Sec. 2.2.19Numbers3/12/91 Mar. 12, 199112
8、/3/9155 B.C.B-52My PGP key is 324a3df234cb23e(800) 234-2333Often have embedded spacesOlder IR systems may not index numbersBut often very useful: think about things like looking up error codes/stacktraces on the webWill often index “meta-data” separatelyCreation date, format, etc.Sec. 2.2.110Tokeniz
9、ation: language issuesFrenchLensemble one token or two?L ? L ? Le ?Want lensemble to match with un ensembleUntil at least 2003, it didnt on GoogleInternationalization!German noun compounds are not segmentedLebensversicherungsgesellschaftsangestellterlife insurance company employeeGerman retrieval sy
10、stems benefit greatly from a compound splitter moduleCan give a 15% performance boost for German Sec. 2.2.111Tokenization: language issuesChinese and Japanese have no spaces between words:莎拉波娃现在居住在美国东南部的佛罗里达。Not always guaranteed a unique tokenization Further complicated in Japanese, with multiple a
11、lphabets intermingledDates/amounts in multiple formats500社情報不足時間$500K(約6,000万円)KatakanaHiraganaKanjiRomajiEnd-user can express query entirely in hiragana!Sec. 2.2.112Tokenization: language issuesArabic (or Hebrew) is basically written right to left, but with certain items like numbers written left t
12、o rightWords are separated, but letter forms within a word form complex ligatures startAlgeria achieved its independence in 1962 after 132 years of French occupation.Sec. 2.2.113Stop wordsWith a stop list, you exclude from the dictionary entirely the commonest words. Intuition:They have little seman
13、tic content: the, a, and, to, beThere are a lot of them: 30% of postings for top 30 wordsBut the trend is away from doing this:Good compression techniques (lecture 5) means the space for including stopwords in a system is very smallGood query optimization techniques (lecture 7) mean you pay little a
14、t query time for including stop words.You need them for:Phrase queries: “King of Denmark”Various song titles, etc.: “Let it be”, “To be or not to be”“Relational” queries: “flights to London”Sec. 2.2.214Normalization to termsWe need to “normalize” words in indexed text as well as query words into the
15、 same formWe want to match U.S.A. and USAResult is terms: a term is a (normalized) word type, which is an entry in our IR system dictionaryWe most commonly implicitly define equivalence classes of terms by, e.g., deleting periods to form a termU.S.A., USA USAdeleting hyphens to form a termanti-discr
16、iminatory, antidiscriminatory antidiscriminatorySec. 2.2.315Normalization: other languagesAccents: e.g., French rsum vs. resume.Umlauts: e.g., German: Tuebingen vs. TbingenShould be equivalentMost important criterion:How are your users like to write their queries for these words?Even in languages th
17、at standardly have accents, users often may not type themOften best to normalize to a de-accented termTuebingen, Tbingen, Tubingen TubingenSec. 2.2.316Normalization: other languagesNormalization of things like date forms7月30日 vs. 7/30Japanese use of kana vs. Chinese charactersTokenization and normal
18、ization may depend on the language and so is intertwined with language detectionCrucial: Need to “normalize” indexed text as well as query terms into the same formMorgen will ich in MIT Is thisGerman “mit”?Sec. 2.2.317Case foldingReduce all letters to lower caseexception: upper case in mid-sentence?
19、e.g., General MotorsFed vs. fedSAIL vs. sailOften best to lower case everything, since users will use lowercase regardless of correct capitalizationSec. 2.2.318Normalization to termsAn alternative to equivalence classing is to do asymmetric expansionAn example of where this may be usefulEnter: windo
20、wSearch: window, windowsEnter: windowsSearch: Windows, windows, windowEnter: WindowsSearch: WindowsPotentially more powerful, but less efficientSec. 2.2.319Thesauri and soundexDo we handle synonyms and homonyms?E.g., by hand-constructed equivalence classescar = automobile color = colourWe can rewrit
21、e to form equivalence-class termsWhen the document contains automobile, index it under car-automobile (and vice-versa)Or we can expand a queryWhen the query contains automobile, look under car as wellWhat about spelling mistakes?One approach is soundex, which forms equivalence classes of words based
22、 on phonetic heuristicsMore in lectures 3 and 920LemmatizationReduce inflectional/variant forms to base formE.g.,am, are, is becar, cars, cars, cars carthe boys cars are different colors the boy car be different colorLemmatization implies doing “proper” reduction to dictionary headword formSec. 2.2.
23、421StemmingReduce terms to their “roots” before indexing“Stemming” suggest crude affix choppinglanguage dependente.g., automate(s), automatic, automation all reduced to automat.for example compressed and compression are both accepted as equivalent to compress.for exampl compress andcompress ar both
24、acceptas equival to compressSec. 2.2.422Porters algorithmCommonest algorithm for stemming EnglishResults suggest its at least as good as other stemming optionsConventions + 5 phases of reductionsphases applied sequentiallyeach phase consists of a set of commandssample convention: Of the rules in a c
25、ompound command, select the one that applies to the longest suffix.Sec. 2.2.423Typical rules in Portersses ssies iational atetional tionRules sensitive to the measure of words (m1) EMENT replacement replaccement cementSec. 2.2.424Other stemmersOther stemmers exist, e.g., Lovins stemmer Single-pass,
26、longest suffix removal (about 250 rules)Full morphological analysis at most modest benefits for retrievalDo stemming and other normalizations help?English: very mixed results. Helps recall but harms precisionoperative (dentistry) operoperational (research) operoperating (systems) operDefinitely usef
27、ul for Spanish, German, Finnish, 30% performance gains for Finnish!Sec. 2.2.425Language-specificityMany of the above features embody transformations that areLanguage-specific andOften, application-specificThese are “plug-in” addenda to the indexing processBoth open source and commercial plug-ins are
28、 available for handling theseSec. 2.2.426Dictionary entriesensemble.french時間.japaneseMIT.englishmit.germanguaranteed.englishentries.englishsometimes.englishtokenization.englishThese may be grouped by language (or not). More on this in ranking/query processing.Sec. 2.227Faster postings merges:Skip po
29、inters/Skip lists28Recall basic mergeWalk through the two postings simultaneously, in time linear in the total number of postings entries128312484148641238111721BrutusCaesar28If the list lengths are m and n, the merge takes O(m+n)operations.Can we do better?Yes (if index isnt changing too fast).Sec.
30、 2.329Augment postings with skip pointers (at indexing time)Why?To skip postings that will not figure in the search results.How?Where do we place skip pointers?128248414864311238111721311141128Sec. 2.330Query processing with skip pointers128248414864311238111721311141128Suppose weve stepped through
31、the lists until we process 8 on each list. We match it and advance.We then have 41 and 11 on the lower. 11 is smaller.But the skip successor of 11 on the lower list is 31, sowe can skip ahead past the intervening postings.Sec. 2.331Where do we place skips?Tradeoff:More skips shorter skip spans more
32、likely to skip. But lots of comparisons to skip pointers.Fewer skips few pointer comparison, but then long skip spans few successful skips.Sec. 2.332Placing skipsSimple heuristic: for postings of length L, use L evenly-spaced skip pointers.This ignores the distribution of query terms.Easy if the ind
33、ex is relatively static; harder if L keeps changing because of updates.This definitely used to help; with modern hardware it may not unless youre memory-basedThe I/O cost of loading a bigger postings list can outweigh the gains from quicker in memory merging!Sec. 2.333Phrase queries and positional i
34、ndexes34Phrase queriesWant to be able to answer queries such as “stanford university” as a phraseThus the sentence “I went to university at Stanford” is not a match. The concept of phrase queries has proven easily understood by users; one of the few “advanced search” ideas that worksMany more querie
35、s are implicit phrase queriesFor this, it no longer suffices to store only entriesSec. 2.435A first attempt: Biword indexesIndex every consecutive pair of terms in the text as a phraseFor example the text “Friends, Romans, Countrymen” would generate the biwordsfriends romansromans countrymenEach of
36、these biwords is now a dictionary termTwo-word phrase query-processing is now immediate.Sec. 2.4.136Longer phrase queriesLonger phrases are processed as we did with wild-cards:stanford university palo alto can be broken into the Boolean query on biwords:stanford university AND university palo AND pa
37、lo altoWithout the docs, we cannot verify that the docs matching the above Boolean query do contain the phrase.Can have false positives!Sec. 2.4.137Extended biwordsParse the indexed text and perform part-of-speech-tagging (POST).Bucket the terms into (say) Nouns (N) and articles/prepositions (X).Cal
38、l any string of terms of the form NX*N an extended biword.Each such extended biword is now made a term in the dictionary.Example: catcher in the rye N X X NQuery processing: parse it into Ns and XsSegment query into enhanced biwordsLook up in index: catcher ryeSec. 2.4.138Issues for biword indexesFa
39、lse positives, as noted beforeIndex blowup due to bigger dictionaryInfeasible for more than biwords, big even for themBiword indexes are not the standard solution (for all biwords) but can be part of a compound strategySec. 2.4.139Solution 2: Positional indexesIn the postings, store for each term th
40、e position(s) in which tokens of it appear:Sec. 2.4.240Positional index exampleFor phrase queries, we use a merge algorithm recursively at the document levelBut we now need to deal with more than just equalityWhich of docs 1,2,4,5could contain “to beor not to be”?Sec. 2.4.241Processing a phrase quer
41、yExtract inverted index entries for each distinct term: to, be, or, not.Merge their doc:position lists to enumerate all positions with “to be or not to be”.to: 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; .be: 1:17,19; 4:17,191,291,430,434; 5:14,19,101; .Same general method for proximity sear
42、chesSec. 2.4.242Proximity queriesLIMIT! /3 STATUTE /3 FEDERAL /2 TORT Again, here, /k means “within k words of”.Clearly, positional indexes can be used for such queries; biword indexes cannot.Exercise: Adapt the linear merge of postings to handle proximity queries. Can you make it work for any value of k?This is a little tricky to do correctly
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2024农产品订购合同
- 2024年广西古建施工承揽合同模板
- 2024年人力资源服务保密协议
- 2024年度城市轨道交通安全监控系统合同
- 2024年建筑内架搭建专业承包合同
- 2024年度产品研发与技术服务合同
- 2024不能强迫续订劳动合同
- 2024年度赠与合同
- 2024年废旧物品回收处理协议
- 2024商铺租赁合同适用于各类商业街、购物中心店铺
- 航站楼管理部《机场使用手册》实施细则
- 脑卒中基本知识课件
- 高效沟通与管理技能提升课件
- 消防维保方案 (详细完整版)
- 四年级上册英语课件- M3U1 In the school (Period 3 ) 上海牛津版试用版(共15张PPT)
- 档案馆建设标准
- 高边坡支护专家论证方案(附有大量的图件)
- 苏教版五年级上册数学试题-第一、二单元 测试卷【含答案】
- 人员定位矿用井口唯一性检测系统
- 电力系统数据标记语言E语言格式规范CIME
- 历史纪年与历史年代的计算方法
评论
0/150
提交评论