




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、An Introduction to IR3rd CourseA quick quiz on Boolean retrieval (Chap.1 &2)Outline of Chapter 3: Dictionaries and tolerant retrievalWild-card queriesSpelling correctionSoundexA Boolean Retrieval QuizWhat is an “inverted index”, and what is its data structure?How would you construct an inverted inde
2、x for a collection of text documents?What is a “posting”? How to find the postings of two words that occur within the same sentences in a document?Describe an algorithm that finds the postings of a “phrase” of n ordered words W1Wn, where n = 16.Chapter 3“Tolerant” retrievalWild-card queriesSpelling
3、correctionSoundexWild-card queriesWild-card queries: *mon*: find all docs containing any word beginning “mon”.Easy with binary tree (or B-tree) lexicon: retrieve all words in range: mon w moo*mon: find words ending in “mon”: harderMaintain an additional B-tree for terms backwards.Can retrieve all wo
4、rds in range: nom w 0.8, declare a match Matching trigramsConsider the query lord we wish to identify words matching 2 of its 3 bigrams (lo, or, rd)loorrdalonelordslothlordmorbidbordercardborderardentStandard postings “merge” will enumerate Adapt this to using Jaccard (or another) measure.CaveatEven
5、 for isolated-word correction, the notion of an index token is critical whats the unit were trying to correct?In Chinese/Japanese, the notions of spell-correction and wildcards are poorly formulated/understoodContext-sensitive spell correctionText: I flew from Heathrow to Narita.Consider the phrase
6、query “flew form Heathrow”Wed like to respondDid you mean “flew from Heathrow”?because no docs matched the query phrase.Context-sensitive correctionNeed surrounding context to catch this.NLP too heavyweight for this.First idea: retrieve dictionary terms close (in weighted edit distance) to each quer
7、y termNow try all possible resulting phrases with one word “fixed” at a timeflew from heathrow fled form heathrowflea form heathrowetc.Suggest the alternative that has lots of hits?ExerciseSuppose that for “flew form Heathrow” we have 7 alternatives for flew, 19 for form and 3 for heathrow.How many
8、“corrected” phrases will we enumerate in this scheme?Another approachBreak phrase query into a conjunction of biwords (Lecture 2).Look for biwords that need only one term corrected.Enumerate phrase matches and rank them!General issue in spell correctionWill enumerate multiple alternatives for “Did y
9、ou mean”Need to figure out which one (or small number) to present to the userUse heuristicsThe alternative hitting most docsQuery log analysis + tweakingFor especially popular, topical queriesComputational costSpell-correction is computationally expensiveAvoid running routinely on every query?Run on
10、ly on queries that matched few docsThesauriThesaurus: language-specific list of synonyms for terms likely to be queriedcar automobile, etc.Machine learning methods can assist more on this in later lectures.Can be viewed as hand-made alternative to edit-distance, etc.Query expansionUsually do query e
11、xpansion rather than index expansionNo index blowupQuery processing slowed downDocs frequently contain equivalencesMay retrieve more junkpuma jaguar retrieves documents on cars instead of on sneakers.SoundexSoundexClass of heuristics to expand a query into phonetic equivalentsLanguage specific mainl
12、y for namesE.g., chebyshev tchebycheffSoundex typical algorithmTurn every token to be indexed into a 4-character reduced formDo the same with query termsBuild and search an index on the reduced forms(when the query calls for a soundex match)/Doc/Articles/SoundEx1/SoundEx1.htm#TopSoundex typical algo
13、rithmRetain the first letter of the word. Change all occurrences of the following letters to 0 (zero): A, E, I, O, U, H, W, Y. Change letters to digits as follows: B, F, P, V 1C, G, J, K, Q, S, X, Z 2D,T 3L 4M, N 5R 6Soundex continuedRemove all pairs of consecutive digits.Remove all zeros from the r
14、esulting string.Pad the resulting string with trailing zeros and return the first four positions, which will be of the form . E.g., Herman becomes H655.Will hermann generate the same code?ExerciseUsing the algorithm described above, find the soundex code for your nameDo you know someone who spells t
15、heir name differently from you, but their name yields the same soundex code?Language detectionMany of the components described above require language detectionFor docs/paragraphs at indexing timeFor query terms at query time much harderFor docs/paragraphs, generally have enough text to apply machine
16、 learning methodsFor queries, lack sufficient textAugment with other cues, such as client properties/specification from applicationDomain of query origination, etc.What queries can we process?We haveBasic inverted index with skip pointersWild-card indexSpell-correctionSoundexQueries such as(SPELL(mo
17、riset) /3 toron*to) OR SOUNDEX(chaikofski)Aside results cachingIf 25% of your users are searching for britney AND spearsthen you probably do need spelling correction, but you dont need to keep on intersecting those two postings listsWeb query distribution is extremely skewed, and you can usefully ca
18、che results for common queries more later.ExerciseDraw yourself a diagram showing the various indexes in a search engine incorporating all this functionalityIdentify some of the key design choices in the index pipeline:Does stemming happen before the Soundex index?What about n-grams?Given a query, how would you parse and dispatch sub-queries to the various indexes?Exercise on previous slideIs the beginning of “what do we we need in our sear
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 毛皮行业可持续发展战略规划考核试卷
- 毛皮制品的绿色能源利用考核试卷
- 安宁疗护医患沟通考核试卷
- 港口及航运设施工程的船舶污染防治措施考核试卷
- 煤炭市场结构优化与产业转型升级路径探索分析研究考核试卷
- 成人学生个性化教学考核试卷
- 智能照明在无人机照明中的应用考核试卷
- 智能配送系统优化与升级考核试卷
- 涂料店经营案例分析考核试卷
- 劳动合同标准文本续签
- 2025山西地质集团招聘37人笔试参考题库附带答案详解
- 2024年新疆中考数学试卷(含答案解析)
- 建筑地基基础检测规范DBJ-T 15-60-2019
- 07FK02防空地下室通风设备安装图集
- 问诊教学课件
- 参考文献的标注规范
- 武松打虎剧本
- 精品资料(2021-2022年收藏)辽宁省建筑材料检测费标准
- 浙江省交通建设工程质量检测和工程材料试验收费标准表
- 脱硝培训课件
- 分子生态学(课堂PPT)
评论
0/150
提交评论