版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
IntroducingInformationRetrievalandWebSearchInformationRetrievalInformationRetrieval(IR)isfindingmaterial(usuallydocuments)ofanunstructurednature(usuallytext)thatsatisfiesaninformationneedfromwithinlargecollections(usuallystoredoncomputers).Thesedayswefrequentlythinkfirstofwebsearch,buttherearemanyothercases:E-mailsearchSearchingyourlaptopCorporateknowledgebasesLegalinformationretrieval2Unstructured(text)vs.structured(database)datainthemid-nineties3Unstructured(text)vs.structured(database)datatoday4BasicassumptionsofInformationRetrievalCollection:AsetofdocumentsAssumeitisastaticcollectionforthemomentGoal:Retrievedocumentswithinformationthatisrelevanttotheuser’sinformationneed
andhelpstheusercompleteatask5Sec.1.1howtrapmicealiveTheclassicsearchmodelCollectionUsertaskInfoneed
Query
Results
Searchengine
Query
refinementGetridofmiceinapoliticallycorrectwayInfoaboutremovingmicewithoutkillingthemMisconception?Misformulation?SearchHowgoodaretheretrieveddocs?Precision:Fractionofretrieveddocsthatarerelevanttotheuser’sinformationneedRecall
:FractionofrelevantdocsincollectionthatareretrievedMoreprecisedefinitionsandmeasurementstofollowlater7Sec.1.1Term-documentincidencematricesUnstructureddatain1620WhichplaysofShakespearecontainthewordsBrutus
AND
CaesarbutNOT
Calpurnia?OnecouldgrepallofShakespeare’splaysforBrutusandCaesar,thenstripoutlinescontainingCalpurnia?Whyisthatnottheanswer?Slow(forlargecorpora)NOT
Calpurniaisnon-trivialOtheroperations(e.g.,findthewordRomansnear
countrymen)notfeasibleRankedretrieval(bestdocumentstoreturn)Laterlectures9Sec.1.1Term-documentincidencematrices1ifplaycontainsword,0otherwiseBrutus
AND
Caesar
BUT
NOT
CalpurniaSec.1.1IncidencevectorsSowehavea0/1vectorforeachterm.Toanswerquery:takethevectorsforBrutus,CaesarandCalpurnia(complemented)bitwiseAND.110100AND110111AND101111=10010011Sec.1.1AnswerstoqueryAntonyandCleopatra,
ActIII,SceneiiAgrippa[AsidetoDOMITIUSENOBARBUS]:Why,Enobarbus,WhenAntonyfoundJuliusCaesardead,Hecriedalmosttoroaring;andheweptWhenatPhilippihefoundBrutusslain.Hamlet,ActIII,SceneiiLordPolonius:IdidenactJuliusCaesarIwaskilledi’theCapitol;Brutuskilledme.12Sec.1.1BiggercollectionsConsiderN=1milliondocuments,eachwithabout1000words.Avg6bytes/wordincludingspaces/punctuation6GBofdatainthedocuments.SaythereareM=500Kdistincttermsamongthese.13Sec.1.1Can’tbuildthematrix500Kx1Mmatrixhashalf-a-trillion0’sand1’s.Butithasnomorethanonebillion1’s.matrixisextremelysparse.What’sabetterrepresentation?Weonlyrecordthe1positions.14Why?Sec.1.1TheInvertedIndexThekeydatastructureunderlyingmodernIRInvertedindexForeachtermt,wemuststorealistofalldocumentsthatcontaint.IdentifyeachdocbyadocID,adocumentserialnumberCanweusedfixed-sizearraysforthis?16WhathappensifthewordCaesarisaddedtodocument14?Sec.1.2BrutusCalpurniaCaesar12456165713212411314517323117454101InvertedindexWeneedvariable-sizepostingslistsOndisk,acontinuousrunofpostingsisnormalandbestInmemory,canuselinkedlistsorvariablelengtharraysSometradeoffsinsize/easeofinsertion17DictionaryPostingsSortedbydocID(morelateronwhy).PostingSec.1.2BrutusCalpurniaCaesar12456165713212411314517323117454101TokenizerTokenstreamFriendsRomansCountrymenInvertedindexconstructionLinguisticmodulesModifiedtokensfriendromancountrymanIndexerInvertedindexfriendromancountryman24213161DocumentstobeindexedFriends,Romans,countrymen.Sec.1.2InitialstagesoftextprocessingTokenizationCutcharactersequenceintowordtokensDealwith“John’s”,astate-of-the-artsolutionNormalizationMaptextandquerytermtosameformYouwantU.S.A.andUSAtomatchStemmingWemaywishdifferentformsofaroottomatchauthorize,authorizationStopwordsWemayomitverycommonwords(ornot)the,a,to,ofIndexersteps:TokensequenceSequenceof(Modifiedtoken,DocumentID)pairs.IdidenactJuliusCaesarIwaskilledi’theCapitol;Brutuskilledme.Doc1SoletitbewithCaesar.ThenobleBrutushathtoldyouCaesarwasambitiousDoc2Sec.1.2Indexersteps:SortSortbytermsAndthendocIDCoreindexingstepSec.1.2Indexersteps:Dictionary&PostingsMultipletermentriesinasingledocumentaremerged.SplitintoDictionaryandPostingsDoc.frequencyinformationisadded.Whyfrequency?Willdiscusslater.Sec.1.2Wheredowepayinstorage?23PointersTermsandcountsIRsystemimplementationHowdoweindexefficiently?Howmuchstoragedoweneed?Sec.1.2ListsofdocIDsQueryprocessingwithaninvertedindexTheindexwejustbuiltHowdoweprocessaquery?Later-whatkindsofqueriescanweprocess?25OurfocusSec.1.3Queryprocessing:ANDConsiderprocessingthequery:Brutus
AND
CaesarLocateBrutusintheDictionary;Retrieveitspostings.LocateCaesarintheDictionary;Retrieveitspostings.“Merge”thetwopostings(intersectthedocumentsets):2612834248163264123581321BrutusCaesarSec.1.3ThemergeWalkthroughthetwopostingssimultaneously,intimelinearinthetotalnumberofpostingsentries2734128248163264123581321BrutusCaesarIfthelistlengthsarexandy,themergetakesO(x+y)operations.Crucial:postingssortedbydocID.Sec.1.3Intersectingtwopostingslists
(a“merge”algorithm)28TheBooleanRetrievalModel&ExtendedBooleanModelsBooleanqueries:ExactmatchTheBooleanretrievalmodelisbeingabletoaskaquerythatisaBooleanexpression:BooleanQueriesarequeriesusingAND,ORandNOTtojoinquerytermsViewseachdocumentasasetofwordsIsprecise:documentmatchesconditionornot.PerhapsthesimplestmodeltobuildanIRsystemonPrimarycommercialretrievaltoolfor3decades.ManysearchsystemsyoustilluseareBoolean:Email,librarycatalog,MacOSXSpotlight30Sec.1.3Example:WestLaw://westlaw/Largestcommercial(payingsubscribers)legalsearchservice(started1975;rankingadded1992;newfederatedsearchadded2010)Tensofterabytesofdata;~700,000usersMajorityofusersstillusebooleanqueriesExamplequery:Whatisthestatuteoflimitationsincasesinvolvingthefederaltortclaimsact?LIMIT!/3STATUTEACTION/SFEDERAL/2TORT/3CLAIM/3=within3words,/S=insamesentence31Sec.1.4Example:WestLaw://westlaw/Anotherexamplequery:Requirementsfordisabledpeopletobeabletoaccessaworkplacedisabl!/paccess!/swork-sitework-place(employment/3placeNotethatSPACEisdisjunction,notconjunction!Long,precisequeries;proximityoperators;incrementallydeveloped;notlikewebsearchManyprofessionalsearchersstilllikeBooleansearchYouknowexactlywhatyouaregettingButthatdoesn’tmeanitactuallyworksbetter….Sec.1.4Booleanqueries:
MoregeneralmergesExercise:Adaptthemergeforthequeries:
Brutus
ANDNOT
Caesar Brutus
ORNOT
CaesarCanwestillrunthroughthemergeintimeO(x+y)?Whatcanweachieve?33Sec.1.3MergingWhataboutanarbitraryBooleanformula?(Brutus
ORCaesar)ANDNOT(AntonyORCleopatra)Canwealwaysmergein“linear”time?Linearinwhat?Canwedobetter?34Sec.1.3QueryoptimizationWhatisthebestorderforqueryprocessing?ConsideraquerythatisanANDofnterms.Foreachofthenterms,getitspostings,thenANDthemtogether.BrutusCaesarCalpurnia123581621342481632641281316Query:Brutus
AND
Calpurnia
AND
Caesar35Sec.1.3QueryoptimizationexampleProcessinorderofincreasingfreq:startwithsmallestset,thenkeep
cuttingfurther.36Thisiswhywekeptdocumentfreq.indictionaryExecutethequeryas(Calpurnia
AND
Brutus)
ANDCaesar.Sec.1.3BrutusCaesarCalpurnia123581621342481632641281316Moregeneraloptimizatione.g.,(maddingORcrowd)AND(ignobleORstrife)Getdoc.freq.’sforallterms.EstimatethesizeofeachORbythesumofitsdoc.freq.’s(conservative).ProcessinincreasingorderofORsizes.37Sec.1.3ExerciseRecommendaqueryprocessingorderforWhichtwotermsshouldweprocessfirst?38(tangerineORtrees)AND(marmaladeORskies)AND(kaleidoscopeOReyes)QueryprocessingexercisesExercise:Ifthequeryisfriends
ANDromansAND(NOTcountrymen),howcouldweusethefreqofcountrymen?Exercise:ExtendthemergetoanarbitraryBooleanquery.Canwealwaysguaranteeexecutionintimelinearinthetotalpostingssize?Hint:BeginwiththecaseofaBooleanformulaquery:inthis,eachquerytermappearsonlyonceinthequery.39ExerciseTrythesearchfeatureat://rhymezone/shakespeare/Writedownfivesearchfeaturesyouthinkitcoulddobetter40PhrasequeriesandpositionalindexesPhrasequeriesWewanttobeabletoanswerqueriessuchas“stanforduniversity”–asaphraseThusthesentence“IwenttouniversityatStanford”isnotamatch.Theconceptofphrasequerieshasproveneasilyunderstoodbyusers;oneofthefew“advancedsearch”ideasthatworksManymorequeriesareimplicitphrasequeriesForthis,itnolongersufficestostoreonly<term:docs>entriesSec.2.4Afirstattempt:BiwordindexesIndexeveryconsecutivepairoftermsinthetextasaphraseForexamplethetext“Friends,Romans,Countrymen”wouldgeneratethebiwordsfriendsromansromanscountrymenEachofthesebiwordsisnowadictionarytermTwo-wordphrasequery-processingisnowimmediate.Sec.2.4.1LongerphrasequeriesLongerphrasescanbeprocessedbybreakingthemdownstanforduniversitypaloaltocanbebrokenintotheBooleanqueryonbiwords:stanforduniversityANDuniversitypaloANDpaloaltoWithoutthedocs,wecannotverifythatthedocsmatchingtheaboveBooleanquerydocontainthephrase.Canhavefalsepositives!Sec.2.4.1IssuesforbiwordindexesFalsepositives,asnotedbeforeIndexblowupduetobiggerdictionaryInfeasibleformorethanbiwords,bigevenforthemBiwordindexesarenotthestandardsolution(forallbiwords)butcanbepartofacompoundstrategySec.2.4.1Solution2:PositionalindexesInthepostings,store,foreachtermtheposition(s)inwhichtokensofitappear:<term,numberofdocscontainingterm;doc1:position1,position2…;doc2:position1,position2…;etc.>Sec.2.4.2PositionalindexexampleForphrasequeries,weuseamergealgorithmrecursivelyatthedocumentlevelButwenowneedtodealwithmorethanjustequality<be:993427;1:7,18,33,72,86,231;2:3,149;4:17,191,291,430,434;5:363,367,…>Whichofdocs1,2,4,5couldcontain“tobeornottobe”?Sec.2.4.2ProcessingaphrasequeryExtractinvertedindexentriesforeachdistinctterm:to,be,or,not.Mergetheirdoc:positionliststoenumerateallpositionswith“tobeornottobe”.to:2:1,17,74,222,551;
4:8,16,190,429,433;
7:13,23,191;...be:1:17,19;4:17,191,291,430,434;
5:14,19,101;...SamegeneralmethodforproximitysearchesSec.2.4.2ProximityqueriesLIMIT!/3STATUTE/3FEDERAL/2TORTAgain,here,/kmeans“withinkwordsof”.Clearly,positionalindexescanbeusedforsuchqueries;biwordindexescannot.Exercise:Adaptthelinearmergeofpostingstohandleproximityqueries.Canyoumakeitworkforanyvalueofk?ThisisalittletrickytodocorrectlyandefficientlySeeFigure2.12ofIIRSec.2.4.2PositionalindexsizeApositionalindexexpandspostingsstoragesubstantiallyEventhoughindicescanbecompressedNevertheless,apositionalindexisnowstandardlyusedbecauseofthepowerandusefulnessofphraseandproximityqueries…whetherusedexplicitlyorimplicitlyinarankingretrievalsystem.Sec.2.4.2PositionalindexsizeNeedanentryforeachoccurrence,notjustonceperdocumentIndexsizedependsonaveragedocumentsizeAveragewebpagehas<1000termsSECfilings,books,evensomeepicpoems…easily100,000termsConsideratermwithfrequency0.1%Why?1001100,000111000PositionalpostingsPostingsDocumentsizeSec.2.4.2RulesofthumbApositionalindexis2–4aslargeasanon-positionalindexPositionalindexsize35–50%ofvolumeoforiginaltextCaveat:allofthisholdsfor“English-like”languagesSec
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2024届山东省望留镇庄头中学中考一模英语试题含答案
- 模拟电子技术基础 课件 (张菁)第5、6章 电子电路中的负反馈、信号运算与处理电路
- 2024年热熔胶粘剂相关行业营销方案
- 国电系统-北京市-2023年《通信安规》科目 单选题+多选题+判断题+简答题真题冲刺卷3月份B卷
- 2024年改装汽车相关行业营销方案
- 观察 南水北调是刚刚展开的历史考卷
- 高中生物选修1专题一-三测试题3月月考卷
- 2024年医疗康复器材相关公司行业营销方案
- 五年级数学(小数四则混合运算)计算题专项练习及答案汇编
- 四年级数学(四则混合运算)计算题专项练习与答案汇编
- 2024至2030年中国高纯工艺系统行业市场发展调研及投资方向分析报告
- 2024-2030年中国软件外包行业市场发展分析及前景趋势与投资研究报告
- 百融云创风险决策引擎V5产品操作手册
- 《第1课假期有收获》教案教学设计-1.假期有收获-二年级上册道德与法治
- 2024-2030年中国招商引资模式深度分析及前景趋势与投资发展战略研究报告
- 2024年初级养老护理员考前通关必练题库(含答案)
- 《中国心力衰竭诊断和治疗指南2024》解读(总)
- 学校食堂餐饮服务监管协议
- 膨胀节检修施工方案
- 湖北介绍课件
- (高清版)JTG 5210-2018 公路技术状况评定标准
评论
0/150
提交评论