




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
IntroducingInformationRetrievalandWebSearchInformationRetrievalInformationRetrieval(IR)isfindingmaterial(usuallydocuments)ofanunstructurednature(usuallytext)thatsatisfiesaninformationneedfromwithinlargecollections(usuallystoredoncomputers).Thesedayswefrequentlythinkfirstofwebsearch,buttherearemanyothercases:E-mailsearchSearchingyourlaptopCorporateknowledgebasesLegalinformationretrieval2Unstructured(text)vs.structured(database)datainthemid-nineties3Unstructured(text)vs.structured(database)datatoday4BasicassumptionsofInformationRetrievalCollection:AsetofdocumentsAssumeitisastaticcollectionforthemomentGoal:Retrievedocumentswithinformationthatisrelevanttotheuser’sinformationneed
andhelpstheusercompleteatask5Sec.1.1howtrapmicealiveTheclassicsearchmodelCollectionUsertaskInfoneed
Query
Results
Searchengine
Query
refinementGetridofmiceinapoliticallycorrectwayInfoaboutremovingmicewithoutkillingthemMisconception?Misformulation?SearchHowgoodaretheretrieveddocs?Precision:Fractionofretrieveddocsthatarerelevanttotheuser’sinformationneedRecall
:FractionofrelevantdocsincollectionthatareretrievedMoreprecisedefinitionsandmeasurementstofollowlater7Sec.1.1Term-documentincidencematricesUnstructureddatain1620WhichplaysofShakespearecontainthewordsBrutus
AND
CaesarbutNOT
Calpurnia?OnecouldgrepallofShakespeare’splaysforBrutusandCaesar,thenstripoutlinescontainingCalpurnia?Whyisthatnottheanswer?Slow(forlargecorpora)NOT
Calpurniaisnon-trivialOtheroperations(e.g.,findthewordRomansnear
countrymen)notfeasibleRankedretrieval(bestdocumentstoreturn)Laterlectures9Sec.1.1Term-documentincidencematrices1ifplaycontainsword,0otherwiseBrutus
AND
Caesar
BUT
NOT
CalpurniaSec.1.1IncidencevectorsSowehavea0/1vectorforeachterm.Toanswerquery:takethevectorsforBrutus,CaesarandCalpurnia(complemented)bitwiseAND.110100AND110111AND101111=10010011Sec.1.1AnswerstoqueryAntonyandCleopatra,
ActIII,SceneiiAgrippa[AsidetoDOMITIUSENOBARBUS]:Why,Enobarbus,WhenAntonyfoundJuliusCaesardead,Hecriedalmosttoroaring;andheweptWhenatPhilippihefoundBrutusslain.Hamlet,ActIII,SceneiiLordPolonius:IdidenactJuliusCaesarIwaskilledi’theCapitol;Brutuskilledme.12Sec.1.1BiggercollectionsConsiderN=1milliondocuments,eachwithabout1000words.Avg6bytes/wordincludingspaces/punctuation6GBofdatainthedocuments.SaythereareM=500Kdistincttermsamongthese.13Sec.1.1Can’tbuildthematrix500Kx1Mmatrixhashalf-a-trillion0’sand1’s.Butithasnomorethanonebillion1’s.matrixisextremelysparse.What’sabetterrepresentation?Weonlyrecordthe1positions.14Why?Sec.1.1TheInvertedIndexThekeydatastructureunderlyingmodernIRInvertedindexForeachtermt,wemuststorealistofalldocumentsthatcontaint.IdentifyeachdocbyadocID,adocumentserialnumberCanweusedfixed-sizearraysforthis?16WhathappensifthewordCaesarisaddedtodocument14?Sec.1.2BrutusCalpurniaCaesar12456165713212411314517323117454101InvertedindexWeneedvariable-sizepostingslistsOndisk,acontinuousrunofpostingsisnormalandbestInmemory,canuselinkedlistsorvariablelengtharraysSometradeoffsinsize/easeofinsertion17DictionaryPostingsSortedbydocID(morelateronwhy).PostingSec.1.2BrutusCalpurniaCaesar12456165713212411314517323117454101TokenizerTokenstreamFriendsRomansCountrymenInvertedindexconstructionLinguisticmodulesModifiedtokensfriendromancountrymanIndexerInvertedindexfriendromancountryman24213161DocumentstobeindexedFriends,Romans,countrymen.Sec.1.2InitialstagesoftextprocessingTokenizationCutcharactersequenceintowordtokensDealwith“John’s”,astate-of-the-artsolutionNormalizationMaptextandquerytermtosameformYouwantU.S.A.andUSAtomatchStemmingWemaywishdifferentformsofaroottomatchauthorize,authorizationStopwordsWemayomitverycommonwords(ornot)the,a,to,ofIndexersteps:TokensequenceSequenceof(Modifiedtoken,DocumentID)pairs.IdidenactJuliusCaesarIwaskilledi’theCapitol;Brutuskilledme.Doc1SoletitbewithCaesar.ThenobleBrutushathtoldyouCaesarwasambitiousDoc2Sec.1.2Indexersteps:SortSortbytermsAndthendocIDCoreindexingstepSec.1.2Indexersteps:Dictionary&PostingsMultipletermentriesinasingledocumentaremerged.SplitintoDictionaryandPostingsDoc.frequencyinformationisadded.Whyfrequency?Willdiscusslater.Sec.1.2Wheredowepayinstorage?23PointersTermsandcountsIRsystemimplementationHowdoweindexefficiently?Howmuchstoragedoweneed?Sec.1.2ListsofdocIDsQueryprocessingwithaninvertedindexTheindexwejustbuiltHowdoweprocessaquery?Later-whatkindsofqueriescanweprocess?25OurfocusSec.1.3Queryprocessing:ANDConsiderprocessingthequery:Brutus
AND
CaesarLocateBrutusintheDictionary;Retrieveitspostings.LocateCaesarintheDictionary;Retrieveitspostings.“Merge”thetwopostings(intersectthedocumentsets):2612834248163264123581321BrutusCaesarSec.1.3ThemergeWalkthroughthetwopostingssimultaneously,intimelinearinthetotalnumberofpostingsentries2734128248163264123581321BrutusCaesarIfthelistlengthsarexandy,themergetakesO(x+y)operations.Crucial:postingssortedbydocID.Sec.1.3Intersectingtwopostingslists
(a“merge”algorithm)28TheBooleanRetrievalModel&ExtendedBooleanModelsBooleanqueries:ExactmatchTheBooleanretrievalmodelisbeingabletoaskaquerythatisaBooleanexpression:BooleanQueriesarequeriesusingAND,ORandNOTtojoinquerytermsViewseachdocumentasasetofwordsIsprecise:documentmatchesconditionornot.PerhapsthesimplestmodeltobuildanIRsystemonPrimarycommercialretrievaltoolfor3decades.ManysearchsystemsyoustilluseareBoolean:Email,librarycatalog,MacOSXSpotlight30Sec.1.3Example:WestLaw://westlaw/Largestcommercial(payingsubscribers)legalsearchservice(started1975;rankingadded1992;newfederatedsearchadded2010)Tensofterabytesofdata;~700,000usersMajorityofusersstillusebooleanqueriesExamplequery:Whatisthestatuteoflimitationsincasesinvolvingthefederaltortclaimsact?LIMIT!/3STATUTEACTION/SFEDERAL/2TORT/3CLAIM/3=within3words,/S=insamesentence31Sec.1.4Example:WestLaw://westlaw/Anotherexamplequery:Requirementsfordisabledpeopletobeabletoaccessaworkplacedisabl!/paccess!/swork-sitework-place(employment/3placeNotethatSPACEisdisjunction,notconjunction!Long,precisequeries;proximityoperators;incrementallydeveloped;notlikewebsearchManyprofessionalsearchersstilllikeBooleansearchYouknowexactlywhatyouaregettingButthatdoesn’tmeanitactuallyworksbetter….Sec.1.4Booleanqueries:
MoregeneralmergesExercise:Adaptthemergeforthequeries:
Brutus
ANDNOT
Caesar Brutus
ORNOT
CaesarCanwestillrunthroughthemergeintimeO(x+y)?Whatcanweachieve?33Sec.1.3MergingWhataboutanarbitraryBooleanformula?(Brutus
ORCaesar)ANDNOT(AntonyORCleopatra)Canwealwaysmergein“linear”time?Linearinwhat?Canwedobetter?34Sec.1.3QueryoptimizationWhatisthebestorderforqueryprocessing?ConsideraquerythatisanANDofnterms.Foreachofthenterms,getitspostings,thenANDthemtogether.BrutusCaesarCalpurnia123581621342481632641281316Query:Brutus
AND
Calpurnia
AND
Caesar35Sec.1.3QueryoptimizationexampleProcessinorderofincreasingfreq:startwithsmallestset,thenkeep
cuttingfurther.36Thisiswhywekeptdocumentfreq.indictionaryExecutethequeryas(Calpurnia
AND
Brutus)
ANDCaesar.Sec.1.3BrutusCaesarCalpurnia123581621342481632641281316Moregeneraloptimizatione.g.,(maddingORcrowd)AND(ignobleORstrife)Getdoc.freq.’sforallterms.EstimatethesizeofeachORbythesumofitsdoc.freq.’s(conservative).ProcessinincreasingorderofORsizes.37Sec.1.3ExerciseRecommendaqueryprocessingorderforWhichtwotermsshouldweprocessfirst?38(tangerineORtrees)AND(marmaladeORskies)AND(kaleidoscopeOReyes)QueryprocessingexercisesExercise:Ifthequeryisfriends
ANDromansAND(NOTcountrymen),howcouldweusethefreqofcountrymen?Exercise:ExtendthemergetoanarbitraryBooleanquery.Canwealwaysguaranteeexecutionintimelinearinthetotalpostingssize?Hint:BeginwiththecaseofaBooleanformulaquery:inthis,eachquerytermappearsonlyonceinthequery.39ExerciseTrythesearchfeatureat://rhymezone/shakespeare/Writedownfivesearchfeaturesyouthinkitcoulddobetter40PhrasequeriesandpositionalindexesPhrasequeriesWewanttobeabletoanswerqueriessuchas“stanforduniversity”–asaphraseThusthesentence“IwenttouniversityatStanford”isnotamatch.Theconceptofphrasequerieshasproveneasilyunderstoodbyusers;oneofthefew“advancedsearch”ideasthatworksManymorequeriesareimplicitphrasequeriesForthis,itnolongersufficestostoreonly<term:docs>entriesSec.2.4Afirstattempt:BiwordindexesIndexeveryconsecutivepairoftermsinthetextasaphraseForexamplethetext“Friends,Romans,Countrymen”wouldgeneratethebiwordsfriendsromansromanscountrymenEachofthesebiwordsisnowadictionarytermTwo-wordphrasequery-processingisnowimmediate.Sec.2.4.1LongerphrasequeriesLongerphrasescanbeprocessedbybreakingthemdownstanforduniversitypaloaltocanbebrokenintotheBooleanqueryonbiwords:stanforduniversityANDuniversitypaloANDpaloaltoWithoutthedocs,wecannotverifythatthedocsmatchingtheaboveBooleanquerydocontainthephrase.Canhavefalsepositives!Sec.2.4.1IssuesforbiwordindexesFalsepositives,asnotedbeforeIndexblowupduetobiggerdictionaryInfeasibleformorethanbiwords,bigevenforthemBiwordindexesarenotthestandardsolution(forallbiwords)butcanbepartofacompoundstrategySec.2.4.1Solution2:PositionalindexesInthepostings,store,foreachtermtheposition(s)inwhichtokensofitappear:<term,numberofdocscontainingterm;doc1:position1,position2…;doc2:position1,position2…;etc.>Sec.2.4.2PositionalindexexampleForphrasequeries,weuseamergealgorithmrecursivelyatthedocumentlevelButwenowneedtodealwithmorethanjustequality<be:993427;1:7,18,33,72,86,231;2:3,149;4:17,191,291,430,434;5:363,367,…>Whichofdocs1,2,4,5couldcontain“tobeornottobe”?Sec.2.4.2ProcessingaphrasequeryExtractinvertedindexentriesforeachdistinctterm:to,be,or,not.Mergetheirdoc:positionliststoenumerateallpositionswith“tobeornottobe”.to:2:1,17,74,222,551;
4:8,16,190,429,433;
7:13,23,191;...be:1:17,19;4:17,191,291,430,434;
5:14,19,101;...SamegeneralmethodforproximitysearchesSec.2.4.2ProximityqueriesLIMIT!/3STATUTE/3FEDERAL/2TORTAgain,here,/kmeans“withinkwordsof”.Clearly,positionalindexescanbeusedforsuchqueries;biwordindexescannot.Exercise:Adaptthelinearmergeofpostingstohandleproximityqueries.Canyoumakeitworkforanyvalueofk?ThisisalittletrickytodocorrectlyandefficientlySeeFigure2.12ofIIRSec.2.4.2PositionalindexsizeApositionalindexexpandspostingsstoragesubstantiallyEventhoughindicescanbecompressedNevertheless,apositionalindexisnowstandardlyusedbecauseofthepowerandusefulnessofphraseandproximityqueries…whetherusedexplicitlyorimplicitlyinarankingretrievalsystem.Sec.2.4.2PositionalindexsizeNeedanentryforeachoccurrence,notjustonceperdocumentIndexsizedependsonaveragedocumentsizeAveragewebpagehas<1000termsSECfilings,books,evensomeepicpoems…easily100,000termsConsideratermwithfrequency0.1%Why?1001100,000111000PositionalpostingsPostingsDocumentsizeSec.2.4.2RulesofthumbApositionalindexis2–4aslargeasanon-positionalindexPositionalindexsize35–50%ofvolumeoforiginaltextCaveat:allofthisholdsfor“English-like”languagesSec
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 大连东软信息学院《工程材料》2023-2024学年第二学期期末试卷
- 重庆市涪陵区涪陵高中2025届高三下学期阶段性测试(三)物理试题试卷含解析
- 福建省德化一中、安溪一中2025届高三下学期第一次摸底考试历史试题理试卷含解析
- 民办四川天一学院《古代汉语下》2023-2024学年第一学期期末试卷
- 白喉、百日咳、破伤风、乙肝四联制剂项目风险分析和评估报告
- 贵州体育职业学院《专项理论与实践Ⅵ》2023-2024学年第二学期期末试卷
- 铁路货运站服务项目风险分析和评估报告
- 安徽省皖南地区2024-2025学年高三考前最后一次模拟试题语文试题试卷含解析
- 新疆理工学院《TeamProject》2023-2024学年第一学期期末试卷
- 济宁医学院《文献检索与研究综述》2023-2024学年第二学期期末试卷
- 2025年上半年下半年浙江省舟山市港航管理局招聘6人易考易错模拟试题(共500题)试卷后附参考答案
- 2025年中医针灸学主治医师-中医针灸学考试题(附答案)
- 老年人安全用药与护理
- 黑色三分钟生死一瞬间第9、10部
- 适老化住宅改造服务行业深度调研及发展战略咨询报告
- 2025年郑州黄河护理职业学院单招职业技能测试题库及答案1套
- GB/T 45236-2025化工园区危险品运输车辆停车场建设规范
- 新地基基础-基桩静荷载试验考试复习题库(含答案)
- 《致敬英雄》课件
- 房地产开发项目资金监管协议
- 持续集成与自动化部署(CICD)-深度研究
评论
0/150
提交评论