版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
天网搜索平台PARADISE闫宏飞北京大学计算机系网络实验室2009/4/24信息检索前沿概述闫宏飞北京大学计算机系网络实验室2009/4/24345OutlineIssues:searchengineandwebminingGoal:FrameworkPrinciples:Relatedcomponents(照应任务各部分关联)Implementation:Achievements(应用成果)Casestudy(具体到一事)6SearchEngineandWebMiningCrawlingFull-textindexingRetrievingWebarchivingandMining7WebGraph:bowtieteapotModel:bagofwordsInfomall,CDAL:archivehistraceEvaluation:manualautomaticPlatform:proprietaryopen89OutlineMotivationtoBuildPARADISEDesignandImplementPARADISEResearchIssuesofConcern10Corpus(1/3)Shakespeare’scollectedworksAgzippedtar-file[2039k]http://www.cs.su.oz.au/~matty/Shakespeare/shakespeare.tar.gzRCV1,ReutersCorpusVolume1.oneyearofReutersnewswire1996-08-20to1997-08-19containsabout810,000ReutersEnglishLanguageNewsstories.requiresabout2.5GBforstorageoftheuncompressedfiles./data/reuters/reuters.html
11Corpus(2/3)CWT100GB,ChineseWebTestcollection2004年6月搜集,有5,712,710个网页,解压容量为90GB。CWT200GB2005年11月搜集,有37,482,913个网页,压缩容量为197GB/12Corpus(3/3)2008/10/1513Hardwareassumptionssymbol statistic values averageseektime 5ms=5x10−3sb transfertimeperbyte 0.02μs=2x10−8s processor’sclockrate 10−9sp lowleveloperation 0.01μs=10−8s(e.g.,compare&swapaword) sizeofmainmemory severalGB sizeofdiskspace 1TBormore14ReutersRCV1statisticssymbol statistic valueN documents 800,000L avg.#tokensperdoc 200M terms(=wordtypes) 400,000avg.#bytespertoken 6(incl.spaces/punct.)avg.#bytespertoken 4.5(withoutspaces/punct.)avg.#bytesperterm 7.5tokens 100,000,0004.5bytesperwordtokenvs.7.5bytesperwordtype:why?15DocumentsareparsedtoextractwordsandthesearesavedwiththeDocumentID.IdidenactJuliusCaesarIwaskilledi'theCapitol;Brutuskilledme.Doc1SoletitbewithCaesar.ThenobleBrutushathtoldyouCaesarwasambitiousDoc2Indexconstruction16KeystepAfteralldocumentshavebeenparsed,theinvertedfileissortedbyterms.Wefocusonthissortstep.Wehave100Mitemstosort.17ScalingindexconstructionIn-memoryindexconstructiondoesnotscale.Howcanweconstructanindexforverylargecollections?TakingintoaccountthehardwareconstraintsMemory,disk,speedetc.18Sort-basedIndexconstructionAswebuildtheindex,weparsedocsoneatatime.Whilebuildingtheindex,wecannoteasilyexploitcompressiontricks(youcan,butmuchmorecomplex)Thefinalpostingsforanytermareincompleteuntiltheend.At12bytesperpostingsentry,demandsalotofspaceforlargecollections.T=100,000,000inthecaseofRCV1So…wecandothisinmemoryin2008,buttypicalcollectionsaremuchlarger.E.g.NewYorkTimesprovidesindexof>150yearsofnewswireThus:Weneedtostoreintermediateresultsondisk.19Usethesamealgorithmfordisk?Canweusethesameindexconstructionalgorithmforlargercollections,butbyusingdiskinsteadofmemory?No:SortingT=100,000,000recordsondiskistooslow–toomanydiskseeks.Weneedanexternalsortingalgorithm.20BottleneckParseandbuildpostingsentriesonedocatatimeNowsortpostingsentriesbyterm(thenbydocwithineachterm)Doingthiswithrandomdiskseekswouldbetooslow–mustsortT=100MrecordsIfeverycomparisontook2diskseeks,andNitemscouldbesortedwithNlog2Ncomparisons,howlongwouldthistake?2112-byte(4+4+4)records(term,doc,freq).Thesearegeneratedasweparsedocs.Mustnowsort100Msuch12-byterecordsbyterm.DefineaBlock~10MsuchrecordsCaneasilyfitacoupleintomemory.Willhave10suchblockstostartwith.Basicideaofalgorithm:Accumulatepostingsforeachblock,sort,writetodisk.Thenmergetheblocksintoonelongsortedorder.BSBI:Blockedsort-basedIndexing(Sortingwithfewerdiskseeks)22BSBI:Blockedsort-basedIndexing23Remainingproblemwithsort-basedalgorithmOurassumptionwas:wecankeepthedictionaryinmemory.Weneedthedictionary(whichgrowsdynamically)inordertoimplementatermtotermIDmapping.Actually,wecouldworkwithterm,docIDpostingsinsteadoftermID,docIDpostings......butthenintermediatefilesbecomeverylarge.(Wewouldendupwithascalable,butveryslowindexconstructionmethod.)24SPIMI:
Single-passin-memoryindexingKeyidea1:Generateseparatedictionariesforeachblock–noneedtomaintainterm-termIDmappingacrossblocks.Keyidea2:Don’tsort.Accumulatepostingsinpostingslistsastheyoccur.Withthesetwoideaswecangenerateacompleteinvertedindexforeachblock.Theseseparateindexescanthenbemergedintoonebigindex.25
SPIMI-InvertMergingofblocksisanalogoustoBSBI.26DistributedindexingForweb-scaleindexing(don’ttrythisathome!):mustuseadistributedcomputingclusterIndividualmachinesarefault-proneCanunpredictablyslowdownorfailHowdoweexploitsuchapoolofmachines?27GoogledatacentersGoogledatacentersmainlycontaincommoditymachines.Datacentersaredistributedaroundtheworld.Estimate:atotalof1millionservers,3millionprocessors/cores(Gartner2007)Estimate:Googleinstalls100,000serverseachquarter.Basedonexpendituresof200–250milliondollarsperyearThiswouldbe10%ofthecomputingcapacityoftheworld!?!28GoogledatacentersIfinanon-fault-tolerantsystemwith1000nodes,eachnodehas99.9%uptime,whatistheuptimeofthesystem?Answer:37%Calculatethenumberofserversfailingperminuteforaninstallationof1millionservers.29DistributedindexingMaintainamastermachinedirectingtheindexingjob–considered“safe”.Breakupindexingintosetsof(parallel)tasks.Mastermachineassignseachtasktoanidlemachinefromapool.30ParalleltasksWewillusetwosetsofparalleltasksParsersInvertersBreaktheinputdocumentcorpusintosplitsEachsplitisasubsetofdocuments(correspondingtoblocksinBSBI/SPIMI)31ParsersMasterassignsasplittoanidleparsermachineParserreadsadocumentatatimeandemits(term,doc)pairsParserwritespairsintojpartitionsEachpartitionisforarangeofterms’firstletters(e.g.,a-f,g-p,q-z)–herej=3.Nowtocompletetheindexinversion32InvertersAninvertercollectsall(term,doc)pairs(=postings)foroneterm-partition.Sortsandwritestopostingslists33DataflowsplitsParserParserParserMastera-fg-pq-za-fg-pq-za-fg-pq-zInverterInverterInverterPostingsa-fg-pq-zassignassignMapphaseSegmentfilesReducephase34MapReduceTheindexconstructionalgorithmwejustdescribedisaninstanceofMapReduce.MapReduce(DeanandGhemawat2004)isarobustandconceptuallysimpleframeworkfordistributedcomputingwithouthavingtowritecodeforthedistributionpart.TheydescribetheGoogleindexingsystem(ca.2002)asconsistingofanumberofphases,eachimplementedinMapReduce.35MapReduceIndexconstructionwasjustonephase.Anotherphase:transformingaterm-partitionedindexintodocument-partitionedindex.Term-partitioned:onemachinehandlesasubrangeoftermsDocument-partitioned:onemachinehandlesasubrangeofdocumentsmostsearchenginesuseadocument-partitionedindex…betterloadbalancing,etc.36SchemaforindexconstructioninMapReduceSchemaofmapandreducefunctionsmap:input→list(k,v)reduce:(k,list(v))→outputInstantiationoftheschemaforindexconstructionmap:webcollection→list(termID,docID)reduce:(<termID1,list(docID)>,<termID2,list(docID)>,…)→(postingslist1,postingslist2,…)Exampleforindexconstructionmap:d2:Cdied.d1:Ccame,Cc’ed.→(<C,d2>,<died,d2>,<C,d1>,<came,d1>,<C,d1>,<c’ed,d1>reduce:(<C,(d2,d1,d1)>,<died,(d2)>,<came,(d1)>,<c’ed,(d1)>)→(<C,(d1:2,d2:1)>,<died,(d2:1)>,<came,(d1:1)>,<c’ed,(d1:1)>)37有几个问题需要解决很难找到合适的硬件设施,基于这些数据开展工作存储,运行,维护,….缺少处理这些数据的有效的软件工具分析,检索,评测,….如何维护有stateofarts算法的软件工具模块替换,功能测试,性能测试,…38智能中文搜索引擎技术研究及平台构建由一个平台,四个任务组成的.达到平台能够展示四个任务.平台是:一个开放式智能化中文搜索引擎平台任务是:构建适用于中文互联网的文本内容抽取系统及网页文本内容结构化分析基于内容的多媒体信息检索基于自然语言理解的查询分析基于用户属性挖掘的个性化检索/src/paradise/39PARADISEDesign
PlatformforApplying,ResearchingAndDevelopingIntelligentSearchEngineServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersBareMetalMachinesAutomationClusterManagementToolsDatanodesTianwangjobsandevaluationjobs40PARADISE索引设计介绍41当前设计中的基本功能Usingfastersingle-passindexingHighlycompressedindexdiskdatastructuresVariouscompressalgorithm(*)Variousinvertindexfileformat(*)indexposition,freqvalueorimpactvalue(*)Indexfieldinformation,suchastitle,dateCachesupport(*)42下一步设计的功能Indexvariousdocumentformat,suchasword,pdf使用MapReduce等技术的索引多种切词的实现与处理增量索引与文档删除Termvector存储多segment的合并策略43InvertedindexDocument-sortedindexThepointersineachinvertedlistarestoredinincreasingdocumentorderUsingd-gapFrequency-sortedindexInvertedlistarestoredindecreasingtermfrequencyscoreImpact-sortedindexThepointersareorderedaccordingtotheimpactvalueImpactvalueisasmallintegerrepresentingtheoverallcontributionoftermttothescoreofdocumentd,includethefactorusedtonormalizefordocumentlength44ProcessingquerymodeDocument-at-a-timeAllinvertedlistsaresimultaneouslyaccessed|q|-wayprocessingiscarriedouttohandleaqueryqof|q|termsTerm-at-a-timeOnlyoninvertedlistisaccessedatanygiventimeAsequenceof|q|-1binaryoperationsareperformedNotclearbetterthandocument-at-a-timeprocessingScore-at-a-timeAllofthetermlistsareopenforreadingProcessedindecreasingscorecontributionorderratherthaninincreasingdocumentnumberorderDynamicquerypruning45IndexlevelsDocumentnumbersonlyOnlysupportBooleanqueriesDocumentnumbersandScoreScorecanbeimpactscoresorfrequentorbothSupportBooleanandrankedqueriesDocumentnumbers,Score,PositionsSupportBoolean,ranked,phrase(proximityqueries)46TypesofindexandsupportquerymodestypedsposOrganizationProcessingmodesupportedDY--Document-sortTermordoc–at-a-timeDSYY-Document-sortImpactorfreq-sortTermordoc–at-a-timeDSPYYYDocument-sortedTermordoc-at-a-timescore-at-a-time47InterleavingPointerinterleavingEachdocumentnumberisimmediatelyfollowedbythecorrespondingworfvalueorposvalueTerminterleavingKeepvariousrelatedpartsofeachinvertedlistincontiguousblocsNon-interleavingUsingseparateinvertedfileEachstoringaparticulartypeofinformationFordistinctdiskpointersmaintainedineachentryinthedictionary48Block-interleavedindexesEachinvertedlistisbuiltupasasequenceofoneormoregroupsWithineachgrouptherearefourormorek-blocks49Cache是一个工具包,为一些模块提供cache功能Cahce策略与存储分开有多种cahce策略 当前实现MRU策略多种存储策略基于STLmap采用模板实现,保证其通过性50DocumentThelogicalrepresentationofaDocumentforindexingandsearchingADocumentisacollectionofFieldablesAFieldableisalogicalrepresentationofauser'scontentthatneedstobeindexedorstoredFieldableshaveanumberofpropertiesthattelltheindexhowtotreatthecontent(likeindexed,tokenized,stored,etc.)51Compress工具包,提供整形数字的解压缩实现多种解压缩算法采用工厂方法模式52AnalysisAnanalyzerbuildsTokenStreams,whichanalyzetextItthusrepresentsapolicyforextractingindextermsfromtextTypicalimplementationsfirstbuildaTokenizer,whichbreaksthestreamofcharactersintorawTokensOneormoreTokenFiltersmaythenbeappliedtotheoutputoftheTokenizer53PARADISEImplementation
/src/paradise/
applicationPKUsearchpdfliteraturesearch2008Olympicssearchmodulecrawlanalysis&indexsearchutilitycompressBuffer...thirdpartycmake,boost,zlib,tidy,openssl,berkeleydbdataTianwangFormatquerylogIPLocator
tutorialCMakeandBoostreferenceBooksandpaperscourseInformationRetrievalCloudComputingDistributedSy
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 安全类合同模板
- 京城股合同范例
- 水库土地租赁合同模板
- 汽车分期公司合同范例
- 承包水泥工合同模板
- 屋顶瓦脱落维修合同范例
- 服务协议劳务合同范例
- 店铺装饰合同范例
- 承包矿场运输合同范例
- 不定时 劳务合同范例
- 结算周期与付款方式
- 成人氧气吸入疗法-中华护理学会团体标准
- 林木种质资源调查表(新表)
- 高考英语单词3500记忆短文40篇
- 2024年 贵州茅台酒股份有限公司招聘笔试参考题库含答案解析
- 河上建坝纠纷可行性方案
- 施工人材机配置方案3
- 小学低年级自主识字的教学策略
- 第五单元学雷锋在行动(教案)全国通用五年级下册综合实践活动
- 服装店人员不稳定分析报告
- GB 37219-2023充气式游乐设施安全规范
评论
0/150
提交评论