天网搜索平台PARADISE_第1页
天网搜索平台PARADISE_第2页
天网搜索平台PARADISE_第3页
天网搜索平台PARADISE_第4页
已阅读5页,还剩60页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

天网搜索平台PARADISE闫宏飞北京大学计算机系网络实验室2009/4/24信息检索前沿概述闫宏飞北京大学计算机系网络实验室2009/4/24345OutlineIssues:searchengineandwebminingGoal:FrameworkPrinciples:Relatedcomponents(照应任务各部分关联)Implementation:Achievements(应用成果)Casestudy(具体到一事)6SearchEngineandWebMiningCrawlingFull-textindexingRetrievingWebarchivingandMining7WebGraph:bowtieteapotModel:bagofwordsInfomall,CDAL:archivehistraceEvaluation:manualautomaticPlatform:proprietaryopen89OutlineMotivationtoBuildPARADISEDesignandImplementPARADISEResearchIssuesofConcern10Corpus(1/3)Shakespeare’scollectedworksAgzippedtar-file[2039k]http://www.cs.su.oz.au/~matty/Shakespeare/shakespeare.tar.gzRCV1,ReutersCorpusVolume1.oneyearofReutersnewswire1996-08-20to1997-08-19containsabout810,000ReutersEnglishLanguageNewsstories.requiresabout2.5GBforstorageoftheuncompressedfiles./data/reuters/reuters.html

11Corpus(2/3)CWT100GB,ChineseWebTestcollection2004年6月搜集,有5,712,710个网页,解压容量为90GB。CWT200GB2005年11月搜集,有37,482,913个网页,压缩容量为197GB/12Corpus(3/3)2008/10/1513Hardwareassumptionssymbol statistic values averageseektime 5ms=5x10−3sb transfertimeperbyte 0.02μs=2x10−8s processor’sclockrate 10−9sp lowleveloperation 0.01μs=10−8s(e.g.,compare&swapaword) sizeofmainmemory severalGB sizeofdiskspace 1TBormore14ReutersRCV1statisticssymbol statistic valueN documents 800,000L avg.#tokensperdoc 200M terms(=wordtypes) 400,000avg.#bytespertoken 6(incl.spaces/punct.)avg.#bytespertoken 4.5(withoutspaces/punct.)avg.#bytesperterm 7.5tokens 100,000,0004.5bytesperwordtokenvs.7.5bytesperwordtype:why?15DocumentsareparsedtoextractwordsandthesearesavedwiththeDocumentID.IdidenactJuliusCaesarIwaskilledi'theCapitol;Brutuskilledme.Doc1SoletitbewithCaesar.ThenobleBrutushathtoldyouCaesarwasambitiousDoc2Indexconstruction16KeystepAfteralldocumentshavebeenparsed,theinvertedfileissortedbyterms.Wefocusonthissortstep.Wehave100Mitemstosort.17ScalingindexconstructionIn-memoryindexconstructiondoesnotscale.Howcanweconstructanindexforverylargecollections?TakingintoaccountthehardwareconstraintsMemory,disk,speedetc.18Sort-basedIndexconstructionAswebuildtheindex,weparsedocsoneatatime.Whilebuildingtheindex,wecannoteasilyexploitcompressiontricks(youcan,butmuchmorecomplex)Thefinalpostingsforanytermareincompleteuntiltheend.At12bytesperpostingsentry,demandsalotofspaceforlargecollections.T=100,000,000inthecaseofRCV1So…wecandothisinmemoryin2008,buttypicalcollectionsaremuchlarger.E.g.NewYorkTimesprovidesindexof>150yearsofnewswireThus:Weneedtostoreintermediateresultsondisk.19Usethesamealgorithmfordisk?Canweusethesameindexconstructionalgorithmforlargercollections,butbyusingdiskinsteadofmemory?No:SortingT=100,000,000recordsondiskistooslow–toomanydiskseeks.Weneedanexternalsortingalgorithm.20BottleneckParseandbuildpostingsentriesonedocatatimeNowsortpostingsentriesbyterm(thenbydocwithineachterm)Doingthiswithrandomdiskseekswouldbetooslow–mustsortT=100MrecordsIfeverycomparisontook2diskseeks,andNitemscouldbesortedwithNlog2Ncomparisons,howlongwouldthistake?2112-byte(4+4+4)records(term,doc,freq).Thesearegeneratedasweparsedocs.Mustnowsort100Msuch12-byterecordsbyterm.DefineaBlock~10MsuchrecordsCaneasilyfitacoupleintomemory.Willhave10suchblockstostartwith.Basicideaofalgorithm:Accumulatepostingsforeachblock,sort,writetodisk.Thenmergetheblocksintoonelongsortedorder.BSBI:Blockedsort-basedIndexing(Sortingwithfewerdiskseeks)22BSBI:Blockedsort-basedIndexing23Remainingproblemwithsort-basedalgorithmOurassumptionwas:wecankeepthedictionaryinmemory.Weneedthedictionary(whichgrowsdynamically)inordertoimplementatermtotermIDmapping.Actually,wecouldworkwithterm,docIDpostingsinsteadoftermID,docIDpostings......butthenintermediatefilesbecomeverylarge.(Wewouldendupwithascalable,butveryslowindexconstructionmethod.)24SPIMI:

Single-passin-memoryindexingKeyidea1:Generateseparatedictionariesforeachblock–noneedtomaintainterm-termIDmappingacrossblocks.Keyidea2:Don’tsort.Accumulatepostingsinpostingslistsastheyoccur.Withthesetwoideaswecangenerateacompleteinvertedindexforeachblock.Theseseparateindexescanthenbemergedintoonebigindex.25

SPIMI-InvertMergingofblocksisanalogoustoBSBI.26DistributedindexingForweb-scaleindexing(don’ttrythisathome!):mustuseadistributedcomputingclusterIndividualmachinesarefault-proneCanunpredictablyslowdownorfailHowdoweexploitsuchapoolofmachines?27GoogledatacentersGoogledatacentersmainlycontaincommoditymachines.Datacentersaredistributedaroundtheworld.Estimate:atotalof1millionservers,3millionprocessors/cores(Gartner2007)Estimate:Googleinstalls100,000serverseachquarter.Basedonexpendituresof200–250milliondollarsperyearThiswouldbe10%ofthecomputingcapacityoftheworld!?!28GoogledatacentersIfinanon-fault-tolerantsystemwith1000nodes,eachnodehas99.9%uptime,whatistheuptimeofthesystem?Answer:37%Calculatethenumberofserversfailingperminuteforaninstallationof1millionservers.29DistributedindexingMaintainamastermachinedirectingtheindexingjob–considered“safe”.Breakupindexingintosetsof(parallel)tasks.Mastermachineassignseachtasktoanidlemachinefromapool.30ParalleltasksWewillusetwosetsofparalleltasksParsersInvertersBreaktheinputdocumentcorpusintosplitsEachsplitisasubsetofdocuments(correspondingtoblocksinBSBI/SPIMI)31ParsersMasterassignsasplittoanidleparsermachineParserreadsadocumentatatimeandemits(term,doc)pairsParserwritespairsintojpartitionsEachpartitionisforarangeofterms’firstletters(e.g.,a-f,g-p,q-z)–herej=3.Nowtocompletetheindexinversion32InvertersAninvertercollectsall(term,doc)pairs(=postings)foroneterm-partition.Sortsandwritestopostingslists33DataflowsplitsParserParserParserMastera-fg-pq-za-fg-pq-za-fg-pq-zInverterInverterInverterPostingsa-fg-pq-zassignassignMapphaseSegmentfilesReducephase34MapReduceTheindexconstructionalgorithmwejustdescribedisaninstanceofMapReduce.MapReduce(DeanandGhemawat2004)isarobustandconceptuallysimpleframeworkfordistributedcomputingwithouthavingtowritecodeforthedistributionpart.TheydescribetheGoogleindexingsystem(ca.2002)asconsistingofanumberofphases,eachimplementedinMapReduce.35MapReduceIndexconstructionwasjustonephase.Anotherphase:transformingaterm-partitionedindexintodocument-partitionedindex.Term-partitioned:onemachinehandlesasubrangeoftermsDocument-partitioned:onemachinehandlesasubrangeofdocumentsmostsearchenginesuseadocument-partitionedindex…betterloadbalancing,etc.36SchemaforindexconstructioninMapReduceSchemaofmapandreducefunctionsmap:input→list(k,v)reduce:(k,list(v))→outputInstantiationoftheschemaforindexconstructionmap:webcollection→list(termID,docID)reduce:(<termID1,list(docID)>,<termID2,list(docID)>,…)→(postingslist1,postingslist2,…)Exampleforindexconstructionmap:d2:Cdied.d1:Ccame,Cc’ed.→(<C,d2>,<died,d2>,<C,d1>,<came,d1>,<C,d1>,<c’ed,d1>reduce:(<C,(d2,d1,d1)>,<died,(d2)>,<came,(d1)>,<c’ed,(d1)>)→(<C,(d1:2,d2:1)>,<died,(d2:1)>,<came,(d1:1)>,<c’ed,(d1:1)>)37有几个问题需要解决很难找到合适的硬件设施,基于这些数据开展工作存储,运行,维护,….缺少处理这些数据的有效的软件工具分析,检索,评测,….如何维护有stateofarts算法的软件工具模块替换,功能测试,性能测试,…38智能中文搜索引擎技术研究及平台构建由一个平台,四个任务组成的.达到平台能够展示四个任务.平台是:一个开放式智能化中文搜索引擎平台任务是:构建适用于中文互联网的文本内容抽取系统及网页文本内容结构化分析基于内容的多媒体信息检索基于自然语言理解的查询分析基于用户属性挖掘的个性化检索/src/paradise/39PARADISEDesign

PlatformforApplying,ResearchingAndDevelopingIntelligentSearchEngineServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersServersBareMetalMachinesAutomationClusterManagementToolsDatanodesTianwangjobsandevaluationjobs40PARADISE索引设计介绍41当前设计中的基本功能Usingfastersingle-passindexingHighlycompressedindexdiskdatastructuresVariouscompressalgorithm(*)Variousinvertindexfileformat(*)indexposition,freqvalueorimpactvalue(*)Indexfieldinformation,suchastitle,dateCachesupport(*)42下一步设计的功能Indexvariousdocumentformat,suchasword,pdf使用MapReduce等技术的索引多种切词的实现与处理增量索引与文档删除Termvector存储多segment的合并策略43InvertedindexDocument-sortedindexThepointersineachinvertedlistarestoredinincreasingdocumentorderUsingd-gapFrequency-sortedindexInvertedlistarestoredindecreasingtermfrequencyscoreImpact-sortedindexThepointersareorderedaccordingtotheimpactvalueImpactvalueisasmallintegerrepresentingtheoverallcontributionoftermttothescoreofdocumentd,includethefactorusedtonormalizefordocumentlength44ProcessingquerymodeDocument-at-a-timeAllinvertedlistsaresimultaneouslyaccessed|q|-wayprocessingiscarriedouttohandleaqueryqof|q|termsTerm-at-a-timeOnlyoninvertedlistisaccessedatanygiventimeAsequenceof|q|-1binaryoperationsareperformedNotclearbetterthandocument-at-a-timeprocessingScore-at-a-timeAllofthetermlistsareopenforreadingProcessedindecreasingscorecontributionorderratherthaninincreasingdocumentnumberorderDynamicquerypruning45IndexlevelsDocumentnumbersonlyOnlysupportBooleanqueriesDocumentnumbersandScoreScorecanbeimpactscoresorfrequentorbothSupportBooleanandrankedqueriesDocumentnumbers,Score,PositionsSupportBoolean,ranked,phrase(proximityqueries)46TypesofindexandsupportquerymodestypedsposOrganizationProcessingmodesupportedDY--Document-sortTermordoc–at-a-timeDSYY-Document-sortImpactorfreq-sortTermordoc–at-a-timeDSPYYYDocument-sortedTermordoc-at-a-timescore-at-a-time47InterleavingPointerinterleavingEachdocumentnumberisimmediatelyfollowedbythecorrespondingworfvalueorposvalueTerminterleavingKeepvariousrelatedpartsofeachinvertedlistincontiguousblocsNon-interleavingUsingseparateinvertedfileEachstoringaparticulartypeofinformationFordistinctdiskpointersmaintainedineachentryinthedictionary48Block-interleavedindexesEachinvertedlistisbuiltupasasequenceofoneormoregroupsWithineachgrouptherearefourormorek-blocks49Cache是一个工具包,为一些模块提供cache功能Cahce策略与存储分开有多种cahce策略 当前实现MRU策略多种存储策略基于STLmap采用模板实现,保证其通过性50DocumentThelogicalrepresentationofaDocumentforindexingandsearchingADocumentisacollectionofFieldablesAFieldableisalogicalrepresentationofauser'scontentthatneedstobeindexedorstoredFieldableshaveanumberofpropertiesthattelltheindexhowtotreatthecontent(likeindexed,tokenized,stored,etc.)51Compress工具包,提供整形数字的解压缩实现多种解压缩算法采用工厂方法模式52AnalysisAnanalyzerbuildsTokenStreams,whichanalyzetextItthusrepresentsapolicyforextractingindextermsfromtextTypicalimplementationsfirstbuildaTokenizer,whichbreaksthestreamofcharactersintorawTokensOneormoreTokenFiltersmaythenbeappliedtotheoutputoftheTokenizer53PARADISEImplementation

/src/paradise/

applicationPKUsearchpdfliteraturesearch2008Olympicssearchmodulecrawlanalysis&indexsearchutilitycompressBuffer...thirdpartycmake,boost,zlib,tidy,openssl,berkeleydbdataTianwangFormatquerylogIPLocator

tutorialCMakeandBoostreferenceBooksandpaperscourseInformationRetrievalCloudComputingDistributedSy

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论