搜索引擎技术闫宏飞北京大学计算机系网络实验室_第1页
搜索引擎技术闫宏飞北京大学计算机系网络实验室_第2页
搜索引擎技术闫宏飞北京大学计算机系网络实验室_第3页
搜索引擎技术闫宏飞北京大学计算机系网络实验室_第4页
搜索引擎技术闫宏飞北京大学计算机系网络实验室_第5页
已阅读5页,还剩57页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

搜索引擎技术2004年12月24日@CERNET20041内容提要搜索引擎工作原理信息检索相关研究和机构2搜索引擎—WebSearchEngines定义:允许用户递交查询,检索出与查询相关的网页结果列表,并且排序输出。创建索引的方法手工索引自动索引系统结构集中式体系结构分布式体系结构345BrowsingServicesSearchEngineServicesWebPagesBagofWordsTwosemanticsextremesTwoserviceextremes??????6搜索引擎三段式工作流程搜集批量搜集,增量式搜集;搜集目标,搜集策略预处理关键词提取;重复网页消除;链接分析;索引服务查询方式和匹配;结果排序;文档摘要搜集整理服务7搜索引擎系统流程8天网搜索引擎系统流程9分布式Web搜集系统结构协调进程(节点)抓取进程协调进程(节点)抓取进程协调进程(节点)抓取进程调度模块……10天网存储格式version:1.0 //versionnumberurl: //URLorigin: //originalURLdate:Tue,15Apr200308:13:06GMT //timeofharvestip:162.105.129.12 //IPaddressunzip-length:30233 //Ifincluded,thedatamustbecompressedlength:18133 //datalength

//ablanklineXXXXXXXX //thefollowingsaredatapartXXXXXXXX….XXXXXXXX //dataend

//insertanewline11(Indexes)ChoicesforaccessingdataduringqueryevaluationScantheentirecollectionTypicalinearly(batch)retrievalsystemsComputationalandI/OcostsareO(charactersincollection)Practicalforonly“small”textcollectionsLargememorysystemsmakescanningfeasibleUseindexesfordirectaccessEvaluationtimeO(querytermoccurrencesincollection)Practicalfor“large”collectionsManyopportunitiesforoptimizationHybrids:Usesmallindex,thenscanasubsetofthecollection12IndexesWhatshouldtheindexcontain?DatabasesystemsindexprimaryandsecondarykeysThisisthehybridapproachIndexprovidesfastaccesstoasubsetofdatabaserecordsScansubsettofindsolutionsetIRProblem:CannotpredictkeysthatpeoplewilluseinqueriesEverywordinadocumentisapotentialsearchtermIRSolution:Indexbyallkeys(words)

fulltextindexes13IndexContentsThecontentsdependupontheretrievalmodelFeaturepresence/absenceBooleanStatistical(tf,df,ctf,doclen,maxtf)Oftenabout10%thesizeoftherawdata,compressedPositionalFeaturelocationwithindocumentGranularitiesincludeword,sentence,paragraph,etcCoarsegranularitiesarelessprecise,buttakelessspaceWord-levelgranularityabout20-30%thesizeoftherawdata,compressed14Indexes:ImplementationCommonimplementationsofindexesBitmapsSignaturefilesInvertedfilesCommonindexcomponentsDictionary(lexicon)PostingsdocumentidswordpositionsNopositionaldataindexed15InvertedFiles16InvertedFiles17Word-LevelInvertedFile18InvertedSearchAlgorithmFindqueryelements(terms)inthelexiconRetrievepostingsforeachlexiconentryManipulatepostingsaccordingtotheretrievalmodel19Word-LevelInvertedFileQuery:

1.porridge&pot(BOOL)

2.“porridgepot”(BOOL)3.porridgepot(VSM)lexiconpostingAnswer20内容提要搜索引擎工作原理信息检索相关研究和机构21ABriefhistoryofModernInformationRetrievalIn1945,VannevarBushpublished"AsWeMayThink"intheAtlanticmonthly.Inthe1960s,theSMARTsystembyGerardSaltonandhisstudentsCranfieldevaluationsdonebyCyrilCleverdonThe1970sand1980ssawmanydevelopmentsbuiltontheadvancesofthe1960s.In1992withtheinceptionofTextRetrievalConference.ThealgorithmsdevelopedThealgorithmsdevelopedinIRwereemployedforsearchingtheWebfrom1996.22ClusteringofSIGIRpapersbytopicvs.year23Questionanswering24Clustering25Invertedfiles&Implementations26Messageunderstanding&TDT27Filtering28HypertextIR,Multipleevidence29Probabilistic&Languagemodels30DistributedIR31Evaluation32Topicdistillation&Linkageretrieval33Textcategorisation34Documentsummarisation35Crosslingual36信息检索相关研究和机构CIIR,UniversityofMassachusettsLTI,CarnegieMellonUniversityTheStanfordUniversityDBGroupMicrosoftResearchAsiaTREC北京大学,网络实验室,天网组37Lemur简介38LemurToolkit目标:为促进LM和IR研究的researchsystemadhoc,distributedretrieval,cross-languageIR,summarization,filtering,andclassification功能:支持大规模文档数据库的索引建立SimpleLanguageModel实现基于LanguageModel和其它多个检索模型的系统实现:CandC++Unix/WindowsCurrentVersion3.139MRA:TowardsNextGenerationWebSearchFromPagestoBlocksAnalyzetheWebatfinergranularityFromSurfaceWebtoDeepWebUnleashthehugeassetsofhigh-valueinformationFromUnstructuretoStructureProvidewellorganizedresultsFromrelevancetointelligenceContributeknowledgediscoverywithsearchFromDesktopSearchtoMobileSearchBridgephysicalworldsearchtodigitalworldsearch40TheStanfordUniv.DBGroupWebBaseCrawling,storage,indexing,andqueryingoflargecollectionsofWebpages.DigitalLibrariesInfrastructureandservicesforcreating,disseminating,sharingandmanaginginformation41TRECConferenceEstablishedin1992toevaluatelarge-scaleIRRetrievingdocumentsfromagigabytecollectionHasruncontinuouslysincethenTREC2004(13th)meetingisinNovemberRunbyNIST’sInformationAccessDivisionProbablymostwellknownIRevaluationsettingStartedwith25participatingorganizationsin1992evaluationIn2003,therewere93groupsfrom22differentcountriesProceedingsavailableon-line()OverviewofTREC2003at42TRECconsistsofIRresearchtracksAdhoc,routing,confusion(scanneddocuments,speechrecognition),video,filtering,multilingual(cross-language,Spanish,Chinese),questionanswering,novelty,highprecision,interactive,Web,databasemerging,NLP,…EachtrackworksonroughlythesamemodelNovember:trackapprovedbyTRECcommunityWinter:track’smembersfinalizeformatfortrackSpring:researcherstrainsystembasedonspecificationSummer:researcherscarryoutformatevaluationUsuallya“blind”evaluation:researchdonotknowanswerFall:NISTcarriesoutevaluationNovember:Groupmeeting(TREC)tofindout:HowwellyoursitedidHowotherstackledtheprogramManytracksarerunbyvolunteersoutsideofNIST(e.g.Web)“Coopetition”modelofevaluationSuccessfulapproachesgenerallyadoptedinnextcycleTRECGeneralFormat43TRECTracks44SummaryofVLC/WebTrackevaluation1996-200345TianwangGroup@PKU46474849CWT100g构建时间表√√√我是一小步,人类的一大步!√5051截止2004-12-20北大燕穹数据共享情况2.5/8.8=28.4%52提交结果的参加队注:pooling还包括google,yisou,baidu,sogou,zhongsou五个SE的检索结果。53主题提取导航搜索其中TIANWANG_RUN仅供参考评测结果54总结搜索引擎工作原理信息检索相关研究和机构55谢谢!56VectorSpaceModel文档d和查询q在向量空间中表示为两个m维向量,每维度的权值用TF∙IDF,其相似度用向量夹角余弦度量,有:(使用原始的tf,idf公式)BACK57QueryAnswer1.porridge&pot(BOOL)d22.“porridgepot”(BOOL)null3.porridgepot(VSM)d2>d1>d5Nextpage

BACK58CIIR-CenterforIntelligentInformationRetrieval@UMASS

OneoftheleadingresearchgroupsinIRimprovingtheprobabilisticmodels,firstdescriptionofaretrievalsystembasedonstatisticallanguagemodels.introducedandimprovedanumberoftechniquesfortextandqueryrepresentationautomaticallyrepresentingdatabasesandcombininglocalsearchesforDIRfirsthighcapacityprobabilisticfilteringarchitecturedefineandevaluatethefirstversionsofeventdetectionandtrackingsoftwareearliestresearchonrankingandrepresentationtechniquesforAsianlanguagesfirstapproachestoinformationextractionthatemphasizedlearningnoveltechniquesforindexingimagesandvideo59CIIRcont.Researchmorethan500journalandrefereedconferencepapersoverthepast12years(52submissionsin2003).industrialandgovernmentcollaborationINQUERYlicensedoursoftwaretonearly300sitesEducation20Ph.D.s,29M.S.123/145,34/4graduate/undergraduate60CIIRcont.

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论