




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Google搜索与
Inter网的信息检索
马志明
May16,2008Email:mazm@/member/mazhiming/index.html约有626,000项符合中国科学院数学与系统科学研究院的查询结果,以下是第1-100项。
(搜索用时0.45
秒)Howcangooglemakearankingof626,000pagesin0.45seconds?Amaintaskof
Internet(Web)
InformationRetrieval
=DesignandAnalysisof
SearchEngine(SE)Algorithm
involvingplentyofMathematicsHITS
PageRank1998JonKleinbergCornellUniversity
SergeyBrinandLarryPageStanfordUniversityNevanlinnaPrize(2006)
JonKleinberg
OneofKleinberg‘smostimportantresearchachievementsfocusesontheinternetworkstructureoftheWorldWideWeb.Priorto
Kleinberg‘swork,searchenginesfocusedonlyonthecontentofwebpages,notonthelinkstructure.Kleinbergintroducedtheideaof“authorities”and“hubs”:Anauthorityisawebpagethatcontains
informationonaparticulartopic,andahubisapagethatcontainslinksto
manyauthorities.Zhuzihuthesis.pdfPage
Rank,therankingsystem
usedbytheGooglesearch
engine.
Queryindependentcontentindependent.usingonlythewebgraphstructurePage
Rank,therankingsystemusedbytheGooglesearchengine.
PageRankasaFunctionoftheDampingFactorPaoloBoldiMassimoSantiniSebastianoVignaDSI,UniversitàdegliStudidiMilanoWWW2005paper3.1Choosingthedampingfactor3GeneralBehaviour3.2Gettingcloseto1
canwesomehowcharacterisethepropertiesof?whatmakes
differentfromtheother(infinitelymany,ifPisreducible)limitdistributionsofP?
isthelimitdistributionofPwhenthestartingdistributionisuniform,thatis,Conjecture1
:
Website
provideplentyofinformation:
pagesinthesamewebsitemaysharethesameIP,runonthesamewebserveranddatabaseserver,andbeauthored/maintainedbythesamepersonororganization.
theremightbehighcorrelationsbetweenpagesinthesamewebsite,intermsofcontent,pagelayoutandhyperlinks.
websitescontainhigherdensityofhyperlinksinsidethem(about75%)andlowerdensityofedgesinbetween.HostGraphlosesmuchtransitioninformation
Canasurferjumpfrompage5ofsite1toapageinsite2?From:s06-pc-chairs-email@[mailto:s06-pc-chairs-Sent:2006年4月4日8:36
To:Tie-YanLiu;wangying@;fengg03@;ybao@;mazm@
Subject:[SIGIR2006]YourPaper#191
Title:AggregateRank:BringOrdertoWebSites
Congratulations!!29thAnnual
International
Conferenceon
Research&DevelopmentonInformationRetrieval(SIGIR’06,August6–11,2006,Seattle,Washington,USA).RankingWebsites,
aProbabilisticView
YingBao,GangFeng,Tie-YanLiu,Zhi-MingMa,andYingWang
InternetMathematics,
Volume3(2007),Issue3-WesuggestevaluatingtheimportanceofawebsitewiththemeanfrequencyofvisitingthewebsitefortheMarkovchainontheInternetGraphdescribingrandomsurfing.
WeshowthatthismeanfrequencyisequaltothesumofthePageRanksofallthewebpagesinthatwebsite(henceisreferredasPageRankSum)
Weproposeanovelalgorithm(AggregateRankAlgorithm)basedonthetheoryofstochasticcomplement
tocalculatetherankofawebsite.TheAggregateRankAlgorithmcanapproximatethePageRankSumaccurately,whilethecorrespondingcomputationalcomplexityismuchlowerthanPageRankSum
Byconstructingreturn-timeMarkovchainsrestrictedtoeachwebsite,wedescribealsotheprobabilisticrelationbetweenPageRankandAggregateRank.
ThecomplexityandtheerrorboundofAggregateRankAlgorithmwithexperimentsofrealdadaarediscussedattheendofthepaper.nwebsinNsites,
Thestationarydistribution,knownasthePageRankvector,isgivenbyWemayrewritethestationarydistributionaswithasarowvectoroflength
Wedefinetheone-steptransitionprobabilityfromthewebsite
tothewebsite
bywhereeisandimensionalcolumnvectorofallones
TheN×NmatrixC(α)=(cij(α))isreferredtoasthecouplingmatrix,whoseelementsrepresentthetransitionprobabilitiesbetweenwebsites.ItcanbeprovedthatC(α)isanirreduciblestochasticmatrix,sothatitpossessesauniquestationaryprobabilityvector.Weuseξ(α)todenotethisstationaryprobability,whichcanbegottenfrom
SinceOnecaneasilycheckthatistheuniquesolutionto
WeshallreferastheAggregateRankThatis,theprobabilityofvisitingawebsiteisequaltothesumofPageRanksofallthepagesinthatwebsite.Thisconclusionisconsistenttoourintuition.thetransitionprobabilityfromSitoSjactuallysummarizesallthecasesthattherandomsurferjumpsfromanypageinSitoanypageinSjwithinone-steptransition.Therefore,thetransitioninthisnewHostGraphisinaccordancewiththerealbehavioroftheWebsurfers.Inthisregard,theso-calculatedrankfromthecouplingmatrixC(α)willbemorereasonablethanthosepreviousworks.Let
denotethenumberofvisitingthewebsite
duringthentimes,thatisWehaveAssumeastartingstateinwebsiteA,i.e.Itisclearthatallthevariables
arestoppingtimesforX.WedefineandinductivelyLet
denotethetransitionmatrixofthereturn-timeMarkovchainforsiteSimilarly,wehaveSinceThereforeSupposethatAggregateRank,i.e.thestationarydistributionofisBasedontheabovediscussions,thedirectapproachofcomputingtheAggregateRankξ(α)istoaccumulatePageRankvalues(denotedbyPageRankSum).However,thisapproachisunfeasiblebecausethecomputationofPageRankisnotatrivialtaskwhenthenumberofwebpagesisaslargeasseveralbillions.Therefore,Efficientcomputationbecomesasignificantproblem.1.Dividethen×nmatrix
intoN×NblocksaccordingtotheNsites.AggregateRank
Constructthestochasticmatrixforbychangingthediagonalelementsoftomakeeachrawsumupto1.3.Determinefrom4.Formanapproximation
tothecouplingmatrix
,byevaluating5.Determinethestationarydistributionof
anddenoteit
,i.e.,Experiments
Inourexperiments,thedatacorpusisthebenchmarkdatafortheWebtrackofTREC2003and2004,domainintheyearof2002.Itcontains1,247,753dataset.Thelargestwebsitecontains137,103webpageswhilethesmallestonecontainsonly1page.PerformanceEvaluationofRankingAlgorithmsbasedonKendall'sdistanceSimilaritybetweenPageRankSumandotherthreerankingresults.From:pcchairs@
Sent:Thursday,April03,20089:48AM
DearYutingLiu,BinGao,Tie-YanLiu,YingZhang,ZhimingMa,ShuyuanHe,HangLi
Wearepleasedtoinformyouthatyourpaper
Title:BrowseRank:LettingWebUsersVoteforPageImportance
hasbeenacceptedfororalpresentationasafullpaperandforpublicationasaneightpaperintheproceedingsofthe31stAnnualInternationalACMSIGIR
ConferenceonResearch&DevelopmentonInformationRetrieval.
Congratulations!!BuildingmodelPropertiesofQprocess:Stationarydistribution:
Jumpingprobability:
EmbeddedMarkovchain:isaMarkovchainwiththetransitionprobabilitymatrixMainconclusion1
isthemeanofthestayingtimeonpagei.
Themoreimportantapageis,thelongerstayingtimeonitis.isthemeanofthefirstre-visittimeatpagei.Themoreimportantapageis,thesmallerthere-visittimeis,andthelargerthevisitfrequencyis.Mainconclusion2
isthestationarydistributionofThestationarydistributionofdiscretemodeliseasytocomputePowermethodforLogdataforFurtherquestionsHowaboutinhomogenousprocess?Statisticresultshow:differentperiodoftimepossessesdifferentvisitingfrequency.Poissonprocesseswithdifferentintensity.MarkedpointprocessHyperlinkisnotreliable.Users’realbehaviorshouldbeconsidered.RelevanceRankingManyfeaturesformeasuringrelevanceTermdistribution(anchor,URL,title,body,proximity,….)Recommendation&citation(PageRank,click-throughdata,…)StatisticsorknowledgeextractedfromwebdataQuestionsWhatistheoptimalrankingfunctiontocombinedifferentfeatures(orevidences)?Howtomeasurerelevance?LearningtoRankWhatistheoptimalweightingsforcombiningthevariousfeaturesUsemachinelearningmethodstolearntherankingfunctionHumanrelevancesystem(HRS)Relevanceverificationtests(RVT)Wei-YingMa,MicrosoftResearchAsiaLearningtoRankModelLearningSystemRankingSystemminLoss66Wei-YingMa,MicrosoftResearchAsiaLearningtoRank(Cont)
State-of-the-artalgorithmsforlearningtoranktakethepairwiseapproachRankingSVMRankBoostRankNet(employedatLiveSearch)67BreakdownWei-YingMa,MicrosoftResearchAsialearningtorankThegoaloflearningtorankistoconstructareal-valuedfunctionthatcangeneratearankingonthedocumentsassociatedwiththegivenquery.Thestate-of-the-artmethodstransformsthelearningproblemintothatofclassificationandthenperformsthelearningtask:Foreachquery,itisassumedthattherearetwocategoriesofdocuments:positiveandnegative(representingrelevantandirreverentwithrespecttothequery).Thendocumentpairsareconstructedbetweenpositivedocumentsandnegativedocuments.Inthetrainingprocess,thequeryinformationisactuallyignored.[5]Y.Cao,J.Xu,T.-Y.Liu,H.Li,Y.Huang,andH.-W.Hon.Adaptingrankingsvmtodocumentretrieval.InProc.ofSIGIR’06,pages186–193,2006.[11]T.Qin,T.-Y.Liu,M.-F.Tsai,X.-D.Zhang,andH.Li.Learningtosearchwebpageswithquery-levellossfunctions.TechnicalReportMSR-TR-2006-156,2006.Ascasestudies,weinvestigateRankingSVMandRankBoost.Weshowthatafterintroducing
query-levelnormalization
toitsobjectivefunction,RankingSVMwillhavequery-levelstability.ForRankBoost,thequery-levelstabilitycanbeachievedifweintroduceboth
query-levelnormalizationandregularization
toitsobjectivefunction.Were-representthelearningtorankproblembyintroducingtheconceptof‘query’and‘distributiongivenquery’intoitsmathematicalformulation.Moreprecisely,weassumethatqueriesaredrawnindependentlyfromaqueryspaceQaccordingtoan(unknown)probabilitydistributionItshouldbenotedthatif,thentheboundmakessense.Thisconditioncanbesatisfiedinmanypracticalcases.Ascasestudies,weinvestigateRankingSVMandRankBoost.Weshowthatafterintroducingquery-levelnormalizationtoitsobjectivefunction,RankingSVMwillhavequery-levelstability.ForRankBoost,thequery-levelstabilitycanbeachievedifweintroducebothquery-levelnormalizationandregularizationtoitsobjectivefunction.Theseanalysesagreelargelywithourexperimentsandtheexperimentsin[5]and[11].RankaggregationRankaggregationistocombinerankingresultsofentitiesfrommultiplerankingfunctionsinordertogenerateabetterone.Theindividualrankingfunctionsarereferredtoasbaserankers,orsimplyrankers.Score-basedaggregationRankaggregationcanbeclassifiedintotwocategories[2].Inthefirstcategory,theentitiesinindividualrankinglistsareassignedscoresandtherankaggregationfunctionisassumedtousethescores(denotedasscore-basedaggregation)[11][18][28].order-basedaggregation
Inthesecondcategory,onlytheordersoftheentitiesinindividualrankinglistsa
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025-2030中国芝士夹心饼干行业市场深度分析及发展趋势与投资研究报告
- 2025-2030中国自动化机器人行业市场深度调研及竞争格局与投资策略研究报告
- 2025-2030中国肿瘤生物标志物行业市场发展趋势与前景展望战略研究报告
- 2025-2030中国美司钠注射液(美安)行业市场发展趋势与前景展望战略研究报告
- 2025-2030中国组合前灯行业市场深度调研及前景趋势与投资研究报告
- 2025-2030中国红枣汁行业市场深度调研及发展趋势和投资前景预测研究报告
- 2025-2030中国移动终端设备行业发展分析及发展趋势与投资前景预测研究报告
- 2025-2030中国科技旅游行业市场深度调研及前景趋势与投资研究报告
- 2025-2030中国硅锰行业市场发展趋势与前景展望战略研究报告
- 2025-2030中国矿泥面膜行业市场深度调研及前景趋势与投资研究报告
- 新时代青年与中华传统文化的现代表达:青春、创新与传承
- 科技领域实验室质量控制关键技术与方法
- 国土业务知识培训课件
- 《糖尿病与肥胖》课件
- 高考语文专题复习【高效课堂精研】小说的叙述艺术
- 2024年05月湖南湖南湘江新区农商行社会招考15人笔试历年参考题库附带答案详解
- 服装设计与工艺基础知识单选题100道及答案
- AI人工智能应用开发合同
- 护理MDT多学科联合查房
- 《人工智能发展史》课件
- 易制毒化学品采购员岗位职责
评论
0/150
提交评论