stanford大学-大数据挖掘-PageRank.ppt_第1页
stanford大学-大数据挖掘-PageRank.ppt_第2页
stanford大学-大数据挖掘-PageRank.ppt_第3页
stanford大学-大数据挖掘-PageRank.ppt_第4页
stanford大学-大数据挖掘-PageRank.ppt_第5页
已阅读5页,还剩32页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

CS345DataMining,LinkAnalysisAlgorithmsPageRank,AnandRajaraman,JeffreyD.Ullman,LinkAnalysisAlgorithms,PageRankHubsandAuthoritiesTopic-SpecificPageRankSpamDetectionAlgorithmsOtherinterestingtopicswewontcoverDetectingduplicatesandmirrorsMiningforcommunities,Rankingwebpages,Webpagesarenotequally“important”Ihas23,400inlinkswww.joe-has1inlinkAreallinlinksequal?Recursivequestion!,Simplerecursiveformulation,EachlinksvoteisproportionaltotheimportanceofitssourcepageIfpagePwithimportancexhasnoutlinks,eachlinkgetsx/nvotesPagePsownimportanceisthesumofthevotesonitsinlinks,Simple“flow”model,Thewebin1839,y,a,m,y/2,y/2,a/2,a/2,m,y=y/2+a/2a=y/2+mm=a/2,Solvingtheflowequations,3equations,3unknowns,noconstantsNouniquesolutionAllsolutionsequivalentmoduloscalefactorAdditionalconstraintforcesuniquenessy+a+m=1y=2/5,a=2/5,m=1/5Gaussianeliminationmethodworksforsmallexamples,butweneedabettermethodforlargegraphs,Matrixformulation,MatrixMhasonerowandonecolumnforeachwebpageSupposepagejhasnoutlinksIfj!i,thenMij=1/nElseMij=0MisacolumnstochasticmatrixColumnssumto1SupposerisavectorwithoneentryperwebpageriistheimportancescoreofpageiCallittherankvector|r|=1,Example,Supposepagejlinksto3pages,includingi,r,Eigenvectorformulation,Theflowequationscanbewrittenr=MrSotherankvectorisaneigenvectorofthestochasticwebmatrixInfact,itsfirstorprincipaleigenvector,withcorrespondingeigenvalue1,Example,y1/21/20a1/201m01/20,yam,y=y/2+a/2a=y/2+mm=a/2,PowerIterationmethod,Simpleiterativescheme(akarelaxation)SupposethereareNwebpagesInitialize:r0=1/N,.,1/NTIterate:rk+1=MrkStopwhen|rk+1-rk|1|x|1=1iN|xi|istheL1normCanuseanyothervectornorme.g.,Euclidean,PowerIterationExample,y1/21/20a1/201m01/20,yam,ya=m,1/31/31/3,1/31/21/6,5/121/31/4,3/811/241/6,2/52/51/5,.,RandomWalkInterpretation,ImaginearandomwebsurferAtanytimet,surferisonsomepagePAttimet+1,thesurferfollowsanoutlinkfromPuniformlyatrandomEndsuponsomepageQlinkedfromPProcessrepeatsindefinitelyLetp(t)beavectorwhoseithcomponentistheprobabilitythatthesurferisatpageiattimetp(t)isaprobabilitydistributiononpages,Thestationarydistribution,Whereisthesurferattimet+1?Followsalinkuniformlyatrandomp(t+1)=Mp(t)Supposetherandomwalkreachesastatesuchthatp(t+1)=Mp(t)=p(t)Thenp(t)iscalledastationarydistributionfortherandomwalkOurrankvectorrsatisfiesr=MrSoitisastationarydistributionfortherandomsurfer,ExistenceandUniqueness,Acentralresultfromthetheoryofrandomwalks(akaMarkovprocesses):Forgraphsthatsatisfycertainconditions,thestationarydistributionisuniqueandeventuallywillbereachednomatterwhattheinitialprobabilitydistributionattimet=0.,Spidertraps,AgroupofpagesisaspidertrapiftherearenolinksfromwithinthegrouptooutsidethegroupRandomsurfergetstrappedSpidertrapsviolatetheconditionsneededfortherandomwalktheorem,Microsoftbecomesaspidertrap,Yahoo,Msoft,Amazon,y1/21/20a1/200m01/21,yam,ya=m,111,11/23/2,3/41/27/4,5/83/82,003,.,Randomteleports,TheGooglesolutionforspidertrapsAteachtimestep,therandomsurferhastwooptions:Withprobability,followalinkatrandomWithprobability1-,jumptosomepageuniformlyatrandomCommonvaluesforareintherange0.8to0.9Surferwillteleportoutofspidertrapwithinafewtimesteps,Randomteleports(=0.8),Yahoo,Msoft,Amazon,1/2,1/2,0.8*1/2,0.8*1/2,0.2*1/3,0.2*1/3,0.2*1/3,1/21/201/20001/21,1/31/31/31/31/31/31/31/31/3,y7/157/151/15a7/151/151/15m1/157/1513/15,0.8,+0.2,Randomteleports(=0.8),Yahoo,Msoft,Amazon,1/21/201/20001/21,1/31/31/31/31/31/31/31/31/3,y7/157/151/15a7/151/151/15m1/157/1513/15,0.8,+0.2,ya=m,111,1.000.601.40,0.840.601.56,0.7760.5361.688,7/115/1121/11,.,Matrixformulation,SupposethereareNpagesConsiderapagej,withsetofoutlinksO(j)WehaveMij=1/|O(j)|whenj!iandMij=0otherwiseTherandomteleportisequivalenttoaddingateleportlinkfromjtoeveryotherpagewithprobability(1-)/Nreducingtheprobabilityoffollowingeachoutlinkfrom1/|O(j)|to/|O(j)|Equivalent:taxeachpageafraction(1-)ofitsscoreandredistributeevenly,PageRank,ConstructtheNNmatrixAasfollowsAij=Mij+(1-)/NVerifythatAisastochasticmatrixThepagerankvectorristheprincipaleigenvectorofthismatrixsatisfyingr=ArEquivalently,risthestationarydistributionoftherandomwalkwithteleports,Deadends,Pageswithnooutlinksare“deadends”fortherandomsurferNowheretogoonnextstep,Microsoftbecomesadeadend,Yahoo,Msoft,Amazon,ya=m,111,10.60.6,0.7870.5470.387,0.6480.4300.333,000,.,1/21/201/20001/20,1/31/31/31/31/31/31/31/31/3,y7/157/151/15a7/151/151/15m1/157/151/15,0.8,+0.2,Dealingwithdead-ends,TeleportFollowrandomteleportlinkswithprobability1.0fromdead-endsAdjustmatrixaccordinglyPruneandpropagatePreprocessthegraphtoeliminatedead-endsMightrequiremultiplepassesComputepagerankonreducedgraphApproximatevaluesfordeadendsbypropagatingvaluesfromreducedgraph,Computingpagerank,Keystepismatrix-vectormultiplicationrnew=AroldEasyifwehaveenoughmainmemorytoholdA,rold,rnewSayN=1billionpagesWeneed4bytesforeachentry(say)2billionentriesforvectors,approx8GBMatrixAhasN2entries1018isalargenumber!,Rearrangingtheequation,r=Ar,whereAij=Mij+(1-)/Nri=1jNAijrjri=1jNMij+(1-)/Nrj=1jNMijrj+(1-)/N1jNrj=1jNMijrj+(1-)/N,since|r|=1r=Mr+(1-)/NNwherexNisanN-vectorwithallentriesx,Sparsematrixformulation,Wecanrearrangethepagerankequation:r=Mr+(1-)/NN(1-)/NNisanN-vectorwithallentries(1-)/NMisasparsematrix!10linkspernode,approx10NentriesSoineachiteration,weneedto:Computernew=MroldAddaconstantvalue(1-)/Ntoeachentryinrnew,Sparsematrixencoding,EncodesparsematrixusingonlynonzeroentriesSpaceproportionalroughlytonumberoflinkssay10N,or4*10*1billion=40GBstillwontfitinmemory,butwillfitondisk,sourcenode,degree,destinationnodes,BasicAlgorithm,AssumewehaveenoughRAMtofitrnew,plussomeworkingmemoryStoreroldandmatrixMondiskBasicAlgorithm:Initialize:rold=1/NNIterate:Update:PerformasequentialscanofMandroldtoupdaternewWriteoutrnewtodiskasroldfornextiterationEveryfewiterations,compute|rnew-rold|andstopifitisbelowthresholdNeedtoreadinbothvectorsintomemory,Updatestep,src,degree,destination,0,1,2,3,4,5,6,0,1,2,3,4,5,6,rnew,rold,Initializeallentriesofrnewto(1-)/NForeachpagep(out-degreen):Readintomemory:p,n,dest1,destn,rold(p)forj=1.n:rnew(destj)+=*rold(p)/n,Analysis,Ineachiteration,wehaveto:ReadroldandMWriternewbacktodiskIOCost=2|r|+|M|Whatifwehadenoughmemorytofitbothrnewandrold?Whatifwecouldnotevenfitrnewinmemory

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论