已阅读5页,还剩32页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
CS345DataMining,LinkAnalysisAlgorithmsPageRank,AnandRajaraman,JeffreyD.Ullman,LinkAnalysisAlgorithms,PageRankHubsandAuthoritiesTopic-SpecificPageRankSpamDetectionAlgorithmsOtherinterestingtopicswewontcoverDetectingduplicatesandmirrorsMiningforcommunities,Rankingwebpages,Webpagesarenotequally“important”Ihas23,400inlinkswww.joe-has1inlinkAreallinlinksequal?Recursivequestion!,Simplerecursiveformulation,EachlinksvoteisproportionaltotheimportanceofitssourcepageIfpagePwithimportancexhasnoutlinks,eachlinkgetsx/nvotesPagePsownimportanceisthesumofthevotesonitsinlinks,Simple“flow”model,Thewebin1839,y,a,m,y/2,y/2,a/2,a/2,m,y=y/2+a/2a=y/2+mm=a/2,Solvingtheflowequations,3equations,3unknowns,noconstantsNouniquesolutionAllsolutionsequivalentmoduloscalefactorAdditionalconstraintforcesuniquenessy+a+m=1y=2/5,a=2/5,m=1/5Gaussianeliminationmethodworksforsmallexamples,butweneedabettermethodforlargegraphs,Matrixformulation,MatrixMhasonerowandonecolumnforeachwebpageSupposepagejhasnoutlinksIfj!i,thenMij=1/nElseMij=0MisacolumnstochasticmatrixColumnssumto1SupposerisavectorwithoneentryperwebpageriistheimportancescoreofpageiCallittherankvector|r|=1,Example,Supposepagejlinksto3pages,includingi,r,Eigenvectorformulation,Theflowequationscanbewrittenr=MrSotherankvectorisaneigenvectorofthestochasticwebmatrixInfact,itsfirstorprincipaleigenvector,withcorrespondingeigenvalue1,Example,y1/21/20a1/201m01/20,yam,y=y/2+a/2a=y/2+mm=a/2,PowerIterationmethod,Simpleiterativescheme(akarelaxation)SupposethereareNwebpagesInitialize:r0=1/N,.,1/NTIterate:rk+1=MrkStopwhen|rk+1-rk|1|x|1=1iN|xi|istheL1normCanuseanyothervectornorme.g.,Euclidean,PowerIterationExample,y1/21/20a1/201m01/20,yam,ya=m,1/31/31/3,1/31/21/6,5/121/31/4,3/811/241/6,2/52/51/5,.,RandomWalkInterpretation,ImaginearandomwebsurferAtanytimet,surferisonsomepagePAttimet+1,thesurferfollowsanoutlinkfromPuniformlyatrandomEndsuponsomepageQlinkedfromPProcessrepeatsindefinitelyLetp(t)beavectorwhoseithcomponentistheprobabilitythatthesurferisatpageiattimetp(t)isaprobabilitydistributiononpages,Thestationarydistribution,Whereisthesurferattimet+1?Followsalinkuniformlyatrandomp(t+1)=Mp(t)Supposetherandomwalkreachesastatesuchthatp(t+1)=Mp(t)=p(t)Thenp(t)iscalledastationarydistributionfortherandomwalkOurrankvectorrsatisfiesr=MrSoitisastationarydistributionfortherandomsurfer,ExistenceandUniqueness,Acentralresultfromthetheoryofrandomwalks(akaMarkovprocesses):Forgraphsthatsatisfycertainconditions,thestationarydistributionisuniqueandeventuallywillbereachednomatterwhattheinitialprobabilitydistributionattimet=0.,Spidertraps,AgroupofpagesisaspidertrapiftherearenolinksfromwithinthegrouptooutsidethegroupRandomsurfergetstrappedSpidertrapsviolatetheconditionsneededfortherandomwalktheorem,Microsoftbecomesaspidertrap,Yahoo,Msoft,Amazon,y1/21/20a1/200m01/21,yam,ya=m,111,11/23/2,3/41/27/4,5/83/82,003,.,Randomteleports,TheGooglesolutionforspidertrapsAteachtimestep,therandomsurferhastwooptions:Withprobability,followalinkatrandomWithprobability1-,jumptosomepageuniformlyatrandomCommonvaluesforareintherange0.8to0.9Surferwillteleportoutofspidertrapwithinafewtimesteps,Randomteleports(=0.8),Yahoo,Msoft,Amazon,1/2,1/2,0.8*1/2,0.8*1/2,0.2*1/3,0.2*1/3,0.2*1/3,1/21/201/20001/21,1/31/31/31/31/31/31/31/31/3,y7/157/151/15a7/151/151/15m1/157/1513/15,0.8,+0.2,Randomteleports(=0.8),Yahoo,Msoft,Amazon,1/21/201/20001/21,1/31/31/31/31/31/31/31/31/3,y7/157/151/15a7/151/151/15m1/157/1513/15,0.8,+0.2,ya=m,111,1.000.601.40,0.840.601.56,0.7760.5361.688,7/115/1121/11,.,Matrixformulation,SupposethereareNpagesConsiderapagej,withsetofoutlinksO(j)WehaveMij=1/|O(j)|whenj!iandMij=0otherwiseTherandomteleportisequivalenttoaddingateleportlinkfromjtoeveryotherpagewithprobability(1-)/Nreducingtheprobabilityoffollowingeachoutlinkfrom1/|O(j)|to/|O(j)|Equivalent:taxeachpageafraction(1-)ofitsscoreandredistributeevenly,PageRank,ConstructtheNNmatrixAasfollowsAij=Mij+(1-)/NVerifythatAisastochasticmatrixThepagerankvectorristheprincipaleigenvectorofthismatrixsatisfyingr=ArEquivalently,risthestationarydistributionoftherandomwalkwithteleports,Deadends,Pageswithnooutlinksare“deadends”fortherandomsurferNowheretogoonnextstep,Microsoftbecomesadeadend,Yahoo,Msoft,Amazon,ya=m,111,10.60.6,0.7870.5470.387,0.6480.4300.333,000,.,1/21/201/20001/20,1/31/31/31/31/31/31/31/31/3,y7/157/151/15a7/151/151/15m1/157/151/15,0.8,+0.2,Dealingwithdead-ends,TeleportFollowrandomteleportlinkswithprobability1.0fromdead-endsAdjustmatrixaccordinglyPruneandpropagatePreprocessthegraphtoeliminatedead-endsMightrequiremultiplepassesComputepagerankonreducedgraphApproximatevaluesfordeadendsbypropagatingvaluesfromreducedgraph,Computingpagerank,Keystepismatrix-vectormultiplicationrnew=AroldEasyifwehaveenoughmainmemorytoholdA,rold,rnewSayN=1billionpagesWeneed4bytesforeachentry(say)2billionentriesforvectors,approx8GBMatrixAhasN2entries1018isalargenumber!,Rearrangingtheequation,r=Ar,whereAij=Mij+(1-)/Nri=1jNAijrjri=1jNMij+(1-)/Nrj=1jNMijrj+(1-)/N1jNrj=1jNMijrj+(1-)/N,since|r|=1r=Mr+(1-)/NNwherexNisanN-vectorwithallentriesx,Sparsematrixformulation,Wecanrearrangethepagerankequation:r=Mr+(1-)/NN(1-)/NNisanN-vectorwithallentries(1-)/NMisasparsematrix!10linkspernode,approx10NentriesSoineachiteration,weneedto:Computernew=MroldAddaconstantvalue(1-)/Ntoeachentryinrnew,Sparsematrixencoding,EncodesparsematrixusingonlynonzeroentriesSpaceproportionalroughlytonumberoflinkssay10N,or4*10*1billion=40GBstillwontfitinmemory,butwillfitondisk,sourcenode,degree,destinationnodes,BasicAlgorithm,AssumewehaveenoughRAMtofitrnew,plussomeworkingmemoryStoreroldandmatrixMondiskBasicAlgorithm:Initialize:rold=1/NNIterate:Update:PerformasequentialscanofMandroldtoupdaternewWriteoutrnewtodiskasroldfornextiterationEveryfewiterations,compute|rnew-rold|andstopifitisbelowthresholdNeedtoreadinbothvectorsintomemory,Updatestep,src,degree,destination,0,1,2,3,4,5,6,0,1,2,3,4,5,6,rnew,rold,Initializeallentriesofrnewto(1-)/NForeachpagep(out-degreen):Readintomemory:p,n,dest1,destn,rold(p)forj=1.n:rnew(destj)+=*rold(p)/n,Analysis,Ineachiteration,wehaveto:ReadroldandMWriternewbacktodiskIOCost=2|r|+|M|Whatifwehadenoughmemorytofitbothrnewandrold?Whatifwecouldnotevenfitrnewinmemory
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 餐饮电子版劳务合同
- 驳回民事裁定申请书
- 北京市政府劳动合同续签办法
- 肿瘤放射治疗体位固定技术
- 广东省仲元中学2024-2025学年九年级上学期期中考试化学试题(含答案)
- 调研活动心得体会
- 突发事件应急
- 双头应急灯相关行业投资方案范本
- 石油钻采设备相关项目投资计划书范本
- 电控多瓶采水器相关行业投资规划报告
- GB∕T 17268-2020 工业用非重复充装焊接钢瓶
- 苏教版二年级数学上册《认识线段》课件(市级赛课一等奖)
- 幼儿园:中班美术活动《柿柿如意》
- 输电线路初步设计评审要点课件
- (完整word版)小餐饮经营食品安全管理制度
- 产后尿潴留的护理个案课件
- 装配式混凝土结构部件吊装监理细则
- 交通事故伤残鉴定知识培训及案例课件
- 地铁站装饰施工组织设计(181页)
- 动火作业及动火工作票管理规定
- 变电站综合自动化电子教案
评论
0/150
提交评论