




已阅读5页,还剩32页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
CS345DataMining,LinkAnalysisAlgorithmsPageRank,AnandRajaraman,JeffreyD.Ullman,LinkAnalysisAlgorithms,PageRankHubsandAuthoritiesTopic-SpecificPageRankSpamDetectionAlgorithmsOtherinterestingtopicswewontcoverDetectingduplicatesandmirrorsMiningforcommunities,Rankingwebpages,Webpagesarenotequally“important”Ihas23,400inlinkswww.joe-has1inlinkAreallinlinksequal?Recursivequestion!,Simplerecursiveformulation,EachlinksvoteisproportionaltotheimportanceofitssourcepageIfpagePwithimportancexhasnoutlinks,eachlinkgetsx/nvotesPagePsownimportanceisthesumofthevotesonitsinlinks,Simple“flow”model,Thewebin1839,y,a,m,y/2,y/2,a/2,a/2,m,y=y/2+a/2a=y/2+mm=a/2,Solvingtheflowequations,3equations,3unknowns,noconstantsNouniquesolutionAllsolutionsequivalentmoduloscalefactorAdditionalconstraintforcesuniquenessy+a+m=1y=2/5,a=2/5,m=1/5Gaussianeliminationmethodworksforsmallexamples,butweneedabettermethodforlargegraphs,Matrixformulation,MatrixMhasonerowandonecolumnforeachwebpageSupposepagejhasnoutlinksIfj!i,thenMij=1/nElseMij=0MisacolumnstochasticmatrixColumnssumto1SupposerisavectorwithoneentryperwebpageriistheimportancescoreofpageiCallittherankvector|r|=1,Example,Supposepagejlinksto3pages,includingi,r,Eigenvectorformulation,Theflowequationscanbewrittenr=MrSotherankvectorisaneigenvectorofthestochasticwebmatrixInfact,itsfirstorprincipaleigenvector,withcorrespondingeigenvalue1,Example,y1/21/20a1/201m01/20,yam,y=y/2+a/2a=y/2+mm=a/2,PowerIterationmethod,Simpleiterativescheme(akarelaxation)SupposethereareNwebpagesInitialize:r0=1/N,.,1/NTIterate:rk+1=MrkStopwhen|rk+1-rk|1|x|1=1iN|xi|istheL1normCanuseanyothervectornorme.g.,Euclidean,PowerIterationExample,y1/21/20a1/201m01/20,yam,ya=m,1/31/31/3,1/31/21/6,5/121/31/4,3/811/241/6,2/52/51/5,.,RandomWalkInterpretation,ImaginearandomwebsurferAtanytimet,surferisonsomepagePAttimet+1,thesurferfollowsanoutlinkfromPuniformlyatrandomEndsuponsomepageQlinkedfromPProcessrepeatsindefinitelyLetp(t)beavectorwhoseithcomponentistheprobabilitythatthesurferisatpageiattimetp(t)isaprobabilitydistributiononpages,Thestationarydistribution,Whereisthesurferattimet+1?Followsalinkuniformlyatrandomp(t+1)=Mp(t)Supposetherandomwalkreachesastatesuchthatp(t+1)=Mp(t)=p(t)Thenp(t)iscalledastationarydistributionfortherandomwalkOurrankvectorrsatisfiesr=MrSoitisastationarydistributionfortherandomsurfer,ExistenceandUniqueness,Acentralresultfromthetheoryofrandomwalks(akaMarkovprocesses):Forgraphsthatsatisfycertainconditions,thestationarydistributionisuniqueandeventuallywillbereachednomatterwhattheinitialprobabilitydistributionattimet=0.,Spidertraps,AgroupofpagesisaspidertrapiftherearenolinksfromwithinthegrouptooutsidethegroupRandomsurfergetstrappedSpidertrapsviolatetheconditionsneededfortherandomwalktheorem,Microsoftbecomesaspidertrap,Yahoo,Msoft,Amazon,y1/21/20a1/200m01/21,yam,ya=m,111,11/23/2,3/41/27/4,5/83/82,003,.,Randomteleports,TheGooglesolutionforspidertrapsAteachtimestep,therandomsurferhastwooptions:Withprobability,followalinkatrandomWithprobability1-,jumptosomepageuniformlyatrandomCommonvaluesforareintherange0.8to0.9Surferwillteleportoutofspidertrapwithinafewtimesteps,Randomteleports(=0.8),Yahoo,Msoft,Amazon,1/2,1/2,0.8*1/2,0.8*1/2,0.2*1/3,0.2*1/3,0.2*1/3,1/21/201/20001/21,1/31/31/31/31/31/31/31/31/3,y7/157/151/15a7/151/151/15m1/157/1513/15,0.8,+0.2,Randomteleports(=0.8),Yahoo,Msoft,Amazon,1/21/201/20001/21,1/31/31/31/31/31/31/31/31/3,y7/157/151/15a7/151/151/15m1/157/1513/15,0.8,+0.2,ya=m,111,1.000.601.40,0.840.601.56,0.7760.5361.688,7/115/1121/11,.,Matrixformulation,SupposethereareNpagesConsiderapagej,withsetofoutlinksO(j)WehaveMij=1/|O(j)|whenj!iandMij=0otherwiseTherandomteleportisequivalenttoaddingateleportlinkfromjtoeveryotherpagewithprobability(1-)/Nreducingtheprobabilityoffollowingeachoutlinkfrom1/|O(j)|to/|O(j)|Equivalent:taxeachpageafraction(1-)ofitsscoreandredistributeevenly,PageRank,ConstructtheNNmatrixAasfollowsAij=Mij+(1-)/NVerifythatAisastochasticmatrixThepagerankvectorristheprincipaleigenvectorofthismatrixsatisfyingr=ArEquivalently,risthestationarydistributionoftherandomwalkwithteleports,Deadends,Pageswithnooutlinksare“deadends”fortherandomsurferNowheretogoonnextstep,Microsoftbecomesadeadend,Yahoo,Msoft,Amazon,ya=m,111,10.60.6,0.7870.5470.387,0.6480.4300.333,000,.,1/21/201/20001/20,1/31/31/31/31/31/31/31/31/3,y7/157/151/15a7/151/151/15m1/157/151/15,0.8,+0.2,Dealingwithdead-ends,TeleportFollowrandomteleportlinkswithprobability1.0fromdead-endsAdjustmatrixaccordinglyPruneandpropagatePreprocessthegraphtoeliminatedead-endsMightrequiremultiplepassesComputepagerankonreducedgraphApproximatevaluesfordeadendsbypropagatingvaluesfromreducedgraph,Computingpagerank,Keystepismatrix-vectormultiplicationrnew=AroldEasyifwehaveenoughmainmemorytoholdA,rold,rnewSayN=1billionpagesWeneed4bytesforeachentry(say)2billionentriesforvectors,approx8GBMatrixAhasN2entries1018isalargenumber!,Rearrangingtheequation,r=Ar,whereAij=Mij+(1-)/Nri=1jNAijrjri=1jNMij+(1-)/Nrj=1jNMijrj+(1-)/N1jNrj=1jNMijrj+(1-)/N,since|r|=1r=Mr+(1-)/NNwherexNisanN-vectorwithallentriesx,Sparsematrixformulation,Wecanrearrangethepagerankequation:r=Mr+(1-)/NN(1-)/NNisanN-vectorwithallentries(1-)/NMisasparsematrix!10linkspernode,approx10NentriesSoineachiteration,weneedto:Computernew=MroldAddaconstantvalue(1-)/Ntoeachentryinrnew,Sparsematrixencoding,EncodesparsematrixusingonlynonzeroentriesSpaceproportionalroughlytonumberoflinkssay10N,or4*10*1billion=40GBstillwontfitinmemory,butwillfitondisk,sourcenode,degree,destinationnodes,BasicAlgorithm,AssumewehaveenoughRAMtofitrnew,plussomeworkingmemoryStoreroldandmatrixMondiskBasicAlgorithm:Initialize:rold=1/NNIterate:Update:PerformasequentialscanofMandroldtoupdaternewWriteoutrnewtodiskasroldfornextiterationEveryfewiterations,compute|rnew-rold|andstopifitisbelowthresholdNeedtoreadinbothvectorsintomemory,Updatestep,src,degree,destination,0,1,2,3,4,5,6,0,1,2,3,4,5,6,rnew,rold,Initializeallentriesofrnewto(1-)/NForeachpagep(out-degreen):Readintomemory:p,n,dest1,destn,rold(p)forj=1.n:rnew(destj)+=*rold(p)/n,Analysis,Ineachiteration,wehaveto:ReadroldandMWriternewbacktodiskIOCost=2|r|+|M|Whatifwehadenoughmemorytofitbothrnewandrold?Whatifwecouldnotevenfitrnewinmemory
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- PCB精密定位材料相关行业投资方案
- 财经市场投资基础知识
- 泡丝剂相关项目投资计划书范本
- 诗歌中的情感表达与解读:高二语文教学案例
- 专业文件管理软件合作推广协议
- 一般类现代文阅读09 综合练习03课件
- 脾肿大切除护理查房
- 尽职调查委托协议
- 买卖意向合作协议书
- 公司独立董事聘任协议
- 护理查对制度-课件
- 设备清单-15年物联网智慧生活实训平台专业版
- 汉字偏旁部首表及例字
- 2021年中国远洋海运集团有限公司招聘笔试试题及答案解析
- 《大学物理学》课程教学大纲
- 励志班会你想成为什么样人
- ISOTS-9002:2022质量管理体系ISO9001:2022-应用指南
- 《带状疱疹治疗学》牛德兴教授专业研究治疗病毒性疱疹50年心血
- 20以内进位加法口算练习打印版
- 戴氏无线电遥控飞机教程
- 巴黎卢浮宫介绍PPT模板课件
评论
0/150
提交评论