FindingNearDuplicates_第1页
FindingNearDuplicates_第2页
FindingNearDuplicates_第3页
FindingNearDuplicates_第4页
FindingNearDuplicates_第5页
已阅读5页,还剩5页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Finding Near Duplicates(Adapted from slides and material from Rajeev Motwani and Jeff Ullman)1Set SimilaritySet Similarity (Jaccard measure)View sets as columns of a matrix; one row for each element in the universe. aij = 1 indicates presence of item i in set j ExampleC1 C2 0 1 1 0 1 1 simJ(C1,C2) =

2、 2/5 = 0.4 0 0 1 1 0 12Identifying Similar Sets?Signature IdeaHash columns Ci to signature sig(Ci)simJ(Ci,Cj) approximated by simH(sig(Ci),sig(Cj)Nave ApproachSample P rows uniformly at randomDefine sig(Ci) as P bits of Ci in sampleProblemsparsity would miss interesting part of columnssample would g

3、et only 0s in columns3Key ObservationFor columns Ci, Cj, four types of rowsCiCjA 1 1B 1 0C 0 1D 0 0Overload notation: A = # of rows of type AClaim4Min HashingRandomly permute rowsHash h(Ci) = index of first row with 1 in column Ci Suprising PropertyWhy?Both are A/(A+B+C)Look down columns Ci, Cj unti

4、l first non-Type-D rowh(Ci) = h(Cj) type A row5Min-Hash SignaturesPick P random row permutations MinHash Signature sig(C) = list of P indexes of first rows with 1 in column CSimilarity of signatures Let simH(sig(Ci),sig(Cj) = fraction of permutations where MinHash values agree Observe EsimH(sig(Ci),

5、sig(Cj) = simJ(Ci,Cj) 6Example C1 C2 C3R1 1 0 1R2 0 1 1R3 1 0 0R4 1 0 1R5 0 1 0 Signatures S1 S2 S3Perm 1 = (12345) 1 2 1Perm 2 = (54321) 4 5 4Perm 3 = (34512) 3 5 4 Similarities 1-2 1-3 2-3Col-Col 0.00 0.50 0.25Sig-Sig 0.00 0.67 0.007Implementation TrickPermuting rows even once is prohibitiveRow Ha

6、shingPick P hash functions hk: 1,n1,O(n)Ordering under hk gives random row permutationOne-pass ImplementationFor each Ci and hk, keep “slot” for min-hash valueInitialize all slot(Ci,hk) to infinityScan rows in arbitrary order looking for 1sSuppose row Rj has 1 in column Ci For each hk, if hk(j) slot

7、(Ci,hk), then slot(Ci,hk) hk(j) 8ExampleC1 C2R11 0R2 0 1R3 1 1R4 1 0R5 0 1h(x) = x mod 5g(x) = 2x+1 mod 5h(1) = 11-g(1) = 33-h(2) = 212g(2) = 030h(3) = 312g(3) = 220h(4) = 412g(4) = 420h(5) = 010g(5) = 120C1 slots C2 slots 9Comparing SignaturesSignature Matrix SRows = Hash FunctionsColumns = ColumnsEntries = SignaturesCompute Pair-wise similarity of signature columnsProblemMinHash fits column signatures in memoryBut comparing signature

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论