课件-模式识别2013聚类分析_第1页
课件-模式识别2013聚类分析_第2页
课件-模式识别2013聚类分析_第3页
课件-模式识别2013聚类分析_第4页
课件-模式识别2013聚类分析_第5页
已阅读5页,还剩124页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

PatternDepartmentofIn ligentScienceandSchoolofAutomationCAOZHIChapter BasicProximityandClusteringClusteringSequentialClusteringk-Means,k-Me ds,Fuzzyk-MeansClusteringHierarchicalClusteringAlgorithmsSpectralClassificationClassificationandClusteringHardClustering:Eachpointbelongstoasingle

X{x1,x2,...,xNAnm-clusteringRofX,isdefinedasthepartitionofintomsets(clusters),C1,C2,…,Cm,so Ci,i1,2,...,m UCi CiCj,ij,i,j1,2,...,Clusteringtask

ProximityMeasure:similarordissimilar.

fiestheClusteringCriterion:Thisconsistsofacostfunctionorsometypeofrules.ClusteringAlgorithm:Thisconsistsofthesetofstepsfollowedtorevealthestructure,basedonthesimilaritymeasureandtheadoptedcriterion.ValidationoftheInterpretationoftheWhyWhyproximitymeasuresWhyWhyclusteringcriteriaBlueshark,sheep,Blueshark,sheep,cat,dogseagull,goldfish,frog,redmulletHowmammalsseagull,goldfish,frog,redmulletGoldGoldfish,redmullet,bluesharkSheep,sparrow,dog,cat,seagull,lizard,frog,Sheep,sparrow,dog,cat,seagull,lizard,frog,GoldGoldfish,redmullet,bluesharkSheep,sparrow,dog,Sheep,sparrow,dog,cat,seagull,lizard,viperWhyWhyvalidationoftheresults2or4WhyWhyclusteringalgorithmsNumberofpossibleLetQuestion:InhowmanywaystheNpointscanbeassignedintomgroups?

mS(N,m)

iS(15,3)2S(100,5)1068AwayConsideronlyasmallfractionofclusteringsofXandselecta“sensible”clusteringamongQuestion1:WhichfractionofclusteringsisQuestion2:What“sensible”Theanswerdependsonthespecificclusteringalgorithmandthespecificcriteriatobeadopted.Chapter BasicProximityandClusteringClusteringSequentialClusteringk-Means,k-Me ds,Fuzzyk-MeansClusteringHierarchicalClusteringAlgorithmsSpectralBetween–Dissimilaritymeasure(betweenvectorsofX)isafunctiond XXwiththefollowingd0: d0d(x,y),x,yd(x,x)d0,xd(x,y)d(y,x),x,yIfind(x,y)d0 andonly xd(x,z)d(x,y)d(y, x,y,z(triangulardiscalledametricdissimilaritySimilaritymeasure(betweenvectorsofX)isafunctions XXwiththefollowings0: s(x,y)s0,x,ys(x,x)s0,xs(x,y)s(y,x),x,yIfins(x,y)s0 andonly xs(x,y)s(y,z)[s(x,y)s(y,z)]s(x, x,y,zsiscalledametricsimilarityPROXIMITYPROXIMITYMEASURESBETWEENWeightedlpmetricldp(x,y)(w|xy|p)1/l Interestinginstancesareobtainedp=1(weightedManhattanp=2(weightedEuclideanp=∞(d(x,y)=max1ilwi|xi-yi|Other

X{x1,x2,...,xN}xi,xjXdd(xi,xj)(xixj (xixjV NN-(xx)(xiix1NNiMahalanobisSimilarityInner (x,y)xT(x(xx)'(yr(x,y)[(xx)'(xx)(yy)'(yy)]1/Tanimoto

yT

ll

sT(x,y)||x||2||y||2xTsT(,y)

d2(x,||x||||y1,Measures1,Measures2,Measuresmixed-valuedPROXIMITYFUNCTIONSBETWEENAVECTORANDASETLetX={x1,x2,…,xN}andCPROXIMITYFUNCTIONSBETWEENAVECTORANDASETAllpointsofCcontributetothedefinitionof(x,Maxproximity (x,C)maxyC(x,Minproximityps(x,C)minyC(x,Averageproximityps

(x,C)

(nCisthecardinalityofArepresentative(s)ofC,rC,contributestothedefinitionInthiscase:Typicalrepresentatives–Themeanmp

where isthecardinalityof1n C 1nd:adissimilarityd:adissimilaritymCC:d(mC,y)d(z,y),zC ThemedianmmedC:med(d(mmed,y)|yC)med(d(z,y)|yC),zCNOTE:Otherrepresentatives(e.g.,hyperplanes,hyperspheres)areusefulincertainapplications(e.g.,objectidentificationusingclusteringtechniques).PROXIMITYPROXIMITYFUNCTIONSBETWEENLetX={x1,…,xN},Di,DjXandni=|Di|,AllpointsofeachsetcontributetoMaxproximityfunction(measurebutnotmetric,onlyifisasimilaritymeasure) (D,D) (x, Minproximityfunction(measurebutnotmetric,onlyifisadissimilaritymeasure) (D,D) (x, jxDAverageproximityfunction(notameasure,evenjxD1ss(D,D)1

(x,n n

i EachsetDiisrepresentedbyitsrepresentativevectorMeanproximityfunction(itisameasureprovidedthatisa (D,D)(m,m ss(D,D)

(m,m

nin

NOTE:ProximityfunctionsbetweenavectorxandasetCmaybederivedfromtheabovefunctionsifwesetDi={x}.Differentchoicesofproximityfunctionsbetweensetsmayleadtototallydifferentclusteringresults.Differentproximitymeasuresbetweenvectorsinthesameproximityfunctionbetweensetsmayleadtototallydifferentclusteringresults.Theonlywaytoachieveaproperclusteringbytrialanderrortakingintoaccounttheopinionofanexpertinthefieldofapplication.ClusteringClusteringIntra-classdistanceis(asmallwithin-classInter-classdistanceis(alargebetween-class设有待分类的模式

x1x2,xN在某种相似性测cc基础上被划分 类, 类内距离准则函数JW定义为

mj

均值矢量。JJWj1 cnjx(j 2ij类内距离准则JWWcdn cdJWW= 2

xd (jxd

(jnj(nj1)x(j (j (j

(j ,式中,

表示

nj(n2

d2d

n NJ

(m1m2)(m1m2JJWBcnN(mj m)(mjm)Jn1NJBJB2和 j(jj(jmjmji(j(jnj1n(j) j)式中为 的模式均值矢 i1xjm(j i1xjm(jnn证明

STSWSBim)(i N1NSTjN(j m)(xi(jn(xi1 cn1(xn(j(jTim)(im)Nj jcnn(j)(j)j1Nnjimjim)(mm)(m jjj S S Tr[SW]min

Tr[SB]。利用线性代数有关JTr[S1SJTr[S1S1WB S2WB Tr[S1S3WT S4WTChapter BasicProximityandClusteringClusteringSequentialClusteringk-Means,k-Me ds,Fuzzyk-MeansClusteringHierarchicalClusteringAlgorithmsSpectralChapter BasicProximityandClusteringClusteringSequentialClusteringk-Means,k-Me ds,Fuzzyk-MeansClusteringHierarchicalClusteringAlgorithmsSpectralSEQUENTIALCLUSTERINGThecommontraitssharedbythesealgorithms–Oneorvery ssesonthedataare–Thenumberofclustersisnotknowna-priori,except(possibly)anupperbound,q.–TheclustersaredefinedwiththeaidAnappropria ydefineddistanced(x,C)ofapointfromacluster.AthresholdΘassociatedwiththeBasicSequentialClusteringAlgorithmm=1\{numberofFori=2toFindCk:d(xi,If(d(xi,Ck)>Θ)AND(m<q)oCk=CkWherenecessary,updateEndEnd(*)WhenthemeanvectormCisusedasrepresentativeoftheclusterCwithelements,theupdatinginthelightofanewvector mCnew=(nCmC+x)/ 1110

BABSYZYT=X 67 67 X 6798X YYYY9 9 876 X9 876Z1T=2 XTheorderofpresentationofthedatainthealgorithmplaysimportantroleintheclusteringresults.Differentorderofpresentationmayleadtototallydifferentclusteringresults,intermsofthenumberofclustersaswellastheclustersInBSASthedecisionforavectorxisreachedpriortothefinalclusterformation.BSASperformasinglepassonthedata.ItscomplexityisO(N).Ifclustersarerepresentedbypointrepresentatives,compactclustersarefavored.EstimatingthenumberofclustersinthedataLetBSAS(Θ)denotetheBSASalgorithmwhenthedissimilaritythresholdisΘ.ForΘ=atobstepRunstimesBSAS(Θ),eachtimepresentingthedatainadifferentorder.EstimatethenumberofclustersmΘ,asthemostfrequentnumberresultingfromthesrunsofBSAS(Θ).NextPlotmΘversusΘandidentifythenumberofclustersmastheonecorrespondingtothewidestflatregionintheabovegraph.NumberofNumberof5--55 -

30MBSAS,amodificationofPhasem=1\{numberofFori=2toFindCk:d(xi,If(d(xi,Ck)>Θ)AND(m<q)oEndEndMBSAS,amodificationofPhaseFori=1toIfxiisn’tclassifiedtoanyFindCk:d(xi,Ck=CkWherenecessary,update-EndEndMBSAS,amodificationofMBSASconsistsAclusterdeterminationphase(firstpassonthewhichisthesameasBSASwiththeexceptionthatnovectorisassignedto readyformedcluster. ofthisphase,eachclusterconsistsofasingleelement.Apatternclassificationphase(secondpassonthewhereeachoneoftheunassignedvectorisassignedtoitsclosestcluster.Chapter BasicProximityandClusteringClusteringSequentialClusteringk-Means,k-Me ds,Fuzzyk-MeansClusteringHierarchicalClusteringAlgorithmsSpectralChapter BasicProximityandClusteringClusteringSequentialClusteringk-Means,k-Me ds,Fuzzyk-MeansClusteringHierarchicalClusteringAlgorithmsSpectral三、三、k对于需要进行设有N个样本需要分到K个聚类中,K均值算JJk1 ||xmkn其 表示样

被归类到第k类的时候为1,否则为mknrmknrnkn固

J

求导并令导数等于零,得到

最小三、三、kZ(0),Z(0),Z(0),...Z(0),t12kmin[d(t)ji1,2,...N,X(til:任选k:

对待分类的模式特征矢量集{Xi}则分划给k类中的某一类,即 Z(tj1n(tXijjx( ZZ(t1)Z(t)jjj(4)否则,t=t+1,Goto,则结束样本序特征0101212367特征001112226686789789896777788899第一步:令k=2,选初始聚类中心 (1)

(0,0)T;

(1)

(1,0)X x20

2 x72 x3

X x 第二步x1Z1(1)

0

01 Z21

()-( 因为x1 x1Z2

Z1x2x21Z(1)101(0x2Z

(1)

(1 21x2Z1

2x2Z(1)2

所以x2Z21x3 Z1x3

=1

x3Z2

2 x3Z1x4

=2

x4Z2

x4Z2

x5、x6...x20与第二个聚类中1212

x5x6...x20都属

Z2G1(1)x1x3G2(1)(x2,x4x

,...x20

2, X

x20 Z Z

ZZ1x3

543 x7

x G1(2)(x1,x2,...,x8),N1G2(2)(x9,x10,...,x20),N212第三步:更新聚类中1i1Z(3) x1(xxx...x1i1N1

Z(3) x1(x N2 N2

XZZ2

x20

6543Z2Z211x3

x7

0x

第四步:因Zj(3Zj(2j12,转第二第二重新计算

x2,...,x20到Z1(3),Z2(3)的距离,分别把x1x2x20归于最近的那个聚类中心,重新分为二类G1(4)(x1,x2x8G2(4)(x9,x10,...,x20),N18,N212第三步:更新聚类中(1.25,1.13T(7.67,7.33T三、k均值聚类算算算法收敛性证三、k均值聚类算初始初始化状态三、k均值聚类算迭代一次三、k均值聚类算再次迭三、k均值聚类算迭代结三、k均值聚类算初始初始状三、k均值聚类算收敛收敛状态,不三、k均值聚类 算法改进的出发 将模式随机分成kk用相距最远的k个特征点作为 算法的性能与k及聚类中心初始值、(3)k(a)按先验知识确定kJkChapter BasicProximityandClusteringClusteringSequentialClusteringk-Means,k-Me ds,Fuzzyk-MeansClusteringHierarchicalClusteringAlgorithmsSpectralk均值聚类算法的类心是:knrnknKK中值聚类算法的类心是从当前类中选取这样——它到该类其他两两者的差异K中值K中值聚类算法的准“中值”,令表示所有类的“中值”集合,定义 待聚类的样本集X中包含点的集合,定义IX为不包含Jrd(,xixiIXrr若d(xi,xj)min d(xi,xqiChapter BasicProximityandClusteringClusteringSequentialClusteringk-Means,k-Me ds,Fuzzyk-MeansClusteringHierarchicalClusteringAlgorithmsSpectral五、模糊C-均值算模糊}模糊C-均值算法(FCM}将N个n

X{xX

,,

分成C矩阵U(uij)NC 表示

表示样

xi

类的程度,它((a):uij(b):0uijN(c):uijjCJJ(u,Z)m j um2 ,为1m 2izj)'V(xizjzZ{z1,z2,,推五、模糊C-均值算(3)FCM确定类别数C,参数m,矩阵V和适当小

U(0,U(U(k(k计 时的{z

kU(k)为U(k1)i=1至Ii{jIi{j|dij Ii{1,2,c}计

的新隶属度 2uij

ij)m1

0,jI,

,Ii如,Ii

k

否则

z(kz(k)jNux) umm Nj

||U(k)U(k1)||

Chapter BasicProximityandClusteringClusteringSequentialClusteringk-Means,k-Me ds,Fuzzyk-MeansClusteringHierarchicalClusteringAlgorithmsSpectral应用背景1,某次国际会议收到上千篇投稿,需要将每个研究方向的分发给对应方向的审稿,人工2,需要将不同类别的放在一类中,人工如六、层次聚类算计算机如何“读的信息和词的语义是联系 计算机如何“读”文分词中国航 应邀 与太 开计算机如何“读”文分词

中国/航天 /应邀/到 /与/太空 /开二义发展--中--国家,发展—中国——大学城—书店 大学—城—书

最长匹计算机如何“读”文分词

统计语言模1,A2,B1,B2,B3,...BmC1,C2,C3

P(A1,A2,A3,...Ak)P(B1,B2,B3,...BmP(A1,A2,A3,...Ak)P(C1,C2, 1,A2,计算机如何“读”文分词

技术已经成中文分词技术颗粒局限颗粒此地安能居住,其人好

计算机如何“读”文分词—词的重要性度“原子能/的/包含这三个词多的网页比包含他们少的网页更 词频TF:词的出现次数/网页的总字数N个词:TF1+TF2+ +计算机如何“读”文分词—词的重要性度“原子能/的/直接计算词频 1:非实词的贡献 给词频TF赋予权重N个词:TF1*IDF1+TF2*IDF2+ +计算机如何“读”文分词—词的重要性度量—文档的数字化统计的词汇计算文档每个词的TF-IDF值,并计算机如何“读”文分词—词的重要性度量—文档的数字化表 相性度六、层次聚类通用合并方法(GeneralizedAgglomerativeScheme,初始选择初始聚类为0={{x1},…,{x令重复执行以下g(C,C)rg(Cr,Cs), gisag(C,C)rg(Cr,Cs), gisa ijrg(C,C),rsgisa 直到所有的向量全被加入到单一聚类中六、层次聚类N(NN(Nt)(Nt2ttNC2N N2kk(N1)N(N6也就是说合并算法的全部运算量

N

的数量级上对通用合并方法中的一些变量给出如下:X{x1,x2,...xN待分类的样本:X{x1,x2,...xNx[x,x,...x i

,这表明样本向量具有l维特(转置 测度矩阵(相似或不相似)P(X)NxN矩阵,它的第(ij)xixjsxixj)或不相似度d(xi,xj)例:有样本X={x1x2x3x4其中x1=[11]Tx2=[21]Tx3=[54]Tx4=[6x5=[6.5,D(X)7.400.181P(X)0151050P'(X)1101使用欧式距离时的不相似矩阵使用Tanimotog(g(C,C)dss(C,Cij j{{x1,x2},{x3},{x4},{x5}}{{x1,x2},{x3},{x4,x5}}{{x1,x2},{x3,x4,x5}}{{x1,x2,x3,x4,x5}}

x1 x2 x3 x4 1SimilaritySimilarity0

x 012DissimilarityDissimilarity456789 合并算法主要分为两类以下讨论中我基于矩阵理论的算法N个样本集X组成的NN不相似矩阵在t级两个聚CiCj合并成聚Cq时,通过以下步骤从不相似Pt-1Pt:删去对应于CiCj类对应的两行和两列d(Cq,Cs)=f(d(CI,Cs),d(Cj,Cs),d(Ci,C将其作为新增d(Cq,Cs)aid(Ci,Cs)ajd(Cj,CS)bd(Ci,Cj)c|d(Ci,Cs)d(Cj,Cs)当a,b,c取不同值时可得到以下不同算法单连接算(ai=1/2aj=1/2b=0c=-1/2).d(Cq,Cs)min{d(Ci,Cs),d(Cj,CS完全连接算(ai=1/2aj=1/2b=0c=1/2).d(Cq,Cs)max{d(Ci,Cs),d(Cj,CS矩阵更新算法(MatrixUpdatingAlgorithmScheme,初始选择初始聚类为0={{x1},…,{x令重复执行以下d(Cid(Ci,Cj)d(Cr,Cs选择CiCj,使之合并Ci,Cj,得到聚类Cq,t=(t-1-{Ci,Cj{C根据上述步骤从不相似矩阵Pt-1得到直到形成聚类N-1即所有的向量全被加入到单一例d(Cq,Csmin{d(Ci,Csd(Cj,CS)},d(d(Cq,Cs)max{d(Ci,Cs),d(Cj,CS Chapter BasicProximityandClusteringClusteringSequentialClusteringk-Means,k-Me ds,Fuzzyk-MeansClusteringHierarchicalClusteringAlgorithmsSpectralc七、谱聚类算1、当前受到广泛关注的聚类算3、有很多相关理论支撑c七、谱聚类算 通俗的解释:如果我们把一个东西看成是一些分就把这个向量v拉长了c倍。对新的向量x,可分解为这些向量的组合,x=a1v1+a2v2+ 以分解:AxA(a1v1a2v2a1c1v1a2c2所以,矩阵的图论的基本定义(只涉及与谱聚类算法相关概念接顶点vivj的边用eij(vi,vj)表示。谱聚类的基本定义 令样本 i创建一图 i中的 ,且图G是无向连通用权值W(i,j)为图的每个边赋权

,其中w用来度G中各顶点vi和vj之间的相似性。权值的集合定义N*N维的相似矩阵W,它带有元素

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论