挖掘技术textmining聚类簇Cluster数据对象的集合_第1页
挖掘技术textmining聚类簇Cluster数据对象的集合_第2页
挖掘技术textmining聚类簇Cluster数据对象的集合_第3页
挖掘技术textmining聚类簇Cluster数据对象的集合_第4页
挖掘技术textmining聚类簇Cluster数据对象的集合_第5页
已阅读5页,还剩133页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

第五章文本自动聚类技杨建1聚类(簇Cluster):2聚类分析就是按照一定的规律和要求对事物进行区分和分类的过程,在这一过程中没有任何关于类分的先验知识,没有指导,仅靠事物间的相似性作为类属划分的准则。3 4聚类分析(Clustering):给定一数据样k{C1,C2,…,Ck}的过程称为聚类分 簇记为Ci={Xi,Xi Ci(i=1,…,k)是X的子Ci∪Cj=ф,i≠j相似样本在同一簇中,相异样本在不同5 可以作为其他算法的预处理6市场销售: :对了汽车的客户,标识那些有城市规划:研究:7 Clustering(DC)ispartitioningasetof sintogroupsorclustersClustersshouldbecomputedContain SeparateasmuchaspossibledifferentForinstance,ifsimilaritysisdefinedtocapturesemantic sinaclustershoulddealwiththesametopics,andtopicsineachclustershouldbe8Guidingtheorganizationofacollection(e.g.[Sahami98])Progressivelyclusteringgroupsof allowtobuildatopichierarchySupportingbrowsingandin stoallowafasterrelevant sselectionprocessSomesearchenginesPre-processingforothertasks,Todetectthemainsemanticdimensionsofthetextcollectionandtosupportakindofconceptindexing[Karypis&Han00]Fortext9Asothertextprocessingtasks,DChasseveralstepsDimensionalityApplyingaclusteringEvaluatingtheeffectivenessofthe[-,-,-,-[-,-,-,-[-,-,-,- 聚类算 [-,-,-,-[-,-,-,-[-,-,-,-[-,-,-,-[-,-,-,-[-,-,-,-[-,-,-,-1.用户指定用于聚类的数据集2.特征选3.聚合:「准确率」(P, 率」(R,F 111 F P所有类的总体

iF i11 宏平

PMacroFP

FiiFiimmm1

i微平MicroF

(nii1(sum-of-squared-errorkJei X

|XmiJe即所有样方误其Je即所有样方误平平

Je|Xmiki Xk 向量空间模型(VectorSpace 每个文档dj可以用标引项向量来 权重计算,N个训练 档AM*N 相似度算

D1=2T1+3T2+Q=0T1+0T2+2D2=3T1+7T2+ 类与类之间的距离dxi,xj表示点xi∈和xj∈之间的距离最短距离法Dpqmind(xi,xj重心法

最长距离法Dpqmaxd(xi,xj类平均Dpqmind(xp,xq

1 d(x,x n 2xiGpx离差平方和D1(xixp)'(xixp),D2(xjxq)'(xjxq xjD12

xkGp

(xkx)'(xix)DpqD12D1划分的层次的聚(partitioningab将文档集D={d1…,di确定要生成的类的数目

…,dnS={s1…,sj…,sk};对D中的每一个文档di,依次计算它与各 的相似度sim(disj argmaxsim(di,sj),将di归入以sj为聚类中心的类Cj,从而得到D的一个聚类C={c1,… 该方法速度快,但k要预先确定 选取k-means(k-均值2.根据欧氏距离把剩余的每个样本分配3.计算被分配到每个簇的样本的均值向4.重复2,3直到k个簇的质心点不再发生ArbitrarilychooseKobjectasinitialclustercenter

9879876543210

9879876543210 9876543210

98769876543210 9876543210 X1=(0,2),X2=(0,0),假设要求的簇的数量k=2第1步:由样本的随机C1={X1,X2,X4}和C2={X3,X5}M1={(0+0+5)/3,(2+0+0)/3}M2={(1.5+5)/2,(0+2)/2}=样本初始随机分布之e12=[(0-1.66)2+(2-+[(0-1.66)2+(0-k+[(5-1.66)2+(0-k2

Je

|Xmi|2e2

i X总体平方误差是:E2=e12+e22 第2步:按与质心(M1或M2)间距离关系,选择最小距新簇C1={X1,X2,X3}和第3步:计算新的质心: M2={5.0,1.0}1相应的方差及总体平方误差分别是:e12 e 2可以看出第一次迭代后,总体误差显著减小(从值27.48到6.17)在这个简单的例子中,第一次迭代同时也是最后一次迭代,因为如果继续分析新中心和样本间的距离,样本将会全部分给同样的簇,不将重新分配,算法停止。Strength:Relativelyefficient:O(tkn),whereis#objects,kis#clusters,andtis# tions.Normally,k,t<<n.Comparing:PAM:O(k(n-k)2),CLARA:O(ks2+k(n-Comment:Oftenterminatesatalocaloptimum.Theglobaloptimummaybefoundusing数目难以处理大小很不相同的簇或具有凹状平均值k-prototypes算法对数值与离散两种混合的数据聚类,k-改进目第第n-1轮的某 改进前的k-means算法改进后的算法1510数5O0.60

0.650.70 0.85FmeasurePAM(PartitioningAroundMeds围绕最 k-中心点算法之一(1987)基本思想:最初随机选择k复尝试找更好的中 4.随机的选择一个非中心点对象5.计算用Orandom代替Oj 6.ifs为负,则Orandom代替Oj,形成新7.UntilTypicalk- dsalgorithmTotalCost=9876549876543210

choosekobjectas

876543210 TotalCost=

gobject

876543210 Randomlyselecta DoUntil

andOramdomIfqualityis

98765 3 10 9

totalcostof

9876543210 9PAM算法分Pamismorerobustthank-meansinthepresenceofnoiseandoutliersbecauseame dislessinfluencedbyoutliersorotherextremevaluesthanameanPamworksefficientlyforsmalldatasetsbutdoesnotscalewellforlargedatasets.O(k(n-k)2)foreach wherenis#ofdata,kis#ofSamplingbasedmethod,CLARA(ClusteringLARgeApplications)CLARACLARA(ClusteringLarge(KaufmannandRousseeuwinItdrawsmultiplesamplesofthedataset,appliesPAMoneachsample,andgivesthebestclusteringastheoutputStrength:dealswithlargerdatasetsthanEfficiencydependsonthesampleAgoodclusteringbasedonsampleswillnotnecessarilyrepresentagoodclusteringofthewholedatasetifthesampleis CLARANSCLARANS(AClusteringAlgorithmbasedonRandomizedSearch)(NgandHan’94)CLARANSdrawssampleofneighborsTheclusteringprocesscanbepresentedassearchingagraphwhereeverynodeisapotentialsolution,thatis,asetofk Ifthelocaloptimumisfound,CLARANSaItismoreefficientandscalablethanbothand自底向上的聚类(凝聚 迭代,将最近的两类合为一自顶向下的聚类 将所有项看 aeaabcdcdaeaabcdcddStep Step Step Step Step

DIANA KaufmannandRousseeuwInverseorderofacluster将文档集D={d1…,di…,dn的C={c1,…,ci,…计算C中每对类(cicj)之间的相似度sim(cicj);选取具有最大相似度的类对argmaxsim(cicj),并将ci和cj合并为一个新的类ck=ci∪cj,从而构成D的一个新的类C={c1,…,cn-1};AGNESAGNES(AgglomerativeKaufmannandRousseeuwUsetheSingle-Linkmethodandthedissimilaritymatrix.MergenodesthathavetheleastGooninanon-descendingAGNES邻(NearestNeighbor)):Dmin(Ci,Cj三.平均连接先将五个样本都分别看成是一3和4,因D(3,4)=5.0第一步:合并簇3和4,得到新簇集合D(1,(34))=min(D(1,3),D(1,4))=min(20.6,22.4)=20.6;

=min(D(2,3),D(2,4))11.2)==min(D(3,5),D(4,5))25.5)=离矩阵如图所示,在四个簇1,2,(34),5中,最靠近的两个簇是1和5,它们具有 全连接 组平均AGNES算法分Majorweaknessofagglomerativeclusteringmethodsdonotscalewell:timecomplexityofatleastO(n2),wherenisthenumberoftotalcanneverundowhatwasBIRCHBirch:BalancedI tiveReducingandClusteringusingHierarchiesbyZhang,Ramakrishnan,LivnyIncrementallyconstructaCF(ClusteringFeature)tree,ahierarchicaldatastructureformultiphaseclusteringPhase1:scanDBtobuildaninitialin-memoryCFtreeamulti-levelcompressionofthedatathattriestopreservetheinherentclusteringstructureofthedata)(用树状Phase2:useanarbitraryclusteringalgorithmtoclustertheleafnodesoftheCF-tree(其它算法对叶节点进行聚类)CF-TreeinCFisacompactstoragefordataonpointsinaclusterHasenoughinformationtocalculatetheintra-clustersummaryofthestatisticsforagivensubcluster.Additivitytheoremallowsustomergesub-clusters ClusteringFeatureClusteringFeature:CF=(N,LS,N:NumberofdataLS: SS:

CFCF=(5,CFLeafLeafLeafCF-TreeinACFtreeisaheight-balancedtreethatstorestheclusteringfeaturesforahierarchicalclusteringAnonleafnodeinatreehasdescendantsThenonleafnodesstoresumsoftheCFsoftheirchildrenACFtreehastwoBranchingfactor:specifythe umnumberofchildren.threshold:maxdiameterofsub-storedattheleaf clusteringwithasinglescanandimprovesthequalitywithafewadditionalscansdata,andsensitivetotheorderofthedatarecord.ConsideronlyonepointasrepresentativeofaclusterGoodonlyforconvexshaped,similarsizeanddensity,andifkcanbereasonablyestimatedCURECURE(ClusteringUsingREpresentatives)proposedbyGuha,Rastogi&Shim,Shrinkthemultiplerepresentativepointstowardsthegravitycenterbyafractionof.MultiplerepresentativescapturetheshapeoftheclusterCURE从源数据对象中抽取一个随机样本将样本分割为一组划分p(size CUREys=p=2ys=p=2

s/pq= yyxyy Shrinkthemultiplerepresentativepointstowardsthegravitycenterbyafractionof.Multiplerepresentativescapturetheshapeofthecluster 主要思想:只要区域的密度(样本Clusteringbasedondensity(localclustercriterion),suchasdensity-connectedMajorDiscoverclustersofarbitraryHandleOneNeeddensityparametersasterminationSeveralinterestingEster,etal.Ankerst,etalHinneburg&D.KeimAgrawal,etal.a.给定对象半径ε内的区域称为该对象b.如果一个对象的ε-邻域至少包含最小数目MinPts个对象,则称该对象为核c.给定一个对象集合D,如果p是在q的-邻域内,而q是一个对象,则称对 MinPts= Eps=1d.如果存在一个对象链=p,对pi∈D(1<=i<=n),pi+1是从pi关于ε和MinPts直接密度可达的,则对象p是从对象qMinPt。p和q是从o关于ε和MinPts密度可达的,那么pqpqoDBSCAN(KDD-DBSCAN(Density-BasedSpatialClusteringofApplicationswith噪音 = MinPts=1)任意选择没有加 的点2)找到从p关于andMinPts3)如果|N(q)|MinPts,则p是 4)如果p=1cmMinPts=5对用户定义的参数(andMinPts)敏感计算复杂度为OPTICSOPTICS:OrderingPointsToIdentifytheClusteringStructure(通过对象MihaelAnkerst.ACM对DBSCAN的改 OPTICSP 距离 若|(N(q)|MinPts,即P不是 P关于对象q的可达距离:p 距离p,q的欧氏距离之间的较 若|N(q)|MinPts,即P不是 OPTICS1.计算数据点p的距离和可达距2.如果p为对象,找到所有它的Complexity:可达距无定对象的簇次 不同级别的矩形单元:的每个算和,以供处理查询之用CLIQUE:ClusteringIn给定数据集合,数据点在数据簇:相连的密集单元的最大在符 度的子空间中找出密集单在符 度的子空间中找出相连的密集单即给定一个k维的密集单元,如果检查它的k-151230 51230

51230 51230

476476=476476

自动地发现最的子空间,高密 孤立点 设(working等分布参数(等在未知数据分布状态下做数据 基于索引的 嵌套-循环 基于单元的算 检查组中对象的主偏离主要特征的对象被认为是孤序列异常技术(sequentialexceptiontechnique)OLAP数据立方体方法(OLAPdatacube On-lineOn-lineWebSearchResults:multiplesub-topicsaremixedtogetherforthegivenOrganizingWebsearchresultsintoclustersfacilitatesusers'quickOn-lineZamirandclusteringalgorithmshouldtakethe insteadofthewhole sasinput,sincethedownloadingoforiginal sistime-consuming; theclusteringalgorithmshouldbefastenoughforonlinecalculation thegeneratedclustersshouldhavereadabledescriptionsforquickbrowsingbyusers.ZamirandEtzionipresentedaSuffixTreeClustering(STC)‘’95(Web Clustering:AFeasibilityDemonstration)firstidentifiessetsof sthatsharecommonthencreateclustersaccordingtotheseSuffixTreeClusteringSTCisalineartimeclusteringalgorithmthatisbasedonasuffixtreewhichefficientlyidentifiessetsofsthatsharecommonSTCsatisfiesthekeySTCtreatsa asastring,makinguseofproximityinformationbetweenwords.STCisnovel,incremental,andO(n)timeSTCsuccinctlysummarizesclusters’contentsforusers.Quickbecauseofworkingonsmallersetofs,incremantality…bananaananananaananaaAsuffixtreeisarooted,directedEachinternalnodehas2+Eachedgeislabeledwithanon-emptysub-stringofS.Thelabelofanodeisdefinedtobetheconcatenationoftheedge-labelsonthepathfromtheroottothatnodeNotwoedgesoutofthesamenodecanhaveedge-labelsthatbeginwiththesameword—compact. Html- inWordsMarksentenceRemovenon-wordStep2:IdentifyingBaseStep3:CombiningBaseStep2:IdentifyingbaseClusters—SuffixTree*STCtreatsa asasetofSuffixtreeofstringS:acompacttreecontainingallthesuffixesofSSuffixofaword:Suffixofastring:“Friends”isalovelyshow.Ex.ASuffixTreeofString1:“catateString2:“mouseatecheeseString3:“catatemouseStringNo,TermBaseclustersBaseclusterscorrespondingtothetree

Clusters(B)=|B|*|B|isthenumberof sbaseclusterB|P|ishavenumberofnon-zeroinzeroscorefew(<3)orwords:stopwords,toomany(>40%) biningBaseMergebaseclusterswithahighoverlapintheir smayshareSimilarityofBmandBn (0.5is1|Bm/|Bm>=|Bm/|Bn>

>>1Whatif“ate”isinthestopwordlist?

STCisAseach arrivesfromtheweb,“clean”it(linearwithcollection Addittothesuffixtree.Eachnodethatisupdated/createdasaresultofthisisUpdatetherelevantbaseclustersandrecalculatethesimilarityofthesebaseclusterstotherestofkhighestscoringbaseclusters(linear)CheckanychangestothefinalclustersScoreandsortthefinalclusters,choosetop10...(linear)STCallowsclusterWhyoverlapisaoftenhas1+STCallowsatoappearin1+clusters,sincesmayshare1+phraseswithothersButnottoosimilartobemergedintoonecluster..SnippetsversusExecutionSTC– anincremental,o(n)timeclusteringalgorithmAclusteringalgorithmsonWebsearchengineresultsApaper:ClusterWebSearchResults,2004Hua-JunZeng,etc.L

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论