第十九章-聚类分析-Chapter19-Clustering-Analysis-课件_第1页
第十九章-聚类分析-Chapter19-Clustering-Analysis-课件_第2页
第十九章-聚类分析-Chapter19-Clustering-Analysis-课件_第3页
第十九章-聚类分析-Chapter19-Clustering-Analysis-课件_第4页
第十九章-聚类分析-Chapter19-Clustering-Analysis-课件_第5页
已阅读5页,还剩32页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Chapter19ClusteringAnalysis

Chapter19ClusteringAnalysis1ContentSimilaritycoefficientHierarchicalclusteringanalysis

Dynamicclusteringanalysis

OrderedsampleclusteringanalysisContentSimilaritycoefficient2DiscriminantAnalysis:havingknownwithcertaintytocomefromtwoormorepopulations,it’samethodtoacquirethediscriminatemodelthatwillallocatefurtherindividualstothecorrectpopulation.

ClusteringAnalysis:astatisticmethodforgroupingobjectsofrandomkindintorespectivecategories.It’susedwhenthere’snopriorihypotheses,buttryingtofindthemostappropriatesortingmethodresortingtomathematicalstatisticsandsomecollectedinformation.Ithasbecomethefirstselectedmeanstouncovergreatcapacityofgeneticmessages.

Botharemethodsofmultivariatestatisticstostudyclassification.

DiscriminantAnalysis:h3Clusteringanalysisisamethodofexploringstatisticalanalysis.Itcanbeclassifiedintotwomajorspeciesaccordingtoitsaims.Forexample,mreferstothenumberofvariables(i.e.indexes)whilenreferstothatofcases(i.e.samples),youcandoasfollows:

(1)R-typeclustering:alsocalledindexclustering.Themethodtosortthemkindsofindexes,aimingatloweringthedimensionofindexesandchoosingtypicalones.

(2)Q-typeclustering:alsocalledsampleclustering.Themethodtosortthenkindsofsamplestofindthecommonnessamongthem.Clusteringanalysisisa4ThemostimportantthingforbothR-typeclusteringandQ-typeclusteringisthedefinitionofsimilarity,thatishowtoquantifysimilarity.Thefirststepofclusteringistodefinethemetricsimilaritybetweentwoindexesortwosamples-similaritycoefficientThemostimportantthingfo5§1similaritycoefficient

1similaritycoefficientofR-typeclusteringSupposetherearemkindsofvariables:X1,X2,…,Xm.R-typeclusteringusuallyusetheabsolutevalueofsimplecorrelationcoefficienttodefinethesimilaritycoefficientamongvariables:Thetwovariablestendtobemoresimilarwhentheabsolutevalueincreases.Similarly,Spearmanrankcorrelationcoefficientcanbeusedtodefinethesimilaritycoefficientofnon-normalvariables.Butwhenthevariablesareallqualitativevariables,it’sbesttousecontingencycoefficient.

§1similaritycoefficient162.SimilaritycoefficientcommonlyusedinQ-typeclustering:Supposetherearencasesregardasnspotsinamdimensionsspace,distancebetweentwospotscanbeusedtodefinesimilaritycoefficient,thetwosamplestendtobemoresimilarwhenthedistancedeclines.(1)Euclideandistance

(2)Manhattandistance

(3)Minkowskidistance:

AbsolutedistancereferstoMinkowskidistancewhenq=1;Euclideandistanceisdirect-viewingandsimpletocompute,buthavingnotregardedthecorrelatedrelationsamongvariables.That’swhyManhattandistancewasintroduced.(19-5)2.Similaritycoefficientcomm7(4)Mahalanobisdistance:it’susedtoexpressthesamplecovariancematrixamongmkindsofvariables.Itcanbeworkedoutasfollows:

Whenit’saunitmatrix,MahalanobisdistanceequalstothesquareofEuclideandistance.

Allofthefourdistancesrefertoquantitativevariables,forthequalitativevariablesandordinalvariables,quantizationisneededbeforeusing.(4)Mahalanobisdistance:it’s8§2HierarchicalClusteringAnalysisHierarchicalclusteringanalysisisamostcommonlyusedmethodtosortoutsimilarsamplesorvariables.Theprocessisasfollows:

1)Atthebeginning,samples(orvariables)areregardedrespectivelyasonesinglecluster,thatis,eachclustercontainsonlyonesample(orvariable).Thenworkoutsimilaritycoefficientmatrixamongclusters.Thematrixismadeupofsimilaritycoefficientsbetweensamples(orvariables).Similaritycoefficientmatrixisasymmetricalmatrix.

2)Thetwoclusterswiththemaximumsimilaritycoefficient(minimumdistanceormaximumcorrelationcoefficient)aremergedintoanewcluster.Computethesimilaritycoefficientbetweenthenewclusterwithotherclusters.Repeatsteptwountilallofthesamples(orvariables)aremergedintoonecluster.§2HierarchicalClustering9Thecalculationofsimilaritycoefficientbetweenclusters

Eachstepofhierarchicalclusteringhastocalculatethesimilaritycoefficientamongclusters.Whenthereisonlyonesampleorvariableineachofthetwoclusters,thesimilaritycoefficientbetweenthemequalstothatofthetwosamplesorthetwovariables,orcomputeaccordingtosectionone.

Whentherearemorethanonesampleorvariableineachcluster,manykindsofmethodscanbeusedtocomputesimilaritycoefficient.Justlist5kindsofmethodsasfollows.andrefertothetwoclusters,whichrespectivelyhasorkindsofsamplesorvariables.

Thecalculationofsimilarity101.ThemaximumsimilaritycoefficientmethodIfthere’rerespectively,samples(orvariables)inclusterand,here’realtogetherandsimilaritycoefficientsbetweenthetwoclusters,butonlythemaximumisconsideredasthesimilaritycoefficientofthetwoclusters.

Attention:theminimumdistancealsomeansthemaximumsimilaritycoefficient.

2.TheMinimumsimilaritycoefficientmethodsimilaritycoefficientbetweenclusterscanbe

calculatedasfollows:

1.Themaximumsimilaritycoeff113.Thecenterofgravitymethod(onlyusedinsampleclustering)Theweightsaretheindexmeansamongclusters.Itcanbecomputedasfollows:

4.Clusterequilibrationmethod(onlyusedin

sample

clustering)workouttheaveragesquaredistancebetweentwosamplesofeachcluster.

Clusterequilibrationisoneofthegoodmethodsinthehierarchicalclustering,becauseitcanfullyreflecttheindividualinformationwithinacluster.

3.Thecenterofgravitymeth125.sumofsquaresofdeviations

methodalsocalledWardmethod,onlyforsampleclustering.Itimitatesthebasicthoughtsofvarianceanalysis,thatis,arationalclassificationcanmakethesumofsquaresofdeviationwithinaclustersmaller,whilethatamongclusterslarger.Supposethatsampleshavebeenclassifiedintogclusters,includingand.Thesumofsquaresofdeviationsofclusterfromsamplesis:(isthemeanof).Themergedsumofsquaresofdeviationsofallthegclustersis.Ifandaremerged,therewillbeg-1clusters.

Theincrementofmergedsumofsquaresofdeviationsis,whichisdefinedasthesquaredistancebetweenthetwoclusters.Obviously,whennsamplesrespectivelyformsasinglecluster,themergedsumofsquaresofdeviationis0.5.sumofsquaresofdeviations13Sample19-1There’refourvariablessurveyingfrom3454femaleadults:height(X1)、lengthoflegs(X2)、waistline(X3)andchestcircumference(X4).Thecorrelationmatrixhasbeenworkedoutasfollows:

Trytousehierarchicalclusteringtoclusterthe4indexes.

ThisisacaseofR-type(index)clustering.Wechoosesimplesimilaritycoefficientasthesimilaritycoefficient,andusemaximumsimilaritycoefficientmethodtocalculatethesimilaritycoefficientamongclusters.Sample19-1There’refou14

Theclusteringprocedureislistedasfollows:(1)eachindexisregardedasasingleclusterG1={X1},G2={X2},G3={X3},G4={X4}.There’realtogether4clusters.

(2)Mergethetwoclusterswithmaximumsimilaritycoefficientintoanewcluster.Inthiscase,wemergeG1andG2(similaritycoefficientis0.852)asG5={X1,X2}.CalculatethesimilaritycoefficientamongG5、G3andG4.

ThesimilarmatrixamongG3,G4andG5:Theclusteringprocedure15

(3)MergeG3andG4asG6={G3,G4},forthistimethesimilaritycoefficientbetweenG3andG4ranksthelargest(0.732).ComputethesimilaritycoefficientbetweenG6andG5.

(4)LastlyG5andG6aremergedintooneclusterG7={G5,G6},whichinfactincludesalltheprimitiveindexes.(3)MergeG3andG4asG6={16Drawthehierarchicaldendrogram(picture19-1)accordingtotheprocessofclustering.Asthepictureindicates,it’sbettertobeclassifiedintotwoclusters:{X1,X2},{X3,X4}.Thatis,lengthindexasoneclusterwhilecircumferenceastheotherone.

height

lengthwaistlinechestoflegscircumference

Picture19-1hierarchicaldendrogramwith4indexesDrawthehierarchicalden17Sample19-2Table19-1liststhemeansofenergyexpenditureandsugarexpenditureoffourathleticitemsfromsixathletes.Inordertoprovidecorrespondentdietarystandardtoimproveperformancerecord,pleaseclustertheathleticitemsusinghierarchicalclustering.

Table19-1measurevaluesof4athleticitemsAthleticitemsEnergyexpenditureX1(joule/minute、m2)SugarexpenditureX2(%)WeightloadingcrouchingG127.89261.421.3150.688Pull-upG223.47556.830.1740.088Push-upsG318.92445.13-1.001-1.441Sit-upG420.91361.25-0.4880.665Sample19-2Table19-118

WechooseMinkowskidistanceinthissample,anduseminimumsimilaritycoefficientmethodtocalculatedistancesamongclusters.Toreducetheeffectofvariabledimensions,thevariablesshouldbestandardizedbeforeanalysis.respectivelyreferstothesamplemeanandstandarddeviationofXi.Thedataaftertransformationarelistedintable19-1.WechooseMinkowskidistanc19Theclusteringprocess:

(1)computethesimilaritycoefficientmatrix(i.e.distancematrix)ofthe4samples.Thedistanceofweightloadingcrouchingandpull-upscanbeworkoutusingformula(19-3).

Likewise,thedistancebetweenweightloadingcrouchingandpush-upscanbecomputedasfollows:Lastly,workoutthedistancematrix:

Theclusteringprocess:

(20(2)ThedistancebetweenG2andG4istheminimum,soG2andG4shouldbeemergedintoanewclusterG5={G2,G4}.ComputethedistancebetweenG5andotherclustersusingminimumsimilaritycoefficientmethodaccordingtoformula(19-8).

ThedistancematrixofG1,G3andG5:

(3)MergeG1andG5intoanewclusterG6={G1,G5}.ComputethedistancebetweenG6andG3:(4)lastlymergeG1andG6intoG7={G1,G6}.Alltheindexeshaveallbeenmergedintoalargecluster.(2)ThedistancebetweenG221

Accordingtotheprocessofclustering,drawoutthethehierarchydendrogram(chart19-2).Asthehierarchydendrogramshowsandexpertisewehavelearned,theindexesshouldbesortedintotwoclusters:{G1,G2,G4}and{G3}.Physicalenergyexpenditureinweightloadingcrouching、pull-upsandsit-upswouldbemuchhigher,dietarystandardimprovementmightberequiredinthoseitemsduringtraining.Accordingtotheprocess22

Analysisofclusteringexamples

Differentdefinitionofsimilaritycoefficientandthatamongclusterswillcausedifferentclusteringresults.Expertiseaswellasclusteringmethodisimportanttotheexplanationofclusteringanalysis.Analysisofclusteringexam23

Sample19-3twenty-sevenpetroleumpitchworkersandpyro-furnacemanaresurveyedabouttheirages,lengthofserviceandsmokinginformation.Inaddition,detectionsofsero-P21,sero-P53,peripheralbloodlymphocyteSCE,thenumberofchromosomalaberrationandthenumberofcellsthathadhappenedchromosomalaberrationwerecarriedoutamongtheseworkers(table19-3).(P21mutiple=P21detectionvalue/themeanofcontrolgroupP21)Pleasesortthe27workersusinghierarchicalclusteringserviceablymethod.

Sample19-3twenty-seven24Table19-3resultofbio-markerdetectionandclusteringanalysisofpetroleumpitchworkersandpyro-furnacemanSampleNumberageLengthofservicesmokeRamus/dSero-P21P21MultipleP53SCENumberofchromosomeaberrationNumberofcellsofChromosomeaberrationresultofculsterin680.358.1144235122035102.761.436.84331352252027842.190.544.1133143272024511.930.4711.4596153822032472.560.8011.68551651313037102.920.3711.6022174091031942.510.4011.40551834172046583.670.4611.3533195029050193.950.4713.4510811042202074825.890.1213.110021157301538002.990.1910.762211236152024781.950.2510.00001133712038273.010.8210.50441145232029842.350.1611.153311552321037492.950.7211.45111011642273049413.890.7313.807611744272039483.110.3313.6516141184021533602.640.3711.40001193821529362.310.6911.401112044272068515.390.9912.28762214327039263.090.4711.95001222610343813.450.5211.807512337182071425.620.8511.81552242892026122.060.3711.65111252593026382.080.7812.251112634142043223.400.4115.005512750322028622.250.698.80221Table19-3resultofbio-marke25ThisexampleapplyminimumsimilaritycoefficientmethodoriginatingfromEuclideandistance,clusterequilibrationmethodandsumofsquaresofdeviationsmethodtoclusterthedata.Theresultsarelistedinchart19-3,chart19-4andchart19-5.Allthevariableshavebeenstandardizedbeforeanalysis.Thisexampleapplyminimum26

chart19-3thehierarchydendrogramof27petroleumpitchworkersandpyro-furnacemenusingminimumsimilaritycoefficientmethodchart19-3thehierarchyden27Chart19-4thehierarchydendrogramof27petroleumpitchworkersandpyro-furnacemenusingclusterequilibrationmethodChart19-4thehierarchydend28Chart19-5thehierarchydendrogramof27petroleumpitchworkersandpyro-furnacemenusingsumofsquaresofdeviationsmethod

Chart19-5thehierarchydendr29Theoutcomesofthethreekindsofclusteringarenotthesame,fromwhichwecanseedifferentwayshavedifferentefficiency.Thedifferencesaremoredistinctincaseofmorevariables.Soyou’dbetterselectefficientvariablesbeforeclusteringanalysis.Suchasthep21andp53inthisexample.Youcangetmoreinformationbyreadingtheclusteringchart.Theoutcomesofthethreek30Accordingtoexpertise,wecanseetheoutcomeofequilibrationclusteringismorereasonable.Theclassifyingresultisfilledinthelastcolumn.Workersnumbered{10,20,23}areclassifiedasoneclass;othersareanother.researchersfindthatworkersnumbered{10,20,23}areinhighriskofcancer.Number{10,20,23,8,16,26}areclusteredtogetheraccordingtothechartofsumofsquaresofdeviations,remindingthatworkersof8,16,26maybeinhighrisktoo.Accordingtoexpertise,we31DynamicclusteringIftherearetoomanysamplesunderclassified,hierarchyclusteringanalysisdemandsmorespacetostoresimilaritycoefficientmatrix.andisquiteinefficient.What’smore,samplescan’tbechangedoncetheyareclassified.Becauseoftheseshortcomings,statistsputforwarddynamicclusteringwhichcanovercometheinefficiencyandadjusttheclassifyingalongwiththeprocessofclustering.DynamicclusteringIfther32Theprincipleofdynamicclusteringanalysisis:firstly,selectseveralrepresentativesamples,calledcohesionpoint,asthecoreofeachclass;secondly,classifyothers.adjustthecoreofeachclassuntilclassifyingisreasonable.Themostcommonwayofdynamicclusteringanalysisisk-means,whichisquit

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论