




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Chapter19ClusteringAnalysis
Chapter19ClusteringAnalysis1ContentSimilaritycoefficientHierarchicalclusteringanalysis
Dynamicclusteringanalysis
OrderedsampleclusteringanalysisContentSimilaritycoefficient2DiscriminantAnalysis:havingknownwithcertaintytocomefromtwoormorepopulations,it’samethodtoacquirethediscriminatemodelthatwillallocatefurtherindividualstothecorrectpopulation.
ClusteringAnalysis:astatisticmethodforgroupingobjectsofrandomkindintorespectivecategories.It’susedwhenthere’snopriorihypotheses,buttryingtofindthemostappropriatesortingmethodresortingtomathematicalstatisticsandsomecollectedinformation.Ithasbecomethefirstselectedmeanstouncovergreatcapacityofgeneticmessages.
Botharemethodsofmultivariatestatisticstostudyclassification.
DiscriminantAnalysis:h3Clusteringanalysisisamethodofexploringstatisticalanalysis.Itcanbeclassifiedintotwomajorspeciesaccordingtoitsaims.Forexample,mreferstothenumberofvariables(i.e.indexes)whilenreferstothatofcases(i.e.samples),youcandoasfollows:
(1)R-typeclustering:alsocalledindexclustering.Themethodtosortthemkindsofindexes,aimingatloweringthedimensionofindexesandchoosingtypicalones.
(2)Q-typeclustering:alsocalledsampleclustering.Themethodtosortthenkindsofsamplestofindthecommonnessamongthem.Clusteringanalysisisa4ThemostimportantthingforbothR-typeclusteringandQ-typeclusteringisthedefinitionofsimilarity,thatishowtoquantifysimilarity.Thefirststepofclusteringistodefinethemetricsimilaritybetweentwoindexesortwosamples-similaritycoefficientThemostimportantthingfo5§1similaritycoefficient
1similaritycoefficientofR-typeclusteringSupposetherearemkindsofvariables:X1,X2,…,Xm.R-typeclusteringusuallyusetheabsolutevalueofsimplecorrelationcoefficienttodefinethesimilaritycoefficientamongvariables:Thetwovariablestendtobemoresimilarwhentheabsolutevalueincreases.Similarly,Spearmanrankcorrelationcoefficientcanbeusedtodefinethesimilaritycoefficientofnon-normalvariables.Butwhenthevariablesareallqualitativevariables,it’sbesttousecontingencycoefficient.
§1similaritycoefficient162.SimilaritycoefficientcommonlyusedinQ-typeclustering:Supposetherearencasesregardasnspotsinamdimensionsspace,distancebetweentwospotscanbeusedtodefinesimilaritycoefficient,thetwosamplestendtobemoresimilarwhenthedistancedeclines.(1)Euclideandistance
(2)Manhattandistance
(3)Minkowskidistance:
AbsolutedistancereferstoMinkowskidistancewhenq=1;Euclideandistanceisdirect-viewingandsimpletocompute,buthavingnotregardedthecorrelatedrelationsamongvariables.That’swhyManhattandistancewasintroduced.(19-5)2.Similaritycoefficientcomm7(4)Mahalanobisdistance:it’susedtoexpressthesamplecovariancematrixamongmkindsofvariables.Itcanbeworkedoutasfollows:
Whenit’saunitmatrix,MahalanobisdistanceequalstothesquareofEuclideandistance.
Allofthefourdistancesrefertoquantitativevariables,forthequalitativevariablesandordinalvariables,quantizationisneededbeforeusing.(4)Mahalanobisdistance:it’s8§2HierarchicalClusteringAnalysisHierarchicalclusteringanalysisisamostcommonlyusedmethodtosortoutsimilarsamplesorvariables.Theprocessisasfollows:
1)Atthebeginning,samples(orvariables)areregardedrespectivelyasonesinglecluster,thatis,eachclustercontainsonlyonesample(orvariable).Thenworkoutsimilaritycoefficientmatrixamongclusters.Thematrixismadeupofsimilaritycoefficientsbetweensamples(orvariables).Similaritycoefficientmatrixisasymmetricalmatrix.
2)Thetwoclusterswiththemaximumsimilaritycoefficient(minimumdistanceormaximumcorrelationcoefficient)aremergedintoanewcluster.Computethesimilaritycoefficientbetweenthenewclusterwithotherclusters.Repeatsteptwountilallofthesamples(orvariables)aremergedintoonecluster.§2HierarchicalClustering9Thecalculationofsimilaritycoefficientbetweenclusters
Eachstepofhierarchicalclusteringhastocalculatethesimilaritycoefficientamongclusters.Whenthereisonlyonesampleorvariableineachofthetwoclusters,thesimilaritycoefficientbetweenthemequalstothatofthetwosamplesorthetwovariables,orcomputeaccordingtosectionone.
Whentherearemorethanonesampleorvariableineachcluster,manykindsofmethodscanbeusedtocomputesimilaritycoefficient.Justlist5kindsofmethodsasfollows.andrefertothetwoclusters,whichrespectivelyhasorkindsofsamplesorvariables.
Thecalculationofsimilarity101.ThemaximumsimilaritycoefficientmethodIfthere’rerespectively,samples(orvariables)inclusterand,here’realtogetherandsimilaritycoefficientsbetweenthetwoclusters,butonlythemaximumisconsideredasthesimilaritycoefficientofthetwoclusters.
Attention:theminimumdistancealsomeansthemaximumsimilaritycoefficient.
2.TheMinimumsimilaritycoefficientmethodsimilaritycoefficientbetweenclusterscanbe
calculatedasfollows:
1.Themaximumsimilaritycoeff113.Thecenterofgravitymethod(onlyusedinsampleclustering)Theweightsaretheindexmeansamongclusters.Itcanbecomputedasfollows:
4.Clusterequilibrationmethod(onlyusedin
sample
clustering)workouttheaveragesquaredistancebetweentwosamplesofeachcluster.
Clusterequilibrationisoneofthegoodmethodsinthehierarchicalclustering,becauseitcanfullyreflecttheindividualinformationwithinacluster.
3.Thecenterofgravitymeth125.sumofsquaresofdeviations
methodalsocalledWardmethod,onlyforsampleclustering.Itimitatesthebasicthoughtsofvarianceanalysis,thatis,arationalclassificationcanmakethesumofsquaresofdeviationwithinaclustersmaller,whilethatamongclusterslarger.Supposethatsampleshavebeenclassifiedintogclusters,includingand.Thesumofsquaresofdeviationsofclusterfromsamplesis:(isthemeanof).Themergedsumofsquaresofdeviationsofallthegclustersis.Ifandaremerged,therewillbeg-1clusters.
Theincrementofmergedsumofsquaresofdeviationsis,whichisdefinedasthesquaredistancebetweenthetwoclusters.Obviously,whennsamplesrespectivelyformsasinglecluster,themergedsumofsquaresofdeviationis0.5.sumofsquaresofdeviations13Sample19-1There’refourvariablessurveyingfrom3454femaleadults:height(X1)、lengthoflegs(X2)、waistline(X3)andchestcircumference(X4).Thecorrelationmatrixhasbeenworkedoutasfollows:
Trytousehierarchicalclusteringtoclusterthe4indexes.
ThisisacaseofR-type(index)clustering.Wechoosesimplesimilaritycoefficientasthesimilaritycoefficient,andusemaximumsimilaritycoefficientmethodtocalculatethesimilaritycoefficientamongclusters.Sample19-1There’refou14
Theclusteringprocedureislistedasfollows:(1)eachindexisregardedasasingleclusterG1={X1},G2={X2},G3={X3},G4={X4}.There’realtogether4clusters.
(2)Mergethetwoclusterswithmaximumsimilaritycoefficientintoanewcluster.Inthiscase,wemergeG1andG2(similaritycoefficientis0.852)asG5={X1,X2}.CalculatethesimilaritycoefficientamongG5、G3andG4.
ThesimilarmatrixamongG3,G4andG5:Theclusteringprocedure15
(3)MergeG3andG4asG6={G3,G4},forthistimethesimilaritycoefficientbetweenG3andG4ranksthelargest(0.732).ComputethesimilaritycoefficientbetweenG6andG5.
(4)LastlyG5andG6aremergedintooneclusterG7={G5,G6},whichinfactincludesalltheprimitiveindexes.(3)MergeG3andG4asG6={16Drawthehierarchicaldendrogram(picture19-1)accordingtotheprocessofclustering.Asthepictureindicates,it’sbettertobeclassifiedintotwoclusters:{X1,X2},{X3,X4}.Thatis,lengthindexasoneclusterwhilecircumferenceastheotherone.
height
lengthwaistlinechestoflegscircumference
Picture19-1hierarchicaldendrogramwith4indexesDrawthehierarchicalden17Sample19-2Table19-1liststhemeansofenergyexpenditureandsugarexpenditureoffourathleticitemsfromsixathletes.Inordertoprovidecorrespondentdietarystandardtoimproveperformancerecord,pleaseclustertheathleticitemsusinghierarchicalclustering.
Table19-1measurevaluesof4athleticitemsAthleticitemsEnergyexpenditureX1(joule/minute、m2)SugarexpenditureX2(%)WeightloadingcrouchingG127.89261.421.3150.688Pull-upG223.47556.830.1740.088Push-upsG318.92445.13-1.001-1.441Sit-upG420.91361.25-0.4880.665Sample19-2Table19-118
WechooseMinkowskidistanceinthissample,anduseminimumsimilaritycoefficientmethodtocalculatedistancesamongclusters.Toreducetheeffectofvariabledimensions,thevariablesshouldbestandardizedbeforeanalysis.respectivelyreferstothesamplemeanandstandarddeviationofXi.Thedataaftertransformationarelistedintable19-1.WechooseMinkowskidistanc19Theclusteringprocess:
(1)computethesimilaritycoefficientmatrix(i.e.distancematrix)ofthe4samples.Thedistanceofweightloadingcrouchingandpull-upscanbeworkoutusingformula(19-3).
Likewise,thedistancebetweenweightloadingcrouchingandpush-upscanbecomputedasfollows:Lastly,workoutthedistancematrix:
Theclusteringprocess:
(20(2)ThedistancebetweenG2andG4istheminimum,soG2andG4shouldbeemergedintoanewclusterG5={G2,G4}.ComputethedistancebetweenG5andotherclustersusingminimumsimilaritycoefficientmethodaccordingtoformula(19-8).
ThedistancematrixofG1,G3andG5:
(3)MergeG1andG5intoanewclusterG6={G1,G5}.ComputethedistancebetweenG6andG3:(4)lastlymergeG1andG6intoG7={G1,G6}.Alltheindexeshaveallbeenmergedintoalargecluster.(2)ThedistancebetweenG221
Accordingtotheprocessofclustering,drawoutthethehierarchydendrogram(chart19-2).Asthehierarchydendrogramshowsandexpertisewehavelearned,theindexesshouldbesortedintotwoclusters:{G1,G2,G4}and{G3}.Physicalenergyexpenditureinweightloadingcrouching、pull-upsandsit-upswouldbemuchhigher,dietarystandardimprovementmightberequiredinthoseitemsduringtraining.Accordingtotheprocess22
Analysisofclusteringexamples
Differentdefinitionofsimilaritycoefficientandthatamongclusterswillcausedifferentclusteringresults.Expertiseaswellasclusteringmethodisimportanttotheexplanationofclusteringanalysis.Analysisofclusteringexam23
Sample19-3twenty-sevenpetroleumpitchworkersandpyro-furnacemanaresurveyedabouttheirages,lengthofserviceandsmokinginformation.Inaddition,detectionsofsero-P21,sero-P53,peripheralbloodlymphocyteSCE,thenumberofchromosomalaberrationandthenumberofcellsthathadhappenedchromosomalaberrationwerecarriedoutamongtheseworkers(table19-3).(P21mutiple=P21detectionvalue/themeanofcontrolgroupP21)Pleasesortthe27workersusinghierarchicalclusteringserviceablymethod.
Sample19-3twenty-seven24Table19-3resultofbio-markerdetectionandclusteringanalysisofpetroleumpitchworkersandpyro-furnacemanSampleNumberageLengthofservicesmokeRamus/dSero-P21P21MultipleP53SCENumberofchromosomeaberrationNumberofcellsofChromosomeaberrationresultofculsterin680.358.1144235122035102.761.436.84331352252027842.190.544.1133143272024511.930.4711.4596153822032472.560.8011.68551651313037102.920.3711.6022174091031942.510.4011.40551834172046583.670.4611.3533195029050193.950.4713.4510811042202074825.890.1213.110021157301538002.990.1910.762211236152024781.950.2510.00001133712038273.010.8210.50441145232029842.350.1611.153311552321037492.950.7211.45111011642273049413.890.7313.807611744272039483.110.3313.6516141184021533602.640.3711.40001193821529362.310.6911.401112044272068515.390.9912.28762214327039263.090.4711.95001222610343813.450.5211.807512337182071425.620.8511.81552242892026122.060.3711.65111252593026382.080.7812.251112634142043223.400.4115.005512750322028622.250.698.80221Table19-3resultofbio-marke25ThisexampleapplyminimumsimilaritycoefficientmethodoriginatingfromEuclideandistance,clusterequilibrationmethodandsumofsquaresofdeviationsmethodtoclusterthedata.Theresultsarelistedinchart19-3,chart19-4andchart19-5.Allthevariableshavebeenstandardizedbeforeanalysis.Thisexampleapplyminimum26
chart19-3thehierarchydendrogramof27petroleumpitchworkersandpyro-furnacemenusingminimumsimilaritycoefficientmethodchart19-3thehierarchyden27Chart19-4thehierarchydendrogramof27petroleumpitchworkersandpyro-furnacemenusingclusterequilibrationmethodChart19-4thehierarchydend28Chart19-5thehierarchydendrogramof27petroleumpitchworkersandpyro-furnacemenusingsumofsquaresofdeviationsmethod
Chart19-5thehierarchydendr29Theoutcomesofthethreekindsofclusteringarenotthesame,fromwhichwecanseedifferentwayshavedifferentefficiency.Thedifferencesaremoredistinctincaseofmorevariables.Soyou’dbetterselectefficientvariablesbeforeclusteringanalysis.Suchasthep21andp53inthisexample.Youcangetmoreinformationbyreadingtheclusteringchart.Theoutcomesofthethreek30Accordingtoexpertise,wecanseetheoutcomeofequilibrationclusteringismorereasonable.Theclassifyingresultisfilledinthelastcolumn.Workersnumbered{10,20,23}areclassifiedasoneclass;othersareanother.researchersfindthatworkersnumbered{10,20,23}areinhighriskofcancer.Number{10,20,23,8,16,26}areclusteredtogetheraccordingtothechartofsumofsquaresofdeviations,remindingthatworkersof8,16,26maybeinhighrisktoo.Accordingtoexpertise,we31DynamicclusteringIftherearetoomanysamplesunderclassified,hierarchyclusteringanalysisdemandsmorespacetostoresimilaritycoefficientmatrix.andisquiteinefficient.What’smore,samplescan’tbechangedoncetheyareclassified.Becauseoftheseshortcomings,statistsputforwarddynamicclusteringwhichcanovercometheinefficiencyandadjusttheclassifyingalongwiththeprocessofclustering.DynamicclusteringIfther32Theprincipleofdynamicclusteringanalysisis:firstly,selectseveralrepresentativesamples,calledcohesionpoint,asthecoreofeachclass;secondly,classifyothers.adjustthecoreofeachclassuntilclassifyingisreasonable.Themostcommonwayofdynamicclusteringanalysisisk-means,whichisquit
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 江西省高安第二中学2025年高二物理第二学期期末教学质量检测试题含解析
- 冬季期末家长会课件
- 2025届江西省赣中南五校联考物理高一第二学期期末联考模拟试题含解析
- 2025版餐厅食品安全管理与经营风险防控合同
- 2025版汽车维修行业绿色环保服务合同
- 二零二五版财务软件定制开发及实施服务协议
- 二零二五年度生态农业园建设项目施工合同细则
- 二零二五年智能仓储物流包月运输合作协议
- 宝洁校园健康计划课件
- 二零二五年度工业产权互换项目实施合同范本
- 接电施工合同协议
- 2024年大学生就业力调研报告-智联招聘-202405
- 2024年山西华阳新材料科技集团有限公司招聘笔试真题
- 江阴国企笔试题库及答案
- 2025年军队文职人员(司机岗)历年考试真题库及答案(重点300题)
- 中俄运输合同范例
- 2025年小红书账号经营权转协议
- 《就业指导与礼仪》课件
- (新版)口腔执业医师资格考试(重点)题(附答案)
- 数学竞赛辅导:《高中数学竞赛辅导班》教案
- 眼视光医学病例解析与现代治疗技术
评论
0/150
提交评论