




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
JiafengGuoUnsupervisedLearning——ClusteringOutlineIntroductionApplicationsofClusteringDistanceFunctionsEvaluationMetricsClusteringAlgorithmsK-meansGaussianMixtureModelsandEMAlgorithmK-medoidsHierarchicalClusteringDensity-basedClusteringSupervisedvs.UnsupervisedLearning
WhydoUnsupervisedLearning?Rawdatacheap.Labeleddataexpensive.Savememory/computation.Reducenoiseinhigh-dimensionaldata.Usefulinexploratorydataanalysis.Oftenapre-processingstepforsupervisedlearning.Discovergroupssuchthatsampleswithinagrouparemoresimilartoeachotherthansamplesacrossgroups.ClusterAnalysisAvariablecanbeunobserved(latent).Itisanimaginaryquantitymeanttoprovidesomesimplifiedandabstractiveviewofthedatagenerationprocess.E.g.,speechrecognitionmodels,mixturemodelsItisareal-worldobjectand/orphenomena,butdifficultorimpossibletomeasure.E.g.,thetemperatureofastar,causesofadisease,evolutionaryancestorsItisareal-worldobjectand/orphenomena,butsometimeswasnotmeasured,becauseoffaultysensors;orwasmeasurewithanoisychannel,etc.E.g.,trafficradio,aircraftsignalonaradarscreenDiscretelatentvariablescanbeusedtopartition/clusterdataintosub-groups.Continuouslatentvariablescanbeusedfordimensionalityreduction.UnobservedVariablesOutlineIntroductionApplicationsofClusteringDistanceFunctionsEvaluationMetricsClusteringAlgorithmsK-meansGaussianMixtureModelsandEMAlgorithmK-medoidsHierarchicalClusteringDensity-basedClusteringImageSegmentation/pff/segmentHumanPopulationEranElhaiketal.NatureClusteringGraphsNewman,2008VectorquantizationtocompressimagesBishop,PRMLAdissimilarity/distancefunctionbetweensamples.Alossfunctiontoevaluateclusters.Algorithmthatoptimizesthislossfunction.IngredientsofclusteranalysisOutlineIntroductionApplicationsofClusteringDistanceFunctionsEvaluationMetricsClusteringAlgorithmsK-meansGaussianMixtureModelsandEMAlgorithmK-medoidsHierarchicalClusteringDensity-basedClusteringChoiceofdissimilarity/distancefunctionisapplicationdependent.Needtoconsiderthetypeoffeatures.Categorical,ordinalorquantitative.Possibletolearndissimilarityfromdata.Dissimilarity/DistanceFunction
DistanceFunction
StandardizationWithoutstandardizationWith
standardizationStandardizationnotalwayshelpfulWithoutstandardizationWith
standardizationOutlineIntroductionApplicationsofClusteringDistanceFunctionsEvaluationMetricsClusteringAlgorithmsK-meansGaussianMixtureModelsandEMAlgorithmK-medoidsHierarchicalClusteringDensity-basedClusteringPerformanceEvaluationofClustering:ValidityindexEvaluationmetrics:referencemodel(externalindex)comparewithreferencenon-referencemodel(internalindex)measuredistanceofinner-classandinter-classEvaluationofClustering
ReferenceModelm(m-1)/2referencesamenotclusteringsameabnotcd
ExternalIndexOnlyhavingresultofclustering,howcanweevaluateit?Intra-clustersimilarity:largerisbetterInter-clustersimilarity:smallerisbetterNon-referencemodel
Non-referencemodel
InternalIndex
OutlineIntroductionApplicationsofClusteringDistanceFunctionsEvaluationMetricsClusteringAlgorithmsK-meansGaussianMixtureModelsandEMAlgorithmK-medoidsHierarchicalClusteringDensity-basedClusteringK-means:
Idea
HowdoweminimizeJw.r.t(rik,uk)?ChickenandeggproblemIfprototypesknown,canassignresponsibilitiesIfresponsibilitiesknown,cancomputeprototypesWeuseaniterativeprocedureK-means:minimizingthelossfunction
K-meansAlgorithmsSomeheuristicsRandomlypickKdatapointsasprototypesPickprototypei+1tobethefarthestfromprototypes{1,2….i}HowdoweinitializeK-means?Evolutionofk-Means(a)originaldataset;(b)randominitialization;(c-f)illustrationofrunningtwoiterationsofk-means.(ImagesfromMichaelJordan)LossfunctionJaftereachiterationk-meansisexactlycoordinatedescentonthereconstructionerrorE.Emonotonicallydecreases,andthevalueofEconverges,sodotheclusteringresults.Itispossiblefork-meanstooscillatebetweenafewdifferentclusterings,butthisalmostneverhappensinpractice.Eisnon-convex,socoordinatedescentonEcannotguaranteedtoconvergetoglobalminimum.Onecommonthingtodoisrunningk-meansmanytimesandpickthebestone.ConvergenceofK-meansLikechoosingKinkNN.ThelossfunctionJgenerallydecreaseswithK.HowtochooseK?HowtochooseK?GapstatisticCross-validation:Partitiondataintotwosets.Estimateprototypesononeandusethesetocomputethelossfunctionontheother.Stabilityofclusters:Measurethechangeintheclustersobtainedbyresamplingorsplittingthedata.Non-parametricapproach:PlaceaprioronK.MoredetailsintheBayesiannon-parametriclecture.Hardassignmentsofdatapointstoclusterscancauseasmallperturbationtoadatapointtoflipittoanothercluster.Solution:GMMAssumessphericalclustersandequalprobabilitiesforeachcluster.Solution:GMMClusterschangearbitrarilyfordifferentK.Solution:HierarchicalclusteringSensitivetooutliers.Solution:Usearobustlossfunction.Workspoorlyonnon-convexclusters.Solution:Spectralclustering.LimitationsofK-meansOutlineIntroductionApplicationsofClusteringDistanceFunctionsEvaluationMetricsClusteringAlgorithmsK-meansGaussianMixtureModelsandEMAlgorithmK-medoidsHierarchicalClusteringDensity-basedClusteringMultivariateNormalDistribution
GaussianMixtureModel
TheLearningisHard
HowtoSolveit?
TheExpectation-Maximization(EM)AlgorithmAverygeneraltreatmentoftheEMalgorithm,andIntheprocessprovideaproofthattheEMalgorithmderivedheuristicallybeforeforGaussianmixturesdoesindeedmaximizethelikelihoodfunction,andThisdiscussionwillalsoformthebasisforthederivationofthevariationalinferenceframeworkTheEMAlgorithminGeneral
TheEMAlgorithminGeneralTheEMAlgorithminGeneral
Maximizingoverq(Z)wouldgivethetrueposteriorEM:VariationalViewpointEStepMStep
TheEMAlgorithm
InitialConfiguratinE-StepM-StepTheEMAlgorithmTheEMAlgorithm
Convergence
GMM:RelationtoK-meansIllustrationK-meansvsGMMLossfunction:minimizesumofsquareddistance.Hardassignmentofpointstoclusters.Assumessphericalclusterswithequalprobabilityofacluster.Minimizenegativeloglikelihood.Softassignmentofpointstoclusters.Canbeusedfornon-sphericalclusterswithdifferentprobabilities.K-meansGMMOutlineIntroductionApplicationsofClusteringDistanceFunctionsEvaluationMetricsClusteringAlgorithmsK-meansGaussianMixtureModelsandEMAlgorithmK-medoidsHierarchicalClusteringDensity-basedClusteringSquaredEuclideandistancelossfunctionofK-meansnotrobust.Onlythedissimilaritymatrixmaybegiven.Attributesnotquantitative.K-medoids
K-medoidsOutlineIntroductionApplicationsofClusteringDistanceFunctionsEvaluationMetricsClusteringAlgorithmsK-meansGaussianMixtureModelsandEMAlgorithmK-medoidsHierarchicalClusteringDensity-basedClusteringOrganizetheclustersinahierarchicalway.Producesarootedbinarytree(dendrogram).HierarchicalClusteringHierarchicalClusteringBottom-up(agglomerative):Recursivelymergetwogroupswiththesmallestbetween-clustersimilarity.Top-down(divisive):Recursivelysplitaleast-coherent(e.g.largestdiameter)cluster.Userscanthenchooseacutthroughthehierarchytorepresentthemostnaturaldivisionintoclusters(e.g.whereintergroupsimilarityexceedssomethreshold).
HierarchicalClusteringOutlineIntroductionApplicationsofClusteringDistanceFunctionsEvaluationMetricsClusteringAlgorithmsK-meansGaussianMixtureModelsandEMAlgorithmK-medoidsHierarchicalClusteringDensity-basedClustering
DBSCAN1Esteretal.Adensity-basedalgorithmfordiscoveringclustersinlargespatialdatabaseswithnoise.ProceedingsoftheSecondInternationalConferenceonKnowledgeDiscoveryandDataMining(KDD).1996.Twopointspandqaredensity-connectedifthereisapointosuchthatbothpandqarereachablefromoAclustersatisfiestwoproperties:Allpointswithintheclusteraremutuallydensity-connected;Ifapointisdensity-reachablefromanypointofthecluster,itispartoftheclusteraswellDBSCAN
DBSCANAdvantagesNotneedtospecifythenumberofclustersArbitraryshapeclusterRobusttooutliersDisadvantagesDifficultparameterselectionNotproperfordatasetswithlargedifferencesindensitiesAnalysisofDBSCAN
Mean-ShiftClustering2Fukunaga,Keinosuke;LarryD.Hostetler.TheEstimationoftheGradientofaDensityFunction,withApplicationsinPatternRecognition.IEEETransactionsonInformationTheory21(1):32–40.Jan.1975.Cheng,Yizong.Me
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025至2030年中国气动定尺小车市场分析及竞争策略研究报告
- 2025至2030年中国毛皮动物复合预混料市场现状分析及前景预测报告
- 2025至2030年中国欧式大头斜嘴钳市场分析及竞争策略研究报告
- 2025至2030年中国樱桃夹心巧克力市场分析及竞争策略研究报告
- 2025至2030年中国楼梯包角数据监测研究报告
- 2025至2030年中国桌线刀行业发展研究报告
- 2025至2030年中国树脂相架市场调查研究报告
- 2025至2030年中国柱状石英谐振器行业投资前景及策略咨询报告
- 2025至2030年中国果品套袋机行业投资前景及策略咨询报告
- 2025至2030年中国杠杆式电子测头行业投资前景及策略咨询报告
- 2024年广东省五年一贯制学校招生考试数学试卷
- 总放射性检测原始记录
- 2022年北京市西城区八年级下学期期末语文试卷
- 郑州大学-格兰杰-答辩通用PPT模板
- 投诉案件奖罚制度
- 海马CVT-VT2变速箱培训
- 普通高中课程设置及学时安排指导表
- 我的小秘密(课堂PPT)
- 人教版八年级下册英语单词表(带音标)
- 科护士排班表
- 沈阳市终止(解除)劳动合同证明书
评论
0/150
提交评论