版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
聚类与树Clustering2024/9/91主要内容Microarrays(微阵列)HierarchicalClustering(层次聚类或系统聚类)K-MeansClustering(K-均值聚类)2024/9/92ApplicationsofClusteringViewingandanalyzingvastamountsofbiologicaldataasawholesetcanbeperplexingItiseasiertointerpretthedataiftheyarepartitionedintoclusterscombiningsimilardatapoints.2024/9/93InferringGeneFunctionalityResearcherswanttoknowthefunctionsofnewlysequencedgenesSimplycomparingthenewgenesequencestoknownDNAsequencesoftendoesnotgiveawaythefunctionofgeneFor40%ofsequencedgenes,functionalitycannotbeascertainedbyonlycomparingtosequencesofotherknowngenesMicroarraysallowbiologiststoinfergenefunctionevenwhensequencesimilarityaloneisinsufficienttoinferfunction.2024/9/94MicroarraysandExpressionAnalysisMicroarraysmeasuretheactivity(expressionlevel)ofthegenesundervaryingconditions/timepointsExpressionlevelisestimatedbymeasuringtheamountofmRNAforthatparticulargeneAgeneisactiveifitisbeingtranscribedMoremRNAusuallyindicatesmoregeneactivity2024/9/95MicroarrayExperimentsProducecDNAfrommRNA(DNAismorestable)AttachphosphortocDNAtoseewhenaparticulargeneisexpressedDifferentcolorphosphorsareavailabletocomparemanysamplesatonceHybridizecDNAoverthemicroarrayScanthemicroarraywithaphosphor-illuminatinglaserIlluminationrevealstranscribedgenesScanmicroarraymultipletimesforthedifferentcolorphosphor’s2024/9/96MicroarrayExperiments(con’t)PhosphorscanbeaddedhereinsteadTheninsteadofstaining,laserilluminationcanbeused2024/9/97UsingMicroarraysEachboxrepresentsonegene’sexpressionovertime
TrackthesampleoveraperiodoftimetoseegeneexpressionovertimeTracktwodifferentsamplesunderthesameconditionstoseethedifferenceingeneexpressions2024/9/98UsingMicroarrays(cont’d)Green:expressedonlyfromcontrolRed:expressedonlyfromexperimentalcellYellow:equallyexpressedinbothsamplesBlack:NOTexpressedineithercontrolorexperimentalcells2024/9/99MicroarrayDataMicroarraydataareusuallytransformedintoanintensitymatrix(below)Theintensitymatrixallowsbiologiststomakecorrelationsbetweendiferentgenes(eveniftheyaredissimilar)andtounderstandhowgenesfunctionsmightberelatedTime:TimeXTimeYTimeZGene110810Gene21009Gene348.63Gene4783Gene5123Intensity(expressionlevel)ofgeneatmeasuredtime2024/9/910MicroarrayData-REVISION-showinthematrixwhichgenesaresimilarandwhicharenot.Microarraydataareusuallytransformedintoanintensitymatrix(below)Theintensitymatrixallowsbiologiststomakecorrelationsbetweendiferentgenes(eveniftheyaredissimilar)andtounderstandhowgenesfunctionsmightberelatedClusteringcomesintoplayTime:TimeXTimeYTimeZGene110810Gene21009Gene348.63Gene4783Gene5123Intensity(expressionlevel)ofgeneatmeasuredtime2024/9/911ClusteringofMicroarrayDataPloteachdatumasapointinN-dimensionalspaceMakeadistancematrixforthedistancebetweeneverytwogenepointsintheN-dimensionalspaceGeneswithasmalldistancesharethesameexpressioncharacteristicsandmightbefunctionallyrelatedorsimilar.Clusteringrevealgroupsoffunctionallyrelatedgenes2024/9/912ClusteringofMicroarrayData(cont’d)Clusters2024/9/913HomogeneityandSeparationPrinciplesHomogeneity:ElementswithinaclusterareclosetoeachotherSeparation:Elementsindifferentclustersarefurtherapartfromeachother…clusteringisnotaneasytask!Giventhesepointsaclusteringalgorithmmightmaketwodistinctclustersasfollows2024/9/914BadClusteringThisclusteringviolatesbothHomogeneityandSeparationprinciplesClosedistancesfrompointsinseparateclustersFardistancesfrompointsinthesamecluster2024/9/915GoodClusteringThisclusteringsatisfiesboth
HomogeneityandSeparationprinciples2024/9/916ClusteringTechniquesAgglomerative:Startwitheveryelementinitsowncluster,anditerativelyjoinclusterstogetherDivisive:StartwithoneclusteranditerativelydivideitintosmallerclustersHierarchical:Organizeelementsintoatree,leavesrepresentgenesandthelengthofthepathesbetweenleavesrepresentsthedistancesbetweengenes.Similargenesliewithinthesamesubtrees2024/9/917HierarchicalClustering2024/9/918HierarchicalClustering:Example2024/9/919HierarchicalClustering:Example2024/9/920HierarchicalClustering:Example2024/9/921HierarchicalClustering:Example2024/9/922HierarchicalClustering:Example2024/9/923HierarchicalClustering(cont’d)HierarchicalClusteringisoftenusedtorevealevolutionaryhistory2024/9/924HierarchicalClusteringAlgorithmHierarchicalClustering(d
,n)FormnclusterseachwithoneelementConstructagraphTbyassigningonevertextoeachclusterwhilethereismorethanoneclusterFindthetwoclosestclustersC1andC2
MergeC1andC2intonewclusterCwith|C1|+|C2|elementsComputedistancefromCtoallotherclustersAddanewvertexCtoTandconnecttoverticesC1andC2RemoverowsandcolumnsofdcorrespondingtoC1andC2Addarowandcolumntod
corrspondingtothenewclusterC
returnTThealgorithmtakesanxndistancematrixdofpairwisedistancesbetweenpointsasaninput.2024/9/925HierarchicalClusteringAlgorithmHierarchicalClustering(d
,n)FormnclusterseachwithoneelementConstructagraphTbyassigningonevertextoeachclusterwhilethereismorethanoneclusterFindthetwoclosestclustersC1andC2
MergeC1andC2intonewclusterCwith|C1|+|C2|elements
ComputedistancefromCtoallotherclustersAddanewvertexCtoTandconnecttoverticesC1andC2RemoverowsandcolumnsofdcorrespondingtoC1andC2Addarowandcolumntod
corrspondingtothenewclusterC
returnTDifferentwaystodefinedistancesbetweenclustersmayleadtodifferentclusterings2024/9/926HierarchicalClustering:RecomputingDistances
dmin(C,C*)=mind(x,y)
forallelementsxinCandyinC*Distancebetweentwoclustersisthesmallestdistancebetweenanypairoftheirelements
davg(C,C*)=(1/|C*||C|)∑d(x,y)
forallelementsxinCandyinC*Distancebetweentwoclustersistheaveragedistancebetweenallpairsoftheirelements2024/9/927系统聚类例:微阵列数据2024/9/928评估表达模式的相似性两行数据之间的相似性或者距离如何量化。欧几里德距离。采用pearson相关系数r(-1,1)。如果两个基因之间r为1,说明两个数据表达模式吻合得很好如果两个基因之间r为-1,也说明两个数据表达模式吻合得很好(一上升,一下降)r=0,则说明表达模式之间没什么相关性2024/9/929数据标准化计算第2个和第10个基因的平均值和标准方差减去平均值,然后除以标准方差,得到每行的标准化数据2024/9/930求pearson相关系数求经过标准化以后的两向量的内积,再除以元素个数2024/9/931分析基因2,11与基因6,10之间表达比值正好各自相反,因此相关系数r(2,11),r(6,10)应该是-1。2024/9/932数据标准化以后基因两两之间的相关系数2024/9/933根据相关系数进行聚类(层次聚类法)1,计算所有元素两两之间的距离(相关系数),创建一个距离矩阵。每个元素就是一个类,仅仅包含它自己。2,寻找距离最小的两个类(相关系数最大)。3,将这两个类合并为一个新的类。新的类替换这两个类,重新计算所有的距离,修改相似性矩阵。4,重复2,3步骤直到所有的类聚集为一个类。2024/9/934迭代过程首先会发现r(5,10)=1,然后把基因5和10归为一类,然后需要重新计算距离矩阵。2024/9/935聚类图2024/9/936主要内容Microarrays(微阵列)HierarchicalClustering(层次聚类或系统聚类)K-MeansClustering(K-均值聚类)2024/9/937SquaredErrorDistortion(平方误差失真)Givenadatapoint
vandasetofpointsX,definethedistancefromvtoX
d(v,X)asthe(Eucledian)distancefromvtotheclosestpointfromX.Givenasetofndatapoints
V={v1…vn}andasetofkpointsX,definetheSquaredErrorDistortion
d(V,X)=∑d(vi,X)2/n1<
i
<
n
2024/9/938K-MeansClusteringProblem:FormulationInput:Aset,V,consistingofnpointsandaparameterkOutput:AsetXconsistingofkpoints(clustercenters)thatminimizesthesquarederrordistortiond(V,X)overallpossiblechoicesofX。2024/9/9391-MeansClusteringProblem:anEasyCaseInput:Aset,V,consistingofnpointsOutput:Asinglepointsx(clustercenter)thatminimizesthesquarederrordistortiond(V,x)overallpossiblechoicesofx。
2024/9/9401-MeansClusteringProblem:anEasyCaseInput:Aset,V,consistingofnpointsOutput:Asinglepointsx(clustercenter)thatminimizesthesquarederrordistortiond(V,x)overallpossiblechoicesofx
1-MeansClusteringproblemiseasy.However,itbecomesverydifficult(NP-complete)formorethanonecenter.AnefficientheuristicmethodforK-MeansclusteringistheLloydalgorithm
2024/9/941K-MeansClustering:LloydAlgorithmLloydAlgorithmArbitrarilyassignthekclustercenterswhiletheclustercenterskeepchangingAssigneachdatapointtotheclusterCi correspondingtotheclosestcluster representative(center)(1≤i≤k)Aftertheassignmentofalldatapoints, computenewclusterrepresentatives
accordingtothecenterofgravityofeach cluster,thatis,thenewclusterrepresentativeisforallvinCforeveryclusterC
*Thismayleadtomerelyalocallyoptimalclustering.2024/9/942x1x2x32024/9/943x1x2x32024/9/944x1x2x32024/9/945x1x2x32024/9/946ConservativeK-MeansAlgorithmLloydalgorithmisfastbutineachiterationitmovesmanydatapoints,notnecessarilycausingbetterconvergence.AmoreconservativemethodwouldbetomoveonepointatatimeonlyifitimprovestheoverallclusteringcostThesmallertheclusteringcostof
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025版学校节日庆典活动承包运营合同3篇
- 2025年度个人商标权抵押担保许可协议书4篇
- 二零二五年度高速公路边坡草皮修复合同模板3篇
- 网络素养在学生职业发展中的重要性
- 二零二五年度车辆牌照租赁数据共享协议4篇
- 当代企业网络安全风险评估与防范措施汇报
- 教学资料数字化的应用场景及案例分析
- 二零二五年度承台基坑开挖施工劳务分包合同施工人员资质要求4篇
- 数海导航小学数学知识框架构建
- 安全知识教育在儿童成长中的角色
- 体育概论(第二版)课件第三章体育目的
- 无人驾驶航空器安全操作理论复习测试附答案
- 建筑工地春节留守人员安全技术交底
- 默纳克-NICE1000技术交流-V1.0
- 蝴蝶兰的简介
- 老年人心理健康量表(含评分)
- 《小儿静脉输液速度》课件
- 营销人员薪酬标准及绩效考核办法
- 医院每日消防巡查记录表
- 运输企业重大危险源辨识及排查制度
- 运动技能学习与控制课件第五章运动中的中枢控制
评论
0/150
提交评论