




下载本文档
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、chapter 10Unsupervised Learning9/13/20221数据挖掘与统计计算This chapter will instead focus on unsupervised learning, a set of statistical tools intended for the setting in which we have only a set of features X1, X2, . . . , Xp measured on n observations. We are not interested in prediction, because we do no
2、t have an associated response variable Y . Rather, the goal is to discover interesting things about the measurements on X1, X2, . . . , Xp. principal components analysisclustering9/13/2022数据挖掘与统计计算210.1 The Challenge of Unsupervised LearningIf we fit a predictive model using a supervised learning te
3、chnique, then it is possible to check our work by seeing how well our model predicts the response Y on observations not used in fitting the model. However, in unsupervised learning, there is no way to check ourwork because we dont know the true answerthe problem is unsupervised。9/13/2022数据挖掘与统计计算310
4、.2 Principal Components AnalysisWhen faced with a large set of correlated variables, principal components allow us to summarize this set with a smaller number of representative variables that collectively explain most of the variability in the original set. PCA is an unsupervised approach, since it
5、involves only a set of features X1, X2, . . . , Xp, and no associated responseY . Apart from producing derived variables for use in supervised learning problems, PCA also serves as a tool for data visualization (visualization of the observations or visualization of the variables).9/13/2022数据挖掘与统计计算4
6、10.2.1 What Are Principal Components?Suppose that we wish to visualize n observations with measurements on a set of p features, X1, X2, . . . , Xp, We could do this by examining two-dimensional scatterplots of the data, each of which contains the n observations measurements on two of thefeatures. Ho
7、wever, there are p(p1)/2 such scatterplots.9/13/2022数据挖掘与统计计算5low-dimensional representationClearly, a better method is required to visualize the n observations when p is large. In particular, we would like to find a low-dimensional representation of the data that captures as much of the information
8、 as possible. For instance, if we can obtain a two-dimensional representation of the data that captures most of the information, then we can plot the observations in this low-dimensional space.9/13/2022数据挖掘与统计计算6PCA provides a tool to do just this. It finds a low-dimensional representation of a data
9、 set that contains as much as possible of the variation. The idea is that each of the n observations lives in p dimensional space, but not all of these dimensions are equally interesting. PCA seeks a small number of dimensions that are as interesting as possible, where the concept of interesting is
10、measured by the amount that the observations vary along each dimension. 9/13/2022数据挖掘与统计计算7Geometric interpretation for PCAThere is a nice geometric interpretation for the first principal component. The loading vector 1 with elements 11, 21, . . . , p1 defines a direction in feature space along whic
11、h the data vary the most. If we project the n data points x1, . . . , xn onto this direction, the projected values are the principal component scores z11, . . . , zn1 themselves. 9/13/2022数据挖掘与统计计算8Low-dimensional views of the dataOnce we have computed the principal components, we can plot them agai
12、nst each other in order to produce low-dimensional views of the data.For instance, we can plot the score vector Z1 against Z2, Z1 against Z3, Z2 against Z3, and so forth. Geometrically, this amounts to projecting the original data down onto the subspace spanned by 1, 2, and 3, and plotting the proje
13、cted points.9/13/2022数据挖掘与统计计算99/13/2022数据挖掘与统计计算1010.2.2 Another Interpretation of Principal ComponentsPrincipal components provide low-dimensional linear surfaces that are closest to the observations. The first principal component loading vector has a very special property: it is the line in p-dim
14、ensional space that is closest to the n observations (using average squared Euclidean distance as a measure of closeness). The appeal of this interpretation is clear: we seek a single dimension of the data that lies as close as possible to all of the data points, since such a line will likely provid
15、e a good summary of the data.9/13/2022数据挖掘与统计计算119/13/2022数据挖掘与统计计算1210.2.3 More on PCAScaling the Variablesthe results obtained when we perform PCA will also depend on whether the variables have been individually scaled (each multiplied by a different constant). Uniqueness of the Principal Componen
16、tsThis means that two different software packages will yield the same principal component loading vectors, although the signs of those loading vectors may differ. 9/13/2022数据挖掘与统计计算13The Proportion of Variance ExplainedWe can now ask a natural question: how much of the information in a given data se
17、t is lost by projecting the observations onto the first few principal components? That is, how much of the variance in the data is not contained in the first few principal components? Deciding How Many Principal Components to UseIn fact, we would like to use the smallest number of principal componen
18、ts required to get a good understanding of the data. How many principal components are needed? Unfortunately, there is no single (or simple!) answer to this question.9/13/2022数据挖掘与统计计算149/13/2022数据挖掘与统计计算1510.3 Clustering MethodsClustering refers to a very broad set of techniques for finding subgrou
19、ps, or clusters, in a data set. When we cluster the observations of a data set, we seek to partition them into distinct groups so that the observations withineach group are quite similar to each other, while observations in different groups are quite different from each other. K-means clusteringHier
20、archical clustering9/13/2022数据挖掘与统计计算1610.3.1 K-Means Clustering1. C1 C2 . . . CK = 1, . . . , n. In other words, each observation belongs to at least one of the K clusters.2. Ck Ck = for all k k. In other words, the clusters are nonoverlapping: no observation belongs to more than one cluster.9/13/2022数据挖掘与统计计算179/13/2022数据挖掘与统计计算18Code : Principal Components Analysislibrary(ISLR)states =s ( USArrests )statesnames ( USArrests )apply ( USArrests , 2, mean )apply ( USArrests , 2, var )pr.out = p ( USArrests , scale =TRUE)nam
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 家庭装修水电安装合同协议书
- ××超市防火墙规定
- 父母的辛劳感悟话题作文(12篇)
- 能源领域在职表现与工作年限证明(7篇)
- 2025年枕头项目立项申请报告
- 农村畜牧育种技术合作协议
- 2025年大学英语四级考试模拟试卷:翻译技巧与案例分析
- 2025年宁夏回族自治区公务员遴选考试时事政治试题
- 2025年整熨洗涤设备:洗衣房设备项目立项申请报告
- 社区农村合作农业种植合作协议
- 【MOOC】中国艺术歌曲演唱与赏析-江西财经大学 中国大学慕课MOOC答案
- 【MOOC】运输包装-暨南大学 中国大学慕课MOOC答案
- (CNAS-CL01-2018认可准则)内审核查表
- 小学语文1-6年级·课内16则文言文·译文注释
- 美容院服务项目及操作流程手册
- 《食用亚麻籽(粉)等级规格》编制说明
- 广东省高州市2023-2024学年高一下学期期中考试数学
- 食堂工作人员考核方案
- 国家基本公卫(老年人健康管理)考试复习题及答案
- 广东省广州市海珠区2023-2024学年六年级下学期期末考试英语试卷
- 临床营养(043)(正高级)高级卫生专业技术资格考试试卷及答案指导(2025年)
评论
0/150
提交评论