版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、chapter 3Linear regression8/27/20221数据挖掘与统计计算Linear regression has been around for a long time and is the topic of innumerable textbooks. Though it may seem somewhat dull compared to some of the more modern statistical learning approaches described in later chapters of this book, linear regression i
2、s still a useful and widely used statistical learning method. Moreover, it serves as a good jumping-off point for newer approaches: as we will see in later chapters, many fancy statistical learning approaches can be seen as generalizations or extensions of linear regression.8/27/2022数据挖掘与统计计算2Recall
3、 the Advertising data1. Is there a relationship between advertising budget and sales?2. How strong is the relationship between advertising budget and sales?3. Which media contribute to sales?4. How accurately can we estimate the effect of each medium on sales?5. How accurately can we predict future
4、sales?6. Is the relationship linear?7. Is there synergy among the advertising media?8/27/2022数据挖掘与统计计算33.1 Simple Linear RegressionSimple linear regression lives up to its name: it is a very straightforward simple linear approach for predicting a quantitative response Y on the basis of a single pred
5、ictor variable X.8/27/2022数据挖掘与统计计算4You might read “” as “is approximately modeled as”.3.1.1 Estimating the CoefficientsWe define the residual sum of squares (RSS) as8/27/2022数据挖掘与统计计算58/27/2022数据挖掘与统计计算63.1.2 Assessing the Accuracy of the Coefficient Estimatespopulation regression line, which is th
6、e best linear approximation to the true relationship between X andY . 8/27/2022数据挖掘与统计计算7The least squares regression coefficient estimates characterize the least squares line8/27/2022数据挖掘与统计计算8The true relationship is generally not known forreal data, but the least squares line can always be comput
7、ed using the coefficient estimates. In other words, in real applications, we have access to a set of observations from which we can compute the least squares line; however, the population regression line is unobserved.8/27/2022数据挖掘与统计计算9UnbiasednessThe property of unbiasedness holds for the least sq
8、uares coefficient estimates : if we estimate 0 and 1 on the basis of a particular data set, then our estimates wont be exactly equal to 0 and 1. But if we could average the estimates obtained over a huge number of data sets, then the average of these estimates would be spot on! 8/27/2022数据挖掘与统计计算10V
9、ariancewe can wonder how close 0 and 1 are to the true values 0 and 1. 8/27/2022数据挖掘与统计计算11For linear regression, the 95 % confidence interval for 1approximately takes the form8/27/2022数据挖掘与统计计算12Similarly, a confidence interval for 0 approximately takes the form3.1.3 Assessing the Accuracy of the M
10、odelResidual Standard Error8/27/2022数据挖掘与统计计算13R2 Statistic3.2 Multiple Linear Regression8/27/2022数据挖掘与统计计算14Instead of fitting a separate simple linear regression model for each predictor, a better approach is to extend the simple linear regression model so that it can directly modate multiple pred
11、ictors. We can do this by giving each predictor a separate slope coefficient in a single model.3.2.1 Estimating the Regression Coefficients8/27/2022数据挖掘与统计计算158/27/2022数据挖掘与统计计算163.2.2 Some Important Questions1. Is at least one of the predictors X1, X2, . . . , Xp useful in predicting the response?2
12、. Do all the predictors help to explain Y , or is only a subset of the predictors useful?3. How well does the model fit the data?4. Given a set of predictor values, what response value should we predict, and how accurate is our prediction?8/27/2022数据挖掘与统计计算17One: Is There a Relationship Between the
13、Response and Predictors?8/27/2022数据挖掘与统计计算18Two: Deciding on Important VariablesForward selection. Backward selection.Mixed selection. 8/27/2022数据挖掘与统计计算19We can then select the best model out of all of the models that we have considered. How do we determine which model is best? Various statistics c
14、an be used to judge the quality of a model. These include Mallows Cp, Akaike information criterion (AIC), Bayesian information criterion (BIC), and adjusted R2. Three: Model Fit8/27/2022数据挖掘与统计计算20Thus, models with more variables can have higher RSE if the decrease in RSS is small relative to the in
15、crease in p.Four: Predictions8/27/2022数据挖掘与统计计算21There are three sorts of uncertainty associated with this predictionThe inaccuracy in the coefficient estimates is related to the reducible error There is an additional source of potentially reducible error which we call model bias.We referred to this
16、 as the irreducible error. How much will Y vary from Y ? We use prediction intervals to answer this question.3.3 Other Considerations in the Regression Model3.3.1 Qualitative Predictors Predictors with Only Two Levels 8/27/2022数据挖掘与统计计算22Qualitative Predictors with More than Two Levels8/27/2022数据挖掘与
17、统计计算233.3.2 Extensions of the Linear ModelRemoving the Additive Assumption8/27/2022数据挖掘与统计计算24Non-linear Relationships8/27/2022数据挖掘与统计计算253.3.3 Potential Problems1. Non-linearity of the response-predictor relationships.2. Correlation of error terms.3. Non-constant variance of error terms.4. Outliers
18、.5. High-leverage points.6. Collinearity.8/27/2022数据挖掘与统计计算261. Non-linearity of the DataIf the residual plot indicates that there are non-linear associations in the data, then a simple approach is to use non-linear transformations of the predictors, such as log X, X, and X2, in the regression model
19、.8/27/2022数据挖掘与统计计算272. Correlation of Error Termsif the errors are uncorrelated, then the fact that e(i) is positive provides little or no information about the sign of e(i+1). The standard errors that are computed for the estimated regression coefficients or the fitted values are based on the assu
20、mption of uncorrelated error terms. If in fact there is correlation among the error terms, then the estimated standard errors will tend to underestimate the true standard errors. 8/27/2022数据挖掘与统计计算283. Non-constant Variance of Error TermsOne can identify non-constant variances in the errors, or hete
21、roscedasticity, from the presence of a funnel shape in the residual plot. 8/27/2022数据挖掘与统计计算294. OutliersAn outlier is a point for which yi is far from the value predicted by the model. Outliers can arise for a variety of reasons, such as incorrect recording of an observation during data collection.
22、8/27/2022数据挖掘与统计计算305. High Leverage PointsWe just saw that outliers are observations for which the response yi is unusual given the predictor xi. In contrast, observations with high leverage high leverage have an unusual value for xi.8/27/2022数据挖掘与统计计算316. Collinearity8/27/2022数据挖掘与统计计算32Code: Simp
23、le Linear Regressionlibrary (MASS )library (ISLR )fix ( Boston )names ( Boston )lm.fit =lm(medvlstat , data= Boston )lm.fitconfint (lm.fit )predict (lm.fit , data.frame ( lstat =(c(5 ,10 ,15) ), interval =confidence)predict (lm.fit , data.frame ( lstat =(c(5 ,10 ,15) ), interval =prediction)8/27/2022数据挖掘与统计计算33plot(lstat ,medv )abline (lm.fit )abline (lm.fit ,lwd =3)abline (lm.fit ,lwd =3, col = red )plot(lstat ,medv ,col = red )plot(lstat ,medv ,pch =20)plot(lstat ,medv ,pch =+)plot (1:20 ,1:20 , pch =1:20)8/27/2022数据挖掘与统计计算34par ( mfrow =c(2 ,2) )plot(lm.fit )plot( predict (
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2024年芜湖办理客运从业资格证版试题
- 2024年山西客运驾驶员考试试卷及答案详解
- 2024年哈尔滨客运资格证考试题库答案
- 2024年广东客运从业资格证
- 人教部编版二年级语文上册第7课《妈妈睡了》精美课件
- 吉首大学《功能材料》2021-2022学年第一学期期末试卷
- 吉首大学《散打格斗运动5》2021-2022学年第一学期期末试卷
- 吉林艺术学院《素描实训II》2021-2022学年第一学期期末试卷
- 2024年供应货品合作合同范本
- 吉林师范大学《中小学书法课程与教学论》2021-2022学年第一学期期末试卷
- 陕西省榆林市定边县2024-2025学年七年级上学期期中考试语文试题
- GB/T 22838.7-2024卷烟和滤棒物理性能的测定第7部分:卷烟含末率
- 第四单元认位置(单元测试)2024-2025学年一年级数学上册苏教版
- 国有企业管理人员处分条例(2024)课件
- 三年级数学上册典型例题系列之第一单元:时间计算问题专项练习(原卷版+解析)
- 一般工商贸(轻工)管理人员安全生产考试题库(含答案)
- 《国有企业管理人员处分条例》学习解读课件
- 空气化工高精度气体分装及储运中心一期项目环评报告书
- 切尔诺贝利核电站事故工程伦理分析
- Minitab操作教程
- 岩浆矿床实习报告(四川攀枝花钒钛磁铁矿矿床)
评论
0/150
提交评论