数据挖掘课件03linear regression_第1页
数据挖掘课件03linear regression_第2页
数据挖掘课件03linear regression_第3页
数据挖掘课件03linear regression_第4页
数据挖掘课件03linear regression_第5页
已阅读5页,还剩33页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、chapter 3Linear regression8/27/20221数据挖掘与统计计算Linear regression has been around for a long time and is the topic of innumerable textbooks. Though it may seem somewhat dull compared to some of the more modern statistical learning approaches described in later chapters of this book, linear regression i

2、s still a useful and widely used statistical learning method. Moreover, it serves as a good jumping-off point for newer approaches: as we will see in later chapters, many fancy statistical learning approaches can be seen as generalizations or extensions of linear regression.8/27/2022数据挖掘与统计计算2Recall

3、 the Advertising data1. Is there a relationship between advertising budget and sales?2. How strong is the relationship between advertising budget and sales?3. Which media contribute to sales?4. How accurately can we estimate the effect of each medium on sales?5. How accurately can we predict future

4、sales?6. Is the relationship linear?7. Is there synergy among the advertising media?8/27/2022数据挖掘与统计计算33.1 Simple Linear RegressionSimple linear regression lives up to its name: it is a very straightforward simple linear approach for predicting a quantitative response Y on the basis of a single pred

5、ictor variable X.8/27/2022数据挖掘与统计计算4You might read “” as “is approximately modeled as”.3.1.1 Estimating the CoefficientsWe define the residual sum of squares (RSS) as8/27/2022数据挖掘与统计计算58/27/2022数据挖掘与统计计算63.1.2 Assessing the Accuracy of the Coefficient Estimatespopulation regression line, which is th

6、e best linear approximation to the true relationship between X andY . 8/27/2022数据挖掘与统计计算7The least squares regression coefficient estimates characterize the least squares line8/27/2022数据挖掘与统计计算8The true relationship is generally not known forreal data, but the least squares line can always be comput

7、ed using the coefficient estimates. In other words, in real applications, we have access to a set of observations from which we can compute the least squares line; however, the population regression line is unobserved.8/27/2022数据挖掘与统计计算9UnbiasednessThe property of unbiasedness holds for the least sq

8、uares coefficient estimates : if we estimate 0 and 1 on the basis of a particular data set, then our estimates wont be exactly equal to 0 and 1. But if we could average the estimates obtained over a huge number of data sets, then the average of these estimates would be spot on! 8/27/2022数据挖掘与统计计算10V

9、ariancewe can wonder how close 0 and 1 are to the true values 0 and 1. 8/27/2022数据挖掘与统计计算11For linear regression, the 95 % confidence interval for 1approximately takes the form8/27/2022数据挖掘与统计计算12Similarly, a confidence interval for 0 approximately takes the form3.1.3 Assessing the Accuracy of the M

10、odelResidual Standard Error8/27/2022数据挖掘与统计计算13R2 Statistic3.2 Multiple Linear Regression8/27/2022数据挖掘与统计计算14Instead of fitting a separate simple linear regression model for each predictor, a better approach is to extend the simple linear regression model so that it can directly modate multiple pred

11、ictors. We can do this by giving each predictor a separate slope coefficient in a single model.3.2.1 Estimating the Regression Coefficients8/27/2022数据挖掘与统计计算158/27/2022数据挖掘与统计计算163.2.2 Some Important Questions1. Is at least one of the predictors X1, X2, . . . , Xp useful in predicting the response?2

12、. Do all the predictors help to explain Y , or is only a subset of the predictors useful?3. How well does the model fit the data?4. Given a set of predictor values, what response value should we predict, and how accurate is our prediction?8/27/2022数据挖掘与统计计算17One: Is There a Relationship Between the

13、Response and Predictors?8/27/2022数据挖掘与统计计算18Two: Deciding on Important VariablesForward selection. Backward selection.Mixed selection. 8/27/2022数据挖掘与统计计算19We can then select the best model out of all of the models that we have considered. How do we determine which model is best? Various statistics c

14、an be used to judge the quality of a model. These include Mallows Cp, Akaike information criterion (AIC), Bayesian information criterion (BIC), and adjusted R2. Three: Model Fit8/27/2022数据挖掘与统计计算20Thus, models with more variables can have higher RSE if the decrease in RSS is small relative to the in

15、crease in p.Four: Predictions8/27/2022数据挖掘与统计计算21There are three sorts of uncertainty associated with this predictionThe inaccuracy in the coefficient estimates is related to the reducible error There is an additional source of potentially reducible error which we call model bias.We referred to this

16、 as the irreducible error. How much will Y vary from Y ? We use prediction intervals to answer this question.3.3 Other Considerations in the Regression Model3.3.1 Qualitative Predictors Predictors with Only Two Levels 8/27/2022数据挖掘与统计计算22Qualitative Predictors with More than Two Levels8/27/2022数据挖掘与

17、统计计算233.3.2 Extensions of the Linear ModelRemoving the Additive Assumption8/27/2022数据挖掘与统计计算24Non-linear Relationships8/27/2022数据挖掘与统计计算253.3.3 Potential Problems1. Non-linearity of the response-predictor relationships.2. Correlation of error terms.3. Non-constant variance of error terms.4. Outliers

18、.5. High-leverage points.6. Collinearity.8/27/2022数据挖掘与统计计算261. Non-linearity of the DataIf the residual plot indicates that there are non-linear associations in the data, then a simple approach is to use non-linear transformations of the predictors, such as log X, X, and X2, in the regression model

19、.8/27/2022数据挖掘与统计计算272. Correlation of Error Termsif the errors are uncorrelated, then the fact that e(i) is positive provides little or no information about the sign of e(i+1). The standard errors that are computed for the estimated regression coefficients or the fitted values are based on the assu

20、mption of uncorrelated error terms. If in fact there is correlation among the error terms, then the estimated standard errors will tend to underestimate the true standard errors. 8/27/2022数据挖掘与统计计算283. Non-constant Variance of Error TermsOne can identify non-constant variances in the errors, or hete

21、roscedasticity, from the presence of a funnel shape in the residual plot. 8/27/2022数据挖掘与统计计算294. OutliersAn outlier is a point for which yi is far from the value predicted by the model. Outliers can arise for a variety of reasons, such as incorrect recording of an observation during data collection.

22、8/27/2022数据挖掘与统计计算305. High Leverage PointsWe just saw that outliers are observations for which the response yi is unusual given the predictor xi. In contrast, observations with high leverage high leverage have an unusual value for xi.8/27/2022数据挖掘与统计计算316. Collinearity8/27/2022数据挖掘与统计计算32Code: Simp

23、le Linear Regressionlibrary (MASS )library (ISLR )fix ( Boston )names ( Boston )lm.fit =lm(medvlstat , data= Boston )lm.fitconfint (lm.fit )predict (lm.fit , data.frame ( lstat =(c(5 ,10 ,15) ), interval =confidence)predict (lm.fit , data.frame ( lstat =(c(5 ,10 ,15) ), interval =prediction)8/27/2022数据挖掘与统计计算33plot(lstat ,medv )abline (lm.fit )abline (lm.fit ,lwd =3)abline (lm.fit ,lwd =3, col = red )plot(lstat ,medv ,col = red )plot(lstat ,medv ,pch =20)plot(lstat ,medv ,pch =+)plot (1:20 ,1:20 , pch =1:20)8/27/2022数据挖掘与统计计算34par ( mfrow =c(2 ,2) )plot(lm.fit )plot( predict (

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论