学习大二下南新statistics for business11for ln ch14simple linear regression_第1页
学习大二下南新statistics for business11for ln ch14simple linear regression_第2页
学习大二下南新statistics for business11for ln ch14simple linear regression_第3页
学习大二下南新statistics for business11for ln ch14simple linear regression_第4页
学习大二下南新statistics for business11for ln ch14simple linear regression_第5页
已阅读5页,还剩62页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、中山大学岭南学院教授中山大学岭南学院教授 夏南新博士夏南新博士 制作制作Chapter 14 SIMPLE LINEAR REGRESSION The statistical methods used in studying the relationship between two variables were first employed by Sir Francis Galton F.R.S. 1822-1911. Galton was interested in studying the relatio-nship between a fathers height and the son

2、s height. Galtons disciple, Karl Pearson (1857-1936), analyzed the relationship between the fathers height and the sons height for 1078 pairs of subjects. 回归分析的方法以至“回归”这个名称的起源,统计史上一般归功于英国生物学家兼统计学家F. Galton(18221911)。 Galton是一个英国绅士科学家。他是一个上层英格兰人,在剑桥学医。他在开始遗传学研究之前,他探索了非洲大陆。在1859年出版的巨著物种的起源的Ohnrles Dor

3、win是他的侄子。 Galton富有思想。当时他提出了这样一个问题:如果每代人的身高服从正态分布,身高是遗传的,那么一代人的身高与后一代人的身高间会有什么联系呢?后来,他发现父母的身高与他们孩子的身高间存在线性关系,并且若父母的身高很高,则孩子的身高一般会高于平均水平,但会矮于他们的父母。Galton称这一发现为 “回归律”。 在这个遗传问题上,Galton作了进一步的工作,为了描述这一遗传规律,他建立了线性回归模型(x,y分别是父母及其孩子的身高),Galton的这一思想导致了回归分析的原理。Sir Francis Galton F.R.S. 1822-1911 链接:链接:Ch.14 IM

4、G_0297.JPG,Ch.14 IMG_0298.JPG,Ch.14 IMG_0299.JPG,Ch.14 IMG_0300.JPG The equation that describes how is related to and an error term is called the regression model. Simple Linear Regression Model Where =independent variable(自变量或解 释变量) =dependent variable (应变量或因变 量或相依变量或被解释变量)01yx14.1 Simple Linear Re

5、gression Model(14.1)yxxy and are referred to as the parameters of the model, and (the Greek letter epsilon) is a random variable referred to as the error te-rm. The error term accounts for the variabili-ty in that cannot be explained by the linear relationship between and .01yxy Simple Linear Regres

6、sion Equation The graph of the simple linear regression eq-uation is a straight line; is the -intercept of the regression line, is the slope, and is the mean or expected value of for a giv-en value of .01( )E yx(14.2)01yyx( )E y FIGURE 14.1 POSSIBLE REGRESSION LINES IN SIMPLE LINEAR REGRESSION Panel

7、 A: Panel B: Panel C: Positive Linear Negative Linear No Relationship Relationship Relationship01x( )E yRegression lineRegression lineRegression linexx( )E y( )E y00Slope is positiveSlope is positive11Slope is zeroInterceptInterceptIntercept Examples of possible regression lines are sh-own in Figure

8、 14.1. The regression line in Panel A shows that the mean value of is re-lated positively to , with larger values of associated with larger values of .yx( )E yx Estimated Regression Equation Substituting the values of the sample statisti-cs and for and in the regression eq-uation, we obtain the esti

9、mated regression eq-uation. Estimated Simple Linear Regression Equati-on0b01 ybb x1b01(14.3) NOTES AND COMMENTS 1. Regression analysis cannot be interpreted as a procedure for establishing a cause-and-effect relationship between variables. It can only indicate how or to what extent variables are ass

10、ociated with each other. Any conclusi-ons about cause and effect must be based up-on the judgment of those individuals most knowledgeable about the application. 2. The regression equation in simple linear regression is . More advanced texts in regression analysis often write the01( )E yx regression

11、equation as to em-phasize that the regression equation provides the mean value of for a given value of .01()E y xxyx Carl Friedrich Gauss (1777-1855) proposed the least squares method. Least Squares Criterion where =observed value of the dependent variable for the th observation =estimated value of

12、the dependent variable for the th observation2min()iiyyiy(14.5)14.2 Least Squares Methodiyii As mentioned in the chapter, the least squares method is a procedure for determining the va-lues of and that minimize the sum of squ-ared residuals. The sum of squared residuals is given by Substituting , we

13、 get as the expression that must be minimized.2min()iiyy(14.34)Appendix 14.1 Calculus-Based Derivation of Least Squares Formulas0b1b01iiybb x201min()iiybb x To minimize expression (14.34), we must take the partial derivatives with respect to and , set them equal to zero, and solve. Doi-ng so, we get

14、 Dividing equation (14.35) by two and summing each term individually yields201010()2()0iiiiybb xybb xb (14.36)0b1b(14.35)201011()2()0iiiiiybb xx ybb xb Bringing to the other side of the equal sign and noting that , we obtain Similar algebraic simplification applied to equation (14.36) yields Equatio

15、ns (14.37) and (14.38) are known as the normal equations. 00bnbiy010iiybb x01()iinbx by(14.37)201()()iiiix bx bx y(14.38) Solving equation(14.37) for yields Using equation (14.39) to substitute for in equation (14.38) provides By rearranging the terms in equation (14.40), we obtain0b01iiyxbbnn(14.39

16、)0b2211()()iiiiiixyxbx bx ynn (14.40) Because and , we can rewrite equation (14.39) as Equation (14.41) and (14.42) are the formu-las (14.6) and (14.7) we used in the chapter to compute the coefficients in the estimated reg-ression equation.1222()()()()()iiiiiiiiix yxynxxyybxxnxx (14.41)iyy n01byb x

17、(14.42)ixx n TABLE 14.2 CALCULATIONS FOR THE LEAST SQUARES ESTIMATED REGRESSION EQUATION FOR ARMAND PIZZA PARLORS Restaurant 1 2 58 -12 -72 864 144 2 6 105 -8 -25 200 64 3 8 88 -6 -42 252 36 4 8 118 -6 -12 72 36 5 12 117 -2 -13 26 4 6 16 137 2 7 14 4 7 20 8 20 9 22 10 26 202 12 72 864 144 Totals 568

18、ixiyixxiyy()()iixxyy2()ixxiixiy()()iixxyy2()ixx Thus, the estimated regression equation is1222()()()()()iiiiiiiiix yxynxxyybxxnxx 284056801byb x51305(14)60605yx FIGURE 14.5 DEVIATIONS ABOUT THE ESTIMATED REGRESSION LINE AND THE LINE Theth residualTheth erroriiiiieyy.14.3 Coefficient of Determination

19、.yiyyiyyxy0yy第i个残差=第i个偏误=ie SST=SSR+SSE Where SST= =total sum of squares SSR= =sum of squares due to regression SSE= =sum of squares due to error SSR can be thought of as the explained porti-on of SST, and SSE can be thought of as the unexplained portion of SST.2()iiyy(14.11)2()iyy2()iyyTABLE 14.3 C

20、ALCULATIONS OF SSE AND SST FOR ARMAND PIZZA PARLORSRestaurant 1 2 58 70 -72 -12 5184 144 2 6 105 90 -25 15 625 225 3 8 88 100 -42 -12 1764 144 4 8 118 100 -12 18 144 324 5 12 117 120 -13 -3 169 9 6 16 137 140 7 -3 49 9 7 20 157 160 27 -3 729 9 8 20 169 160 39 9 1521 81 9 22 149 170 19 -21 361 441 10

21、 26 202 144 Predicted Sales Error SST=15730 SSE=1530 ixiy605iiyxiyyiiyyi2()iyy2()iiyy SSR=SSTSSE=15730-1530=14200 COEFFICIENT OF DETERMINATION For example above, can be interpreted as the percentage of the total sum of squares that can be explain- ed by using the estimated regression equati- on.2SSR

22、rSST214200.902715730SSRrSST2rFor Armands Pizza Parlors, 90.27% of the variability in sales can be explained by the linear relationship between the size of the student population and sales. We should be pleased to find such a good fit for the estim-ated regression equation. SAMPLE CORRELATION COEFFIC

23、IENT1()minxyrsign of bCoefficient of Deteration(14.13)2()xyrsign of br The assumed regression model is ASSUMPTIONS ABOUT THE ERROR TERM IN THE REGRESSION MODEL 1. is a random variable with a mean or expected value of zero; that is, . 2. The variance of , denoted by , is the same for all values of .

24、Implication: The variance of about the regression line equals and is the same for01yx14.2 Model Assumption( )0E2xy2 all values of . e.g. in Figure 14.6, , , . 3. The values of are independent. e.g. in Figure 14.6, Implication: The value of for a particular value of is not related to the value of for

25、 any other value of , thus, the value of for a particular value of is not related to the va-lue of for any other value of .y1222x1212()( ) ()EEE 1111213(,)eee2212223(,)eeexxxyx 4. The error term is a normally distributed random variable. Implication: Because is a linear function of , is also a norma

26、lly distributed random variable.yy FIGURE 14.5A DEVIATIONS ABOUT THE ESTIMATED REGRESSION LINE AND THE LINE iiieyy.yiyyiyyxy0yy1x2x3x4xix01 ybb xiyiyy11e12e13e212120eyy232320eyy22e. . .23y2 y21ye相当于离散分布的概率11121213eeeeFIGURE 13.5B THE DISTRIBUTION OF THE ERROR TERM 111e12e13e11()f e12()f e13()f e1111

27、213(,)eee Where ( )f e Figure 14.6 illustrates the model assumpt-ions and their implications; note that in this graphical interpretation, the value of ch-anges according to the specific value of co-nsidered. However, regardless of the value, the probability distribution of and hence the probability

28、distributions of are normally distributed, each with the same variance. The specific value of the error at any particular point depends on whether the actual value of is greater than or less than .xy( )E yxy( )E yFIGURE 14.6 ASSUMPTIONS FOR THE REGRESSION MODELNote: the distributions have the same s

29、hape at each value. xy0 x 10 x 20 x 30 x yx When0( )E y0 x ( )E y10 xWhen10 xDistribution of atyDistribution of aty20 xDistribution of aty30 xWhen( )E y20 xWhen( )E y01( )E yx30 xiy,30i xy,30,301i xnj xjyy(30)if y x ,相当于 With , SSE (Mean Square Error (均方误) ) can be written as SSE= SSE has degrees of

30、 freedom because two parameters ( and ) must be estimated to compute SSE. Thus, the mean squares is co-mputed by dividing SSE by . MSE provi-des an unbiased estimator of . 2201()()iiiiyyybb x14.5 Testing for Significance2n0201iiybb x12n Mean Square Error (ESTIMATE OF ) In Section 14.3 we showed that

31、 for the Armands Pizza Parlors example, SSE=1530; hence, provides an unbiased estimate of . 222SSEsMSEn21530191.25102sMSE2 TEST FOR SIGNIFICANCE IN SIMPLE LINEAR REGRESSION TEST STATISTIC where SAMPLING DISTRIBUTION OF Expected Value 1b0111:0:0HH111bbtst11( )E b Standard Deviation Distribution Form

32、Normal REJECTION RULE value approach: Reject if value Critical value approach: Reject if or if where is based on a distribution with degrees of freedom.(14.17)12()bixxpp0H0H2tt 2tt2tt2n Confidence Interval for The form of a confidence interval for is as follows: Test With only one independent variab

33、le, the test will provide the same conclusion as the test. But with more than one independent variable, only the test can be used to test for an over-all significant relationship. The logic behind the use of the test for1112 bbtsFtF1FF determining whether the regression relations-hip is statisticall

34、y significant is based on the development of two independent estimates of . We explained how MSE provides an es-timate of . If the null hypothesis : is true, the sum of squares due to regression, SSR, divided by its degrees of freedom provi-des another independent estimate of . This estimate is call

35、ed the mean square due to reg-ression, or simply the mean square regression, and denoted MSR. In general, 2100HdegSSRMSRRegressionrees of freedom22 For the models we consider in this text, the regression degrees of freedom is always equal to the number of independent variables in the model: If the n

36、ull hypothesis ( ) is true, MSR and MSE are two independent estimates of and the sampling distribution of MSR/MSE follows an distribution with numerator degrees of freedom equal to one and denominator degrees of freedom equal to varSSRMSRNumber independentiables(14.20)01:0H2F . Therefore, when , the

37、 value of MSR/MSE should be close to one. However, if the null hypothesis is false ( ), MSR will overestimate and the value of MSR/MSE will be inflated; thus, large values of MSR/MSE lead to the rejection of and the conclusion that the relationship between and is statistically significant. MSRFMSE21

38、02n100Hxy TEST FOR SIGNIFICANCE IN SIMPLE LINEAR REGRESSION TEST STATISTIC REJECTION RULE value approach: Reject if value Critical value approach: Reject if where is based on a distribution with degree of freedom in the numerator and MSRFMSEF0111:0:0HHpp0H0HFFFF2n1 degrees of freedom in the denomina

39、tor. . TABLE 14.5 GENERAL FORM OF THE ANOVA TABLE FOR SIMPLE LINEAR REGRESSION Source Sum Degrees Mean of Variation of Squares of Freedom Square Regression SSR Error SSE Total SSTF12n1n1SSRMSR 2SSEMSEnMSRFMSE TABLE 14.6 ANOVA TABLE FOR THE ARMANDS PIZZA PARLORS PROBLEM Source Sum Degrees Mean of Var

40、iation of Squares of Freedom Square Regression SSR=14200 Error SSE=1530 Total SST=15730F1102810 11 89 142001420011530191.2581420074.25191.25 The distribution table shows that with one degree of freedom in the numerator and degrees of freedom in denomi-nator, provides an area of in the upper tail. Th

41、us, the area in the upper tail of the distribution corresponding to the test statistic must be less than . Thus, we conclude that the value must be less than . Excel show the value = . Because the value is less than , we reject and conclude that a significant rela-tionship exists between the size of

42、 the student population and quarterly sales. F21028n11.26F .01F74.25F .01.01pp.000p.010H Conducting a test for significance using the correlation coefficient, that is, , therefore is not necessary, if a or test has already been conducted. F0:0 xyHt Confidence Interval for Mean Value of We use the fo

43、llowing notation. = the particular or given value of the inde-pendent variable = the value of the dependent variable corresponding to the given = the mean or expected value of the dependent variable corresponding to the given . = the point estimate of 14.6 Using the Estimated Regression Equation for

44、 Estimation and Predictionypxpyxpx()pE yypx01ppybb x()pE y when . CONFIDENCE INTERVAL FOR where The confidence coefficient is and is based on a distribution with degre-es of freedom. (14.24)2222()1()ppyixxssnxx2npxx1t2t2ppyyts()pE yFIGURE 14.8 CONFIDENCE INTERVALS FOR THE MEAN SALES AT GIVEN VALUES

45、OF STUDENT POPULATION ypxxxyx0Confidence interval width is smallest atConfidence interval limits depend onpx605yxUpper limitLower limitStudent Population (1000s)1414x Quarterly Sales ($1000s) Prediction Interval for an Individual Value of To develop a prediction interval, we must first determine the

46、 variance associated with using as an estimate of an individual value of when . This variance is made up of the sum of the following two components. 1. the variance of individual values about the mean , an estimate of which is given by . 2. The variance associated with using to estimate , an estimat

47、e of which is given by . 2pys2spxxpy()pE yyyypy()pE y The formula for estimating the variance of an individual value of , denoted by is (14.25)2222()1()pixxssnxxpy2inds222pindysss222()11()pixxsnxxFIGURE 14.9 CONFIDENCE AND PREDICTION INTERVALS FOR SALES AT GIVEN VALUES OF STUDENT POPULATION ypxxxyx0

48、Confidence interval width is smallest atPrediction intervals are widerpx605yxUpper limitLower limitStudent Population (1000s)1414x Quarterly Sales ($1000s)Confidence interval limitsPrediction interval limits PREDICTION INTERVAL FOR Where the confidence coefficient is and is based on a distribution w

49、ith deg- rees of freedom. (14.27)2pindytspy12tt2n RESIDUAL FOR OBSERVATION STANDARD DEVIATION OF THE th RESIDUAL where = the standard deviation of residual 14.7 Residual Analysis: Validating Model Assumptionsiiyyi12(,)iiiiiiikiyyyyyyyyssi(14.28)i12(,)1iiiikieeesssh(14.30)iiyys = the standard error o

50、f the estimate Residual Plot Against In Figure 14.11, if the assumption that the va-riance of is the same for all values of and the assumed regression model is an adequate representation of the relationship between the variables, the residual plot should give an ov-erall impression of a horizontal b

51、and of poin-ts such as the one in Panel A of Figure 14.11. However, if the variance of is not the same (14.31)22()1()iiixxhnxxsxx for all values of - for example, if variabi-lity about the regression line is greater for la-rger values of - a pattern such as the one in Panel B of Figure 14.11 could b

52、e observed. In this case, the assumption of a constant var-iance of is violated. Anther possible residu-al plot is shown in Panel C. In this case, we would conclude that the assumed regression model is not an adequate representation of the relationship between the variables. A cur-vilinear regressio

53、n model or multiple regress-ion model should be considered.xxFIGURE 14.11 RESIDUAL PIOTS FROM THREE REGRESSION STUDIESyyxPanel Ax0 xyyyy00Panel BPanel CResidualResidualResidualGood patternNonconstant varianceModel from not adequate STANDARDIZED RESIDUAL FOR OBSERVATION iiiiiyyyys(14.32)14.8 Residual Analysis: Outliers and Influentia

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论