北大暑期课程回归分析_第1页
北大暑期课程回归分析_第2页
北大暑期课程回归分析_第3页
北大暑期课程回归分析_第4页
北大暑期课程回归分析_第5页
已阅读5页,还剩4页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Class 5: ANOVA (Analysis of Variance) and F-tests I. What is ANOVAWhat is ANOVA? ANOVA is the short name for the Analysis of Variance. The essence of ANOVA is to decompose the total variance of the dependent variable into two additive components, one for the structural part, and the other for the st

2、ochastic part, of a regression. Today we are going to examine the easiest case. II. ANOVA: An IntroductionLet the model be.Assuming x is a column vector (of length p) of independent variable values for the ith' observation, .Then is the predicted value. sum of squares total: because . This is al

3、ways true by OLS.= SSE + SSRImportant: the total variance of the dependent variable is decomposed into two additive parts: SSE, which is due to errors, and SSR, which is due to regression.Geometric interpretation: blackboard Decomposition of VarianceIf we treat X as a random variable, we can decompo

4、se total variance to the between-group portion and the within-group portion in any population:Prove: (by the assumption that , for all possible k.)The ANOVA table is to estimate the three quantities of equation (1) from the sample. As the sample size gets larger and larger, the ANOVA table will appr

5、oach the equation closer and closer. In a sample, decomposition of estimated variance is not strictly true. We thus need to separately decompose sums of squares and degrees of freedom. Is ANOVA a misnomer? III. ANOVA in MatrixI will try to give a simplied representation of ANOVA as follows: (because

6、 ) (in your textbook, monster look)SSE = e'e (because , as always) (in your textbook, monster look)IV. ANOVA TableSOURCESSDFMSFwithRegressionSSRDF(R)MSRMSR/MSEDF(R)ErrorSSEDF(E)MSEDF(E)TotalSSTDF(T)Let us use a real example. Assume that we have a regression estimated to be y = - 1.70 + 0.840 xAN

7、OVA TableSOURCESSDFMSFwithRegression6.4416.446.44/0.19=33.891, 18Error3.40180.19Total9.8419We know , , , , . If we know that DF for SST=19, what is n?n= 20 = 201.71.7+0.840.84509.12-21.70.84100- 125.0 = 6.44SSE = SST-SSR=9.84-6.44=3.40DF (Degrees of freedom): demonstration. Note: discounting the int

8、ercept when calculating SST.MS = SS/DFp = 0.000 ask students. What does the p-value say? V. F-TestsF-tests are more general than t-tests, t-tests can be seen as a special case of F-tests. If you have difficulty with F-tests, please ask your GSIs to review F-tests in the lab. F-tests takes the form o

9、f a fraction of two MS's.An F statistic has two degrees of freedom associated with it: the degree of freedom in the numerator, and the degree of freedom in the denominator. An F statistic is usually larger than 1. The interpretation of an F statistics is that whether the explained variance by th

10、e alternative hypothesis is due to chance. In other words, the null hypothesis is that the explained variance is due to chance, or all the coefficients are zero. The larger an F-statistic, the more likely that the null hypothesis is not true. There is a table in the back of your book from which you

11、can find exact probability values.In our example, the F is 34, which is highly significant.VI. R2R2 = SSR / SSTThe proportion of variance explained by the model.In our example, R-sq = 65.4% VII. What happens if we increase more independent variables.1. SST stays the same.2. SSR always increases.3. S

12、SE always decreases.4. R2 always increases.5. MSR usually increases.6. MSE usually decreases.7. F-test usually increases.Exceptions to 5 and 7: irrelevant variables may not explain the variance but take up degrees of freedom. We really need to look at the results. VIII. Important: General Ways of Hy

13、pothesis Testing with F-Statistics.All tests in linear regression can be performed with F-test statistics. The trick is to run "nested models."Two models are nested if the independent variables in one model are a subset or linear combinations of a subset (子集)of the independent variables in

14、 the other model. That is to say. If model A has independent variables (1, , ), and model B has independent variables (1, , ,), A and B are nested. A is called the restricted model; B is called less restricted or unrestricted model. We call A restricted because A implies that . This is a restriction

15、.Another example: C has independent variable (1, , +), D has (1, +). C and A are not nested. C and B are nested. One restriction in C: .C and D are nested. One restriction in D: .D and A are not nested. D and B are nested: two restriction in D: ; .We can always test hypotheses implied in the restric

16、ted models. Steps: run two regression for each hypothesis, one for the restricted model and one for the unrestricted model. The SST should be the same across the two models. What is different is SSE and SSR. That is, what is different is R2. Let; Use the following formulas:or(proof: use SST = SSE+SS

17、R)Note, df(SSEr)-df(SSEu) = df(SSRu)-df(SSRr) =,is the number of constraints (not number of parameters) implied by the restricted model or Note thatThat is, for 1df tests, you can either do an F-test or a t-test. They yield the same result. Another way to look at it is that the t-test is a special c

18、ase of the F test, with the numerator DF being 1. IX. Assumptions of F-testsWhat assumptions do we need to make an ANOVA table work?Not much an assumption. All we need is the assumption that (X'X) is not singular, so that the least square estimate b exists.The assumption of =0 is needed if you w

19、ant the ANOVA table to be an unbiased estimate of the true ANOVA (equation 1) in the population. Reason: we want b to be an unbiased estimator of , and the covariance between b andto disappear. For reasons I discussed earlier, the assumptions of homoscedasticity and non-serial correlation are necess

20、ary for the estimation of . The normality assumption that ei is distributed in a normal distribution is needed for small samples. X. The Concept of IncrementEvery time you put one more independent variable into your model, you get an increase in . We sometime called the increase "incremental.&q

21、uot; What is means is that more variance is explained, or SSR is increased, SSE is reduced. What you should understand is that the incremental attributed to a variable is always smaller than the when other variables are absent. XI. Consequences of Omitting Relevant Independent VariablesSay the true

22、model is the following:.But for some reason we only collect or consider data on . Therefore, we omit in the regression. That is, we omit in our model. We briefly discussed this problem before. The short story is that we are likely to have a bias due to the omission of a relevant variable in the mode

23、l. This is so even though our primary interest is to estimate the effect of or on y. Why? We will have a formal presentation of this problem.XII. Measures of Goodness-of-FitThere are different ways to assess the goodness-of-fit of a model. A. R2R2 is a heuristic measure for the overall goodness-of-f

24、it. It does not have an associated test statistic. R2 measures the proportion of the variance in the dependent variable that is “explained” by the model:R2 =B. Model F-test The model F-test tests the joint hypotheses that all the model coefficients except for the constant term are zero. Degrees of f

25、reedoms associated with the model F-test:Numerator: p-1Denominator: n-p. C. t-tests for individual parametersA t-test for an individual parameter tests the hypothesis that a particular coefficient is equal to a particular number (commonly zero). tk = (bk- bk0)/SEk, where SEkis the (k, k) element of

26、MSE(XX)-1, with degree of freedom=n-p. D. Incremental R2Relative to a restricted model, the gain in R2 for the unrestricted model:DR2= Ru2- Rr2E. F-tests for Nested Model It is the most general form of F-tests and t-tests. It is equal to a t-test if the unrestricted and restricted models differ only

27、 by one single parameter. It is equal to the model F-test if we set the restricted model to the constant-only model. Ask students What are SST, SSE, and SSR, and their associated degrees of freedom, for the constant-only model? Numerical ExampleA sociological study is interested in understanding the

28、 social determinants of mathematical achievement among high school students. You are now asked to answer a series of questions. The data are real but have been tailored for educational purposes. The total number of observations is 400. The variables are defined as:y: math scorex1: father's educa

29、tionx2: mother's educationx3: family's socioeconomic statusx4: number of siblingsx5: class rankx6: parents' total education (note: x6 = x1 + x2) For the following regression models, we know: Table 1 SST SSR SSE DF R2(1) y on (1 x1 x2 x3 x4)34863 4201 (2) y on (1 x6 x3 x4) 34863 396.1065(3) y on (1 x6 x3 x4 x5)34863 10426 24437 395.2991(4) x5 on (1 x6 x3 x4) 269753396.02101. Please fill the missing cells in Table 1. 2. Test the hypothesis that the effects of father's edu

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论