STATA实用学习笔记

上传人：1*** IP属地：广西上传时间：2023-10-31 格式：DOC 页数：48 大小：854.50KB 积分：12 举报 版权申诉

已阅读5页，还剩43页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

北京科技大学STATA应用学习摘录

第一章STATA的基本操作一、设置内存容setmem500m,perm显示输入内容Display1Display“clive”显示数据集结构describeDescribe/d编辑editEdit重命名变量Renamevar1var2显示数据集内容list/browseListin1Listin2/10数据导入:数据文件是文本类型（.csv）insheet:.insheetusing“C:\DocumentsandSettings\Administrator\桌面\ST9007\dataset\Fees1.csv”,clear内存为空时才可以导入数据集，否则会出现（youmuststartwithanemptydataset）清空内存中的所有变量：.drop_all导入语句后加入“clear”命令。保存文件save“C:\DocumentsandSettings\Administrator\桌面\ST9007\dataset\Fees1.dta”save“C:\DocumentsandSettings\Administrator\桌面\ST9007\dataset\Fees1.dta”,replace打开及退出已存文件use1、.Use文件路径及文件名,clear2、.Drop_all/.exit记录命令和输出结果（log）开始建立记录文件：logusing"J:\phd\output.log",replace暂停记录文件：logoff重新打开记录文件：logon关闭记录文件：logclose十一、创建和保存程序文件：（doedit,do）打开程序编辑窗口：doedit写入命令保存文件，.do.运行命令：.do程序文件路径及文件名十二、多个数据集合并为一个数据集（变量和结构相同）纵向合并appendinsheetusing"J:\phd\Fees1.csv",clearsave"J:\phd\Fees1.dta",replaceinsheetusing"J:\phd\Fees2.csv",clearappendusing"J:\phd\Fees1.dta"save"J:\phd\Fees1.dta",replace十三、横向合并，在原数据集基础上加上另外的变量merge1、insheetusing"J:\phd\Fees1.csv",clearsortcompanyidyearendsave"J:\phd\Fees1.dta",replacedescribeinsheetusing"J:\phd\Fees6.csv",clearsortcompanyidyearendmergecompanyidyearendusing"J:\phd\Fees1.dta"save"J:\phd\Fees1.dta",replacedescribe2、_merge==1obs.Frommasterdata_merge==2obs.Fromusingdata_merge==3obs.Frombothmasterandusingdata十四、帮助文件：help1、.Helpdescribe十五、描述性统计量1、summarizeincorporationyear单个summarizeincorporationyear-big6连续多个summarize_allorsimplysummarize所有2、更详细的统计量summarizeincorporationyear,detail3、centilecentileauditfees,centile(0(10)100)centileauditfees,centile(0(5)100)4、tabulate不同类型变量的频数和比例tabulatecompanytypetabulatecompanytypebig6,column按列计算百分比tabulatecompanytypebig6,row按行计算百分比tabcompanytypebig6ifcompanytype<=3,rowcol同时按行列和条件计算百分比5、计算满足条件观测的个数countifbig6==1countifbig6==0|big6==16、按离散变量排序，对连续变量计算描述性统计量：（1）bycompanytype,sort:summarizeauditfees,detail（2）sortcompanytypeBycompanytype:summarizeauditees十六、转换变量1、按公司类型将公开发行股票公司赋值为1，其他为0genlisted=0replacelisted=1ifcompanytype==2replacelisted=1ifcompanytype==3replacelisted=1ifcompanytype==5replacelisted=.ifcompanytype==.十七、产生新变量genGeneratenewvar=表达式十八、数据类型1、数值型StoragetypeBytesMinMaxbyte1-127+100int2-32,767+32,740long4-2,147,483,6472,147,483,620float4-1.70141173319*10381.70141173319*1036double8-8.9884656743*103078.9884656743*103082、字符型StoragetypeBytesMaxlength(characters)str111str222…

str8080803、新建变量的过程中定义数据类型genstr3gender="male"listgenderin1/104、变量所占字节过长dropgendergenstr30gender="male"browsedescribegendercompressgender5、日期数据类型：%ddates,whichisacountofthenumberofdayselapsedsinceJanuary1,1960。（1）date(日期变量)genfye=date(yearend,"MDY")MDY应根据前面日期的排列顺序而定，结果显示的是距离1960年1月1日的天数listyearendfyein1/10（2）日期格式化%d（显示fye变量为日期形式，但数值并未真正变动）：formatfye%dlistyearendfyein1/10sumfye（3）利用日期天数求对应的年、月、日genyear=year(fye)genmonth=month(fye)genday=day(fye)listyearendfyeyearmonthdayin1/10（4）将三个分别表示年、月、日的变量合并为一个日期变量dropfyegenfye=mdy(month,day,year)formatfye%dlistyearendfyein1/10(5)将一个数值型的时间数据（20080131）转变为ST可识别的时间数据genyear=int(date/10000)genmonth=int((date-year*10000)/100)genday=date-year*10000-month*100listdateyearmonthdayin1/10genedate=mdy(month,day,year)formatedate%dlistedatedatein1/10十九、存贮统计量的内部变量R（）sumauditfeesgenmeanadjaf=auditfees-r(mean)listmeanadjafin1/10SUM命令后常见的几种R（）值r(N)Numberofcasesr(sd)Standarddeviationr(sum_w)Sumofweightsr(min)Minimumr(mean)Arithmeticmeanr(max)Maximumr(var)Variancer(sum)Sumofvariable显示这些变量值的命令sumauditfees,detailreturnlist二十、recode命令（PPT61）1、产生有多个值的变量的哑变量recoderecodeyear(min/1999=0)(2000/max=1),gen(yeardum)min/1999表示小于等于1999的值全部赋值为02000/max表示大于等于2000的值全部赋为1。2、对一个连续变量按一定值分为不同间隔的组recodegenassets_categ=recode(totalassets,100,500,1000,5000,20000,100000,1000000)。分组的值为每组的上限，包含该值。sortassets_categbyassets_categ:sumtotalassetsassets_categ对一个连续变量按一定值分为相同间隔的组autocodeautocode(variablename,#ofintervals,minvalue,maxvalue)forexample:genassets_categ=autocode(totalassets,10,0,10000)4、对一个连续变量按每组样本数相同进行分组：xtilextileassets_categ=totalassets,nquantiles(10)每组样本不一定完全相同二十一、一次性计算同一变量不同组别的均值：egen命令按公司类型先排序，再计算每一类型公司审计费用的均值并赋值给新变量：bycompanytype,sort:egenmeanaf2=mean(auditfees)count()mean()median()sum()二十二、_n和_N命令显示每个观测的序号并显示总观测数sortcompanyidfyecapturedropxgenx=_ncapturedropygeny=_Nlistcompanyidfyexyin1/302、分组显示每个组中变量的序号和每组总的样本数capturedropxysortcompanyidfyebycompanyid:genx=_nbycompanyid:geny=_Nlistcompanyidfyexyin1/303、创建新变量等于每个分组中变量的第一个值或最后一个值sortcompanyidfyebycompanyid:genauditfees_first=auditfees[1]bycompanyid:genauditfees_last=auditfees[_N]listcompanyidfyeauditfeesauditfees_firstauditfees_lastin1/304、创建新变量等于滞后一期或滞后两期的值sortcompanyidfyebycompanyid:genauditfees_lag1=auditfees[_n-1]bycompanyid:genauditfees_lag2=auditfees[_n-2]listcompanyidfyeauditfeesauditfees_lag1auditfees_lag2in1/30二十三、转变数据集结构：reshape不同数据库的数据集结构不同：长型是指同一公司不同年度数据在不同的行。宽型数据是指同一数据不同年度数据在现一行。二者间的转换可通过reshape命令来实现。需要注意的是，在转换过程中对数据集是有要求的，一个公司只能有一个年度数据，否则会出错。1、长型转换为宽型：reshapewideyearendincorporationyearcompanytypesalesauditfeesnonauditfeescurrentassetscurrentliabilitiestotalassetsbig6fye,i(companyid)j(year)2、宽型转换为长型：reshapelongyearendincorporationyearcompanytypesalesauditfeesnonauditfeescurrentassetscurrentliabilitiestotalassetsbig6fye,i(companyid)j(year)3、第二次转换时命令可简化：reshapewidereshapelong二十四、计算CAR的例子：已知股票日回报率，市场回报率，事件日，计算窗口期为三天的CAR。1、定义三天的窗口期：sorttickeredategenwindow=0ifeventdate<.（事件日为0）replacewindow=-1ifwindow[_n+1]==0&ticker==ticker[_n+1]replacewindow=1ifwindow[_n-1]==0&ticker==ticker[_n-1]2、计算AR和CARgenar=ret-vwretdgencar=ar+ar[_n-1]+ar[_n+1]ifwindow==0&ticker==ticker[_n+1]&ticker==ticker[_n-1]3、检验listtickeredateretvwretdarcarwindowifwindow<.二十五、means的T检验：1、检验总体上big6的审计收费有无显著不同use"J:\phd\Fees.dta",cleargenlnaf=ln(auditfees)bybig6,sort:sumlnaftestlnaf,by(big6)2、分年度比较big6的审计收费有无显著不同,加入byyear命令。genfye=date(yearend,"MDY")formatfye%dgenyear=year(fye)sortyearbyyear:ttestlnaf,by(big6)3、均值等于特定值得的T检验：sumlnafttestlnaf=2.1二十六、meadian的显著性检验：1、获取中位数的命令：bybig6,sort:sumlnaf,detailbybig6,sort:centilelnaf2、中位数检验：medianlnaf,by(big6)ranksumlnaf,by(big6)二十七、列联表检验：1、创建列联表的命令：tabulatecompanytypebig6,row第一个变量是表的最左侧一列的项目，第二个变量是表的第一行的项目。2、两变量之间的相关性检验：chi2tabulatecompanytypebig6,chi2row3、相关矩阵：pwcorrlnafbig6yearlisted4、列出相关矩阵并进行符号检验pwcorrlnafbig6yearlisted,sig5、在矩阵中列出观测数pwcorrlnafbig6listedifyear==2000,sigobs二十八、创建一个不包含缺失值的数据集1、无缺失值的变量值为1，至少有一个的为0gensamp=1iflnaf<.&big6<.&year<.&listed<.2、缺失值的变量值表示同一行中缺失值的个数egenmiss=rmiss(lnafbig6yearlisted)summiss,detail二十九、图形1、直方图histogramincorporationyear,width(1)histogramincorporationyear,bin(147)width表示分一小份的宽度。bin表示分成的份数。改变宽度值可以使图像看起来更合适。选择起始点和间隔宽度：histlnafiflnaf>=0&lnaf<=5,width(0.25)选择描述横轴和纵轴的单位和数据标识：histlnafiflnaf>=0&lnaf<=5,width(0.25)xlabel(0(0.5)5)是否与正态分布一致：histlnafiflnaf>=0&lnaf<=5,width(0.25)normal2、散点图（scatter）scatterlnaflnta第一个变量是纵轴，第二个变量是横轴。twoway(scatterlnaflnta,msize(tiny))(lfitlnaflnta)在散点图上加入最适合的一条直线。三十、缩尾处理winsor.winsorrev,gen(wrev)p(0.01)0.01代表去掉的百分数。Winsorrev,gen(wrev)h(5),5代表去掉的个数

第二章线性回归内容简介：2.1Thebasicideaunderlyinglinearregression2.2SinglevariableOLS2.3Correctlyinterpretingthecoefficients2.4Examiningtheresiduals2.5Multipleregression2.6Heteroskedasticity2.7Correlatederrors2.8Multicollinearity2.9Outlyingobservations2.10Medianregression2.11“Looping”2.1Thebasicideaunderlyinglinearregression1．残差F为真实值，为预测值，ε为残差。OLS回归就是使残差最小。2.基本一元回归regressyx3．回归结果的保存回归结果的系数保存在_b[varname]内存变量中，常数项的系数保存在(_cons)内存变量中。4、预测值及残差predictyhatpredictyres,residyres即为真实值得与预测值之差。5、残差与X的散点图twoway(scattery_resx)(lfity_resx)6、衡量估计系数准确程度：标准误差。用样本的标准偏差与系数之间的关系来衡量即T值（用系数除以标准差），同时P值是根据T值的分布计算出来的，表示系数落入标准对应上下限的可能性。前提是残差符合以下假设：同方差：Homoscedasticity(i.e.,theresidualshaveaconstantvariance)独立不相关：Non-correlation(i.e.,theresidualsarenotcorrelatedwitheachother)正态分布：Normality(i.e.,theresidualsarenormallydistributed)7、回归结果包含的一些内容的意思各变差的自由度：FortheESS,df=k-1wherek=numberofregressioncoefficients(df=2–1)FortheRSS,df=n–kwheren=numberofobservations(=11-2)FortheTSS,df=n-1(=11–1)MS：变差除以自由度：Thelastcolumn(MS)reportstheESS,RSSandTSSdividedbytheirrespectivedegreesoffreedomR平方：TheR-squared=ESS/TSS调整的R平方：AdjR-squared=1-(1-R2)(n-1)/(n-k)，消除了加入相关度不高解释变量后R平方增加的不足。RootMSE=squarerootofRSS/n-k：模型的平均解释能力TheF-statistic=(ESS/k-1)/(RSS/n-k)：模型的总解释能力2.3Correctlyinterpretingthecoefficients1、假如想检验big6的审计费用在公开发行和非公开发行公司之间的区别时，可用交互变量。Big6*listed.2、变量回归系数的解释(1)对连续变量系数的解释：估计系数的经济意义是指X对Y的影响，可以有不同的方法来衡量：一种是用X从25%变动到75%时Y的变动量。或X变动一个标准差时Y的变动。regauditfeestotalassetssumtotalassetsifauditfees<.,detailgenfees_low=_b[_cons]+_b[totalassets]*r(p25)genfees_high=_b[_cons]+_b[totalassets]*r(p75)sumfees_lowfees_high（2）对非连续变量的解释一般使用0和1，而不是百分比。reglnafbig6genfees_nb6=exp(_b[_cons])genfees_b6=exp(_b[_cons]+_b[big6])sumfees_nb6fees_b62.4Examiningtheresiduals1、报告结果时，不仅用R平方来衡量显著性，而且需要报告其他统计结果：istheresignificantheteroscedasticity?isthereanypatterntotheresiduals?arethereanyproblemsofoutliers?2、R2的使用：Gu(2007)pointsoutthat:econometriciansconsiderR2valuestoberelativelyunimportant(accountingresearchersputfartoomuchemphasisonthemagnitudeoftheR2)regressionR2sshouldnotbecomparedacrossdifferentsamplesincontrastthereisalargeaccountingliteraturethatusesR2stodeterminewhetherthevaluerelevanceofaccountinginformationhaschangedovertime。 TheR2tellsusnothingaboutwhetherourhypothesisaboutthedeterminantsofYiscorrect.3、适当使用resid来评估模型的优劣。2.5Multipleregression1、判断模型中有无忽略相关解释变量：theorypriorempiricalstudies检验残差和所预测的值之间是否独立：genlisted=0replacelisted=1ifcompanytype==2|companytype==3|companytype==5reglnaflntabig6listedpredictlnaf_hat（求预测值，因变量的估计值）predictlnaf_res,resid（将残差赋值给变量lnaf_res）twoway(scatterlnaf_reslnaf_hat)(lfitlnaf_reslnaf_hat)(检验残差和预测值之间是否相关)3、另一种命令可以实现以上功能：reglnaflntabig6listedrvfplot2.6Heteroscedasticity(hettest)异方差性1、检验方差齐性的方法：回归后使用hettest命令：regauditfeesnonauditfeestotalassetsbig6listedhettest方差齐性不会使系数有偏，但会使使系数的标准差有偏。产生的原因有可能是数据本身有界限，产生高的偏度。一些方差不齐可以通过取对数消除。当发现不齐性时使用Huber/White/sandwichestimator对标准差进行调整。STATA可以在回归时加上robust来实现。regauditfeesnonauditfeestotalassetsbig6listed,robust加robust后的回归系数相同，但标准差不同，T值变小，P值变大，F值变小，R2不变。2.7Correlatederrors(自变量相关)1、Theresidualsofagivenfirmarecorrelatedacrossyears(“timeseriesdependence”)，面板数据（Inpaneldata）,同一公司不可观测的特性对不同年度都会产生一定的影响，这时就会使数据不独立。therearelikelytobeunobservedcompany-specificcharacteristicsthatarerelativelyconstantovertime2、标准差会下偏，Thisproblemcanbeavoidedbyadjustingthestandarderrorsfortheclusteringofyearlyobservationsacrossagivencompany3、消除变量相关问题：在回归中加入robustcluster()reglnaflntabig6listed,robustcluster(companyid)4、如何验证同一公司不同年度数据的残差的相关性reglnaflntapredictres,residkeepcompanyidyearressortcompanyidyeardropifcompanyid==companyid[_n-1]&year==year[_n-1]reshapewideres,i(companyid)j(year)browsepwcorrres1998-res20025、在使用面板数据时应注意：只用robust控制heteroscedasticity，而未用cluster()控制time-seriesdependence，T统计量也会上偏。如果heteroscedasticity也未控制，T统计量会上偏更严重。因此在使用面板数据时应加入robustcluster()option,otherwiseyour“significant”resultsfrompooledregressionsmaybespurious.2.8Multicollinearity1、什么情况下会产生多重共线性Wehaveseenthatwhenthereisperfectcollinearitybetweenindependentvariables,STATAwillhavetoexcludeoneofthem.Forexample,year_1+year_2+year_3+year_4+year_5=1reglnafyear_1year_2year_3year_4year_5,noconsSTATAautomaticallythrowsawayoneoftheyeardummiessothatthemodelcanbeestimatedEveniftheindependentvariablesarenotperfectlycollinear,therecanstillbeaproblemiftheyarehighlycorrelated2、后果：thestandarderrorsofthecoefficientstobelarge(i.e.,thecoefficientsarenotestimatedprecisely)thecoefficientestimatescanbehighlyunstable3、衡量方法：Variance-inflationfactors(VIF)可用来衡量是否存在多重共线性。reglnaflntabig6lnta1vifreglnaflntabig6vif多重共线性的严重程度：如果为10时可判断为高，为20时可判断为非常高。2.9Outlyingobservations1、异常值的衡量Cook’sDWecancalculatetheinfluenceofeachobservationontheestimatedcoefficientsusingCook’sDValuesofCook’sDthatarehigherthan4/Nareconsideredlarge,whereNisthenumberofobservationsusedintheregression2、异常值的计算reglnaflntabig6predictcook,cooksd（将cooksd的值赋给cook）sumcook,detailgenmax=4/e(N)(求max,e(N)是回归过程中的内部已知变量)countifcook>max&cook<.去掉异常值后重新回归reglnaflntabig6ifcook<=max5、用winsorize方法消除异常值:其缺点是Adisadvantagewith“winsorizing”isthattheresearcherisassumingthatoutlierslieonlyattheextremesofthevariable’sdistribution。winsorlnaf,gen(wlnaf)p(0.01)winsorlnta,gen(wlnta)p(0.01)sumlnafwlnaflntawlnta,detailregwlnafwlntabig62.10Medianregression1、中位数回归是当存在异常值问题时使用。2、原理：OLS估计是尽量使残差平方和最小：中位数回归是尽量使thesumoftheabsoluteresiduals最小。回归方法：STATA将中位数回归看作是quantileregressions的一个特例。qreglnaflntabig62.11“Looping”1、当多次用到一个命令集时，我们可以建立一个程序集，以program开头，以forvalues引导的内容，以end结束。使用时只须输入程序名“ten”即可执行程序中的一引起命令集。Example:programten forvaluesi=1(1)10{ display`i' }end2、修改命令集：须首先删除内存中的命令集：captureprogramdropten然后重新编写。例子：利用JONES模型计算操控性应计。use"J:\phd\accruals.dta",cleargenone_sic=int(sic/1000)genncca=current_assets-cashgenndcl=current_liabilities-debt_in_current_liabilitiessortcikyeargench_ncca=ncca-ncca[_n-1]ifcik==cik[_n-1]gench_ndcl=ndcl-ndcl[_n-1]ifcik==cik[_n-1]genaccruals=(ch_ncca-ch_ndcl)/assets[_n-1]ifcik==cik[_n-1]genlag_assets=assets[_n-1]ifcik==cik[_n-1]genppe_scaled=ppe/assets[_n-1]ifcik==cik[_n-1]genchsales_scaled=(sales-sales[_n-1])/assets[_n-1]ifcik==cik[_n-1]genab_acc=.captureprogramdropab_accprogramab_accforvaluesi=0(1)9{captureregaccrualslag_assetsppe_scaledchsales_scaledifone_sic==`i'capturepredictab_acc`i'ifone_sic==`i',residreplaceab_acc=ab_acc`i'ifone_sic==`i'capturedropab_acc`i'}endab_acc

第三章因变量为非连续性变量时的回归分析内容简介：3.1WhynotOLS?3.2Thebasicideaunderlyinglogitmodels3.3Estimatinglogitmodels3.4Multinomialmodels3.5Ordinaldependentvariables3.6Countdatamodels3.7Tobitmodelsandintervalregression3.8Durationmodels3.1WhynotOLS?twostatisticalproblemsifweuseOLSwhenthedependentvariableiscategorical:ThepredictedvaluescanbenegativeorgreaterthanoneThestandarderrorsarebiasedbecausetheresidualsareheteroscedastic.InsteadofOLS,wecanusealogitmodel3.2Thebasicideaunderlyinglogitmodels1、Weneedtocreateavariablethat:将离散型的因变量转变为符合OLS的形式。hasaninfiniterange,reflectsthelikelihoodofchoosingabig6auditorversusanon-big6auditor.2、“oddsration”可实现上面的两项要求：log(oddsration)具体例子：第一列为big6的可能性，第二列和第三列为优势比率，第四列为取自然对数后的值。4、L和P之间的转换关系。5、似然函数：使用最大似然法估计（maximumlikelihood”estimation）6、回归命令logit和logisticlogitreportsthevaluesoftheestimatedcoefficientslogisticreportstheoddsratios一般报告系数估计所以使用logit。7、模型的解释能力参数：pseudo-R2和Chi2pseudo-R2=(ln(L0)-ln(LN))/ln(L0)=(-175224+146215)/-175224ln(L0)是第一个回归值，ln(LN)是最后一个回归值。Chi2=-2(ln(L0)-ln(LN))=-2*(-175224+146215)=580183.3Estimatinglogitmodels1、回归模型logitbig6lntaage,robustcluster(companyid)加入robust命令是为了纠正异方差，加入cluster()是为了纠正相关性错误。2、预测因变量的可能性logitbig6lntaage,robustcluster(companyid)dropbig6hatpredictbig6hatsumbig6hat,detail用此命令产生的预测值为以下公式：另一种产生预测因变量可能性的方法：genbig6hat2=exp(big6hat1)/(1+exp(big6hat1))sumbig6hatbig6hat1big6hat23、产生预测因变量的值：genbig6hat1=_b[_cons]+_b[lnta]*lnta+_b[age]*agesumbig6hat1,detail另一种方法是predictbig6hat1,xb计算自变量变动对因变量可能性的影响：logitbig6lntaage,robustcluster(companyid)genbig10=exp(_b[_cons]+_b[lnta]*lnta+_b[age]*10)/(1+(exp(_b[_cons]+_b[lnta]*lnta+_b[age]*10)))genbig20=exp(_b[_cons]+_b[lnta]*lnta+_b[age]*20)/(1+(exp(_b[_cons]+_b[lnta]*lnta+_b[age]*20)))sumbig10big205、检验因变量与自变量之间单调性的方法：xtilelnta_categ=lnta,nquantiles(10)tabulatelnta_categ,gen(lnta_)logitbig6lnta_2-lnta_10age,robustcluster(companyid)6、另一种估计方法probitLogit把P（Y=1）转换成0-1之间的数据，数据服从对数分布Probit把P（Y=1）转换成0-1之间的数据，数据服从正态分布。似然函数为Thecoefficientstendtobelargerinprobitmodelsbutthelevelsofstatisticalsignificanceareoftensimilar例子：capturedropbig6hatbig6hat1logitbig6lntaage,robustcluster(companyid)predictbig6hatprobitbig6lntaage,robustcluster(companyid)predictbig6hat1pwcorrbig6hatbig6hat13.4Multinomialmodels（多项式模型）1、适用情况：因变量分为三个或以上分类，而且分类不排序，每一个分类都有1和0两个变量。如果用logit模型分别回归，将使回归后合计的可能性不等于1。将公司类型分为三类gencotype1=0ifcompanytype==1|companytype==6replacecotype1=1ifcompanytype==4replacecotype1=2ifcompanytype==2|companytype==3|companytype==5将每类变量分为两种情况genprivate=0replaceprivate=1ifcotype1==0genpublic_nontraded=0replacepublic_nontraded=1ifcotype1==1genpublic_traded=0replacepublic_traded=1ifcotype1==2用logit模型分单个变量进行回归logitprivatelnta,robustcluster(companyid)predictprivate_hatlogitpublic_nontradedlnta,robustcluster(companyid)predictpublic_nontraded_hatlogitpublic_tradedlnta,robustcluster(companyid)predictpublic_traded_hat合计的可能性不等于1gensum_prob=private_hat+public_nontraded_hat+public_traded_hatsumsum_prob,detail2、多于2个分类时的因变量回归：mprobit或mlogitMprobit时间长Mlogit时间短mprobitcotype1lnta,robustcluster(companyid)mlogitcotype1lnta,robustcluster(companyid)回归后直接检验回归系数是否相等：test[1=2]:lntatest[1=2]:_cons以上回归时在三类中选择系统默认的类别作为对比组，也可以人为设置对比组。mlogitcotype1lnta,baseoutcome(1)robustcluster(companyid)3.5Ordinaldependentvariables1、因变量排序模型回归适用情况：Moregenerally,theordereddependentvariablemaytakeNpossiblevalues(Y=1,2,…,N)inwhichcasethereareN-1cut-offpoints:L=a0+a1X1+a2X2+eY=NifkN-1<L<+Y=N-1ifkN-2<LkN-1...Y=2ifk1<Lk2Y=1if-<Lk12、排序模型回归ologitologitopinionreviewed_firm_also_reviewerlitigation_dummy,robustologitopinion1reviewed_firm_also_reviewerlitigation_dummy,robust以上两模型回归的结果相同，虽然因变量的值不一样，但排序的大小顺序一样。3、回归的结果：回归的结果是cut值：Thesearethecut-offvalueskN-1,kN-2,...,k2,k1Y=NifkN-1<L<+Y=N-1ifkN-2<LkN-1....etc.Y=2ifk1<Lk2Y=1if-<Lk1Anotherdifferenceisthatthereisnointercepttermintheorderedlogitandorderedprobitmodels.4、排序数据的另一种回归方法：oprobitoprobitopinionreviewed_firm_also_reviewerlitigation_dummy,robustNoticethattheologitandoprobitresultsarequiteclosetoeachotherusuallyitdoesn’tmakemuchdifferencewhetheryouuseorderedlogitororderedprobit.3.6Countdatamodels1、适用情况：计数模型适用于因变量是非负的离散数，且数据有实际的意义。比如：considerthenumberoffinancialanalyststhatfollowagivencompanyifthecompanyisnotfollowedbyanyanalysts,Y=0ifthecompanyisfollowedbyoneanalyst,Y=1ifthecompanyisfollowedbytwoanalysts,Y=2ifthecompanyisfollowedbytwoanalysts,Y=3此种数据无法使用OLS回归，因为因变量无法满足数据是在负无穷到正无穷之间，因为只能取非负数，同时要求因变量是连续变量，而计数模型的因变量是离散的。2、适用的回归模型Twodistributionsthatfulfillthecriteriaofhavingnon-negativediscreteintegervaluesarethe“Poisson”andthe“negativebinomial”.thenegativebinomial(nbreg)thePoisson(poisson)3、实际中计数模型的例子：ThenumberofR&DpatentsawardedThenumberofairlineaccidentsThenumberofmurdersThenumberoftimesthatmainlandChinesepeoplehavevisitedSingaporeThenumberofweaknessesfoundbypeerreviewersatauditfirms4、模型的选择：（1）POISSON模型：ThePoissondistributionismostoftenusedtodeterminetheprobabilityofxoccurrencesperunitoftime。E.g.,thenumberofmurdersperyearThebasicassumptionsofthePoissondistributionareasfollows:ThetimeintervalcanbedividedintosmallsubintervalssuchthattheprobabilityofanoccurrenceineachsubintervalisverysmallTheprobabilityofanoccurrenceineachsubintervalremainsconstantovertimeTheprobabilityoftwoormoreoccurrencesineachsubintervalmustbesmallenoughtobeignoredAnoccurrenceornonoccurrenceinonesubintervalmustnotaffecttheoccurrenceornonoccurrenceinanyothersubinterval(thisistheindependenceassumption).满足条件下的例子：TheprobabilityofamurderoccurringduringanygivenminuteissmallTheprobabilityofamurderoccurringduringanygivenminuteremainsconstantduringtheyearTheprobabilityofmorethanonepersonbeingmurderedduringanygivenminuteisverysmallThenumberofmurdersinanygiventimeperiodisindependentofthenumberofmurdersinanyothertimeperiod.参数的估计：TheonlyparameterneededtocharacterizethePoissondistributionisthemeanrateatwhicheventsoccur。“incidencerate”，Forexample,canbetheaveragenumberofmurderspermonthortheaveragenumberofanalystspercompanyPOISSON分布的概率函数：如果已知每月的犯罪数为2，求每月有3起犯罪的概率。模型特点：模型只有一个参数λ，发生率可用右式估计。命令：controlforheteroscedasticityusingtherobustoptionpoissonweaknessesreviewed_firm_also_reviewerlitigation_dummy,robustpaneldataset(itisn’t)youwouldalsoneedtocontrolfortime-seriesdependenceusingthecluster()option缺点：Unobservedheterogeneityinthedata(e.g.,omittedvariables)willoftencausethevariancetoexceedthemean(aphenomenonknownas“overdispersion”).回归后检验：回归后马上用poisgof命令，检验是否显著，如显著则无法使用，而须使用Thenegativebinomial，该模型无须assumethatthemeanandvarianceofthedistributionarethesame（2）thenegativebinomial模型：nbregweaknessesreviewed_firm_also_reviewerlitigation_dummy,robust（cluster()）回归结果的α显著，说明POISSON模型不适用。3.7Tobitandintervalregressionmodels1、适用的数据类型：censoring(ortruncation)ofthedependentvariable.当观众数大于座位数时，观测不到。2、选择模型：Thecensoringproblemcanbesolvedbyestimatinga“tobit”modelThetobitmodelissomewhatsimilar:Y*=a0+a1X+eY=0if-<Y*0Y=Y*if0<Y*<+TheY*andYvariablesarebothobservedwhentheyaregreaterthanzero(Y*isunobservedwhenY=0)Boththeprobitandtobitmodelsassumethattheerrors(e)arenormallydistributed.3、例子：Recallthatinourfeedataset,thenonauditfeesvariableisleft-censoredatzerobecausemanycompanieschoosenottopurchaseanynon-auditservices。ThisphenomenonislikesomeindividualschoosingnottopurchaseanycigaretteswhenthepriceexceedsP0genlnta=ln(totalassets)egenmiss=rmiss(lnnaflnta)(当lnnaflnta为miss时，miss为1)tobitlnnaflntaifmiss==0,ll(0)（ll(数字)表示左边截取的数据，ul(数字)表示右边截取的数字。）tobitlnnaflntaifmiss==0,ll（此命令与上命令功能相同）回归完成后可以用命令显示有多少数据censoried.countifmiss==0&lnnaf==0countifmiss==0&lnnaf>04、当左右两边均截取以后，也可使用tobit模型genlnnaf1=lnnafreplacelnnaf1=5iflnnaf>5&lnnaf!=.tobitlnnaf1lntaifmiss==0,ll(0)ul(5)tobitlnnaf1lntaifmiss==0,llul(如果截取数字是样本中的最大和最小值不用列出，系统会自动选取)。tobitlnnaflntaifmiss==0,llul(5)robustcluster(companyid)（控制异方差和时间序列不独立）3.8Durationmodels（生存模型）1、适用数据：因变量测试某一事件持续的时间。例如：Durationoflife(medical,engineering)howlongdopeoplelivefor?howlongdomachineslast?Durationofunemployment(economics)howlongdopeopleremainunemployed?forexample,wemaybeinterestedinhowretrainingschemesaffectthedurationofunemploymentDurationofCEOtenure(management)howlongdoestheCEOstayatthesamecompany?Durationofauditor-companytenure(accounting)howlongdothecompanyandauditfirmstaytogether?2、度量变量：The“hazardrate”,h(t),istheprobabilitythattheeventwilloccurinperiodt,giventhatithasnotoccurreduptotimet.3、使用命令stsettimevaruse"J:\phd\kva.dta",clearliststsetfailtime该语句产生四个内部变量：显示变量：listfailtime_st_d_t_t0The_stvariableisadummyequaltooneforobservationswhosedatahasbeenstset(e.g.,therewouldhavebeensomezerovaluesifwehadexcludedsomeobservationsusingtheifqualifier)The_dvariable是否改变状态The_tvariable生存时间The_t0variable生存起始点，默认为04、用Coxproportionalhazardsmodel估计命令：stcoxstcoxloadbearings（loadbearings两个变量是影响生命的两个因素）Thereportedhazardratiosaretheexponentialsofthecoefficients.Thehazardratioforload=1.52647=exp(a1)wherea1isthecoefficientonloada1=ln(1.52647)=0.4229578Thecoefficientonbearings=ln(0.0636433)=-2.754461Theloadcoefficientissignificantlypositiveimplyingthatthemachinesfailmorequickly(higherhazardrate)whentheyareundergreaterstressThebearingscoefficientissignificantlynegativeimplyingthatthemachinesfaillessquickly(lowerhazardrate)whentheyusethenew-typeofbearing.如果想让系统报告系数而不是H（T）系数，可使用以下命令stcoxloadbearings,nohr解决ties问题的模型之一：breslowTheBreslowmethodisveryfastandisthedefaultmethodthatSTATAusesforresolvingties.如果生存时间相同时，就形成一个ties.命令集：stcoxloadbearings,breslowstcoxloadbearings,efron解决ties问题的模型之二：efron该方法比上一个方法更准确，但用时较长。将两个同样的死亡时间各分0.5的可能性。当存在censoring时，即并不是所且有的样本都死亡时，需要在命令中加选项。stsetfailtime,failure(failed)ThefailtimevariablegivesthetimeoffailureorcensoringThefailedvariableindicateswhetherfailureorcensoringoccurredSTATAassumescensoringiffailedequalszeroorissettomissing以上均是处理一个事件只占一行的情况，当事件某一特性改变时，就需要多行来描述。这时需要在告诉系统以下数据为生存数据的命令中加入选项，事件代码stsett,id(patid)failure(died)当Left-censoringoccurs，这时需在说明生存命令中加入开始时间变量stsetend,id(id)failure(died)enter(begin)当中间部分时间的数据缺失时的处理：需要说明死亡时间、变量标识，死亡标识，开始时间。stsetend,id(id)failure(died)enter(begin)为消除heteroscedasticityandtime-seriesdependence，可以在回归命令的最后加上robust和cluster().stcoxx1,robustcluster(id)小结：根据因变量的类型选择不同的回归模型Dependentvariable(Y)ExamplesEstimationmethod(s)STATAcommandContinuous(-<Y<+)LogofauditfeesStockreturnsCostofcapitalOLSQuantileregressionregressqregBinary(Y=0,1)Listed/NotlistedBig6/Non-Big6auditorProbitLogitprobitlogitDiscreteandunordered(Y=0,1,2,..)Methodoftransport(train,bus,car,bicycle)Typeofcompany(private,publicunquoted,quoted)MultinomiallogitMultinomialprobitmlogitmprobitDiscreteandordered(Y=0,1,2,..)Typeofpeerreviewreport(adverse,modified,unmodified)OrderedprobitOrderedlogitoprobitologitDependentvariable(Y)ExamplesEstimationmethod(s)STATAcommandDiscretecountdata(Y=0,1,2,…)NumberofweaknessesdisclosedinpeerreviewreportPoissonNegativebinomialpoissonnbregContinuousbutcensored(kLY<kH)Non-auditfeesFootballattendanceTobittobitDurationdata(oftencensored)kLY<kHDurationofunemploymentCEOtenureCompanysurvivalCoxproportionalhazardsstcox

第四章面板数据主要内容：4.1Thebasicidea4.2Linearregression4.3Logitandprobitmodels4.4Othermodels4.1Thebasicidea1、面

人人文库> 全部分类> 行业资料 > 管理策划

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

STATA实用学习笔记

文档简介

温馨提示

最新文档

评论

STATA实用学习笔记

文档简介

温馨提示

最新文档

评论

相关文档