![浙江大学肖忠华语料库Corpus-Linguistics-课件_第1页](http://file4.renrendoc.com/view/48f72d4029611f5248afee1f6ba3924a/48f72d4029611f5248afee1f6ba3924a1.gif)
![浙江大学肖忠华语料库Corpus-Linguistics-课件_第2页](http://file4.renrendoc.com/view/48f72d4029611f5248afee1f6ba3924a/48f72d4029611f5248afee1f6ba3924a2.gif)
![浙江大学肖忠华语料库Corpus-Linguistics-课件_第3页](http://file4.renrendoc.com/view/48f72d4029611f5248afee1f6ba3924a/48f72d4029611f5248afee1f6ba3924a3.gif)
![浙江大学肖忠华语料库Corpus-Linguistics-课件_第4页](http://file4.renrendoc.com/view/48f72d4029611f5248afee1f6ba3924a/48f72d4029611f5248afee1f6ba3924a4.gif)
![浙江大学肖忠华语料库Corpus-Linguistics-课件_第5页](http://file4.renrendoc.com/view/48f72d4029611f5248afee1f6ba3924a/48f72d4029611f5248afee1f6ba3924a5.gif)
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
OutlineofthesessionCorpusdesignissuesCorpusrepresentativenessCorpusbalanceSamplingCorpussizeTypesofcorporaIntroducingsomewell-knownEnglishcorporaofdifferenttypesRepresentativenessAcorpusisacollectionof(1)machine-readable(2)authentictexts(includingtranscriptsofspokendata)whichis(3)sampledtobe(4)representativeofaparticularlanguageorlanguagevarietyAcorpusisdifferentfromarandomcollectionoftextsoranarchiveRepresentativenessisadefiningfeatureofacorpusAslanguageisinfinitebutacorpushastobefiniteinsize,wesampleandproportionallyincludeawiderangeoftexttypestoensuremaximumbalanceandrepresentativenessSomedefinitions…“generallyassembledwithparticularpurposesinmind,andareoftenassembledtobe(informallyspeaking)representative
ofsomelanguageortexttype”(Leech1992:116)“…selectedandorderedaccordingtoexplicitlinguisticcriteriainordertobeusedasasampleofthelanguage”(Sinclair2019)“Awell-organizedcollectionofdata”(McEnery2019)“gatheredaccordingtoexplicitdesigncriteria”(Tognini-Bonelili2019:2)“builtaccordingtoexplicitdesigncriteriaforaspecificpurpose”(Atkinsetal1992)textsselectedandputtogether“inaprincipledway”(Johansson2019:3)Whatisrepresentativeness?“Acorpusisthoughttoberepresentativeofthelanguagevarietyitissupposedtorepresentifthefindingsbasedonitscontentscanbegeneralizedtothesaidlanguagevariety”(Leech1991)Representativenessreferstotheextenttowhichasampleincludesthefullrangeofvariabilityinapopulation(Biber1993)Whatisrepresentativeness?RepresentativenessisafluidconceptcloselyrelatedtoyourresearchquestionsIfyouwantacorpuswhichisrepresentativeofgeneralEnglish,acorpusrepresentativeofnewspaperswillnotdoIfyouwantacorpusrepresentativeofnewspapers,acorpusrepresentativeofTheTimeswillnotdoTwotypesofrepresentativenessTherepresentativenessofgeneralcorporaand(domain-orgenrespecific)specializedcorporaaremeasuredindifferentwaysGeneralcorporaBalance:TherangeofgenresincludedinacorpusandtheirproportionSampling:HowthetextchunksforeachgenreareselectedSpecializedcorporaDegreeofclosure/saturation:Closure/saturationforaparticularlinguisticfeature(e.g.sizeoflexicon)ofavarietyoflanguage(e.g.computermanuals)meansthatthefeatureappearstobefiniteorissubjecttoverylimitedvariationbeyondacertainpoint,i.e.thecurveoflexicalgrowthisflatteningoutWhyshouldwecareaboutrepresentativeness?Readerofcorpus-basedstudies(assessment)Tointerprettheresultsofcorpusresearchwithcaution,consideringwhetherthecorpusdataandthemethodusedinthestudywasappropriateCorpususer(assessment)Importantto“knowyourcorpus”TodecidewhetheragivencorpusisappropriatefortheirspecificresearchquestionTomakeappropriateclaimsonthebasisofsuchacorpusCorpuscreator(assessment?)Tomaketheircorpusasrepresentativeaspossibleofalanguage(variety)claimedtorepresentTodocumentdesigncriteriaexplicitlyandmakethedocumentationavailabletocorpususersCriteriafortextselectionThecriteriausedtoselecttextsforacorpusareprincipallyexternalTheexternalvs.internalcriteriacorrespondstoBiber’s(1993:243)situationalvs.linguisticperspectivesExternalcriteriaaredefinedsituationallyirrespectiveofthedistributionoflinguisticfeaturesInternalcriteriaaredefinedlinguistically,takingintoaccountthedistributionofsuchfeaturesItiscirculartouseinternalcriterialikethedistributionofwordsorgrammaticalfeaturesastheprimaryparametersfortheselectionofcorpusdataIfthedistributionoflinguisticfeaturesispre-determinedwhenthecorpusisdesigned,thereisnopointinanalyzingsuchacorpustodiscovernaturallyoccurringlinguisticfeaturedistributionsCriteriafortextselectionTime?Ifacorpusisnotregularlyupdated,itrapidlybecomesunrepresentative(Hunston2019)Therelevanceofpermanenceincorpusdesignactuallydependsonhowweviewacorpus-astaticordynamiclanguagemodelStaticmodel:samplecorpora(nearlyallexistingcorpora,BNC,LOB/FLOB)Dynamicmodel:BankofEnglishCriteriafortextselectionTips“Criteriafordeterminingthestructureofacorpusshouldbesmallinnumber,clearlyseparatefromeachother,andefficientasagroupindelineatingacorpusthatisrepresentativeofthelanguageorvarietyunderexamination.”(Sinclair2019)CorpusbalanceAbalancedcorpuscoversawiderangeoftextcategorieswhicharesupposedtoberepresentativeofthelanguage(variety)underconsiderationTheproportionsofdifferentkindsoftextitcontainsshouldcorrespondwithinformedandintuitivejudgementsThereisnoscientificmeasureforbalance–justbestguessTheacceptablebalanceisdeterminedbytheintendeduse–yourresearchquestionsTheBNCmodelGenerallyacceptedasbeingabalancedcorpusHasbeenfollowedintheconstructionofanumberofcorpora4,124texts(includingtranscriptsofrecording)ca.100millionwords:90%Written+10%SpokenThreecriteriaforWrittenDomain:thecontenttype(i.e.subjectfield)Time:theperiodoftextproductionMedium:thetypeoftextpublication(book,periodicalsetc)TwocriteriaforSpokenDemographic:informalconversationsbyspeakersselectedbyagegroup,sex,socialclassandgeographicalregionContext-governed:formalencounterssuchasmeetings,lecturesandradiobroadcastsrecordedin4broadcontextcategoriesWrittenBNCSpokenBNCBNCvs.balanceThedesigncriteriaoftheBNCillustratesthenotionofcorpusbalance
verywell“Inselectingtextsforinclusioninthecorpus,accountwastakenofbothproduction,bysamplingawidevarietyofdistincttypesofmaterial,andreception,byselectinginstancesofthosetypeswhichhaveawidedistribution.Thus,havingchosentosamplesuchthingsaspopularnovels,ortechnicalwriting,best-sellerlistsandlibrarycirculationstatisticswereconsultedtoselectparticularexamplesofthem.”(AstonandBurnard2019:28)Pragmaticsincorpusdesign“Mostgeneralcorporaoftodayarebadlybalancedbecausetheydonothavenearlyenoughspokenlanguageinthem;estimatesoftheoptimalproportionofspokenlanguagerangefrom50%-theneutraloption-to90%,followingaguessthatmostpeopleexperiencemanytimesasmuchspeechaswriting”(Sinclair2019)ThewrittenBNCisninetimesaslargeasthespokenBNCIsspeechlessfrequentorimportantthanwriting?PragmaticsincorpusdesignAbsolutelynot!…butwritingtypicallyhasalargeraudiencethanspeech…alsocollectionofspokendatacosts10timesasmuchasforwrittendata…ittakes10hourstotranscribeonehourofrecordingPragmaticconsiderationsalsomeanthatbalanceisamoreimportantissueforastaticsamplecorpusthanforadynamicmonitorcorpusAsamonitorcorpusisfrequentlyupdated,itisusually“impossibletomaintainacorpusthatalsoincludestextofmanydifferenttypes,assomeofthemarejusttooexpensiveortimeconsumingtocollectonaregularbasis.”(Hunston2019:30-31)Corpusbalance:Sometips“Thecorpusbuildershouldretain,astargetnotions,representativenessandbalance.Whilethesearenotpreciselydefinableandattainablegoals,theymustbeusedtoguidethedesignofacorpusandtheselectionofitscomponents.”(Sinclair2019)“Itwouldbeshort-sightedindeedtowaituntilonecanscientificallybalanceacorpusbeforestartingtouseone,andhastytodismisstheresultsofcorpusanalysisas‘unreliable’or‘irrelevant’becausethecorpususedcannotbeprovedtobe‘balanced’.”(Atkinsetal1992:6)SamplingincorpuscreationLanguageisinfinite,butacorpusisfiniteinsize,sosamplingisinescapableincorpusbuilding“Someofthefirstconsiderationsinconstructingacorpusconcerntheoveralldesign:forexample,thekindsoftextsincluded,thenumberoftexts,theselectionofparticulartexts,theselectionoftextsamplesfromwithintexts,andthelengthoftextsamples.Eachoftheseinvolvesasamplingdecision,eitherconsciousornot.”(Biber1993)Samplevs.populationTheaimofsampling“istosecureasamplewhich,subjecttolimitationsofsize,willreproducethecharacteristicsofthepopulation,especiallythoseofimmediateinterest,ascloselyaspossible”(Yates1965:9)Asampleisascaled-downversionofalargerpopulationAsampleisrepresentativeifwhatwefindforthesamplealsoholdsforthegeneralpopulationCorpusrepresentativenessandbalancerelyheavilyonsamplingAcorpusisasampleofagivenpopulation(languageorlanguagevariety)SamplingincorpuscreationSamplingunitForwrittentext,itcouldbeabook,periodicalornewspaperSamplingframeAlistofsamplingunitsPopulationLanguages,language,orlanguagevarietyunderconsiderationTheassemblyofallsamplingunits,whichcanbedefinedintermsofLanguageproduction(demographic:speakersandwriters)Languagereception(demographic:audienceandreaders)Languageasaproduct(registersandgenres)ExamplesofBrownandLOBBrownPopulation:WrittenEnglishtextpublishedintheUnitedStatesin1961Samplingframe:AlistofthecollectionofbooksandperiodicalsintheBrownUniversityLibraryandtheProvidenceAthenaeumSamplingunit:eachbook/periodicalwithinthesamplingframeLOBPopulation:WrittenEnglishtextpublishedintheUKaround1961Samplingframe:TheBritishNationalBibliographyCumulatedSubjectIndex1960–1964(forbooks)andWilling’sPressGuide1961(forperiodicals)Samplingunit:eachbook/periodicalwithinthesamplingframeSamplingtechniquesSimplerandomsamplingAllsamplingunitswithinthesamplingframearenumberedandthesampleischosenbyuseofatableofrandomnumbersPositivelycorrelatingwithfrequencyinthepopulation,sorarefeaturesmaynotbeincludedStratifiedrandomsamplingThepopulationisdividedinrelativelyhomogeneousgroups(i.e.strata),andthentheselatteraresampledatrandomNeverlessrepresentativethansimplerandomsamplingStratifiedrandomsamplingThewholepopulationinBrown/LOBcorpusisdividedinto15textcategoriesandthensamplesweredrawnfromeachcategoryatrandomIndemographicsamplingforcollectingspokendata,individuals(samplingunits)inthepopulationarefirstdividedintodifferentgroupsonthebasisofspeaker/writerage,sexandsocialclass,andthensamplesaretakenatrandomfromeachgroupSizeofsamplesFulltextsortextchunks?“Samplesoflanguageforacorpusshouldwhereverpossibleconsistofentiredocumentsortranscriptionsofcompletespeechevents”(Sinclair2019)GoodforstudyingtextualorganizationAfull-textcorpusmaybeinappropriateorproblematicPeculiarityofanindividualstyleortopicmayoccasionallyshowthroughTherearecopyrightissuesinincludingfulltextsFrequentlinguisticfeaturesarequitestableintheirdistributionsandhenceshorttextchunks(e.g.2,000runningwords)areusuallysufficientTextinitial,middleorendchunks?Textinitial,middle,andendsamplesmustbetakeninabalancedwayProportionofsamplesInstratifiedrandomsampling,howmanysamplesshouldbetakenforeachcategory?Thenumbersofsamplesacrosstextcategoriesshouldbeproportionaltotheirfrequenciesand/orweightsinthetargetpopulationinorderfortheresultingcorpustobeconsideredasrepresentativeDifficulttodetermineobjectively,justwell-informedandintuitiveguessProportionofgenresinBrownConstantsamplesize:ca.2,000words“Relativelyspeaking…”AnyclaimofcorpusrepresentativenessandbalancemustbeinterpretedinrelativetermsThereisnoobjectivewaytobalanceacorpusortomeasureitsrepresentativenessCorpusbalanceandrepresentativenessareafluidconceptTheresearchquestionthatonehasinmindwhenbuildingacorpusdetermineswhatanacceptable
balanceisforthecorpusoneshoulduseandwhetheritissuitably
representativeCorpusbalanceisalsoinfluencedbypracticalconsiderationsHoweasilycandataofdifferenttypesbecollected?CorpussizeHowlargeshouldacorpusbe?Thereisnoeasyanswertothisquestion.Krishnamurthy(2019):“Sizematters.”Leech(1991):“Sizeisnotall-important.”ThesizeofthecorpusneededdependsuponthepurposeforwhichitisintendedaswellasanumberofpracticalconsiderationsThekindofquerythatisanticipatedfromusersAreyoustudyingcommonorrarelinguisticfeatures?ThemethodologytheyusetostudythedataHowmuchworkcanbedonebythemachineandhowmuchhastobedonebyhand?Forcorpuscreators,alsothesourceofdataArethedatainelectronicformreadilyavailableatareasonablecost?CorpussizeCorpussizeincreaseswiththedevelopmentoftechnology1960s-70sBrownandLOB:onemillionwords1980sTheBirmingham/Cobuildcorpora:20Mwords1990sTheBritishNationalCorpus:100MwordsEarly21stCenturyTheBankofEnglish:524MwordsCorpussizeIsalargecorpusreallywhatyouwant?Thesizeofthecorpusneededtoexplorearesearchquestiondependsonthefrequencyanddistributionofthelinguisticfeaturesunderconsiderationinthatcorpus–yourresearchquestionCorporaforlexicalstudiesaremuchlargerthanthoseforgrammaticalstudiesSpecializedcorporaserveaverydifferentyetimportantpurposefromlargemulti-million-wordcorporaCorporathatneedextensivemanualannotationoranalysisarenecessarilysmallManycorpustoolssetaceilingonthenumberofconcordancesthatcanbeextractedTheoptimumsizeofacorpusisdeterminedbytheresearchquestionthecorpusisintendedtoaddressaswellaspracticalconsiderationsExploringexistingEnglishcorporaTolearnhowcorporacanbeclassifiedTolearnaboutdesigndecisionsincreatingdifferentkindsofcorporaTobecomefamiliarwitharangeofwell-knownandinfluentialcorporaTypesofcorpora,differentusesGeneralvs.specializedcorporaWrittenvs.spokencorporaSynchronicvs.diachroniccorporaMonolingualvs.multilingualcorporaComparablevs.parallelcorporaNativevs.learnercorporaDevelopmentalvs.learnercorporaRawvs.annotatedcorporaSamplevs.monitorcorpora…MonitorcorporaConstantlyupdatedandgrowinginsizeMuchlargercorpussizeOftencontainfulltextAlwaysup-to-dateOftenonlyadmitnewmaterialwhichhasnewfeaturesnotalreadyincorpusUsedtotrackchangesacrossdifferentperiodsoftimeMonitorcorporacouldbeaseriesofstaticcorporaDisadvantagesNoattempttobalancethecorpusTextavailabilitycanbecomeanissue(e.g.,copyrights)ConfusingtoindicatespecificcorpusversionCannoteasilycompareresultsrunoncorporaofdifferentsizesSomewell-knownEnglishcorporaTheBritishNationalCorpus(BNC)TheBankofEnglish(BoE)BYUAmericanEnglishcorpusCorporaoftheBrownfamily(Brown,LOB,FLOB,Frown)ICEcorpora(GB,EA,HK,Singapore,Philippines,NewZealandetc)London-LundcorpusofspokenEnglishSBCSAETheHelsinkiDiachronicCorpusofEnglishTexts(8th-18thCentury,ca.5millionwords)TheInternationalCorpusofLearnerEnglish(ICLE)MICASETheBNCFirstandbest-knownnationalcorpus(samplecorpus)100Mwordbalancedcorpusofwritten(90%)andspoken(10%)BritishEnglishincurrentuse1960-earlier1990sRichmetadataencodedforlanguagevariationstudiesPOStaggedAccessingtheBNCBYU-BNC:/bnc/BNCOnline:natcorp.ox.ac.uk/getting/index.xml.ID=order_online
LancasterBNCWebCQPeditionbncweb.lancs.ac.uk/bncwebSignup/user/login.php
BNCBaby:natcorp.ox.ac.uk/corpus/baby/index.html
SketchEngine:sketchengine.co.uk/
BNCPIE:/
TheBoEBestknownmonitorcorpus524Mwords(countingandgrowing)ofpresent-dayEnglishlanguage75%writtenand25%spoken70%BrE,20%AmEand10%otherEnglishvarietiesParticularlyusefulforlexicalandlexicographicstudies,e.g.trackingnewwords,newusesormeaningsofoldwords,andwordsfallingoutofuseAccesstotheBoEA56Mwordsampler:collins.co.uk/books.aspx?group=153CorpusofContemporary
AmericanEnglish(COCA)385+MwordsofAmericanEnglish20Mperyearfor1990-2019Equallydividedamongspoken,fiction,popularmagazines,newspapers,andacademictextsUpdatedevery6-9monthsUsefulforstudyingvariationacrossgenresandovertimeFreeonlineaccess/CorporaoftheBrownfamilyBrown:WrittenAmEin1961LOB:WrittenBrEin1961FLOB:WrittenBrEin1991Frown:WrittenAmEin1991CommoncorpusdesignOneMwordeach500samples(ca.2000wordseach)Sameproportionsfromthesame15textcategoriesUsefulforsynchronicanddiachroniccomparisonofBrEandAmEFurtherinformationICAMECD:khnt.hit.uib.no/icame/manuals/TheICEcorpora20oneMwordbalancedcorporaE.g.Britain,Ireland,US,Canada,HongKong,Singapore,India,thePhilippines,EastAfricaCommoncorpusdesign500samples(ca.2000wordseach)60%spoken+40%written12Genres1990-1994DesignedforthesynchronicstudyofworldEnglishesMoreinformationucl.ac.uk/english-usage/ice/TheLondon-LundCorpusFirstelectroniccorpusofspontaneouslanguageAcorpusofspokenBritishEnglishrecordedfrom1953-1987100texts,eachof5,000words,totalinghalfa
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025年全球及中国紧凑型真空干燥箱行业头部企业市场占有率及排名调研报告
- 2025-2030全球高纯涡轮分子泵行业调研及趋势分析报告
- 自治物业管理合同
- 工厂员工劳动合同范本
- 展柜采购合同
- 农场承包合同协议书
- 建筑工程合同的简述
- 杭州市二手房买卖合同
- 砌体施工劳务合同
- 2025抵押担保借款合同
- 医院课件:《食源性疾病知识培训》
- 浙教版七年级数学下册单元测试题及参考答案
- 华为人才发展与运营管理
- 卓有成效的管理者读后感3000字
- 七年级下册-备战2024年中考历史总复习核心考点与重难点练习(统部编版)
- 岩土工程勘察服务投标方案(技术方案)
- 实验室仪器设备验收单
- 新修订药品GMP中药饮片附录解读课件
- 蒙特利尔认知评估量表北京版
- 领导干部个人有关事项报告表(模板)
- GB/T 7631.18-2017润滑剂、工业用油和有关产品(L类)的分类第18部分:Y组(其他应用)
评论
0/150
提交评论