浙江大学肖忠华语料库Corpus-Linguistics-课件_第1页
浙江大学肖忠华语料库Corpus-Linguistics-课件_第2页
浙江大学肖忠华语料库Corpus-Linguistics-课件_第3页
浙江大学肖忠华语料库Corpus-Linguistics-课件_第4页
浙江大学肖忠华语料库Corpus-Linguistics-课件_第5页
已阅读5页,还剩41页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

OutlineofthesessionCorpusdesignissuesCorpusrepresentativenessCorpusbalanceSamplingCorpussizeTypesofcorporaIntroducingsomewell-knownEnglishcorporaofdifferenttypesRepresentativenessAcorpusisacollectionof(1)machine-readable(2)authentictexts(includingtranscriptsofspokendata)whichis(3)sampledtobe(4)representativeofaparticularlanguageorlanguagevarietyAcorpusisdifferentfromarandomcollectionoftextsoranarchiveRepresentativenessisadefiningfeatureofacorpusAslanguageisinfinitebutacorpushastobefiniteinsize,wesampleandproportionallyincludeawiderangeoftexttypestoensuremaximumbalanceandrepresentativenessSomedefinitions…“generallyassembledwithparticularpurposesinmind,andareoftenassembledtobe(informallyspeaking)representative

ofsomelanguageortexttype”(Leech1992:116)“…selectedandorderedaccordingtoexplicitlinguisticcriteriainordertobeusedasasampleofthelanguage”(Sinclair2019)“Awell-organizedcollectionofdata”(McEnery2019)“gatheredaccordingtoexplicitdesigncriteria”(Tognini-Bonelili2019:2)“builtaccordingtoexplicitdesigncriteriaforaspecificpurpose”(Atkinsetal1992)textsselectedandputtogether“inaprincipledway”(Johansson2019:3)Whatisrepresentativeness?“Acorpusisthoughttoberepresentativeofthelanguagevarietyitissupposedtorepresentifthefindingsbasedonitscontentscanbegeneralizedtothesaidlanguagevariety”(Leech1991)Representativenessreferstotheextenttowhichasampleincludesthefullrangeofvariabilityinapopulation(Biber1993)Whatisrepresentativeness?RepresentativenessisafluidconceptcloselyrelatedtoyourresearchquestionsIfyouwantacorpuswhichisrepresentativeofgeneralEnglish,acorpusrepresentativeofnewspaperswillnotdoIfyouwantacorpusrepresentativeofnewspapers,acorpusrepresentativeofTheTimeswillnotdoTwotypesofrepresentativenessTherepresentativenessofgeneralcorporaand(domain-orgenrespecific)specializedcorporaaremeasuredindifferentwaysGeneralcorporaBalance:TherangeofgenresincludedinacorpusandtheirproportionSampling:HowthetextchunksforeachgenreareselectedSpecializedcorporaDegreeofclosure/saturation:Closure/saturationforaparticularlinguisticfeature(e.g.sizeoflexicon)ofavarietyoflanguage(e.g.computermanuals)meansthatthefeatureappearstobefiniteorissubjecttoverylimitedvariationbeyondacertainpoint,i.e.thecurveoflexicalgrowthisflatteningoutWhyshouldwecareaboutrepresentativeness?Readerofcorpus-basedstudies(assessment)Tointerprettheresultsofcorpusresearchwithcaution,consideringwhetherthecorpusdataandthemethodusedinthestudywasappropriateCorpususer(assessment)Importantto“knowyourcorpus”TodecidewhetheragivencorpusisappropriatefortheirspecificresearchquestionTomakeappropriateclaimsonthebasisofsuchacorpusCorpuscreator(assessment?)Tomaketheircorpusasrepresentativeaspossibleofalanguage(variety)claimedtorepresentTodocumentdesigncriteriaexplicitlyandmakethedocumentationavailabletocorpususersCriteriafortextselectionThecriteriausedtoselecttextsforacorpusareprincipallyexternalTheexternalvs.internalcriteriacorrespondstoBiber’s(1993:243)situationalvs.linguisticperspectivesExternalcriteriaaredefinedsituationallyirrespectiveofthedistributionoflinguisticfeaturesInternalcriteriaaredefinedlinguistically,takingintoaccountthedistributionofsuchfeaturesItiscirculartouseinternalcriterialikethedistributionofwordsorgrammaticalfeaturesastheprimaryparametersfortheselectionofcorpusdataIfthedistributionoflinguisticfeaturesispre-determinedwhenthecorpusisdesigned,thereisnopointinanalyzingsuchacorpustodiscovernaturallyoccurringlinguisticfeaturedistributionsCriteriafortextselectionTime?Ifacorpusisnotregularlyupdated,itrapidlybecomesunrepresentative(Hunston2019)Therelevanceofpermanenceincorpusdesignactuallydependsonhowweviewacorpus-astaticordynamiclanguagemodelStaticmodel:samplecorpora(nearlyallexistingcorpora,BNC,LOB/FLOB)Dynamicmodel:BankofEnglishCriteriafortextselectionTips“Criteriafordeterminingthestructureofacorpusshouldbesmallinnumber,clearlyseparatefromeachother,andefficientasagroupindelineatingacorpusthatisrepresentativeofthelanguageorvarietyunderexamination.”(Sinclair2019)CorpusbalanceAbalancedcorpuscoversawiderangeoftextcategorieswhicharesupposedtoberepresentativeofthelanguage(variety)underconsiderationTheproportionsofdifferentkindsoftextitcontainsshouldcorrespondwithinformedandintuitivejudgementsThereisnoscientificmeasureforbalance–justbestguessTheacceptablebalanceisdeterminedbytheintendeduse–yourresearchquestionsTheBNCmodelGenerallyacceptedasbeingabalancedcorpusHasbeenfollowedintheconstructionofanumberofcorpora4,124texts(includingtranscriptsofrecording)ca.100millionwords:90%Written+10%SpokenThreecriteriaforWrittenDomain:thecontenttype(i.e.subjectfield)Time:theperiodoftextproductionMedium:thetypeoftextpublication(book,periodicalsetc)TwocriteriaforSpokenDemographic:informalconversationsbyspeakersselectedbyagegroup,sex,socialclassandgeographicalregionContext-governed:formalencounterssuchasmeetings,lecturesandradiobroadcastsrecordedin4broadcontextcategoriesWrittenBNCSpokenBNCBNCvs.balanceThedesigncriteriaoftheBNCillustratesthenotionofcorpusbalance

verywell“Inselectingtextsforinclusioninthecorpus,accountwastakenofbothproduction,bysamplingawidevarietyofdistincttypesofmaterial,andreception,byselectinginstancesofthosetypeswhichhaveawidedistribution.Thus,havingchosentosamplesuchthingsaspopularnovels,ortechnicalwriting,best-sellerlistsandlibrarycirculationstatisticswereconsultedtoselectparticularexamplesofthem.”(AstonandBurnard2019:28)Pragmaticsincorpusdesign“Mostgeneralcorporaoftodayarebadlybalancedbecausetheydonothavenearlyenoughspokenlanguageinthem;estimatesoftheoptimalproportionofspokenlanguagerangefrom50%-theneutraloption-to90%,followingaguessthatmostpeopleexperiencemanytimesasmuchspeechaswriting”(Sinclair2019)ThewrittenBNCisninetimesaslargeasthespokenBNCIsspeechlessfrequentorimportantthanwriting?PragmaticsincorpusdesignAbsolutelynot!…butwritingtypicallyhasalargeraudiencethanspeech…alsocollectionofspokendatacosts10timesasmuchasforwrittendata…ittakes10hourstotranscribeonehourofrecordingPragmaticconsiderationsalsomeanthatbalanceisamoreimportantissueforastaticsamplecorpusthanforadynamicmonitorcorpusAsamonitorcorpusisfrequentlyupdated,itisusually“impossibletomaintainacorpusthatalsoincludestextofmanydifferenttypes,assomeofthemarejusttooexpensiveortimeconsumingtocollectonaregularbasis.”(Hunston2019:30-31)Corpusbalance:Sometips“Thecorpusbuildershouldretain,astargetnotions,representativenessandbalance.Whilethesearenotpreciselydefinableandattainablegoals,theymustbeusedtoguidethedesignofacorpusandtheselectionofitscomponents.”(Sinclair2019)“Itwouldbeshort-sightedindeedtowaituntilonecanscientificallybalanceacorpusbeforestartingtouseone,andhastytodismisstheresultsofcorpusanalysisas‘unreliable’or‘irrelevant’becausethecorpususedcannotbeprovedtobe‘balanced’.”(Atkinsetal1992:6)SamplingincorpuscreationLanguageisinfinite,butacorpusisfiniteinsize,sosamplingisinescapableincorpusbuilding“Someofthefirstconsiderationsinconstructingacorpusconcerntheoveralldesign:forexample,thekindsoftextsincluded,thenumberoftexts,theselectionofparticulartexts,theselectionoftextsamplesfromwithintexts,andthelengthoftextsamples.Eachoftheseinvolvesasamplingdecision,eitherconsciousornot.”(Biber1993)Samplevs.populationTheaimofsampling“istosecureasamplewhich,subjecttolimitationsofsize,willreproducethecharacteristicsofthepopulation,especiallythoseofimmediateinterest,ascloselyaspossible”(Yates1965:9)Asampleisascaled-downversionofalargerpopulationAsampleisrepresentativeifwhatwefindforthesamplealsoholdsforthegeneralpopulationCorpusrepresentativenessandbalancerelyheavilyonsamplingAcorpusisasampleofagivenpopulation(languageorlanguagevariety)SamplingincorpuscreationSamplingunitForwrittentext,itcouldbeabook,periodicalornewspaperSamplingframeAlistofsamplingunitsPopulationLanguages,language,orlanguagevarietyunderconsiderationTheassemblyofallsamplingunits,whichcanbedefinedintermsofLanguageproduction(demographic:speakersandwriters)Languagereception(demographic:audienceandreaders)Languageasaproduct(registersandgenres)ExamplesofBrownandLOBBrownPopulation:WrittenEnglishtextpublishedintheUnitedStatesin1961Samplingframe:AlistofthecollectionofbooksandperiodicalsintheBrownUniversityLibraryandtheProvidenceAthenaeumSamplingunit:eachbook/periodicalwithinthesamplingframeLOBPopulation:WrittenEnglishtextpublishedintheUKaround1961Samplingframe:TheBritishNationalBibliographyCumulatedSubjectIndex1960–1964(forbooks)andWilling’sPressGuide1961(forperiodicals)Samplingunit:eachbook/periodicalwithinthesamplingframeSamplingtechniquesSimplerandomsamplingAllsamplingunitswithinthesamplingframearenumberedandthesampleischosenbyuseofatableofrandomnumbersPositivelycorrelatingwithfrequencyinthepopulation,sorarefeaturesmaynotbeincludedStratifiedrandomsamplingThepopulationisdividedinrelativelyhomogeneousgroups(i.e.strata),andthentheselatteraresampledatrandomNeverlessrepresentativethansimplerandomsamplingStratifiedrandomsamplingThewholepopulationinBrown/LOBcorpusisdividedinto15textcategoriesandthensamplesweredrawnfromeachcategoryatrandomIndemographicsamplingforcollectingspokendata,individuals(samplingunits)inthepopulationarefirstdividedintodifferentgroupsonthebasisofspeaker/writerage,sexandsocialclass,andthensamplesaretakenatrandomfromeachgroupSizeofsamplesFulltextsortextchunks?“Samplesoflanguageforacorpusshouldwhereverpossibleconsistofentiredocumentsortranscriptionsofcompletespeechevents”(Sinclair2019)GoodforstudyingtextualorganizationAfull-textcorpusmaybeinappropriateorproblematicPeculiarityofanindividualstyleortopicmayoccasionallyshowthroughTherearecopyrightissuesinincludingfulltextsFrequentlinguisticfeaturesarequitestableintheirdistributionsandhenceshorttextchunks(e.g.2,000runningwords)areusuallysufficientTextinitial,middleorendchunks?Textinitial,middle,andendsamplesmustbetakeninabalancedwayProportionofsamplesInstratifiedrandomsampling,howmanysamplesshouldbetakenforeachcategory?Thenumbersofsamplesacrosstextcategoriesshouldbeproportionaltotheirfrequenciesand/orweightsinthetargetpopulationinorderfortheresultingcorpustobeconsideredasrepresentativeDifficulttodetermineobjectively,justwell-informedandintuitiveguessProportionofgenresinBrownConstantsamplesize:ca.2,000words“Relativelyspeaking…”AnyclaimofcorpusrepresentativenessandbalancemustbeinterpretedinrelativetermsThereisnoobjectivewaytobalanceacorpusortomeasureitsrepresentativenessCorpusbalanceandrepresentativenessareafluidconceptTheresearchquestionthatonehasinmindwhenbuildingacorpusdetermineswhatanacceptable

balanceisforthecorpusoneshoulduseandwhetheritissuitably

representativeCorpusbalanceisalsoinfluencedbypracticalconsiderationsHoweasilycandataofdifferenttypesbecollected?CorpussizeHowlargeshouldacorpusbe?Thereisnoeasyanswertothisquestion.Krishnamurthy(2019):“Sizematters.”Leech(1991):“Sizeisnotall-important.”ThesizeofthecorpusneededdependsuponthepurposeforwhichitisintendedaswellasanumberofpracticalconsiderationsThekindofquerythatisanticipatedfromusersAreyoustudyingcommonorrarelinguisticfeatures?ThemethodologytheyusetostudythedataHowmuchworkcanbedonebythemachineandhowmuchhastobedonebyhand?Forcorpuscreators,alsothesourceofdataArethedatainelectronicformreadilyavailableatareasonablecost?CorpussizeCorpussizeincreaseswiththedevelopmentoftechnology1960s-70sBrownandLOB:onemillionwords1980sTheBirmingham/Cobuildcorpora:20Mwords1990sTheBritishNationalCorpus:100MwordsEarly21stCenturyTheBankofEnglish:524MwordsCorpussizeIsalargecorpusreallywhatyouwant?Thesizeofthecorpusneededtoexplorearesearchquestiondependsonthefrequencyanddistributionofthelinguisticfeaturesunderconsiderationinthatcorpus–yourresearchquestionCorporaforlexicalstudiesaremuchlargerthanthoseforgrammaticalstudiesSpecializedcorporaserveaverydifferentyetimportantpurposefromlargemulti-million-wordcorporaCorporathatneedextensivemanualannotationoranalysisarenecessarilysmallManycorpustoolssetaceilingonthenumberofconcordancesthatcanbeextractedTheoptimumsizeofacorpusisdeterminedbytheresearchquestionthecorpusisintendedtoaddressaswellaspracticalconsiderationsExploringexistingEnglishcorporaTolearnhowcorporacanbeclassifiedTolearnaboutdesigndecisionsincreatingdifferentkindsofcorporaTobecomefamiliarwitharangeofwell-knownandinfluentialcorporaTypesofcorpora,differentusesGeneralvs.specializedcorporaWrittenvs.spokencorporaSynchronicvs.diachroniccorporaMonolingualvs.multilingualcorporaComparablevs.parallelcorporaNativevs.learnercorporaDevelopmentalvs.learnercorporaRawvs.annotatedcorporaSamplevs.monitorcorpora…MonitorcorporaConstantlyupdatedandgrowinginsizeMuchlargercorpussizeOftencontainfulltextAlwaysup-to-dateOftenonlyadmitnewmaterialwhichhasnewfeaturesnotalreadyincorpusUsedtotrackchangesacrossdifferentperiodsoftimeMonitorcorporacouldbeaseriesofstaticcorporaDisadvantagesNoattempttobalancethecorpusTextavailabilitycanbecomeanissue(e.g.,copyrights)ConfusingtoindicatespecificcorpusversionCannoteasilycompareresultsrunoncorporaofdifferentsizesSomewell-knownEnglishcorporaTheBritishNationalCorpus(BNC)TheBankofEnglish(BoE)BYUAmericanEnglishcorpusCorporaoftheBrownfamily(Brown,LOB,FLOB,Frown)ICEcorpora(GB,EA,HK,Singapore,Philippines,NewZealandetc)London-LundcorpusofspokenEnglishSBCSAETheHelsinkiDiachronicCorpusofEnglishTexts(8th-18thCentury,ca.5millionwords)TheInternationalCorpusofLearnerEnglish(ICLE)MICASETheBNCFirstandbest-knownnationalcorpus(samplecorpus)100Mwordbalancedcorpusofwritten(90%)andspoken(10%)BritishEnglishincurrentuse1960-earlier1990sRichmetadataencodedforlanguagevariationstudiesPOStaggedAccessingtheBNCBYU-BNC:/bnc/BNCOnline:natcorp.ox.ac.uk/getting/index.xml.ID=order_online

LancasterBNCWebCQPeditionbncweb.lancs.ac.uk/bncwebSignup/user/login.php

BNCBaby:natcorp.ox.ac.uk/corpus/baby/index.html

SketchEngine:sketchengine.co.uk/

BNCPIE:/

TheBoEBestknownmonitorcorpus524Mwords(countingandgrowing)ofpresent-dayEnglishlanguage75%writtenand25%spoken70%BrE,20%AmEand10%otherEnglishvarietiesParticularlyusefulforlexicalandlexicographicstudies,e.g.trackingnewwords,newusesormeaningsofoldwords,andwordsfallingoutofuseAccesstotheBoEA56Mwordsampler:collins.co.uk/books.aspx?group=153CorpusofContemporary

AmericanEnglish(COCA)385+MwordsofAmericanEnglish20Mperyearfor1990-2019Equallydividedamongspoken,fiction,popularmagazines,newspapers,andacademictextsUpdatedevery6-9monthsUsefulforstudyingvariationacrossgenresandovertimeFreeonlineaccess/CorporaoftheBrownfamilyBrown:WrittenAmEin1961LOB:WrittenBrEin1961FLOB:WrittenBrEin1991Frown:WrittenAmEin1991CommoncorpusdesignOneMwordeach500samples(ca.2000wordseach)Sameproportionsfromthesame15textcategoriesUsefulforsynchronicanddiachroniccomparisonofBrEandAmEFurtherinformationICAMECD:khnt.hit.uib.no/icame/manuals/TheICEcorpora20oneMwordbalancedcorporaE.g.Britain,Ireland,US,Canada,HongKong,Singapore,India,thePhilippines,EastAfricaCommoncorpusdesign500samples(ca.2000wordseach)60%spoken+40%written12Genres1990-1994DesignedforthesynchronicstudyofworldEnglishesMoreinformationucl.ac.uk/english-usage/ice/TheLondon-LundCorpusFirstelectroniccorpusofspontaneouslanguageAcorpusofspokenBritishEnglishrecordedfrom1953-1987100texts,eachof5,000words,totalinghalfa

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论