如何避免机器学习陷阱:学术研究人员指南 How to avoid machine learning pitfalls:a guide for academic researchers_第1页
如何避免机器学习陷阱:学术研究人员指南 How to avoid machine learning pitfalls:a guide for academic researchers_第2页
如何避免机器学习陷阱:学术研究人员指南 How to avoid machine learning pitfalls:a guide for academic researchers_第3页
如何避免机器学习陷阱:学术研究人员指南 How to avoid machine learning pitfalls:a guide for academic researchers_第4页
如何避免机器学习陷阱:学术研究人员指南 How to avoid machine learning pitfalls:a guide for academic researchers_第5页
已阅读5页,还剩26页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1

arXiv:2108.02497v3[cs.LG]9Feb2023

Howtoavoidmachinelearningpitfalls:aguideforacademicresearchers

MichaelA.Lones*

Abstract

Thisdocumentisaconciseoutlineofsomeofthecommonmistakesthatoccurwhen

usingmachinelearning,andwhatcanbedonetoavoidthem.Whilstitshouldbeaccessibletoanyonewithabasicunderstandingofmachinelearningtechniques,itwasoriginallywrittenforresearchstudents,andfocusesonissuesthatareofpartic-ularconcernwithinacademicresearch,suchastheneedtodorigorouscomparisonsandreachvalidconclusions.Itcoversfivestagesofthemachinelearningprocess:whattodobeforemodelbuilding,howtoreliablybuildmodels,howtorobustlyevaluatemodels,howtocomparemodelsfairly,andhowtoreportresults.

1Introduction

It’seasytomakemistakeswhenapplyingmachinelearning(ML),andthesemistakescanresultinMLmodelsthatfailtoworkasexpectedwhenappliedtodatanotseenduringtrainingandtesting[

Liaoetal.

,

2021

].Thisisaproblemforpractitioners,sinceitleadstothefailureofMLprojects.However,itisalsoaproblemforsociety,sinceiterodestrustinthefindingsandproductsofML[

Gibney

,

2022

].Thisguideaimstohelpnewcomersavoidsomeofthesemistakes.It’swrittenbyanacademic,andfocusesonlessonslearntwhilstdoingMLresearchinacademia.Whilstprimarilyaimedatstudentsandscientificresearchers,itshouldbeaccessibletoanyonegettingstartedinML,andonlyassumesabasicknowledgeofMLtechniques.However,unlikesimilarguidesaimedatamoregeneralaudience,itincludestopicsthatareofaparticularconcerntoacademia,suchastheneedtorigorouslyevaluateandcomparemodelsinordertogetworkpublished.Tomakeitmorereadable,theguidanceiswritteninformally,inaDosandDon’tsstyle.It’snotintendedtobeexhaustive,andreferences(withpublicly-accessibleURLswhereavailable)areprovidedforfurtherreading.Sinceitdoesn’tcoverissuesspecifictoparticularacademicsubjects,it’srecommendedyoualsoconsultsubject-specificguidancewhereavailable(e.g.

Stevensetal.

[

2020]

formedicine).Feedbackiswelcome,anditisexpectedthatthisdocumentwillevolveovertime.Forthisreason,ifyouciteit,pleaseincludethearXivversionnumber(currentlyv3).

*SchoolofMathematicalandComputerSciences,Heriot-WattUniversity,Edinburgh,Scotland,UK,Email:

m.lones@hw.ac.uk

,Web:

http://www.macs.hw.ac.uk/~ml355

.

2

Contents

1Introduction

1

2Beforeyoustarttobuildmodels

3

2.1Dotakethetimetounderstandyourdata

3

2.2Don’tlookatallyourdata

3

2.3Domakesureyouhaveenoughdata

3

2.4Dotalktodomainexperts

4

2.5Dosurveytheliterature

4

2.6Dothinkabouthowyourmodelwillbedeployed

5

3Howtoreliablybuildmodels

5

3.1Don’tallowtestdatatoleakintothetrainingprocess

5

3.2Dotryoutarangeofdifferentmodels

6

3.3Don’tuseinappropriatemodels

7

3.4Dokeepupwithrecentdevelopmentsindeeplearning

8

3.5Don’tassumedeeplearningwillbethebestapproach

8

3.6Dooptimiseyourmodel’shyperparameters

9

3.7Dobecarefulwhereyouoptimisehyperparametersandselectfeatures

9

3.8Doavoidlearningspuriouscorrelations

11

4Howtorobustlyevaluatemodels

11

4.1Douseanappropriatetestset

11

4.2Don’tdodataaugmentationbeforesplittingyourdata

12

4.3Douseavalidationset

12

4.4Doevaluateamodelmultipletimes

12

4.5Dosavesomedatatoevaluateyourfinalmodelinstance

14

4.6Don’tuseaccuracywithimbalanceddatasets

14

4.7Don’tignoretemporaldependenciesintimeseriesdata

15

5Howtocomparemodelsfairly

16

5.1Don’tassumeabiggernumbermeansabettermodel

16

5.2Dousestatisticaltestswhencomparingmodels

16

5.3Docorrectformultiplecomparisons

17

5.4Don’talwaysbelieveresultsfromcommunitybenchmarks

17

5.5Doconsidercombinationsofmodels

17

6Howtoreportyourresults

18

6.1Dobetransparent

18

6.2Doreportperformanceinmultipleways

19

6.3Don’tgeneralisebeyondthedata

19

6.4Dobecarefulwhenreportingstatisticalsignificance

19

6.5Dolookatyourmodels

20

7Finalthoughts

20

8Acknowledgements

21

9Changes

21

3

2Beforeyoustarttobuildmodels

It’snormaltowanttorushintotrainingandevaluatingmodels,butit’simportanttotakethetimetothinkaboutthegoalsofaproject,tofullyunderstandthedatathatwillbeusedtosupportthesegoals,toconsideranylimitationsofthedatathatneedtobeaddressed,andtounderstandwhat’salreadybeendoneinyourfield.Ifyoudon’tdothesethings,thenyoumayendupwithresultsthatarehardtopublish,ormodelsthatarenotappropriatefortheirintendedpurpose.

2.1Dotakethetimetounderstandyourdata

Eventuallyyouwillwanttopublishyourwork.Thisisaloteasiertodoifyourdataisfromareliablesource,hasbeencollectedusingareliablemethodology,andisofgoodquality.Forinstance,ifyouareusingdatacollectedfromaninternetresource,makesureyouknowwhereitcamefrom.Isitdescribedinapaper?Ifso,takealookatthepaper;makesureitwaspublishedsomewherereputable,andcheckwhethertheauthorsmentionanylimitationsofthedata.Donotassumethat,becauseadatasethasbeenusedbyanumberofpapers,itisofgoodquality—sometimesdataisusedjustbecauseitiseasytogetholdof,andsomewidelyuseddatasetsareknowntohavesignificantlimitations(see

Paulladaetal.

[

2020

]foradiscussionofthis).Ifyoutrainyourmodelusingbaddata,thenyouwillmostlikelygenerateabadmodel:aprocessknownasgarbageingarbageout.So,alwaysbeginbymakingsureyourdatamakessense.Dosomeexploratorydataanalysis(see

Cox

[

2017

]forsuggestions).Lookformissingorinconsistentrecords.Itismucheasiertodothisnow,beforeyoutrainamodel,ratherthanlater,whenyou’retryingtoexplaintoreviewerswhyyouusedbaddata.

2.2Don’tlookatallyourdata

Asyoulookatdata,itisquitelikelythatyouwillspotpatternsandmakeinsightsthatguideyourmodelling.Thisisanothergoodreasontolookatdata.However,itisimportantthatyoudonotmakeuntestableassumptionsthatwilllaterfeedintoyourmodel.The“untestable”bitisimportanthere;it’sfinetomakeassumptions,buttheseshouldonlyfeedintothetrainingofthemodel,notthetesting.So,toensurethisisthecase,youshouldavoidlookingcloselyatanytestdataintheinitialexploratoryanalysisstage.Otherwiseyoumight,consciouslyorunconsciously,makeassumptionsthatlimitthegeneralityofyourmodelinanuntestableway.ThisisathemeIwillreturntoseveraltimes,sincetheleakageofinformationfromthetestsetintothetrainingprocessisacommonreasonwhyMLmodelsfailtogeneralise.See

Don’tallowtestdatatoleakinto

thetrainingprocess

formoreonthis.

2.3Domakesureyouhaveenoughdata

Ifyoudon’thaveenoughdata,thenitmaynotbepossibletotrainamodelthatgener-alises.Workingoutwhetherthisisthecasecanbechallenging,andmaynotbeevidentuntilyoustartbuildingmodels:italldependsonthesignaltonoiseratiointhedataset.

4

Ifthesignalisstrong,thenyoucangetawaywithlessdata;ifit’sweak,thenyouneedmoredata.Ifyoucan’tgetmoredata—andthisisacommonissueinmanyresearchfields—thenyoucanmakebetteruseofexistingdatabyusingcross-validation(see

Doevaluateamodelmultipletimes

).Youcanalsousedataaugmentationtechniques(e.g.see

Wongetal.

[

2016

]and

ShortenandKhoshgoftaar

[

2019

];fortimeseriesdata,see

IwanaandUchida

[

2021

]),andthesecanbequiteeffectiveforboostingsmalldatasets,though

Don’tdodataaugmentationbeforesplittingyourdata

.Dataaugmentationisalsousefulinsituationswhereyouhavelimiteddataincertainpartsofyourdataset,e.g.inclassificationproblemswhereyouhavelesssamplesinsomeclassesthanothers,asituationknownasclassimbalance.See

Haixiangetal.

[

2017

]forareviewofmethodsfordealingwiththis;alsosee

Don’tuseaccuracywithimbalanceddatasets

.Anotheroptionfordealingwithsmalldatasetsistousetransferlearning(see

Dokeepupwith

recentdevelopmentsindeeplearning

).However,ifyouhavelimiteddata,thenit’slikelythatyouwillalsohavetolimitthecomplexityoftheMLmodelsyouuse,sincemodelswithmanyparameters,likedeepneuralnetworks,caneasilyoverfitsmalldatasets(see

Don’tassumedeeplearningwillbethebestapproach

).Eitherway,it’simportanttoidentifythisissueearlyon,andcomeupwithasuitablestrategytomitigateit.

2.4Dotalktodomainexperts

Domainexpertscanbeveryvaluable.Theycanhelpyoutounderstandwhichproblemsareusefultosolve,theycanhelpyouchoosethemostappropriatefeaturesetandMLmodeltouse,andtheycanhelpyoupublishtothemostappropriateaudience.Failingtoconsidertheopinionofdomainexpertscanleadtoprojectswhichdon’tsolveusefulproblems,orwhichsolveusefulproblemsininappropriateways.AnexampleofthelatterisusinganopaqueMLmodeltosolveaproblemwherethereisastrongneedtounderstandhowthemodelreachesanoutcome,e.g.inmakingmedicalorfinancialdecisions(see

Rudin

[

2019

]).Atthebeginningofaproject,domainexpertscanhelpyoutounderstandthedata,andpointyoutowardsfeaturesthatarelikelytobepredictive.Attheendofaproject,theycanhelpyoutopublishindomain-specificjournals,andhencereachanaudiencethatismostlikelytobenefitfromyourresearch.

2.5Dosurveytheliterature

You’reprobablynotthefirstpersontothrowMLataparticularproblemdomain,soit’simportanttounderstandwhathasandhasn’tbeendonepreviously.Otherpeoplehavingworkedonthesameproblemisn’tabadthing;academicprogressistypicallyaniterativeprocess,witheachstudyprovidinginformationthatcanguidethenext.Itmaybediscouragingtofindthatsomeonehasalreadyexploredyourgreatidea,buttheymostlikelyleftplentyofavenuesofinvestigationstillopen,andtheirpreviousworkcanbeusedasjustificationforyourwork.Toignorepreviousstudiesistopotentiallymissoutonvaluableinformation.Forexample,someonemayhavetriedyourproposedapproachbeforeandfoundfundamentalreasonswhyitwon’twork(andthereforesavedyouafewyearsoffrustration),ortheymayhavepartiallysolvedtheprobleminawaythatyou

canbuildon.So,it’simportanttodoaliteraturereviewbeforeyoustartwork;leavingittoolatemaymeanthatyouareleftscramblingtoexplainwhyyouarecoveringthesamegroundornotbuildingonexistingknowledgewhenyoucometowriteapaper.

2.6Dothinkabouthowyourmodelwillbedeployed

WhydoyouwanttobuildanMLmodel?Thisisanimportantquestion,andtheanswershouldinfluencetheprocessyouusetodevelopyourmodel.Manyacademicstudiesarejustthat—studies—andnotreallyintendedtoproducemodelsthatwillbeusedintherealworld.Thisisfairenough,sincetheprocessofbuildingandanalysingmodelscanitselfgiveveryusefulinsightsintoaproblem.However,formanyacademicstudies,theeventualgoalistoproduceanMLmodelthatcanbedeployedinarealworldsituation.Ifthisisthecase,thenit’sworththinkingearlyonabouthowitisgoingtobedeployed.Forinstance,ifit’sgoingtobedeployedinaresource-limitedenvironment,suchasasensororarobot,thismayplacelimitationsonthecomplexityofthemodel.Iftherearetimeconstraints,e.g.aclassificationofasignalisrequiredwithinmilliseconds,thenthisalsoneedstobetakenintoaccountwhenselectingamodel.Anotherconsiderationishowthemodelisgoingtobetiedintothebroadersoftwaresystemwithinwhichitisdeployed;thisprocedureisoftenfarfromsimple(see

Sculley

etal.

[

2015

]).However,emergingapproachessuchasMLOpsaimtoaddresssomeofthedifficulties;see

Tamburri

[

2020

]forareview,and

Shankaretal.

[

2022

]foradiscussionofcommonchallengeswhenoperationalisingMLmodels.

3Howtoreliablybuildmodels

BuildingmodelsisoneofthemoreenjoyablepartsofML.WithmodernMLframeworks,it’seasytothrowallmannerofapproachesatyourdataandseewhatsticks.However,thiscanleadtoadisorganisedmessofexperimentsthat’shardtojustifyandhardtowriteup.So,it’simportanttoapproachmodelbuildinginanorganisedmanner,makingsureyouusedatacorrectly,andputtingadequateconsiderationintothechoiceofmodels.

3.1Don’tallowtestdatatoleakintothetrainingprocess

It’sessentialtohavedatathatyoucanusetomeasurehowwellyourmodelgeneralises.Acommonproblemisallowinginformationaboutthisdatatoleakintotheconfiguration,trainingorselectionofmodels(seeFigure

1

).Whenthishappens,thedatanolongerprovidesareliablemeasureofgenerality,andthisisacommonreasonwhypublishedMLmodelsoftenfailtogeneralisetorealworlddata.Thereareanumberofwaysthatinformationcanleakfromatestset.Someoftheseseemquiteinnocuous.Forinstance,duringdatapreparation,usinginformationaboutthemeansandrangesofvariableswithinthewholedatasettocarryoutvariablescaling—inordertopreventinformationleakage,thiskindofthingshouldonlybedonewiththetrainingdata.Othercommonexamplesofinformationleakagearecarryingoutfeatureselectionbeforepartitioningthedata(see

Dobecarefulwhereyouoptimisehyperparametersandselect

5

Figure1:See

Don’tallowtestdatatoleakintothetrainingprocess

.[left]Howthingsshouldbe,withthetrainingsetusedtotrainthemodel,andthetestsetusedtomeasureitsgenerality.[right]Whenthere’sadataleak,thetestsetcanimplicitlybecomepartofthetrainingprocess,meaningthatitnolongerprovidesarealiablemeasureofgenerality.

features

),usingthesametestdatatoevaluatethegeneralityofmultiplemodels(see

Douseavalidationset

and

Don’talwaysbelieveresultsfromcommunitybenchmarks

),andapplyingdataaugmentationbeforesplittingoffthetestdata(see

Don’tdodata

augmentationbeforesplittingyourdata

).Thebestthingyoucandotopreventtheseissuesistopartitionoffasubsetofyourdatarightatthestartofyourproject,andonlyusethisindependenttestsetoncetomeasurethegeneralityofasinglemodelattheendoftheproject(see

Dosavesomedatatoevaluateyourfinalmodelinstance

).Beparticularlycarefulifyou’reworkingwithtimeseriesdata,sincerandomsplitsofthedatacaneasilycauseleakageandoverfitting—see

Don’tignoretemporaldependencies

intimeseriesdata

formoreonthis.Forabroaderdiscussionofdataleakage,see

Kapoor

andNarayanan

[

2022

].

3.2Dotryoutarangeofdifferentmodels

Generallyspeaking,there’snosuchthingasasinglebestMLmodel.Infact,there’saproofofthis,intheformoftheNoFreeLunchtheorem,whichshowsthatnoMLapproachisanybetterthananyotherwhenconsideredovereverypossibleproblem[

Wolpert

,

2002

].So,yourjobistofindtheMLmodelthatworkswellforyourparticularproblem.Thereissomeguidanceonthis.Forexample,youcanconsidertheinductivebiasesofMLmodels;thatis,thekindofrelationshipstheyarecapableofmodelling.Forinstance,linearmodels,suchaslinearregressionandlogisticregression,areagoodchoiceifyouknowtherearenoimportantnon-linearrelationshipsbetweenthefeaturesinyourdata,butabadchoiceotherwise.Goodqualityresearchoncloselyrelatedproblemsmayalsobeabletopointyoutowardsmodelsthatworkparticularlywell.However,alotofthetimeyou’restillleftwithquiteafewchoices,andtheonlywaytoworkoutwhichmodelisbestistotrythemall.Fortunately,modernMLlibrariesinPython(e.g.scikit-learn[

Varoquauxetal.

,

2015

]),R(e.g.caret[

Kuhn

,

2015]

),Julia(e.g.MLJ[

Blaometal.

,

2020

])etc.allowyoutotryoutmultiplemodelswithonlysmallchangestoyourcode,sothere’snoreasonnottotrythemalloutandfindoutforyourselfwhichoneworksbest.However,

Don’tuseinappropriatemodels

,and

Douse

6

7

Figure2:See

Dokeepupwithrecentdevelopmentsindeeplearning

.Aroughhistoryofneuralnetworksanddeeplearning,showingwhatIconsidertobethemilestonesintheirdevelopment.Forafarmorethoroughandaccurateaccountofthefield’shistoricaldevelopment,takealookat

Schmidhuber

[

2015

].

avalidationset

,ratherthanthetestset,toevaluatethem.Whencomparingmodels,

Dooptimiseyourmodel’shyperparameters

and

Doevaluateamodelmultipletimes

tomakesureyou’regivingthemallafairchance,and

Docorrectformultiplecomparisons

whenyoupublishyourresults.

3.3Don’tuseinappropriatemodels

Byloweringthebarriertoimplementation,modernMLlibrariesalsomakeiteasytoapplyinappropriatemodelstoyourdata.This,inturn,couldlookbadwhenyoutrytopublishyourresults.Asimpleexampleofthisisapplyingmodelsthatexpectcategoricalfeaturestoadatasetcontainingnumericalfeatures,orviceversa.SomeMLlibrariesallowyoutodothis,butitmayresultinapoormodelduetolossofinformation.Ifyoureallywanttousesuchamodel,thenyoushouldtransformthefeaturesfirst;therearevariouswaysofdoingthis,rangingfromsimpleone-hotencodingstocomplexlearnedembeddings.Otherexamplesofinappropriatemodelchoiceincludeusingaclassificationmodelwherearegressionmodelwouldmakemoresense(orviceversa),attemptingtoapplyamodelthatassumesnodependenciesbetweenvariablestotimeseriesdata,orusingamodelthatisunnecessarilycomplex(see

Don’tassumedeeplearningwillbethe

bestapproach

).Also,ifyou’replanningtouseyourmodelinpractice,

Dothinkabout

howyourmodelwillbedeployed

,anddon’tusemodelsthataren’tappropriateforyourusecase.

8

3.4Dokeepupwithrecentdevelopmentsindeeplearning

Machinelearningisafast-movingfield,andit’seasytofallbehindthecurveanduseapproachesthatotherpeopleconsidertobeoutmoded.Nowhereisthismorethecasethanindeeplearning.So,whilstdeeplearningmaynotalwaysbethebestsolution(see

Don’tassumedeeplearningwillbethebestapproach

),ifyouaregoingtousedeeplearning,thenit’sadvisabletotryandkeepupwithrecentdevelopments.Togivesomeinsightintothis,Figure

2

summarisessomeoftheimportantdevelopmentsovertheyears.Multilayerperceptrons(MLP)andrecurrentneuralnetworks(particularlyLSTM)havebeenpopularforsometime,butareincreasinglybeingreplacedbynewermodelssuchasconvolutionalneuralnetworks(CNN)andtransformers.CNNs(see

Lietal.

[

2021

]forareview)arenowthego-tomodelformanytasks,andcanbeappliedtobothimagedataandnon-imagedata.Beyondtheuseofconvolutionallayers,someofthemainmilestoneswhichledtothesuccessofCNNsincludetheuseofrectifiedlinearunits(ReLU),theadoptionofmodernoptimisers(notablyAdamanditsvariants)andthewidespreaduseofregularisation,especiallydropoutlayersandbatchnormalisation—sogiveseriousconsiderationtoincludingtheseinyourmodels.Anotherimportantgroupofcontemporarymodelsaretransformers(see

Linetal.

[

2022

]forareview).Thesearegraduallyreplacingrecurrentneuralnetworksasthego-tomodelforprocessingsequentialdata,andareincreasinglybeingappliedtootherdatatypestoo,suchasimages[

Khanetal.

,

2022

].AprominentdownsideofbothtransformersanddeepCNNsisthattheyhavemanyparametersandthereforerequirealotofdatatotrainthem.However,anoptionforsmalldatasetsistousetransferlearning,whereamodelispre-trainedonalargegenericdatasetandthenfine-tunedonthedatasetofinterest[

Hanetal.

,

2021

].Foranextensive,yetaccessible,guidetodeeplearning,see

Zhangetal.

[

2021

].

3.5Don’tassumedeeplearningwillbethebestapproach

Anincreasinglycommonpitfallistoassumethatdeepneuralnetworkswillprovidethebestsolutiontoanyproblem,andconsequentlyfailtotryoutother,possiblymoreappropriate,models.Whilstdeeplearningisgreatforcertaintasks,itisnotgoodateverything;thereareplentyofexamplesofitbeingout-performedby“oldfashioned”machinelearningmodelssuchasrandomforestsandSVMs.See,forinstance,

Grinsztajn

etal.

[

2022

],whoshowthattree-basedmodelsoftenoutperformdeeplearnersontabulardata.Certainkindsofdeepneuralnetworkarchitecturemayalsobeill-suitedtocertainkindsofdata:see,forexample,

Zengetal.

[

2022

],whoarguethattransformersarenotwell-suitedtotimeseriesforecasting.Therearealsotheoreticalreasonswhyanyonekindofmodelwon’talwaysbethebestchoice(see

Dotryoutarangeofdifferentmodels

).Inparticular,adeepneuralnetworkisunlikelytobeagoodchoiceifyouhavelimiteddata,ifdomainknowledgesuggeststhattheunderlyingpatternisquitesimple,orifthemodelneedstobeinterpretable.Thislastpointisparticularlyworthconsider

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论