版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1
arXiv:2108.02497v3[cs.LG]9Feb2023
Howtoavoidmachinelearningpitfalls:aguideforacademicresearchers
MichaelA.Lones*
Abstract
Thisdocumentisaconciseoutlineofsomeofthecommonmistakesthatoccurwhen
usingmachinelearning,andwhatcanbedonetoavoidthem.Whilstitshouldbeaccessibletoanyonewithabasicunderstandingofmachinelearningtechniques,itwasoriginallywrittenforresearchstudents,andfocusesonissuesthatareofpartic-ularconcernwithinacademicresearch,suchastheneedtodorigorouscomparisonsandreachvalidconclusions.Itcoversfivestagesofthemachinelearningprocess:whattodobeforemodelbuilding,howtoreliablybuildmodels,howtorobustlyevaluatemodels,howtocomparemodelsfairly,andhowtoreportresults.
1Introduction
It’seasytomakemistakeswhenapplyingmachinelearning(ML),andthesemistakescanresultinMLmodelsthatfailtoworkasexpectedwhenappliedtodatanotseenduringtrainingandtesting[
Liaoetal.
,
2021
].Thisisaproblemforpractitioners,sinceitleadstothefailureofMLprojects.However,itisalsoaproblemforsociety,sinceiterodestrustinthefindingsandproductsofML[
Gibney
,
2022
].Thisguideaimstohelpnewcomersavoidsomeofthesemistakes.It’swrittenbyanacademic,andfocusesonlessonslearntwhilstdoingMLresearchinacademia.Whilstprimarilyaimedatstudentsandscientificresearchers,itshouldbeaccessibletoanyonegettingstartedinML,andonlyassumesabasicknowledgeofMLtechniques.However,unlikesimilarguidesaimedatamoregeneralaudience,itincludestopicsthatareofaparticularconcerntoacademia,suchastheneedtorigorouslyevaluateandcomparemodelsinordertogetworkpublished.Tomakeitmorereadable,theguidanceiswritteninformally,inaDosandDon’tsstyle.It’snotintendedtobeexhaustive,andreferences(withpublicly-accessibleURLswhereavailable)areprovidedforfurtherreading.Sinceitdoesn’tcoverissuesspecifictoparticularacademicsubjects,it’srecommendedyoualsoconsultsubject-specificguidancewhereavailable(e.g.
Stevensetal.
[
2020]
formedicine).Feedbackiswelcome,anditisexpectedthatthisdocumentwillevolveovertime.Forthisreason,ifyouciteit,pleaseincludethearXivversionnumber(currentlyv3).
*SchoolofMathematicalandComputerSciences,Heriot-WattUniversity,Edinburgh,Scotland,UK,Email:
m.lones@hw.ac.uk
,Web:
http://www.macs.hw.ac.uk/~ml355
.
2
Contents
1Introduction
1
2Beforeyoustarttobuildmodels
3
2.1Dotakethetimetounderstandyourdata
3
2.2Don’tlookatallyourdata
3
2.3Domakesureyouhaveenoughdata
3
2.4Dotalktodomainexperts
4
2.5Dosurveytheliterature
4
2.6Dothinkabouthowyourmodelwillbedeployed
5
3Howtoreliablybuildmodels
5
3.1Don’tallowtestdatatoleakintothetrainingprocess
5
3.2Dotryoutarangeofdifferentmodels
6
3.3Don’tuseinappropriatemodels
7
3.4Dokeepupwithrecentdevelopmentsindeeplearning
8
3.5Don’tassumedeeplearningwillbethebestapproach
8
3.6Dooptimiseyourmodel’shyperparameters
9
3.7Dobecarefulwhereyouoptimisehyperparametersandselectfeatures
9
3.8Doavoidlearningspuriouscorrelations
11
4Howtorobustlyevaluatemodels
11
4.1Douseanappropriatetestset
11
4.2Don’tdodataaugmentationbeforesplittingyourdata
12
4.3Douseavalidationset
12
4.4Doevaluateamodelmultipletimes
12
4.5Dosavesomedatatoevaluateyourfinalmodelinstance
14
4.6Don’tuseaccuracywithimbalanceddatasets
14
4.7Don’tignoretemporaldependenciesintimeseriesdata
15
5Howtocomparemodelsfairly
16
5.1Don’tassumeabiggernumbermeansabettermodel
16
5.2Dousestatisticaltestswhencomparingmodels
16
5.3Docorrectformultiplecomparisons
17
5.4Don’talwaysbelieveresultsfromcommunitybenchmarks
17
5.5Doconsidercombinationsofmodels
17
6Howtoreportyourresults
18
6.1Dobetransparent
18
6.2Doreportperformanceinmultipleways
19
6.3Don’tgeneralisebeyondthedata
19
6.4Dobecarefulwhenreportingstatisticalsignificance
19
6.5Dolookatyourmodels
20
7Finalthoughts
20
8Acknowledgements
21
9Changes
21
3
2Beforeyoustarttobuildmodels
It’snormaltowanttorushintotrainingandevaluatingmodels,butit’simportanttotakethetimetothinkaboutthegoalsofaproject,tofullyunderstandthedatathatwillbeusedtosupportthesegoals,toconsideranylimitationsofthedatathatneedtobeaddressed,andtounderstandwhat’salreadybeendoneinyourfield.Ifyoudon’tdothesethings,thenyoumayendupwithresultsthatarehardtopublish,ormodelsthatarenotappropriatefortheirintendedpurpose.
2.1Dotakethetimetounderstandyourdata
Eventuallyyouwillwanttopublishyourwork.Thisisaloteasiertodoifyourdataisfromareliablesource,hasbeencollectedusingareliablemethodology,andisofgoodquality.Forinstance,ifyouareusingdatacollectedfromaninternetresource,makesureyouknowwhereitcamefrom.Isitdescribedinapaper?Ifso,takealookatthepaper;makesureitwaspublishedsomewherereputable,andcheckwhethertheauthorsmentionanylimitationsofthedata.Donotassumethat,becauseadatasethasbeenusedbyanumberofpapers,itisofgoodquality—sometimesdataisusedjustbecauseitiseasytogetholdof,andsomewidelyuseddatasetsareknowntohavesignificantlimitations(see
Paulladaetal.
[
2020
]foradiscussionofthis).Ifyoutrainyourmodelusingbaddata,thenyouwillmostlikelygenerateabadmodel:aprocessknownasgarbageingarbageout.So,alwaysbeginbymakingsureyourdatamakessense.Dosomeexploratorydataanalysis(see
Cox
[
2017
]forsuggestions).Lookformissingorinconsistentrecords.Itismucheasiertodothisnow,beforeyoutrainamodel,ratherthanlater,whenyou’retryingtoexplaintoreviewerswhyyouusedbaddata.
2.2Don’tlookatallyourdata
Asyoulookatdata,itisquitelikelythatyouwillspotpatternsandmakeinsightsthatguideyourmodelling.Thisisanothergoodreasontolookatdata.However,itisimportantthatyoudonotmakeuntestableassumptionsthatwilllaterfeedintoyourmodel.The“untestable”bitisimportanthere;it’sfinetomakeassumptions,buttheseshouldonlyfeedintothetrainingofthemodel,notthetesting.So,toensurethisisthecase,youshouldavoidlookingcloselyatanytestdataintheinitialexploratoryanalysisstage.Otherwiseyoumight,consciouslyorunconsciously,makeassumptionsthatlimitthegeneralityofyourmodelinanuntestableway.ThisisathemeIwillreturntoseveraltimes,sincetheleakageofinformationfromthetestsetintothetrainingprocessisacommonreasonwhyMLmodelsfailtogeneralise.See
Don’tallowtestdatatoleakinto
thetrainingprocess
formoreonthis.
2.3Domakesureyouhaveenoughdata
Ifyoudon’thaveenoughdata,thenitmaynotbepossibletotrainamodelthatgener-alises.Workingoutwhetherthisisthecasecanbechallenging,andmaynotbeevidentuntilyoustartbuildingmodels:italldependsonthesignaltonoiseratiointhedataset.
4
Ifthesignalisstrong,thenyoucangetawaywithlessdata;ifit’sweak,thenyouneedmoredata.Ifyoucan’tgetmoredata—andthisisacommonissueinmanyresearchfields—thenyoucanmakebetteruseofexistingdatabyusingcross-validation(see
Doevaluateamodelmultipletimes
).Youcanalsousedataaugmentationtechniques(e.g.see
Wongetal.
[
2016
]and
ShortenandKhoshgoftaar
[
2019
];fortimeseriesdata,see
IwanaandUchida
[
2021
]),andthesecanbequiteeffectiveforboostingsmalldatasets,though
Don’tdodataaugmentationbeforesplittingyourdata
.Dataaugmentationisalsousefulinsituationswhereyouhavelimiteddataincertainpartsofyourdataset,e.g.inclassificationproblemswhereyouhavelesssamplesinsomeclassesthanothers,asituationknownasclassimbalance.See
Haixiangetal.
[
2017
]forareviewofmethodsfordealingwiththis;alsosee
Don’tuseaccuracywithimbalanceddatasets
.Anotheroptionfordealingwithsmalldatasetsistousetransferlearning(see
Dokeepupwith
recentdevelopmentsindeeplearning
).However,ifyouhavelimiteddata,thenit’slikelythatyouwillalsohavetolimitthecomplexityoftheMLmodelsyouuse,sincemodelswithmanyparameters,likedeepneuralnetworks,caneasilyoverfitsmalldatasets(see
Don’tassumedeeplearningwillbethebestapproach
).Eitherway,it’simportanttoidentifythisissueearlyon,andcomeupwithasuitablestrategytomitigateit.
2.4Dotalktodomainexperts
Domainexpertscanbeveryvaluable.Theycanhelpyoutounderstandwhichproblemsareusefultosolve,theycanhelpyouchoosethemostappropriatefeaturesetandMLmodeltouse,andtheycanhelpyoupublishtothemostappropriateaudience.Failingtoconsidertheopinionofdomainexpertscanleadtoprojectswhichdon’tsolveusefulproblems,orwhichsolveusefulproblemsininappropriateways.AnexampleofthelatterisusinganopaqueMLmodeltosolveaproblemwherethereisastrongneedtounderstandhowthemodelreachesanoutcome,e.g.inmakingmedicalorfinancialdecisions(see
Rudin
[
2019
]).Atthebeginningofaproject,domainexpertscanhelpyoutounderstandthedata,andpointyoutowardsfeaturesthatarelikelytobepredictive.Attheendofaproject,theycanhelpyoutopublishindomain-specificjournals,andhencereachanaudiencethatismostlikelytobenefitfromyourresearch.
2.5Dosurveytheliterature
You’reprobablynotthefirstpersontothrowMLataparticularproblemdomain,soit’simportanttounderstandwhathasandhasn’tbeendonepreviously.Otherpeoplehavingworkedonthesameproblemisn’tabadthing;academicprogressistypicallyaniterativeprocess,witheachstudyprovidinginformationthatcanguidethenext.Itmaybediscouragingtofindthatsomeonehasalreadyexploredyourgreatidea,buttheymostlikelyleftplentyofavenuesofinvestigationstillopen,andtheirpreviousworkcanbeusedasjustificationforyourwork.Toignorepreviousstudiesistopotentiallymissoutonvaluableinformation.Forexample,someonemayhavetriedyourproposedapproachbeforeandfoundfundamentalreasonswhyitwon’twork(andthereforesavedyouafewyearsoffrustration),ortheymayhavepartiallysolvedtheprobleminawaythatyou
canbuildon.So,it’simportanttodoaliteraturereviewbeforeyoustartwork;leavingittoolatemaymeanthatyouareleftscramblingtoexplainwhyyouarecoveringthesamegroundornotbuildingonexistingknowledgewhenyoucometowriteapaper.
2.6Dothinkabouthowyourmodelwillbedeployed
WhydoyouwanttobuildanMLmodel?Thisisanimportantquestion,andtheanswershouldinfluencetheprocessyouusetodevelopyourmodel.Manyacademicstudiesarejustthat—studies—andnotreallyintendedtoproducemodelsthatwillbeusedintherealworld.Thisisfairenough,sincetheprocessofbuildingandanalysingmodelscanitselfgiveveryusefulinsightsintoaproblem.However,formanyacademicstudies,theeventualgoalistoproduceanMLmodelthatcanbedeployedinarealworldsituation.Ifthisisthecase,thenit’sworththinkingearlyonabouthowitisgoingtobedeployed.Forinstance,ifit’sgoingtobedeployedinaresource-limitedenvironment,suchasasensororarobot,thismayplacelimitationsonthecomplexityofthemodel.Iftherearetimeconstraints,e.g.aclassificationofasignalisrequiredwithinmilliseconds,thenthisalsoneedstobetakenintoaccountwhenselectingamodel.Anotherconsiderationishowthemodelisgoingtobetiedintothebroadersoftwaresystemwithinwhichitisdeployed;thisprocedureisoftenfarfromsimple(see
Sculley
etal.
[
2015
]).However,emergingapproachessuchasMLOpsaimtoaddresssomeofthedifficulties;see
Tamburri
[
2020
]forareview,and
Shankaretal.
[
2022
]foradiscussionofcommonchallengeswhenoperationalisingMLmodels.
3Howtoreliablybuildmodels
BuildingmodelsisoneofthemoreenjoyablepartsofML.WithmodernMLframeworks,it’seasytothrowallmannerofapproachesatyourdataandseewhatsticks.However,thiscanleadtoadisorganisedmessofexperimentsthat’shardtojustifyandhardtowriteup.So,it’simportanttoapproachmodelbuildinginanorganisedmanner,makingsureyouusedatacorrectly,andputtingadequateconsiderationintothechoiceofmodels.
3.1Don’tallowtestdatatoleakintothetrainingprocess
It’sessentialtohavedatathatyoucanusetomeasurehowwellyourmodelgeneralises.Acommonproblemisallowinginformationaboutthisdatatoleakintotheconfiguration,trainingorselectionofmodels(seeFigure
1
).Whenthishappens,thedatanolongerprovidesareliablemeasureofgenerality,andthisisacommonreasonwhypublishedMLmodelsoftenfailtogeneralisetorealworlddata.Thereareanumberofwaysthatinformationcanleakfromatestset.Someoftheseseemquiteinnocuous.Forinstance,duringdatapreparation,usinginformationaboutthemeansandrangesofvariableswithinthewholedatasettocarryoutvariablescaling—inordertopreventinformationleakage,thiskindofthingshouldonlybedonewiththetrainingdata.Othercommonexamplesofinformationleakagearecarryingoutfeatureselectionbeforepartitioningthedata(see
Dobecarefulwhereyouoptimisehyperparametersandselect
5
Figure1:See
Don’tallowtestdatatoleakintothetrainingprocess
.[left]Howthingsshouldbe,withthetrainingsetusedtotrainthemodel,andthetestsetusedtomeasureitsgenerality.[right]Whenthere’sadataleak,thetestsetcanimplicitlybecomepartofthetrainingprocess,meaningthatitnolongerprovidesarealiablemeasureofgenerality.
features
),usingthesametestdatatoevaluatethegeneralityofmultiplemodels(see
Douseavalidationset
and
Don’talwaysbelieveresultsfromcommunitybenchmarks
),andapplyingdataaugmentationbeforesplittingoffthetestdata(see
Don’tdodata
augmentationbeforesplittingyourdata
).Thebestthingyoucandotopreventtheseissuesistopartitionoffasubsetofyourdatarightatthestartofyourproject,andonlyusethisindependenttestsetoncetomeasurethegeneralityofasinglemodelattheendoftheproject(see
Dosavesomedatatoevaluateyourfinalmodelinstance
).Beparticularlycarefulifyou’reworkingwithtimeseriesdata,sincerandomsplitsofthedatacaneasilycauseleakageandoverfitting—see
Don’tignoretemporaldependencies
intimeseriesdata
formoreonthis.Forabroaderdiscussionofdataleakage,see
Kapoor
andNarayanan
[
2022
].
3.2Dotryoutarangeofdifferentmodels
Generallyspeaking,there’snosuchthingasasinglebestMLmodel.Infact,there’saproofofthis,intheformoftheNoFreeLunchtheorem,whichshowsthatnoMLapproachisanybetterthananyotherwhenconsideredovereverypossibleproblem[
Wolpert
,
2002
].So,yourjobistofindtheMLmodelthatworkswellforyourparticularproblem.Thereissomeguidanceonthis.Forexample,youcanconsidertheinductivebiasesofMLmodels;thatis,thekindofrelationshipstheyarecapableofmodelling.Forinstance,linearmodels,suchaslinearregressionandlogisticregression,areagoodchoiceifyouknowtherearenoimportantnon-linearrelationshipsbetweenthefeaturesinyourdata,butabadchoiceotherwise.Goodqualityresearchoncloselyrelatedproblemsmayalsobeabletopointyoutowardsmodelsthatworkparticularlywell.However,alotofthetimeyou’restillleftwithquiteafewchoices,andtheonlywaytoworkoutwhichmodelisbestistotrythemall.Fortunately,modernMLlibrariesinPython(e.g.scikit-learn[
Varoquauxetal.
,
2015
]),R(e.g.caret[
Kuhn
,
2015]
),Julia(e.g.MLJ[
Blaometal.
,
2020
])etc.allowyoutotryoutmultiplemodelswithonlysmallchangestoyourcode,sothere’snoreasonnottotrythemalloutandfindoutforyourselfwhichoneworksbest.However,
Don’tuseinappropriatemodels
,and
Douse
6
7
Figure2:See
Dokeepupwithrecentdevelopmentsindeeplearning
.Aroughhistoryofneuralnetworksanddeeplearning,showingwhatIconsidertobethemilestonesintheirdevelopment.Forafarmorethoroughandaccurateaccountofthefield’shistoricaldevelopment,takealookat
Schmidhuber
[
2015
].
avalidationset
,ratherthanthetestset,toevaluatethem.Whencomparingmodels,
Dooptimiseyourmodel’shyperparameters
and
Doevaluateamodelmultipletimes
tomakesureyou’regivingthemallafairchance,and
Docorrectformultiplecomparisons
whenyoupublishyourresults.
3.3Don’tuseinappropriatemodels
Byloweringthebarriertoimplementation,modernMLlibrariesalsomakeiteasytoapplyinappropriatemodelstoyourdata.This,inturn,couldlookbadwhenyoutrytopublishyourresults.Asimpleexampleofthisisapplyingmodelsthatexpectcategoricalfeaturestoadatasetcontainingnumericalfeatures,orviceversa.SomeMLlibrariesallowyoutodothis,butitmayresultinapoormodelduetolossofinformation.Ifyoureallywanttousesuchamodel,thenyoushouldtransformthefeaturesfirst;therearevariouswaysofdoingthis,rangingfromsimpleone-hotencodingstocomplexlearnedembeddings.Otherexamplesofinappropriatemodelchoiceincludeusingaclassificationmodelwherearegressionmodelwouldmakemoresense(orviceversa),attemptingtoapplyamodelthatassumesnodependenciesbetweenvariablestotimeseriesdata,orusingamodelthatisunnecessarilycomplex(see
Don’tassumedeeplearningwillbethe
bestapproach
).Also,ifyou’replanningtouseyourmodelinpractice,
Dothinkabout
howyourmodelwillbedeployed
,anddon’tusemodelsthataren’tappropriateforyourusecase.
8
3.4Dokeepupwithrecentdevelopmentsindeeplearning
Machinelearningisafast-movingfield,andit’seasytofallbehindthecurveanduseapproachesthatotherpeopleconsidertobeoutmoded.Nowhereisthismorethecasethanindeeplearning.So,whilstdeeplearningmaynotalwaysbethebestsolution(see
Don’tassumedeeplearningwillbethebestapproach
),ifyouaregoingtousedeeplearning,thenit’sadvisabletotryandkeepupwithrecentdevelopments.Togivesomeinsightintothis,Figure
2
summarisessomeoftheimportantdevelopmentsovertheyears.Multilayerperceptrons(MLP)andrecurrentneuralnetworks(particularlyLSTM)havebeenpopularforsometime,butareincreasinglybeingreplacedbynewermodelssuchasconvolutionalneuralnetworks(CNN)andtransformers.CNNs(see
Lietal.
[
2021
]forareview)arenowthego-tomodelformanytasks,andcanbeappliedtobothimagedataandnon-imagedata.Beyondtheuseofconvolutionallayers,someofthemainmilestoneswhichledtothesuccessofCNNsincludetheuseofrectifiedlinearunits(ReLU),theadoptionofmodernoptimisers(notablyAdamanditsvariants)andthewidespreaduseofregularisation,especiallydropoutlayersandbatchnormalisation—sogiveseriousconsiderationtoincludingtheseinyourmodels.Anotherimportantgroupofcontemporarymodelsaretransformers(see
Linetal.
[
2022
]forareview).Thesearegraduallyreplacingrecurrentneuralnetworksasthego-tomodelforprocessingsequentialdata,andareincreasinglybeingappliedtootherdatatypestoo,suchasimages[
Khanetal.
,
2022
].AprominentdownsideofbothtransformersanddeepCNNsisthattheyhavemanyparametersandthereforerequirealotofdatatotrainthem.However,anoptionforsmalldatasetsistousetransferlearning,whereamodelispre-trainedonalargegenericdatasetandthenfine-tunedonthedatasetofinterest[
Hanetal.
,
2021
].Foranextensive,yetaccessible,guidetodeeplearning,see
Zhangetal.
[
2021
].
3.5Don’tassumedeeplearningwillbethebestapproach
Anincreasinglycommonpitfallistoassumethatdeepneuralnetworkswillprovidethebestsolutiontoanyproblem,andconsequentlyfailtotryoutother,possiblymoreappropriate,models.Whilstdeeplearningisgreatforcertaintasks,itisnotgoodateverything;thereareplentyofexamplesofitbeingout-performedby“oldfashioned”machinelearningmodelssuchasrandomforestsandSVMs.See,forinstance,
Grinsztajn
etal.
[
2022
],whoshowthattree-basedmodelsoftenoutperformdeeplearnersontabulardata.Certainkindsofdeepneuralnetworkarchitecturemayalsobeill-suitedtocertainkindsofdata:see,forexample,
Zengetal.
[
2022
],whoarguethattransformersarenotwell-suitedtotimeseriesforecasting.Therearealsotheoreticalreasonswhyanyonekindofmodelwon’talwaysbethebestchoice(see
Dotryoutarangeofdifferentmodels
).Inparticular,adeepneuralnetworkisunlikelytobeagoodchoiceifyouhavelimiteddata,ifdomainknowledgesuggeststhattheunderlyingpatternisquitesimple,orifthemodelneedstobeinterpretable.Thislastpointisparticularlyworthconsider
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 部编版历史八年级上册期末复习练习题(含答案)
- 第1课 鸦片战争(解析版)
- 2025年全自动地热恒压供水设备合作协议书
- 建筑工程实习律师招聘合同
- 文化艺术项目资助承诺书
- 城市绿化景观规划与施工合同
- 专利保证金协议书样本
- 商业步行街景观施工合同
- 产教融合二手房交易合同模板
- 五化镇房地产行业销售管理守则
- 2023年冬季山东高中学业水平合格考政治试题真题(含答案)
- 中国特色大国外交和推动构建人类命运共同体
- 《风电场项目经济评价规范》(NB-T 31085-2016)
- 压裂施工 安全操作规定
- 元素周期表键能键长半径
- 【三人小品搞笑短剧本】小学生小品剧本三人
- 包装设计化妆品包装设计
- 各类传染病个案调查表集
- 全口义齿PPT课件
- 室内装饰装修工程施工组织设计方案(完整版)
- 工程竣工验收备案申请表1
评论
0/150
提交评论