H2O 深度学习报告_第1页
H2O 深度学习报告_第2页
H2O 深度学习报告_第3页
H2O 深度学习报告_第4页
H2O 深度学习报告_第5页
已阅读5页,还剩50页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

DeepLearningwithH2O

ArnoCandel ErinLeDellEditedby:AngelaBartz

http://h2o.ai/resources/

October2021:SixthEdition

DeepLearningwithH2O

byArnoCandel&ErinLeDell

withassistancefromVirajParmar&AnishaAroraEditedby:AngelaBartz

PublishedbyH2O.ai,Inc.2307LeghornSt.

MountainView,CA94043

©2016-2021H2O.ai,Inc.AllRightsReserved.October2021:SixthEdition

Photosby©H2O.ai,Inc.

Allcopyrightsbelongtotheirrespectiveowners.Whileeveryprecautionhasbeentakeninthepreparationofthisbook,thepublisherandauthorsassumenoresponsibilityforerrorsoromissions,orfordamagesresultingfromtheuseoftheinformationcontainedherein.

PrintedintheUnitedStatesofAmerica.

Contents

Introduction

5

WhatisH2O?

5

Installation

6

InstallationinR

7

InstallationinPython

7

PointingtoaDifferentH2OCluster

8

ExampleCode

8

Citation

9

DeepLearningOverview

9

H2O’sDeepLearningArchitecture

10

SummaryofFeatures

11

TrainingProtocol

12

Initialization

12

ActivationandLossFunctions

12

ParallelDistributedNetworkTraining

15

SpecifyingtheNumberofTrainingSamples

17

Regularization

18

AdvancedOptimization

18

MomentumTraining

19

RateAnnealing

19

AdaptiveLearning

20

LoadingData

20

DataStandardization/Normalization

20

Convergence-basedEarlyStopping

21

Time-basedEarlyStopping

21

AdditionalParameters

21

UseCase:MNISTDigitClassification

22

MNISTOverview

22

PerformingaTrialRun

25

N-foldCross-Validation

27

ExtractingandHandlingtheResults

28

WebInterface

31

VariableImportances

31

JavaModel

33

GridSearchforModelComparison

33

4|CONTENTS

WhatisH2O?|5

CartesianGridSearch

34

RandomGridSearch

35

CheckpointModels

37

AchievingWorld-RecordPerformance

41

ComputationalPerformance

41

DeepAutoencoders

42

NonlinearDimensionalityReduction

42

UseCase:AnomalyDetection

43

StackedAutoencoder

46

UnsupervisedPretrainingwithSupervisedFine-Tuning

46

Parameters

46

CommonRCommands

53

CommonPythonCommands

53

Acknowledgments

53

References

54

Authors

55

Introduction

ThisdocumentintroducesthereadertoDeepLearningwithH2O.ExamplesarewritteninRandPython.Topicsinclude:

installationofH2O

basicDeepLearningconcepts

buildingdeepneuralnetsinH2O

howtointerpretmodeloutput

howtomakepredictions

aswellasvariousimplementationdetails.

WhatisH2O?

H2O.aiisfocusedonbringingAItobusinessesthroughsoftware.ItsflagshipproductisH2O,theleadingopensourceplatformthatmakesiteasyforfinancialservices,insurancecompanies,andhealthcarecompaniestodeployAIanddeeplearningtosolvecomplexproblems.Morethan9,000organizationsand80,000+datascientistsdependonH2Oforcriticalapplicationslikepredictivemaintenanceandoperationalintelligence.Thecompany–whichwasrecentlynamedtotheCBInsightsAI100–isusedby169Fortune500enterprises,including8oftheworld’s10largestbanks,7ofthe10largestinsurancecompanies,and4ofthetop10healthcarecompanies.NotablecustomersincludeCapitalOne,ProgressiveInsurance,Transamerica,Comcast,NielsenCatalinaSolutions,Macy’s,Walgreens,andKaiserPermanente.

Usingin-memorycompression,H2Ohandlesbillionsofdatarowsin-memory,evenwithasmallcluster.Tomakeiteasierfornon-engineerstocreatecompleteanalyticworkflows,H2O’splatformincludesinterfacesforR,Python,Scala,Java,JSON,andCoffeeScript/JavaScript,aswellasabuilt-inwebinterface,Flow.H2Oisdesignedtoruninstandalonemode,onHadoop,orwithinaSparkCluster,andtypicallydeployswithinminutes.

H2Oincludesmanycommonmachinelearningalgorithms,suchasgeneralizedlinearmodeling(linearregression,logisticregression,etc.),Na¨ıveBayes,principalcomponentsanalysis,k-meansclustering,andword2vec.H2Oimplementsbest-in-classalgorithmsatscale,suchasdistributedrandomforest,gradientboosting,anddeeplearning.H2OalsoincludesaStackedEnsemblesmethod,whichfindstheoptimalcombinationofacollectionofpredictionalgorithmsusingaprocess

PAGE

6

|Installation

Installation|7

knownas”stacking.”WithH2O,customerscanbuildthousandsofmodelsandcomparetheresultstogetthebestpredictions.

H2Oisnurturingagrassrootsmovementofphysicists,mathematicians,andcomputerscientiststoheraldthenewwaveofdiscoverywithdatasciencebycollaboratingcloselywithacademicresearchersandindustrialdatascientists.StanforduniversitygiantsStephenBoyd,TrevorHastie,andRobTibshiraniadvisetheH2Oteamonbuildingscalablemachinelearningalgorithms.Andwithhundredsofmeetupsoverthepastseveralyears,H2Ocontinuestoremainaword-of-mouthphenomenon.

Tryitout

DownloadH2Odirectlyat

http://h2o.ai/download

.

InstallH2O’sRpackagefromCRANat

https://cran.r-project.

org/

web/packages/h2o/

.

InstallthePythonpackagefromPyPIat

/

pypi/h2o/

.

Jointhecommunity

Tolearnaboutourtrainingsessions,hackathons,andproductupdates,visit

http://h2o.ai

.

Tolearnaboutourmeetups,visit

/

topics/h2o/all/

.

Havequestions?PostthemonStackOverflowusingtheh2otagat

/questions/tagged/h2o

.

HaveaGoogleaccount(suchasGmailorGoogle+)?Jointheopensourcecommunityforumat

/d/forum/

h2ostream

.

Jointhechatat

https://gitter.im/h2oai/h2o-3

.

Installation

H2OrequiresJava;ifyoudonotalreadyhaveJavainstalled,installitfrom

/en/download/

beforeinstallingH2O.

TheeasiestwaytodirectlyinstallH2OisviaanRorPythonpackage.

InstallationinR

ToloadarecentH2OpackagefromCRAN,run:

install.packages("h2o")

1

Note:TheversionofH2OinCRANmaybeonereleasebehindthecurrentversion.

Forthelatestrecommendedversion,downloadthelateststableH2O-3buildfromtheH2Odownloadpage:

Goto

http://h2o.ai/download

.

ChoosethelateststableH2O-3build.

Clickthe“InstallinR”tab.

library(h2o)

#StartH2Oonyourlocalmachineusingallavailablecores.

#Bydefault,CRANpolicieslimitusetoonly2cores.

h2o.init(nthreads=-1)

#Gethelp

?h2o.glm

?h2o.gbm

?h2o.deeplearning

#Showademodemo(h2o.glm)demo(h2o.gbm)demo(h2o.deeplearning)

CopyandpastethecommandsintoyourRsession.AfterH2Oisinstalledonyoursystem,verifytheinstallation:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

InstallationinPython

ToloadarecentH2OpackagefromPyPI,run:

pipinstallh2o

1

TodownloadthelateststableH2O-3buildfromtheH2Odownloadpage:

Goto

http://h2o.ai/download

.

ChoosethelateststableH2O-3build.

Clickthe“InstallinPython”tab.

CopyandpastethecommandsintoyourPythonsession.

AfterH2Oisinstalled,verifytheinstallation:

importh2o

#StartH2Oonyourlocalmachine

h2o.init()

#Gethelphelp(h2o.estimators.glm.H2OGeneralizedLinearEstimator)help(h2o.estimators.gbm.H2OGradientBoostingEstimator)help(h2o.estimators.deeplearning.

H2ODeepLearningEstimator)

#Showademo

h2o.demo("glm")

h2o.demo("gbm")h2o.demo("deeplearning")

1

2

3

4

5

6

7

8

9

10

11

12

13

14

PointingtoaDifferentH2OCluster

Theinstructionsintheprevioussectionscreateaone-nodeH2Oclusteronyourlocalmachine.

ToconnecttoanestablishedH2Ocluster(inamulti-nodeHadoopenvironment,forexample)specifytheIPaddressandportnumberfortheestablishedclusterusingtheipandportparametersintheh2o.init()command.ThesyntaxforthisfunctionisidenticalforRandPython:

h2o.init(ip="9",port=54321)

1

ExampleCode

RandPythoncodefortheexamplesinthisdocumentcanbefoundhere:

/h2oai/h2o-3/tree/master/h2o-docs/src/

booklets/v2_2015/source/DeepLearning_Vignette_code_examples

DeepLearningOverview|9

10|H2O’sDeepLearningArchitecture

Thedocumentsourceitselfcanbefoundhere:

/h2oai/h2o-3/blob/master/h2o-docs/src/

booklets/v2_2015/source/DeepLearning_Vignette.tex

Citation

Tocitethisbooklet,usethefollowing:

Candel,A.,Parmar,V.,LeDell,E.,andArora,A.(Oct2021).DeepLearningwithH2O.

http://h2o.ai/resources

.

DeepLearningOverview

Unliketheneuralnetworksofthepast,modernDeepLearningprovidestrainingstability,generalization,andscalabilitywithbigdata.Sinceitperformsquitewellinanumberofdiverseproblems,DeepLearningisquicklybecomingthealgorithmofchoiceforthehighestpredictiveaccuracy.

Thefirstsectionisabriefoverviewofdeepneuralnetworksforsupervisedlearningtasks.ThereareseveraltheoreticalframeworksforDeepLearning,butthisdocumentfocusesprimarilyonthefeedforwardarchitectureusedbyH2O.

Thebasicunitinthemodel(shownintheimagebelow)istheneuron,abiologicallyinspiredmodelofthehumanneuron.Inhumans,thevaryingstrengthsoftheneurons’outputsignalstravelalongthesynapticjunctionsandarethenaggregatedasinputforaconnectedneuron’sactivation.

i=1

Inthemodel,theweightedcombinationα=艺n wixi+bofinputsignalsis

aggregated,andthenanoutputsignalf(α)transmittedbytheconnectedneuron.Thefunctionfrepresentsthenonlinearactivationfunctionusedthroughoutthenetworkandthebiasbrepresentstheneuron’sactivationthreshold.

Multi-layer,feedforwardneuralnetworksconsistofmanylayersofinterconnectedneuronunits(asshowninthefollowingimage),startingwithaninputlayertomatchthefeaturespace,followedbymultiplelayersofnonlinearity,andendingwithalinearregressionorclassificationlayertomatchtheoutputspace.Theinputsandoutputsofthemodel’sunitsfollowthebasiclogicofthesingleneurondescribedabove.

Biasunitsareincludedineachnon-outputlayerofthenetwork.Theweightslinkingneuronsandbiaseswithotherneuronsfullydeterminetheoutputoftheentirenetwork.Learningoccurswhentheseweightsareadaptedtominimizetheerroronthelabeledtrainingdata.Morespecifically,foreachtrainingexamplej,theobjectiveistominimizealossfunction,

L(W,B|j).

Here,Wisthecollection{Wi}1:N−1,whereWidenotestheweightmatrixconnectinglayersiandi+1foranetworkofNlayers.SimilarlyBisthecollection{bi}1:N−1,wherebidenotesthecolumnvectorofbiasesforlayeri+1.

Thisbasicframeworkofmulti-layerneuralnetworkscanbeusedtoaccomplishDeepLearningtasks.DeepLearningarchitecturesaremodelsofhierarchicalfeatureextraction,typicallyinvolvingmultiplelevelsofnonlinearity.DeepLearningmodelsareabletolearnusefulrepresentationsofrawdataandhaveexhibitedhighperformanceoncomplexdatasuchasimages,speech,andtext

(Bengio,2009).

H2O’sDeepLearningArchitecture

H2Ofollowsthemodelofmulti-layer,feedforwardneuralnetworksforpredictivemodeling.ThissectionprovidesamoredetaileddescriptionofH2O’sDeepLearningfeatures,parameterconfigurations,andcomputationalimplementation.

H2O’sDeepLearningArchitecture|

PAGE

11

PAGE

12

|H2O’sDeepLearningArchitecture

SummaryofFeatures

H2O’sDeepLearningfunctionalitiesinclude:

supervisedtrainingprotocolforregressionandclassificationtasks

fastandmemory-efficientJavaimplementationsbasedoncolumnarcom-pressionandfine-grainMapReduce

multi-threadedanddistributedparallelcomputationthatcanberunonasingleoramulti-nodecluster

automatic,per-neuron,adaptivelearningrateforfastconvergence

optionalspecificationoflearningrate,annealing,andmomentumoptions

regularizationoptionssuchasL1,L2,dropout,Hogwild!,andmodelaveragingtopreventmodeloverfitting

elegantandintuitivewebinterface(Flow)

fullyscriptableRAPIfromH2O’sCRANpackage

fullyscriptablePythonAPI

gridsearchforhyperparameteroptimizationandmodelselection

automaticearlystoppingbasedonconvergenceofuser-specifiedmetricstouser-specifiedtolerance

modelcheckpointingforreducedruntimesandmodeltuning

automaticpre-andpost-processingforcategoricalandnumericaldata

automaticimputationofmissingvalues(optional)

automatictuningofcommunicationvscomputationforbestperformance

modelexportinplainJavacodefordeploymentinproductionenvironments

additionalexpertparametersformodeltuning

deepautoencodersforunsupervisedfeaturelearningandanomalydetection

TrainingProtocol

ThetrainingprotocoldescribedbelowfollowsmanyoftheideasandadvancesdiscussedinrecentDeepLearningliterature.

Initialization

VariousDeepLearningarchitecturesemployacombinationofunsupervisedpre-trainingfollowedbysupervisedtraining,butH2Ousesapurelysupervisedtrainingprotocol.Thedefaultinitializationschemeistheuniformadaptiveoption,whichisanoptimizedinitializationbasedonthesizeofthenetwork.DeepLearningcanalsobestartedusingarandominitializationdrawnfromeitherauniformornormaldistribution,optionallyspecifyingascalingparameter.

ActivationandLossFunctions

ThechoicesforthenonlinearactivationfunctionfdescribedintheintroductionaresummarizedinTable1below.xiandwirepresentthefiringneuron’sinputvaluesandtheirweights,respectively;αdenotestheweightedcombinationα=iwixi+b.

Function

Formula

Range

Tanh

α −α

f(α)=e−e

f(·)∈[−1,1]

RectifiedLinear

Maxout f

f(α)=max(0,α) f(·)∈R+

(α1,α2)=max(α1,α2) f(·)∈R

Table1:ActivationFunctions

eα+e−α

Thetanhfunctionisarescaledandshiftedlogisticfunction;itssymmetryaround0allowsthetrainingalgorithmtoconvergefaster.Therectifiedlinearactivationfunctionhasdemonstratedhighperformanceonimagerecognitiontasksandisamorebiologicallyaccuratemodelofneuronactivations

(LeCun

etal,1998).

MaxoutisageneralizationoftheRectifiiedLinearactivation,whereeachneuronpicksthelargestoutputofkseparatechannels,whereeachchannelhasitsownweightsandbiasvalues.Thecurrentimplementationsupportsonlyk=2.Maxoutactivationworksparticularlywellwithdropout

(Goodfellowet

al,2013).

Formoreinformation,referto

Regularization

.

TheRectifieristhespecialcaseofMaxoutwheretheoutputofonechannelisalways0.Itisdifficulttodeterminea“best”activationfunctiontouse;eachmayoutperformtheothersinseparatescenarios,butgridsearchmodelscanhelptocompareactivationfunctionsandotherparameters.Formoreinformation,referto

GridSearchforModelComparison

.ThedefaultactivationfunctionistheRectifier.Eachoftheseactivationfunctionscanbeoperatedwithdropoutregularization.Formoreinformation,referto

Regularization

.

Specifytheoneofthefollowingdistributionfunctionsfortheresponsevariableusingthedistributionargument:

AUTO

Bernoulli

Multinomial

Poisson

Gamma

Tweedie

Laplace

Quantile

Huber

Gaussian

Eachdistributionhasaprimaryassociationwithaparticularlossfunction,butsomedistributionsallowuserstospecifyanon-defaultlossfunctionfromthegroupoflossfunctionsspecifiedinTable2.Bernoulliandmultinomialareprimarilyassociatedwithcross-entropy(alsoknownaslog-loss),GaussianwithMeanSquaredError,LaplacewithAbsoluteloss(aspecialcaseofQuantilewithquantilealpha=0.5)andHuberwithHuberloss.ForPoisson,Gamma,andTweediedistributions,thelossfunctioncannotbechanged,solossmustbesettoAUTO.

Thesystemdefaultenforcesthetable’stypicaluserulebasedonwhetherregressionorclassificationisbeingperformed.Noteherethatt(j)ando(j)arethepredicted(alsoknownastarget)outputandactualoutput,respectively,fortrainingexamplej;further,letyrepresenttheoutputunitsandOtheoutputlayer.

Table2:Lossfunctions

Function Formula Typicaluse

MeanSquaredError L(W,B|j)=1lt(j)−o(j)l2

Regression

2 2

Absolute L(W,B|j)=lt(j)−o(j)l1 Regression

Huber L(W,B|j)=

2

2

Regression

lt(j)−o(j)l1−1

2

j1lt(j)−o(j)l2

forlt(j)−o(j)l1≤1,

CrossEntropy L(W,B|j)=−

艺ln(o(j))·t(j)+ln(1−o(j))·(1−t(j))

Classification

otherwise.

y y y y

y∈O

Topredictthe80-thpercentileofthepetallengthoftheIrisdatasetinR,usethefollowing:

ExampleinR

library(h2o)h2o.init(nthreads=-1)

train.hex<-h2o.importFile("https://h2o-public-test-/smalldata/iris/iris_wheader.csv")

splits<-h2o.splitFrame(train.hex,0.75,seed=1234)dl<-h2o.deeplearning(x=1:3,y="petal_len",

training_frame=splits[[1]],distribution="quantile",quantile_alpha=0.8)

h2o.predict(dl,splits[[2]])

1

2

3

4

5

6

7

8

Topredictthe80-thpercentileofthepetallengthoftheIrisdatasetinPython,usethefollowing:

ExampleinPython

importh2o

fromh2o.estimators.deeplearningimportH2ODeepLearningEstimator

h2o.init()

train=h2o.import_file("https://h2o-public-test-data./smalldata/iris/iris_wheader.csv")

splits=train.split_frame(ratios=[0.75],seed=1234)dl=H2ODeepLearningEstimator(distribution="quantile",

quantile_alpha=0.8)

dl.train(x=range(0,2),y="petal_len",training_frame=splits[0])

print(dl.predict(splits[1]))

1

2

3

4

5

6

7

8

ParallelDistributedNetworkTraining

TheprocessofminimizingthelossfunctionL(W,B|j)isaparallelizedversionofstochasticgradientdescent(SGD).AsummaryofstandardSGDprovidedbelow,withthegradient∇L(W,B|j)computedviabackpropagation

(LeCun

etal,1998).

Theconstantαisthelearningrate,whichcontrolsthestepsizesduringgradientdescent.

Standardstochasticgradientdescent

InitializeW,B

Iterateuntilconvergencecriterionreached:

Gettrainingexamplei

∂w

Updateallweightswjk∈W,biasesbjk∈Bwjk:=wjk−α∂L(W,B|j)

jk

∂b

bjk:=bjk−α∂L(W,B|j)

jk

Stochasticgradientdescentisfastandmemory-efficientbutnoteasilyparal-lelizablewithoutbecomingslow.WeutilizeHogwild!,therecentlydevelopedlock-freeparallelizationschemefrom

Niuetal,2011,

toaddressthisissue.

Hogwild!followsasharedmemorymodelwheremultiplecores(whereeachcorehandlesseparatesubsetsorallofthetrainingdata)areabletomakeindependentcontributionstothegradientupdates∇L(W,B|j)asynchronously.

Inamulti-nodesystem,thisparallelizationschemeworksontopofH2O’sdistributedsetupthatdistributesthetrainingdataacrossthecluster.EachnodeoperatesinparallelonitslocaldatauntilthefinalparametersW,Bareobtainedbyaveraging.

Paralleldistributedandmulti-threadedtrainingwithSGDinH2ODeepLearning

InitializeglobalmodelparametersW,B

DistributetrainingdataTacrossnodes(canbedisjointorreplicated)

Iterateuntilconvergencecriterionreached:

FornodesnwithtrainingsubsetTn,doinparallel:

ObtaincopyoftheglobalmodelparametersWn,Bn

SelectactivesubsetTna⊂Tn

(user-givennumberofsamplesperiteration)

PartitionTnaintoTnacbycoresnc

Forcoresnconnoden,doinparallel:

Gettrainingexamplei∈Tnac

∂w

Updateallweightswjk∈Wn,biasesbjk∈Bnwjk:=wjk−α∂L(W,B|j)

jk

∂b

bjk:=bjk−α∂L(W,B|j)

jk

SetW,B:=AvgnWn,AvgnBn

Optionallyscorethemodelon(potentiallysampled)train/validationscoringsets

Here,theweightsandbiasupdatesfollowtheasynchronousHogwild!proce-duretoincrementallyadjusteachnode’sparametersWn,Bnafterseeingtheexamplei.TheAvgnnotationrepresentsthefinalaveragingoftheselocalparametersacrossallnodestoobtaintheglobalmodelparametersandcompletetraining.

SpecifyingtheNumberofTrainingSamples

H2ODeepLearningisscalableandcantakeadvantageoflargeclustersofcomputenodes.Therearethreeoperatingmodes.Thedefaultbehaviorallowseverynodetotrainontheentire(replicated)datasetbutautomaticallyshuffling(and/orusingasubsetof)thetrainingexamplesforeachiterationlocally.

Fordatasetsthatdon’tfitintoeachnode’smemory(dependingontheamountofheapmemoryspecifiedbythe-XmxJavaoption),itmightnotbepossibletoreplicatethedata,soeachcomputenodecanbespecifiedtotrainonlywithlocaldata.Anexperimentalsinglenodemodeisavailableforcaseswherefinalconvergenceisslowduetothepresenceoftoomanynodes,butthishasnotbeennecessaryinourtesting.

TospecifytheglobalnumberoftrainingexamplessharedwiththedistributedSGDworkernodesbetweenmodelaveraging,usethe

trainsamplesperiterationparameter.Ifthespecifiedvalueis-1,allnodesprocessalltheirlocaltrainingdataoneachiteration.

Ifreplicatetrainingdataisenabled,whichisthedefaultsetting,thiswillresultintrainingNepochs(passesoverthedata)periterationonNnodes;otherwise,oneepochwillbetrainedperiteration.Specifying0alwaysresultsinoneepochperiterationregardlessofthenumberofcomputenodes.Ingeneral,thisparametersupportsanypositivenumber.Forlargedatasets,werecommendspecifyingafractionofthedataset.

Avalueof-2,whichisthedefaultvalue,enablesauto-tuningforthisparameterbasedonthecomputationalperformanceoftheprocessorsandthenetworkofthesystemandattemptstofindagoodbalancebetweencomputationandcommunication.Thisparametercanaffecttheconvergencerateduringtraining.

Forexample,ifthetrainingdatacontains10millionrows,andthenumberoftrainingsamplesperiterationisspecifiedas100,000whenrunningonfournodes,theneachnodewillprocess25,000examplesperiteration,anditwilltake40distributediterationstoprocessoneepoch.

Ifthevalueistoohigh,itmighttaketoolongbetweensynchronizationandmodelconvergencemaybeslow.Ifthevalueistoolow,networkcommunicationoverheadwilldominatetheruntimeandcomputationalperformancewillsuffer.

Regularization

H2O’sDeepLearningframeworksupportsregularizationtechniquestopreventoverfitting.£1(L1:Lasso)and£2(L2:Ridge)regularizationenforcethesamepenaltiesastheydowithothermodels:modifyingthelossfunctionsoastominimizeloss:

L1(W,B|j)=L(W,B|j)+λ1R1(W,B|j)+λ2R2(W,B|j).

For£1regularization,R1(W,B|j)isthesumofall£1normsfortheweightsandbiasesinthenetwork;£2regularizationviaR2(W,B|j)representsthesumofsquaresofalltheweightsandbiasesinthenetwork.Theconstantsλ1andλ2aregenerallyspecifiedasverysmall(forexample10−5).

ThesecondtypeofregularizationavailableforDeepLearningisamoderninnovationcalleddropout

(Hintonetal.,2012).

Dropoutconstrainstheonlineoptimizationsothatduringforwardpropagationforagiventrainingexample,eachneuroninthenetworksuppressesitsactivationwithprobabilityP,whichisusuallylessthan0.2forinputneuronsandupto0.5forhiddenneurons.

Therearetwoeffects:aswith£2regularization,thenetworkweightvaluesarescaledtoward0.Althoughtheysharethesameglobalparameters,eachtrainingexampletrainsadifferentmodel.Asaresult,dropoutallowsanexponentiallylargenumberofmodelstobeaveragedasanensembletohelppreventoverfittingandimprovegeneralization.

Ifthefeaturespaceislargeandnoisy,specifyinganinputdropoutusingtheinputdropoutratioparametercanbeespeciallyuseful.Notethatin-putdropoutcanbespecifiedindependentlyofthedropoutspecificationinthehiddenlayers(whichrequiresactivationtobeTanhWithDropout,MaxoutWithDropout,orRectifierWithDropout).Specifytheamountofhiddendropoutperhiddenlayerusingthehiddendropoutratiospa-rameter,whichissetto0.5bydefault.

AdvancedOptimization

H2Ofeaturesmanualandautomaticadvancedoptimizationmodes.Themanualmodefeaturesincludemomentumtrainingandlearningrateannealingandtheautomaticmodefeaturesanadaptivelearningrate.

MomentumTraining

Momentummodifiesback-propagationbyallowingprioriterationstoinfluencethecurrentversion.Inparticular,avelocityvector,v,isdefinedtomodifytheupdatesasfollows:

θrepresentstheparametersW,B

µrepresentsthemomentumcoefficient

αrepresentsthelearningrate

vt+1=µvt−α∇L(θt)θt+1=θt+vt+1

Usingthemomentumparametercanaidinavoidinglocalminimaandanyassociatedinstability

(Sutskeveretal,2014).

Toomuchmomentumcanleadtoinstability,sowerecommendincrementingthemomentumslowly.Thepa-rametersthatcontrolmomentumaremomentumstart,momentumramp,andmomentumstable.

Whenusingmomentumupdates,werecommendusingtheNesterovacceler-atedgradientmethod,whichusesthenesterovacceleratedgradientparameter.Thismethodmodifiestheupdatesasfollows:

vt+1=µvt−α∇L(θt+µvt)Wt+1=Wt+vt+1

RateAnnealing

Duringtraining,thechanceofoscillationor“optimumskipping”createstheneedforaslowerlearningrateasthemodelapproachesaminimum.Asopposedtospecifyingaconstantlearningrateα,learningrateannealinggraduallyreducesthelearningrateαtto“freeze”intolocalminimaintheoptimizationlandscape

(Zeiler,2012).

ForH2O,theannealingrate(rateannealing)istheinverseofthenumberoftrainingsamplesrequiredtodividethelearningrateinhalf(e.g.,10−6meansthatittakes106trainingsamplestohalvethelearningrate).

AdaptiveLearning

TheimplementedadaptivelearningratealgorithmADADELTA

(Zeiler,2012)

automaticallycombinesthebenefitsoflearningrateannealingandmomentumtrainingtoavoidslowconvergence.Tosimplifyhyperparametersearch,specifyonlyρandE.

Insomecases,amanuallycontrolled(non-adaptive)learningrateandmomen-tumspecificationscanleadtobetterresultsbutrequireahyperparametersearchofuptosevenparameters.Ifthemodelisbuiltonatopologywithmanylocalminimaorlongplateaus,aconstantlearningratemayproducesub-optimalresults.However,theadaptivelearningrategenerallyproducesthebestresultsduringourtesting,sothisoptionisthedefault.

Thefirstoftwohyperparametersforadaptivelearningisρ(rho).Itissimilartomomentumandisrelatedtothememoryofpriorweightupdates.Typicalvaluesarebetween0.9and0.999.Thesecondhyperparameter,E(epsilon),issimilartolearningrateannealingduringinitialtrainingandallowsfurtherprogressduringmomentumatlaterstages.Typicalvaluesarebetween10−10and10−4.

LoadingData

LoadingadatasetinRorPythonforusewithH2Oisslightlydifferentthantheusualmethodology.Insteadofusingdata.frameordata.tableinR,orpandas.DataFr

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论