版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Deep
Learning
Tutorial李宏毅Hung-yi
LeeDeep
learningattracts
lots
of
attention.•
I
believe
you
have
seen
lots
of
exciting
resultsbefore.This
talk
focuses
on
the
basic
techniques.Deep
learning
trendsat
Google.
Source:SIGMOD/Jeff
DeanOutlineLecture
IV:
Next
WaveLecture
III:
Variants
of
Neural
NetworkLecture
II:
Tips
for
Training
Deep
Neural
NetworkLecture
I:
Introduction
of
Deep
LearningLecture
I:Introduction
of
Deep
LearningOutline
of
Lecture
IIntroduction
of
Deep
LearningWhy
Deep?“Hello
World”
for
Deep
LearningLet’s
start
with
generalmachine
learning.
Machine
Learning
≈
Looking
for
a
Function•
Speech
Recognition•
Playing
Go•
Dialogue
System
f•
Image
Recognition
fff
“How
are
you”
“Cat”“Hello”“Hi”(what
the
user
said)(system
response)
“5-5”
(next
move)f1f1
“cat”“dog”f2f2“money”“snake”Framework
Model
A
set
of
function
f1,
f2f“cat”Image
Recognition:“cat”Image
Recognition:Framework
Model
A
set
of
function
f1,
f2Training
Data
fBetter!“cat”“dog”function
input:function
output:
“monkey”Goodness
of
function
f
Supervised
LearningFrameworkA
set
offunctionf1,
f2f“cat”Image
Recognition:ModelTraining
Data“monkey”“cat”“dog”Usingf
“cat”TrainingTestingStep
1
Goodness
of
function
fStep
2
Pick
the
“Best”
Function
f
*Step
3
Step
1:define
a
set
of
function
Step
2:goodness
of
functionStep
3:
pick
the
best
functionThree
Steps
for
Deep
LearningDeep
Learning
is
so
simple
……Neuralof
Network
Step
1:define
a
set
function
Step
2:goodness
of
functionStep
3:
pick
the
best
functionThree
Steps
for
Deep
LearningDeep
Learning
is
so
simple
……Human
Brains…………w1a1
akaK
ba
wk
wKweightsNeural
Network
Neuron
z
a1w1akwk
aKwK
bA
simple
function
z
z
Activation
functionbiasNeural
Networkz
Activation
functionbiasNeuron1-2
-1weights12-114zzz
11ezSigmoid
Function0.98zzzz
Neural
NetworkDifferent
connections
leads
todifferent
network
structure
Each
neurons
can
have
different
values
of
weights
and
biases.Weights
and
biases
are
network
parameters
𝜃Fully
Connect
FeedforwardNetworkzzz
11ezSigmoid
Function1-11-2
-1114-2
00.980.12Fully
Connect
FeedforwardNetwork1-21-1104-20.120.98
2-1-1-2-14-10.86
30.110.620.8300-221-1Fully
Connect
FeedforwardNetwork1-2
-11100.50.73
2-1
-2-13-1
-140.720.120.510.8500-22𝑓00=0.510.85𝑓
1−1=0.620.8300This
is
a
function.Input
vector,
output
vectorGiven
parameters
𝜃,
define
a
function
Given
network
structure,
define
a
function
set…………………………OutputLayerHidden
LayersInputLayerFully
Connect
FeedforwardNetworkLayer
1Input
x1
x2
xNLayer
2Layer
L…………
……Output
y1
y2
yMDeep
means
many
hidden
layersneuronOutput
Layer
(Option)•
Softmax
layer
as
the
output
layer
Ordinary
Layery1
z1y2
z2y3
z3
z1z2
z3
In
general,
the
output
ofnetwork
can
be
any
value.May
not
be
easy
to
interprety1
eez2z2
2.70.05e
eeez1•
Softmax
layer
as
the
output
layer
Softmax
Layeree
ez1ee
3j1z1z
j
z3
3j1z
j3
1z3
-3200.88Output
Layer
(Option)
Probability:
1
>
𝑦𝑖
>
0
𝑖𝑦𝑖
=
1
3
j1
3j1
z2z3
z
jz
j0.12
y2
e
≈0
y3
e………………Example
ApplicationInputOutput16
x
16
=
256x1x2x256Ink
→
1No
ink
→
0yy2y10Each
dimension
representsthe
confidence
of
a
digit.is
1is
2is
00.10.70.2The
imageis
“2”………………MachineExample
Application•
Handwriting
Digit
Recognitionx1
x2x256y1
y2“2”
y10is
1is
2is
0function
……
Input:256-dim
vector
output:10-dim
vector
Neural
NetworkWhat
is
needed
is
a………………………………Example
ApplicationInputOutputLayer
1
x1
x2
xNInputLayerLayer
2Layer
L……
y1
y2“2”
y10is
1is
2is
0
……
A
function
set
containing
the
candidates
forHandwriting
Digit
Recognition
……
Output
Hidden
Layers
LayerYou
need
to
decide
the
network
structure
tolet
a
good
function
in
your
function
set.FAQ•
Q:
How
many
layers?
How
many
neurons
for
each
layer?•
Q:
Can
the
structure
be
automatically
determined?Trial
and
ErrorIntuition+Neuralof
Network
Step
1:define
a
set
function
Step
2:goodness
of
functionStep
3:
pick
the
best
functionThree
Steps
for
Deep
LearningDeep
Learning
is
so
simple
……Training
Data•
Preparing
training
data:
images
and
their
labelsThe
learning
target
is
defined
on
the
training
data.“1”“3”“4”“1”“0”“2”“5”“9”Softmax………………Learning
Target16
x
16
=
256x1x2x256………………Ink
→
1No
ink
→
0y1
y2y10The
learning
target
is
……y1
has
the
maximum
valuey2
has
the
maximum
valueInput:Input:is
1is
2is
0…………………………Given
a
set
ofLossx1x2xN………………y2y10Loss
𝑙00Loss
can
be
the
distance
between
thenetwork
output
and
targettargety1
As
close
as
1possibleA
good
function
should
make
the
lossof
all
examples
as
small
as
possible.
“1”parameters……………………Total
LossxRNNyR𝑦𝑅x1x2x3NNNNNNy1y2y3𝑦1𝑦2𝑦3For
all
training
data
…𝐿
=𝑟=1𝑅𝑙𝑟Find
the
networkparameters
𝜽∗
thatminimize
total
loss
LTotal
Loss:𝑙1𝑙2𝑙3𝑙𝑅
As
small
as
possibleFind
a
function
infunction
set
thatminimizes
total
loss
LNeuralof
Network
Step
1:define
a
set
function
Step
2:goodness
of
functionStep
3:
pick
the
best
functionThree
Steps
for
Deep
LearningDeep
Learning
is
so
simple
………………
How
to
pick
the
best
functionFind
network
parameters
𝜽∗
that
minimize
total
loss
L
Layer
l
1000neurons
Layer
l+1
106weights
1000
neuronsEnumerate
all
possible
values
Network
parameters
𝜃
=
𝑤1,𝑤2,𝑤3,⋯,𝑏1,𝑏2,𝑏3,⋯
Millions
of
parameters
E.g.
speech
recognition:
8
layers
and
1000
neurons
each
layerGradient
Descent
TotalLoss
𝐿Random,
RBM
pre-train
Usually
good
enoughNetwork
parameters
𝜃
=
𝑤1,𝑤2,⋯,𝑏1,𝑏2,⋯wFind
network
parameters
𝜽∗
that
minimize
total
loss
L
Pick
an
initial
value
for
wGradient
DescentTotalLoss
𝐿Network
parameters
𝜃
=
𝑤1,𝑤2,⋯,𝑏1,𝑏2,⋯
Compute
𝜕𝐿
𝜕𝑤
Increase
wDecrease
w
w
Negative
Positive/album/photo/171572850Find
network
parameters
𝜽∗
that
minimize
total
loss
L
Pick
an
initial
value
for
wGradient
Descent
TotalLoss
𝐿Network
parameters
𝜃
=
𝑤1,𝑤2,⋯,𝑏1,𝑏2,⋯w−𝜂𝜕𝐿
𝜕𝑤“learning
rate”
Compute
𝜕𝐿
𝜕𝑤
𝑤←
𝑤
−
𝜂𝜕𝐿
𝜕𝑤
Repeat
η
is
calledFind
network
parameters
𝜽∗
that
minimize
total
loss
L
Pick
an
initial
value
for
wGradient
Descent
TotalLoss
𝐿Network
parameters
𝜃
=
𝑤1,𝑤2,⋯,𝑏1,𝑏2,⋯
Compute
𝜕𝐿
𝜕𝑤
𝑤←
𝑤
−
𝜂𝜕𝐿
𝜕𝑤
Repeat
Until
𝜕𝐿
𝜕𝑤
is
approximately
small
(when
update
is
little)
wFind
network
parameters
𝜽∗
that
minimize
total
loss
L
Pick
an
initial
value
for
w…………Gradient
DescentCompute
𝜕𝐿
𝜕𝑤1
−𝜇𝜕𝐿
𝜕𝑤10.15𝑤2Compute
𝜕𝐿
𝜕𝑤2
−𝜇𝜕𝐿
𝜕𝑤20.05𝑏1Compute
𝜕𝐿
𝜕𝑏1
−𝜇𝜕𝐿
𝜕𝑏10.20.2-0.10.3𝜃𝑤1
𝜕𝐿𝜕𝑤1
𝜕𝐿𝜕𝑤2
⋮𝜕𝐿𝜕𝑏1
⋮𝛻𝐿
=gradient…………Gradient
DescentCompute
𝜕𝐿
𝜕𝑤1
−𝜇𝜕𝐿
𝜕𝑤1Compute
𝜕𝐿
𝜕𝑤2
−𝜇𝜕𝐿
𝜕𝑤2
Compute
𝜕𝐿
𝜕𝑏1
−𝜇𝜕𝐿
𝜕𝑏10.150.05
0.2Compute
𝜕𝐿
𝜕𝑤1
−𝜇𝜕𝐿
𝜕𝑤1Compute
𝜕𝐿
𝜕𝑤2
−𝜇𝜕𝐿
𝜕𝑤2
Compute
𝜕𝐿
𝜕𝑏1
−𝜇𝜕𝐿
𝜕𝑏10.090.150.10
0.2-0.1
0.3………………𝜃𝑤1𝑤2𝑏1𝑤2Gradient
Descent
Color:
Value
ofTotal
Loss
LRandomly
pick
a
starting
point
𝑤1Gradient
Descent
𝑤2
(−𝜂
𝜕𝐿
𝜕𝑤1,
−𝜂𝜕𝐿
𝜕𝑤2)Compute
𝜕𝐿
𝜕𝑤1,
𝜕𝐿
𝜕𝑤2
𝑤1Hopfully,
we
would
reach
a
minima
…..
Color:
Value
of
Total
Loss
L𝐿𝑤1𝑤2Gradient
Descent
-
Difficulty•
Gradient
descent
never
guarantee
global
minima
Different
initial
pointReach
different
minima,so
different
resultsThere
are
some
tips
tohelp
you
avoid
localminima,
no
guarantee.𝑤1𝑤2You
are
playing
Age
of
Empires
…
You
cannot
(−𝜂
𝜕𝐿
𝜕𝑤1,
−𝜂𝜕𝐿
𝜕𝑤2)
Compute
𝜕𝐿
𝜕𝑤1,
𝜕𝐿
𝜕𝑤2Gradient
Descent
This
is
the
“learning”
of
machines
in
deep
learning
……
Even
alpha
go
using
this
approach.I
hope
you
are
not
too
disappointed
:pPeople
image
……Actually
…..Backpropagation•
Backpropagation:
an
efficient
way
to
compute
𝜕𝐿
𝜕𝑤•
Ref:.tw/~tlkagk/courses/MLDS_2015_2/Lecture/DNN%20backprop.ecm.mp4/index.htmlDon’t
worry
about
𝜕𝐿
𝜕𝑤,
the
toolkits
will
handle
it.台大周伯威同學開發
Step
1:define
a
set
of
function
Step
2:goodness
of
functionStep
3:
pick
the
best
functionConcluding
RemarksDeep
Learning
is
so
simple
……Outline
of
Lecture
IIntroduction
of
Deep
LearningWhy
Deep?“Hello
World”
for
Deep
LearningLayerXSizeWordErrorLayerXSizeRate(%)WordErrorRate(%)1X2k24.22X2k20.43X2k18.4better4X2k17.85X2k17.21X377222.57X2k17.11X463422.61X16k22.1Deeper
is
Better?Seide,
Frank,
Gang
Li,
and
Dong
Yu.
"Conversational
Speech
TranscriptionUsing
Context-Dependent
Deep
Neural
Networks."
Interspeech.
2011.Reference
for
the
reason:http://neuralnetworksandde/chap4.htmlUniversality
Theorem
Any
continuous
function
f
f
:RN
RM
Can
be
realized
by
a
network
with
one
hidden
layer(given
enough
hiddenneurons)Why
“Deep”
neural
network
not
“Fat”
neural
network?x1x2……xNDeepx1x2……xNShallowFat
+
Short
v.s.
Thin
+
Tall
The
same
number
of
parameters
……LayerXSizeWordErrorRate(%)LayerXSizeWordErrorRate(%)1X2k24.22X2k20.43X2k18.44X2k17.85X2k17.21X377222.57X2k17.11X463422.61X16k22.1Fat
+
Short
v.s.
Thin
+
TallSeide,
Frank,
Gang
Li,
and
Dong
Yu.
"Conversational
Speech
TranscriptionUsing
Context-Dependent
Deep
Neural
Networks."
Interspeech.
2011.AnalogyThis
page
is
for
EE
background.less
gates
needed
Logic
circuits•
Logic
circuits
consists
of
gates•
A
two
layers
of
logic
gates
can
represent
any
Boolean
function.•
Using
multiple
layers
of
logic
gates
to
build
some
functions
are
much
simpler
Neural
network•
Neural
network
consists
of
neurons•
A
hidden
layer
network
can
represent
any
continuous
function.•
Using
multiple
layers
of
neurons
to
represent
some
functions
are
much
simpler
lessparameters
lessdata?weak長長髮Little
examples短
短髮Modularization•
Deep
→
ModularizationImageGirls
with
long
hair
Boys
with
long
hairGirls
withshort
hair
Boys
with
short
hair長髮髮
女女
長髮
女女長髮
男短髮髮
女女短髮
髮
女
女短髮髮
男男
短髮
男男Classifier
1Classifier
2Classifier
3Classifier
4長長髮短
短髮女女短髮
髮女
女短
短髮Modularization•
Deep
→
ModularizationImage髮
長髮髮短髮髮
女女
長髮髮
女女短髮
女女
女
女
長髮髮
女女
長髮
女女
長髮
男
長髮
男v.s.
短髮髮
男男
短髮
男男
短髮髮v.s.
短髮髮
男男
短髮
男男Each
basic
classifier
can
have
sufficient
training
examples.
Boy
or
Girl?
Basic
Classifier
Long
or
short?Classifiers
for
the
attributesfinelongLittle
dataModularizationImage•
Deep
→
Modularization
Boy
or
Girl?following
classifiers
as
modulecan
be
trained
by
little
dataGirls
with
long
hairBoys
withClassifier
1Classifier
hairGirls
withshort
hairBoys
withshort
hair
2Classifier
3Classifier
4
Basic
Classifier
Long
or
short?Sharing
by
the……………………x1x2xN………………The
most
basic
classifiersUse
1st
layer
as
module
to
build
classifiersUse
2nd
layer
as
module
……The
modularization
isautomatically
learned
from
data.Modularization•
Deep
→
Modularization
→
Less
training
data?……………………Modularization•
Deep
→
Modularization
x1
x2
xNThe
most
basic
classifiersUse
1st
layer
as
module
to
build
classifiersUse
2nd
layer
as
module
……Reference:
Zeiler,
M.
D.,
&
Fergus,
R.(2014).
Visualizing
and
understandingconvolutional
networks.
In
ComputerVision–ECCV
2014
(pp.
818-833)
……
……
……Outline
of
Lecture
IIntroduction
of
Deep
LearningWhy
Deep?“Hello
World”
for
Deep
LearningKeras.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Theano%20DNN.ecm.mp4/index.html.tw/~tlkagk/courses/MLDS_2015_2/Lecture/RNN%20training%20(v6).ecm.mp4/index.html
Very
flexible
Need
some
effort
to
learnEasy
to
learn
and
use
(still
have
some
flexibility)You
can
modify
it
if
you
can
writeTensorFlow
or
TheanoInterface
ofTensorFlow
orTheanoor
kerasIf
you
want
to
learn
theano:Keras•
François
Chollet
is
the
author
of
Keras.•
He
currently
works
for
as
a
deep
learningengineer
and
researcher.•
Keras
means
horn
in
Greek•
Documentation:
http://keras.io/•
Example:/fchollet/keras/tree/master/examples使用
Keras
心得感謝
沈昇勳
同學提供圖檔Example
Application•
Handwriting
Digit
RecognitionMachine“1”
28
x
28
MNIST
Data:
/exdb/mnist/
“Hello
world”
for
deep
learningKeras
provides
data
sets
loading
function:http://keras.io/datasets/……………………y1y2y10
Keras28x28
500
500
SoftmaxKeras
KerasStep
3.1:
Configuration
𝑤←
𝑤
−
𝜂𝜕𝐿
𝜕𝑤
0.1Step
3.2:
Find
the
optimal
network
parametersTraining
data
(Images)
Labels(digits)Next
lecture
KerasStep
3.2:
Find
the
optimal
network
parameters/versions/r0.8/tutorials/mnist/beginners/index.htmlNumber
of
training
examplesnumpy
array28
x
28=784numpy
array10Number
of
training
examples…………Kerashttp://keras.io/getting-started/faq/#how-can-i-save-a-keras-modelHow
to
use
the
neural
network
(testing):case
1:case
2:Save
and
load
modelsKeras•
Using
GPU
to
speed
training•
Way
1•
THEANO_FLAGS=device=gpu0
pythonYourCode.py•
Way
2
(in
your
code)•
import
os•
os.environ["THEANO_FLAGS"]
="device=gpu0"Live
DemoLecture
II:Tips
for
Training
DNN
Step
1:
define
a
set
of
functionStep
2:
goodness
of
function
Step
3:
pick
the
best
functionGood
Results
on
Testing
Data?
YESGood
Results
on
Training
Data?
NOOverfitting!
NO
Neural
NetworkRecipe
of
Deep
Learning
YESTesting
DataTraining
DataDo
not
always
blame
Overfitting
Not
well
trained
Overfitting?Good
Results
on
Testing
Data?
YESRecipe
of
Deep
Learning
YESDifferent
approaches
fordifferent
problems.
e.g.
dropout
for
good
resultson
testing
data
Good
Results
on
Training
Data?
Neural
NetworkGood
Results
on
YESGood
Results
on
Training
Data?Recipe
of
Deep
Learning
YESChoosing
proper
loss
Testing
Data?Mini-batchNew
activation
functionAdaptive
Learning
RateMomentum……
Softmax…………………………x1x2…………lossChoosing
Proper
Loss
“1”target𝑖=110𝑦𝑖
−
𝑦𝑖2Square
Error
CrossEntropy
−𝑖=110𝑦𝑖𝑙𝑛𝑦𝑖
x256
……Which
one
is
better?𝑦1
1𝑦2
0𝑦10
0y1
1
y2
0y10
0=0=0Let’s
try
itSquare
ErrorCross
EntropyAccuracySquareError0.11CrossEntropy0.84Let’s
try
itTesting:
Training
CrossEntropy
Square
Errorw1w2
Choosing
Proper
Loss
When
using
softmax
output
layer,
choose
cross
entropy
Cross
Entropy
Total
Loss
Square
Error/proceedings/papers/v9/glorot10a/glorot10a.pdfGood
Results
on
YESGood
Results
on
Training
Data?Recipe
of
Deep
Learning
YESChoosing
proper
loss
Testing
Data?Mini-batchNew
activation
functionAdaptive
Learning
RateMomentumMini-batch……Mini-batch………Mini-batchx1NNy1𝑦1𝑙1x31NNy31𝑦31𝑙31𝑙2
x2x16NNNN
y2y16𝑦2
𝑦16𝑙16
Randomly
initialize
network
parameters
Pick
the
1st
batch
𝐿′
=
𝑙1
+
𝑙31
+
⋯Update
parameters
once
Pick
the
2nd
batch
𝐿′′
=
𝑙2
+
𝑙16
+
⋯
Update
parameters
once
Until
all
mini-batches
have
been
picked
one
epochRepeat
the
above
processWe
do
not
really
minimize
total
loss!Mini-batch………Mini-batch
x1x31NNNN
y1y31𝑦1𝑦31𝑙1𝑙31𝐿′′
=
𝑙2
+
𝑙16
+
⋯
Pick
the
1st
batch
𝐿′
=
𝑙1
+
𝑙31
+
⋯
Update
parameters
once
Pick
the
2nd
batch
Update
parameters
once
Until
all
mini-batches
have
been
picked
one
epoch100
examples
in
a
mini-batch
Repeat
20
timesMini-batch……Mini-batch………Mini-batchx1NNy1𝑦1𝑙1x31NNy31𝑦31𝑙31
x2x16NNNN
y2y16𝑦2
𝑦16𝑙2
𝑙16
Randomly
initialize
network
parameters
Pick
the
1st
batch
𝐿′
=
𝑙1
+
𝑙31
+
⋯Update
parameters
once
Pick
the
2nd
batch
𝐿′′
=
𝑙2
+
𝑙16
+
⋯
Update
parameters
onceL
is
different
each
timewhen
we
updateparameters!We
do
not
really
minimize
total
loss!Mini-batchOriginal
Gradient
DescentWith
Mini-batch
Unstable!!!The
colors
represent
the
total
loss.
See
allexamplesSee
only
onebatchUpdate
after
seeing
allexamplesIf
there
are
20
batches,
update20
times
in
one
epoch.
Mini-batch
is
FasterOriginal
Gradient
Descent
Not
always
true
with
parallel
computing.With
Mini-batchCan
have
the
same
speed
(not
super
large
data
set)
1
epochMini-batch
has
better
performance!AccuracyMini-batch0.84Nobatch0.12AccuracyMini-batch
is
Better!Testing:EpochMini-batchNo
batchTrainingMini-batchMini-batch……Mini-batch……Mini-batch…………𝑙1
x1x31NNNN
y1y31𝑦1𝑦31𝑙2
x2x16NN
NN
y2y16𝑦2
𝑦16𝑙16Shuffle
the
training
examples
for
each
epochEpoch
1𝑙1
x1x31NNNN
y1y31𝑦1
𝑦31𝑙17𝑙2
x2x16NNNN
y2y16𝑦2
𝑦16𝑙26Epoch
2
𝑙31Don’t
worry.
This
is
the
default
of
Keras.Good
Results
on
YESGood
Results
on
Training
Data?Recipe
of
Deep
Learning
YESChoosing
proper
loss
Testing
Data?Mini-batchNew
activation
functionAdaptive
Learning
RateMomentumHard
to
get
the
power
of
Deep
…Deeper
usually
does
not
imply
better.Results
on
Training
DataAccuracy3layers0.849layers0.11Let’s
try
itTesting:
Training3
layers
9
layers…………………………Vanishing
Gradient
Problembased
on
random!?
……
……
……
Larger
gradients
Learn
very
fastAlready
convergey1y2yM
x1
x2
xNSmaller
gradients
Learn
very
slow
Almost
random………………………………x1x2xN……
…………𝑦𝑀
𝑦1𝑦2𝑦𝑀𝑙
𝜕𝑙𝜕𝑤=?
+∆𝑤Intuitive
way
to
compute
the
derivatives
…+∆𝑙
∆𝑙∆𝑤Vanishing
Gradient
Problem
Smaller
gradientsLargeinput
𝑦1
Smalloutput
𝑦2Hard
to
get
the
power
of
Deep
…In
2006,
people
used
RBM
pre-training.In
2015,
people
use
ReLU.ReLU•
Rectified
Linear
Unit
(ReLU)Reason:
1.
Fast
to
compute2.
Biological
reason3.
Infinite
sigmoidwith
different
biases4.
Vanishing
gradientproblem𝑧𝑎𝑎
=
𝑧𝜎
𝑧
𝑎
=
0[Xavier
Glorot,
AISTATS’11][Andrew
L.
Maas,
ICML’13][Kaiming
He,
arXiv’15]ReLU
x1x2y20000𝑎𝑎
=
𝑧
𝑧
y1𝑎
=
0ReLUx1
x2
y1y2A
Thinner
linear
network
Do
not
have
smaller
gradients𝑧𝑎𝑎
=
𝑧𝑎
=
0Let’s
try
it9layersAccuracySigmoid0.11ReLU0.96Let’s
try
itTesting:•
9
layers
Training
ReLU
SigmoidReLU
-
variant
𝐿𝑒𝑎𝑘𝑦
𝑅𝑒𝐿𝑈
𝑎
𝑎
=
𝑧
𝑧𝑎
=
0.01𝑧
𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑟𝑖𝑐
𝑅𝑒𝐿𝑈
𝑎
𝑎
=
𝑧
𝑧𝑎
=
𝛼𝑧
α
also
learned
by
gradient
descentMaxout•
Learnable
activation
function
[Ian
J.
Goodfellow,
ICML’13]x1x2InputMax+5+++7−1171MaxMax+1+++24324ReLU
is
a
special
cases
of
MaxoutYou
can
have
more
than
2
elements
in
a
group.neuron
MaxMaxout•
Learnable
activation
function
[Ian
J.
Goodfellow,
ICML’13]
•
Activation
function
in
maxout
network
can
be
any
piecewise
linear
convex
function
•
How
many
pieces
depending
on
how
many
elements
in
a
groupReLU
is
a
special
cases
of
Maxout2
elements
in
a
group3
elements
in
a
groupGood
Results
on
YESGood
Results
on
Training
Data?Recipe
of
Deep
Learning
YESChoosing
proper
loss
Testing
Data?Mini-batchNew
activation
functionAdaptive
Learning
RateMomentum𝑤1Learning
Rates
𝑤2
Set
the
learning
rate
η
carefullyIf
learning
rate
is
too
largeTotal
loss
may
not
decreaseafter
each
update𝑤1Learning
Rates
𝑤2
Set
the
learning
rate
η
carefullyIf
learning
rate
is
too
largeTotal
loss
may
not
decreaseafter
each
updateIf
learning
rate
is
too
smallTraining
would
be
too
slowLearning
Rates•
Popular
&
Simple
Idea:
Reduce
the
learning
rate
by
some
factor
every
few
epochs.
•
At
the
beginning,
we
are
far
from
the
destination,
so
we
use
larger
learning
rate
•
After
several
epochs,
we
are
close
to
the
destination,
so
we
reduce
the
learning
rate•
E.g.
1/t
decay:
𝜂𝑡
=
𝜂𝑡
+
1•
Learning
rate
cannot
be
one-size-fits-all
•
Giving
different
parameters
different
learning
ratesconstant
𝑔𝑖
is
𝜕𝐿
∕
𝜕𝑤
obtained
at
the
i-th
updateߟ𝑤
=
𝜂𝑡𝑖=0𝑔𝑖2Summation
of
the
square
of
the
previous
derivativesAdagrad
Original:
𝑤
←
𝑤
−
𝜂𝜕𝐿
∕
𝜕𝑤
Adagrad:
w
←
𝑤
−
ߟ𝑤𝜕𝐿
∕
𝜕𝑤
Parameter
dependent
learning
rate0g1g……0.10.2……0g1g……20.010.0……20AdagradObservation:
1.
Learning
rate
is
smaller
and
smaller
for
all
parameters2.
Smaller
derivatives,
largerlearning
rate,
and
vice
versa
𝜂
0.12
𝜂0.12
+
0.22
𝜂
2
𝜂202
+
102==
𝜂0.1
𝜂0.22=
=
𝜂20
𝜂
22Why?ߟ𝑤
=
𝜂𝑡𝑖=0𝑔𝑖2
𝑤1Learning
rate:𝑤2
Learning
rate:2.
Smaller
derivatives,
largerlearning
rate,
and
vice
versaWhy?
Larger
derivatives
SmallerLearning
Rate
Smaller
Derivatives
Larger
Learning
RateNot
the
whole
story
……•
Adagrad
[John
Duchi,
JMLR’11]•
RMSprop•
/watch?v=O3sxAc4hxZU•
Adadelta
[Matthew
D.
Zeiler,
arXiv’12]•
“No
more
pesky
learning
rates”
[Tom
Schaul,
arXiv’12]•
AdaSecant
[Caglar
Gulcehre,
arXiv’14]•
Adam
[Diederik
P.
Kingma,
ICLR’15]•
Nadam•
/proj2015/054_report.pdfGood
Results
on
YESGood
Results
on
Training
Data?Recipe
of
Deep
Learning
YESChoosing
proper
loss
Testing
Data?Mini-batchNew
activation
functionAdaptive
Learning
RateMomentumHard
to
findoptimal
network
parametersTotal
LossThe
value
of
a
network
parameter
w𝜕𝐿
∕
𝜕𝑤
=0Very
slow
at
the
plateau
Stuck
at
saddle
point
Stuck
at
local
minima𝜕𝐿
∕
𝜕𝑤
=0𝜕𝐿
∕
𝜕𝑤
≈0In
physical
world
……•
MomentumHow
about
put
this
phenomenonin
gradient
descent?
Momentumcost
Still
not
guarantee
reaching
global
minima,
but
give
some
hope
……Movement
=Negative
of
𝜕𝐿∕𝜕𝑤
+
Momentum
Negative
of
𝜕𝐿
∕
𝜕𝑤
Momentum
Real
Movement
𝜕𝐿∕𝜕𝑤
=
0AdamRMSProp
(Advanced
Adagrad)
+
MomentumAccuracyOriginal0.96Adam0.97Let’s
try
it•
ReLU,
3
layer
TrainingTesting:AdamOriginalGood
Results
onGood
Results
onRecipe
of
Deep
Learning
YESEarly
Stopping
Testing
Data?Regularization
YESDropout
Training
Data?Network
StructureWhy
Overfitting?•
Training
data
and
testing
data
can
be
different.Training
Data:Testing
Data:Learning
target
is
defined
by
the
training
data.The
parameters
achie
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 央视频X快手《LIVE“来福”之夜》手晚 策划方案
- 市场营销活动与促销制度
- 医疗消毒灭菌制度
- 岗位职责车辆管理
- 人教部编版四年级语文上册第7课《呼风唤雨的世纪》精美课件
- 【寒假阅读提升】四年级下册语文试题-现代文阅读(二)-人教部编版(含答案解析)
- 2024年玉林客运从业资格考试题库
- 2024年黄石客运从业资格证考试模拟
- 2024年宁夏客运员考试题库答案
- 2024年鄂尔多斯客运资格证题库
- 2024年新华社招聘笔试参考题库附带答案详解
- 数字化时代背景下教师角色的思考
- 医院绩效考核分配方案及实施细则
- 水工环地质调查技术标准手册
- 护照加急办理申请
- 乙炔的理化性质及危险特性表
- 汽车场地越野赛突发事件应急预案
- 神奇的世界文档
- 头痛的鉴别诊断--ppt课件完整版
- 王维的生平经历
- 某粮食仓库屋面预应力拱板制作分项施工方案(附图)
评论
0/150
提交评论