深度学习教程李宏毅

上传人：1*** IP属地：湖北上传时间：2023-07-06 格式：PPT 页数：301 大小：25.71MB 积分：6 举报 版权申诉

已阅读5页，还剩296页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

Deep

Learning

Tutorial李宏毅Hung-yi

LeeDeep

learningattracts

lots

attention.•

believe

you

have

seen

lots

exciting

resultsbefore.This

talk

focuses

the

basic

techniques.Deep

learning

trendsat

Google.

Source:SIGMOD/Jeff

DeanOutlineLecture

IV:

WaveLecture

III:

Variants

Neural

NetworkLecture

II:

Tips

for

Training

Deep

Neural

NetworkLecture

Introduction

Deep

LearningLecture

I:Introduction

Deep

LearningOutline

Lecture

IIntroduction

Deep

LearningWhy

Deep?“Hello

World”

for

Deep

LearningLet’s

start

with

generalmachine

learning.

Machine

Learning

≈

Looking

for

Function•

Speech

Recognition•

Playing

Go•

Dialogue

System

f•

Image

Recognition

fff

“How

are

you”

“Cat”“Hello”“Hi”(what

the

user

said)(system

response)

“5-5”

(next

move)f1f1

“cat”“dog”f2f2“money”“snake”Framework

Model

set

function

f1,

f2f“cat”Image

Recognition:“cat”Image

Recognition:Framework

Model

set

function

f1,

f2Training

Data

fBetter!“cat”“dog”function

input:function

output:

“monkey”Goodness

function

Supervised

LearningFrameworkA

set

offunctionf1,

f2f“cat”Image

Recognition:ModelTraining

Data“monkey”“cat”“dog”Usingf

“cat”TrainingTestingStep

Goodness

function

fStep

Pick

the

“Best”

Function

*Step

Step

1:define

set

function

Step

2:goodness

functionStep

pick

the

best

functionThree

Steps

for

Deep

LearningDeep

Learning

simple

……Neuralof

Network

Step

1:define

set

function

Step

2:goodness

functionStep

pick

the

best

functionThree

Steps

for

Deep

LearningDeep

Learning

simple

……Human

Brains…………w1a1

akaK

wKweightsNeural

Network

Neuron

a1w1akwk

aKwK

simple

function

Activation

functionbiasNeural

Networkz

Activation

functionbiasNeuron1-2

-1weights12-114zzz

11ezSigmoid

Function0.98zzzz

Neural

NetworkDifferent

connections

leads

todifferent

network

structure

Each

neurons

can

have

different

values

weights

and

biases.Weights

and

biases

are

network

parameters

𝜃Fully

Connect

FeedforwardNetworkzzz

11ezSigmoid

Function1-11-2

-1114-2

00.980.12Fully

Connect

FeedforwardNetwork1-21-1104-20.120.98

2-1-1-2-14-10.86

30.110.620.8300-221-1Fully

Connect

FeedforwardNetwork1-2

-11100.50.73

2-1

-2-13-1

-140.720.120.510.8500-22𝑓00=0.510.85𝑓

1−1=0.620.8300This

function.Input

vector,

output

vectorGiven

parameters

𝜃,

define

function

Given

network

structure,

define

function

set…………………………OutputLayerHidden

LayersInputLayerFully

Connect

FeedforwardNetworkLayer

1Input

xNLayer

2Layer

L…………

……Output

yMDeep

means

many

hidden

layersneuronOutput

Layer

(Option)•

Softmax

layer

the

output

layer

Ordinary

Layery1

z1y2

z2y3

z1z2

general,

the

output

ofnetwork

can

any

value.May

not

easy

interprety1

eez2z2

2.70.05e

eeez1•

Softmax

layer

the

output

layer

Softmax

Layeree

ez1ee

3j1z1z

3j1z

1z3

-3200.88Output

Layer

(Option)

Probability:

𝑦𝑖

𝑖𝑦𝑖

3j1

z2z3

j0.12

≈0

e………………Example

ApplicationInputOutput16

256x1x2x256Ink

→

1No

ink

→

0yy2y10Each

dimension

representsthe

confidence

digit.is

1is

2is

00.10.70.2The

imageis

“2”………………MachineExample

Application•

Handwriting

Digit

Recognitionx1

x2x256y1

y2“2”

y10is

1is

2is

0function

……

Input:256-dim

vector

output:10-dim

vector

Neural

NetworkWhat

needed

a………………………………Example

ApplicationInputOutputLayer

xNInputLayerLayer

2Layer

L……

y2“2”

y10is

1is

2is

……

function

set

containing

the

candidates

forHandwriting

Digit

Recognition

……

Output

Hidden

Layers

LayerYou

need

decide

the

network

structure

tolet

good

function

your

function

set.FAQ•

How

many

layers?

How

many

neurons

for

each

layer?•

Can

the

structure

automatically

determined?Trial

and

ErrorIntuition+Neuralof

Network

Step

1:define

set

function

Step

2:goodness

functionStep

pick

the

best

functionThree

Steps

for

Deep

LearningDeep

Learning

simple

……Training

Data•

Preparing

training

data:

images

and

their

labelsThe

learning

target

defined

the

training

data.“1”“3”“4”“1”“0”“2”“5”“9”Softmax………………Learning

Target16

256x1x2x256………………Ink

→

1No

ink

→

0y1

y2y10The

learning

target

……y1

has

the

maximum

valuey2

has

the

maximum

valueInput:Input:is

1is

2is

0…………………………Given

set

ofLossx1x2xN………………y2y10Loss

𝑙00Loss

can

the

distance

between

thenetwork

output

and

targettargety1

1possibleA

good

function

should

make

the

lossof

all

examples

small

possible.

“1”parameters……………………Total

LossxRNNyR𝑦𝑅x1x2x3NNNNNNy1y2y3𝑦1𝑦2𝑦3For

all

training

data

…𝐿

=𝑟=1𝑅𝑙𝑟Find

the

networkparameters

𝜽∗

thatminimize

total

loss

LTotal

Loss:𝑙1𝑙2𝑙3𝑙𝑅

small

possibleFind

function

infunction

set

thatminimizes

total

loss

LNeuralof

Network

Step

1:define

set

function

Step

2:goodness

functionStep

pick

the

best

functionThree

Steps

for

Deep

LearningDeep

Learning

simple

………………

How

pick

the

best

functionFind

network

parameters

𝜽∗

that

minimize

total

loss

Layer

1000neurons

Layer

l+1

106weights

1000

neuronsEnumerate

all

possible

values

Network

parameters

𝜃

𝑤1,𝑤2,𝑤3,⋯,𝑏1,𝑏2,𝑏3,⋯

Millions

parameters

E.g.

speech

recognition:

layers

and

1000

neurons

each

layerGradient

Descent

TotalLoss

𝐿Random,

RBM

pre-train

Usually

good

enoughNetwork

parameters

𝜃

𝑤1,𝑤2,⋯,𝑏1,𝑏2,⋯wFind

network

parameters

𝜽∗

that

minimize

total

loss

Pick

initial

value

for

wGradient

DescentTotalLoss

𝐿Network

parameters

𝜃

𝑤1,𝑤2,⋯,𝑏1,𝑏2,⋯

Compute

𝜕𝐿

𝜕𝑤

Increase

wDecrease

Negative

Positive/album/photo/171572850Find

network

parameters

𝜽∗

that

minimize

total

loss

Pick

initial

value

for

wGradient

Descent

TotalLoss

𝐿Network

parameters

𝜃

𝑤1,𝑤2,⋯,𝑏1,𝑏2,⋯w−𝜂𝜕𝐿

𝜕𝑤“learning

rate”

Compute

𝜕𝐿

𝜕𝑤

𝑤←

𝑤

−

𝜂𝜕𝐿

𝜕𝑤

Repeat

calledFind

network

parameters

𝜽∗

that

minimize

total

loss

Pick

initial

value

for

wGradient

Descent

TotalLoss

𝐿Network

parameters

𝜃

𝑤1,𝑤2,⋯,𝑏1,𝑏2,⋯

Compute

𝜕𝐿

𝜕𝑤

𝑤←

𝑤

−

𝜂𝜕𝐿

𝜕𝑤

Repeat

Until

𝜕𝐿

𝜕𝑤

approximately

small

(when

update

little)

wFind

network

parameters

𝜽∗

that

minimize

total

loss

Pick

initial

value

for

w…………Gradient

DescentCompute

𝜕𝐿

𝜕𝑤1

−𝜇𝜕𝐿

𝜕𝑤10.15𝑤2Compute

𝜕𝐿

𝜕𝑤2

−𝜇𝜕𝐿

𝜕𝑤20.05𝑏1Compute

𝜕𝐿

𝜕𝑏1

−𝜇𝜕𝐿

𝜕𝑏10.20.2-0.10.3𝜃𝑤1

𝜕𝐿𝜕𝑤1

𝜕𝐿𝜕𝑤2

⋮𝜕𝐿𝜕𝑏1

⋮𝛻𝐿

=gradient…………Gradient

DescentCompute

𝜕𝐿

𝜕𝑤1

−𝜇𝜕𝐿

𝜕𝑤1Compute

𝜕𝐿

𝜕𝑤2

−𝜇𝜕𝐿

𝜕𝑤2

Compute

𝜕𝐿

𝜕𝑏1

−𝜇𝜕𝐿

𝜕𝑏10.150.05

0.2Compute

𝜕𝐿

𝜕𝑤1

−𝜇𝜕𝐿

𝜕𝑤1Compute

𝜕𝐿

𝜕𝑤2

−𝜇𝜕𝐿

𝜕𝑤2

Compute

𝜕𝐿

𝜕𝑏1

−𝜇𝜕𝐿

𝜕𝑏10.090.150.10

0.2-0.1

0.3………………𝜃𝑤1𝑤2𝑏1𝑤2Gradient

Descent

Color:

Value

ofTotal

Loss

LRandomly

pick

starting

point

𝑤1Gradient

Descent

𝑤2

(−𝜂

𝜕𝐿

𝜕𝑤1,

−𝜂𝜕𝐿

𝜕𝑤2)Compute

𝜕𝐿

𝜕𝑤1,

𝜕𝐿

𝜕𝑤2

𝑤1Hopfully,

would

reach

minima

…..

Color:

Value

Total

Loss

L𝐿𝑤1𝑤2Gradient

Descent

Difficulty•

Gradient

descent

never

guarantee

global

minima

Different

initial

pointReach

different

minima,so

different

resultsThere

are

some

tips

tohelp

you

avoid

localminima,

guarantee.𝑤1𝑤2You

are

playing

Age

Empires

…

You

cannot

(−𝜂

𝜕𝐿

𝜕𝑤1,

−𝜂𝜕𝐿

𝜕𝑤2)

Compute

𝜕𝐿

𝜕𝑤1,

𝜕𝐿

𝜕𝑤2Gradient

Descent

This

the

“learning”

machines

deep

learning

……

Even

alpha

using

this

approach.I

hope

you

are

not

too

disappointed

:pPeople

image

……Actually

…..Backpropagation•

Backpropagation:

efficient

way

compute

𝜕𝐿

𝜕𝑤•

Ref:.tw/~tlkagk/courses/MLDS_2015_2/Lecture/DNN%20backprop.ecm.mp4/index.htmlDon’t

worry

about

𝜕𝐿

𝜕𝑤,

the

toolkits

will

handle

it.台大周伯威同學開發

Step

1:define

set

function

Step

2:goodness

functionStep

pick

the

best

functionConcluding

RemarksDeep

Learning

simple

……Outline

Lecture

IIntroduction

Deep

LearningWhy

Deep?“Hello

World”

for

Deep

LearningLayerXSizeWordErrorLayerXSizeRate(%)WordErrorRate(%)1X2k24.22X2k20.43X2k18.4better4X2k17.85X2k17.21X377222.57X2k17.11X463422.61X16k22.1Deeper

Better?Seide,

Frank,

Gang

Li,

and

Dong

Yu.

"Conversational

Speech

TranscriptionUsing

Context-Dependent

Deep

Neural

Networks."

Interspeech.

2011.Reference

for

the

reason:http://neuralnetworksandde/chap4.htmlUniversality

Theorem

Any

continuous

function

:RN

Can

realized

network

with

one

hidden

layer(given

enough

hiddenneurons)Why

“Deep”

neural

network

not

“Fat”

neural

network?x1x2……xNDeepx1x2……xNShallowFat

Short

v.s.

Thin

Tall

The

same

number

parameters

……LayerXSizeWordErrorRate(%)LayerXSizeWordErrorRate(%)1X2k24.22X2k20.43X2k18.44X2k17.85X2k17.21X377222.57X2k17.11X463422.61X16k22.1Fat

Short

v.s.

Thin

TallSeide,

Frank,

Gang

Li,

and

Dong

Yu.

"Conversational

Speech

TranscriptionUsing

Context-Dependent

Deep

Neural

Networks."

Interspeech.

2011.AnalogyThis

page

for

background.less

gates

needed

Logic

circuits•

Logic

circuits

consists

gates•

two

layers

logic

gates

can

represent

any

Boolean

function.•

Using

multiple

layers

logic

gates

build

some

functions

are

much

simpler

Neural

network•

Neural

network

consists

neurons•

hidden

layer

network

can

represent

any

continuous

function.•

Using

multiple

layers

neurons

represent

some

functions

are

much

simpler

lessparameters

lessdata?weak長長髮Little

examples短

短髮Modularization•

Deep

→

ModularizationImageGirls

with

long

hair

Boys

with

long

hairGirls

withshort

hair

Boys

with

short

hair長髮髮

女女

長髮

女女長髮

男短髮髮

女女短髮

髮

女

女短髮髮

男男

短髮

男男Classifier

1Classifier

2Classifier

3Classifier

4長長髮短

短髮女女短髮

髮女

女短

短髮Modularization•

Deep

→

ModularizationImage髮

長髮髮短髮髮

女女

長髮髮

女女短髮

女女

女

長髮髮

女女

長髮

女女

長髮

男

長髮

男v.s.

短髮髮

男男

短髮

男男

短髮髮v.s.

短髮髮

男男

短髮

男男Each

basic

classifier

can

have

sufficient

training

examples.

Boy

Girl?

Basic

Classifier

Long

short?Classifiers

for

the

attributesfinelongLittle

dataModularizationImage•

Deep

→

Modularization

Boy

Girl?following

classifiers

modulecan

trained

little

dataGirls

with

long

hairBoys

withClassifier

1Classifier

hairGirls

withshort

hairBoys

withshort

hair

2Classifier

3Classifier

Basic

Classifier

Long

short?Sharing

the……………………x1x2xN………………The

most

basic

classifiersUse

1st

layer

module

build

classifiersUse

2nd

layer

module

……The

modularization

isautomatically

learned

from

data.Modularization•

Deep

→

Modularization

→

Less

training

data?……………………Modularization•

Deep

→

Modularization

xNThe

most

basic

classifiersUse

1st

layer

module

build

classifiersUse

2nd

layer

module

……Reference:

Zeiler,

D.,

Fergus,

R.(2014).

Visualizing

and

understandingconvolutional

networks.

ComputerVision–ECCV

2014

(pp.

818-833)

……

……Outline

Lecture

IIntroduction

Deep

LearningWhy

Deep?“Hello

World”

for

Deep

LearningKeras.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Theano%20DNN.ecm.mp4/index.html.tw/~tlkagk/courses/MLDS_2015_2/Lecture/RNN%20training%20(v6).ecm.mp4/index.html

Very

flexible

Need

some

effort

learnEasy

learn

and

use

(still

have

some

flexibility)You

can

modify

you

can

writeTensorFlow

TheanoInterface

ofTensorFlow

orTheanoor

kerasIf

you

want

learn

theano:Keras•

François

Chollet

the

author

Keras.•

currently

works

for

Google

deep

learningengineer

and

researcher.•

Keras

means

horn

Greek•

Documentation:

http://keras.io/•

Example:/fchollet/keras/tree/master/examples使用

Keras

心得感謝

沈昇勳

同學提供圖檔Example

Application•

Handwriting

Digit

RecognitionMachine“1”

MNIST

Data:

/exdb/mnist/

“Hello

world”

for

deep

learningKeras

provides

data

sets

function:http://keras.io/datasets/……………………y1y2y10

Keras28x28

500

SoftmaxKeras

KerasStep

3.1:

Configuration

𝑤←

𝑤

−

𝜂𝜕𝐿

𝜕𝑤

0.1Step

3.2:

Find

the

optimal

network

parametersTraining

data

(Images)

Labels(digits)Next

lecture

KerasStep

3.2:

Find

the

optimal

network

parameters/versions/r0.8/tutorials/mnist/beginners/index.htmlNumber

training

examplesnumpy

array28

28=784numpy

array10Number

training

examples…………Kerashttp://keras.io/getting-started/faq/#how-can-i-save-a-keras-modelHow

use

the

neural

network

(testing):case

1:case

2:Save

and

load

modelsKeras•

Using

GPU

speed

training•

Way

1•

THEANO_FLAGS=device=gpu0

pythonYourCode.py•

Way

(in

your

code)•

import

os•

os.environ["THEANO_FLAGS"]

="device=gpu0"Live

DemoLecture

II:Tips

for

Training

DNN

Step

define

set

functionStep

goodness

function

Step

pick

the

best

functionGood

Results

Testing

Data?

YESGood

Results

Training

Data?

NOOverfitting!

Neural

NetworkRecipe

Deep

Learning

YESTesting

DataTraining

DataDo

not

always

blame

Overfitting

Not

well

trained

Overfitting?Good

Results

Testing

Data?

YESRecipe

Deep

Learning

YESDifferent

approaches

fordifferent

problems.

e.g.

dropout

for

good

resultson

testing

data

Good

Results

Training

Data?

Neural

NetworkGood

Results

YESGood

Results

Training

Data?Recipe

Deep

Learning

YESChoosing

proper

loss

Testing

Data?Mini-batchNew

activation

functionAdaptive

Learning

RateMomentum……

Softmax…………………………x1x2…………lossChoosing

Proper

Loss

“1”target𝑖=110𝑦𝑖

−

𝑦𝑖2Square

Error

CrossEntropy

−𝑖=110𝑦𝑖𝑙𝑛𝑦𝑖

x256

……Which

one

better?𝑦1

1𝑦2

0𝑦10

0y1

0y10

0=0=0Let’s

try

itSquare

ErrorCross

EntropyAccuracySquareError0.11CrossEntropy0.84Let’s

try

itTesting:

Training

CrossEntropy

Square

Errorw1w2

Choosing

Proper

Loss

When

using

softmax

output

layer,

choose

cross

entropy

Cross

Entropy

Total

Loss

Square

Error/proceedings/papers/v9/glorot10a/glorot10a.pdfGood

Results

YESGood

Results

Training

Data?Recipe

Deep

Learning

YESChoosing

proper

loss

Testing

Data?Mini-batchNew

activation

functionAdaptive

Learning

RateMomentumMini-batch……Mini-batch………Mini-batchx1NNy1𝑦1𝑙1x31NNy31𝑦31𝑙31𝑙2

x2x16NNNN

y2y16𝑦2

𝑦16𝑙16

Randomly

initialize

network

parameters

Pick

the

1st

batch

𝐿′

𝑙1

𝑙31

⋯Update

parameters

once

Pick

the

2nd

batch

𝐿′′

𝑙2

𝑙16

⋯

Update

parameters

once

Until

all

mini-batches

have

been

picked

one

epochRepeat

the

above

processWe

not

really

minimize

total

loss!Mini-batch………Mini-batch

x1x31NNNN

y1y31𝑦1𝑦31𝑙1𝑙31𝐿′′

𝑙2

𝑙16

⋯

Pick

the

1st

batch

𝐿′

𝑙1

𝑙31

⋯

Update

parameters

once

Pick

the

2nd

batch

Update

parameters

once

Until

all

mini-batches

have

been

picked

one

epoch100

examples

mini-batch

Repeat

timesMini-batch……Mini-batch………Mini-batchx1NNy1𝑦1𝑙1x31NNy31𝑦31𝑙31

x2x16NNNN

y2y16𝑦2

𝑦16𝑙2

𝑙16

Randomly

initialize

network

parameters

Pick

the

1st

batch

𝐿′

𝑙1

𝑙31

⋯Update

parameters

once

Pick

the

2nd

batch

𝐿′′

𝑙2

𝑙16

⋯

Update

parameters

onceL

different

each

timewhen

updateparameters!We

not

really

minimize

total

loss!Mini-batchOriginal

Gradient

DescentWith

Mini-batch

Unstable!!!The

colors

represent

the

total

loss.

See

allexamplesSee

only

onebatchUpdate

after

seeing

allexamplesIf

there

are

batches,

update20

times

one

epoch.

Mini-batch

FasterOriginal

Gradient

Descent

Not

always

true

with

parallel

computing.With

Mini-batchCan

have

the

same

speed

(not

super

large

data

set)

epochMini-batch

has

better

performance!AccuracyMini-batch0.84Nobatch0.12AccuracyMini-batch

Better!Testing:EpochMini-batchNo

batchTrainingMini-batchMini-batch……Mini-batch……Mini-batch…………𝑙1

x1x31NNNN

y1y31𝑦1𝑦31𝑙2

x2x16NN

y2y16𝑦2

𝑦16𝑙16Shuffle

the

training

examples

for

each

epochEpoch

1𝑙1

x1x31NNNN

y1y31𝑦1

𝑦31𝑙17𝑙2

x2x16NNNN

y2y16𝑦2

𝑦16𝑙26Epoch

𝑙31Don’t

worry.

This

the

default

Keras.Good

Results

YESGood

Results

Training

Data?Recipe

Deep

Learning

YESChoosing

proper

loss

Testing

Data?Mini-batchNew

activation

functionAdaptive

Learning

RateMomentumHard

get

the

power

Deep

…Deeper

usually

does

not

imply

better.Results

Training

DataAccuracy3layers0.849layers0.11Let’s

try

itTesting:

Training3

layers

layers…………………………Vanishing

Gradient

Problembased

random!?

……

Larger

gradients

Learn

very

fastAlready

convergey1y2yM

xNSmaller

gradients

Learn

very

slow

Almost

random………………………………x1x2xN……

…………𝑦𝑀

𝑦1𝑦2𝑦𝑀𝑙

𝜕𝑙𝜕𝑤=?

+∆𝑤Intuitive

way

compute

the

derivatives

…+∆𝑙

∆𝑙∆𝑤Vanishing

Gradient

Problem

Smaller

gradientsLargeinput

𝑦1

Smalloutput

𝑦2Hard

get

the

power

Deep

…In

2006,

people

used

RBM

pre-training.In

2015,

people

use

ReLU.ReLU•

Rectified

Linear

Unit

(ReLU)Reason:

Fast

compute2.

Biological

reason3.

Infinite

sigmoidwith

different

biases4.

Vanishing

gradientproblem𝑧𝑎𝑎

𝑧𝜎

𝑧

𝑎

0[Xavier

Glorot,

AISTATS’11][Andrew

Maas,

ICML’13][Kaiming

He,

arXiv’15]ReLU

x1x2y20000𝑎𝑎

𝑧

y1𝑎

0ReLUx1

y1y2A

Thinner

linear

network

not

have

smaller

gradients𝑧𝑎𝑎

𝑧𝑎

0Let’s

try

it9layersAccuracySigmoid0.11ReLU0.96Let’s

try

itTesting:•

layers

Training

ReLU

SigmoidReLU

variant

𝐿𝑒𝑎𝑘𝑦

𝑅𝑒𝐿𝑈

𝑎

𝑧

𝑧𝑎

0.01𝑧

𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑟𝑖𝑐

𝑅𝑒𝐿𝑈

𝑎

𝑧

𝑧𝑎

𝛼𝑧

also

learned

gradient

descentMaxout•

Learnable

activation

function

[Ian

Goodfellow,

ICML’13]x1x2InputMax+5+++7−1171MaxMax+1+++24324ReLU

special

cases

MaxoutYou

can

have

than

elements

group.neuron

MaxMaxout•

Learnable

activation

function

[Ian

Goodfellow,

ICML’13]

•

Activation

function

maxout

network

can

any

piecewise

linear

convex

function

•

How

many

pieces

depending

how

many

elements

groupReLU

special

cases

Maxout2

elements

group3

elements

groupGood

Results

YESGood

Results

Training

Data?Recipe

Deep

Learning

YESChoosing

proper

loss

Testing

Data?Mini-batchNew

activation

functionAdaptive

Learning

RateMomentum𝑤1Learning

Rates

𝑤2

Set

the

learning

rate

carefullyIf

learning

rate

too

largeTotal

loss

may

not

decreaseafter

each

update𝑤1Learning

Rates

𝑤2

Set

the

learning

rate

carefullyIf

learning

rate

too

largeTotal

loss

may

not

decreaseafter

each

updateIf

learning

rate

too

smallTraining

would

too

slowLearning

Rates•

Popular

Simple

Idea:

Reduce

the

learning

rate

some

factor

every

few

epochs.

•

the

beginning,

are

far

from

the

destination,

use

larger

learning

rate

•

After

several

epochs,

are

the

destination,

reduce

the

learning

rate•

E.g.

1/t

decay:

𝜂𝑡

1•

Learning

rate

cannot

one-size-fits-all

•

Giving

different

parameters

different

learning

ratesconstant

𝑔𝑖

𝜕𝐿

∕

𝜕𝑤

obtained

the

i-th

updateߟ𝑤

𝜂𝑡𝑖=0𝑔𝑖2Summation

the

square

the

derivativesAdagrad

Original:

𝑤

←

𝑤

−

𝜂𝜕𝐿

∕

𝜕𝑤

Adagrad:

←

𝑤

−

ߟ𝑤𝜕𝐿

∕

𝜕𝑤

Parameter

dependent

learning

rate0g1g……0.10.2……0g1g……20.010.0……20AdagradObservation:

Learning

rate

smaller

and

smaller

for

all

parameters2.

Smaller

derivatives,

largerlearning

rate,

and

vice

versa

𝜂

0.12

𝜂0.12

0.22

𝜂

𝜂202

102==

𝜂0.1

𝜂0.22=

𝜂20

𝜂

22Why?ߟ𝑤

𝜂𝑡𝑖=0𝑔𝑖2

𝑤1Learning

rate:𝑤2

Learning

rate:2.

Smaller

derivatives,

largerlearning

rate,

and

vice

versaWhy?

Larger

derivatives

SmallerLearning

Rate

Smaller

Derivatives

Larger

Learning

RateNot

the

whole

story

……•

Adagrad

[John

Duchi,

JMLR’11]•

RMSprop•

/watch?v=O3sxAc4hxZU•

Adadelta

[Matthew

Zeiler,

arXiv’12]•

“No

pesky

learning

rates”

[Tom

Schaul,

arXiv’12]•

AdaSecant

[Caglar

Gulcehre,

arXiv’14]•

Adam

[Diederik

Kingma,

ICLR’15]•

Nadam•

/proj2015/054_report.pdfGood

Results

YESGood

Results

Training

Data?Recipe

Deep

Learning

YESChoosing

proper

loss

Testing

Data?Mini-batchNew

activation

functionAdaptive

Learning

RateMomentumHard

findoptimal

network

parametersTotal

LossThe

value

network

parameter

w𝜕𝐿

∕

𝜕𝑤

=0Very

slow

the

plateau

Stuck

saddle

point

Stuck

local

minima𝜕𝐿

∕

𝜕𝑤

=0𝜕𝐿

∕

𝜕𝑤

≈0In

physical

world

……•

MomentumHow

about

put

this

phenomenonin

gradient

descent?

Momentumcost

Still

not

guarantee

reaching

global

minima,

but

give

some

hope

……Movement

=Negative

𝜕𝐿∕𝜕𝑤

Momentum

Negative

𝜕𝐿

∕

𝜕𝑤

Momentum

Real

Movement

𝜕𝐿∕𝜕𝑤

0AdamRMSProp

(Advanced

Adagrad)

MomentumAccuracyOriginal0.96Adam0.97Let’s

try

it•

ReLU,

layer

TrainingTesting:AdamOriginalGood

Results

onGood

Results

onRecipe

Deep

Learning

YESEarly

Stopping

Testing

Data?Regularization

YESDropout

Training

Data?Network

StructureWhy

Overfitting?•

Training

data

and

testing

data

can

different.Training

Data:Testing

Data:Learning

target

defined

the

training

data.The

parameters

achie

人人文库> 全部分类> 办公材料 > 对照材料

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

深度学习教程李宏毅

文档简介

温馨提示

最新文档

评论

深度学习教程 李宏毅

文档简介

温馨提示

最新文档

评论

相关文档

深度学习教程李宏毅