深度学习教程 李宏毅_第1页
深度学习教程 李宏毅_第2页
深度学习教程 李宏毅_第3页
深度学习教程 李宏毅_第4页
深度学习教程 李宏毅_第5页
已阅读5页,还剩296页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Deep

Learning

Tutorial李宏毅Hung-yi

LeeDeep

learningattracts

lots

of

attention.•

I

believe

you

have

seen

lots

of

exciting

resultsbefore.This

talk

focuses

on

the

basic

techniques.Deep

learning

trendsat

Google.

Source:SIGMOD/Jeff

DeanOutlineLecture

IV:

Next

WaveLecture

III:

Variants

of

Neural

NetworkLecture

II:

Tips

for

Training

Deep

Neural

NetworkLecture

I:

Introduction

of

Deep

LearningLecture

I:Introduction

of

Deep

LearningOutline

of

Lecture

IIntroduction

of

Deep

LearningWhy

Deep?“Hello

World”

for

Deep

LearningLet’s

start

with

generalmachine

learning.

Machine

Learning

Looking

for

a

Function•

Speech

Recognition•

Playing

Go•

Dialogue

System

f•

Image

Recognition

fff

“How

are

you”

“Cat”“Hello”“Hi”(what

the

user

said)(system

response)

“5-5”

(next

move)f1f1

“cat”“dog”f2f2“money”“snake”Framework

Model

A

set

of

function

f1,

f2f“cat”Image

Recognition:“cat”Image

Recognition:Framework

Model

A

set

of

function

f1,

f2Training

Data

fBetter!“cat”“dog”function

input:function

output:

“monkey”Goodness

of

function

f

Supervised

LearningFrameworkA

set

offunctionf1,

f2f“cat”Image

Recognition:ModelTraining

Data“monkey”“cat”“dog”Usingf

“cat”TrainingTestingStep

1

Goodness

of

function

fStep

2

Pick

the

“Best”

Function

f

*Step

3

Step

1:define

a

set

of

function

Step

2:goodness

of

functionStep

3:

pick

the

best

functionThree

Steps

for

Deep

LearningDeep

Learning

is

so

simple

……Neuralof

Network

Step

1:define

a

set

function

Step

2:goodness

of

functionStep

3:

pick

the

best

functionThree

Steps

for

Deep

LearningDeep

Learning

is

so

simple

……Human

Brains…………w1a1

akaK

ba

wk

wKweightsNeural

Network

Neuron

z

a1w1akwk

aKwK

bA

simple

function

z

z

Activation

functionbiasNeural

Networkz

Activation

functionbiasNeuron1-2

-1weights12-114zzz

11ezSigmoid

Function0.98zzzz

Neural

NetworkDifferent

connections

leads

todifferent

network

structure

Each

neurons

can

have

different

values

of

weights

and

biases.Weights

and

biases

are

network

parameters

𝜃Fully

Connect

FeedforwardNetworkzzz

11ezSigmoid

Function1-11-2

-1114-2

00.980.12Fully

Connect

FeedforwardNetwork1-21-1104-20.120.98

2-1-1-2-14-10.86

30.110.620.8300-221-1Fully

Connect

FeedforwardNetwork1-2

-11100.50.73

2-1

-2-13-1

-140.720.120.510.8500-22𝑓00=0.510.85𝑓

1−1=0.620.8300This

is

a

function.Input

vector,

output

vectorGiven

parameters

𝜃,

define

a

function

Given

network

structure,

define

a

function

set…………………………OutputLayerHidden

LayersInputLayerFully

Connect

FeedforwardNetworkLayer

1Input

x1

x2

xNLayer

2Layer

L…………

……Output

y1

y2

yMDeep

means

many

hidden

layersneuronOutput

Layer

(Option)•

Softmax

layer

as

the

output

layer

Ordinary

Layery1

z1y2

z2y3

z3

z1z2

z3

In

general,

the

output

ofnetwork

can

be

any

value.May

not

be

easy

to

interprety1

eez2z2

2.70.05e

eeez1•

Softmax

layer

as

the

output

layer

Softmax

Layeree

ez1ee

3j1z1z

j

z3

3j1z

j3

1z3

-3200.88Output

Layer

(Option)

Probability:

1

>

𝑦𝑖

>

0

𝑖𝑦𝑖

=

1

3

j1

3j1

z2z3

z

jz

j0.12

y2

e

≈0

y3

e………………Example

ApplicationInputOutput16

x

16

=

256x1x2x256Ink

1No

ink

0yy2y10Each

dimension

representsthe

confidence

of

a

digit.is

1is

2is

00.10.70.2The

imageis

“2”………………MachineExample

Application•

Handwriting

Digit

Recognitionx1

x2x256y1

y2“2”

y10is

1is

2is

0function

……

Input:256-dim

vector

output:10-dim

vector

Neural

NetworkWhat

is

needed

is

a………………………………Example

ApplicationInputOutputLayer

1

x1

x2

xNInputLayerLayer

2Layer

L……

y1

y2“2”

y10is

1is

2is

0

……

A

function

set

containing

the

candidates

forHandwriting

Digit

Recognition

……

Output

Hidden

Layers

LayerYou

need

to

decide

the

network

structure

tolet

a

good

function

in

your

function

set.FAQ•

Q:

How

many

layers?

How

many

neurons

for

each

layer?•

Q:

Can

the

structure

be

automatically

determined?Trial

and

ErrorIntuition+Neuralof

Network

Step

1:define

a

set

function

Step

2:goodness

of

functionStep

3:

pick

the

best

functionThree

Steps

for

Deep

LearningDeep

Learning

is

so

simple

……Training

Data•

Preparing

training

data:

images

and

their

labelsThe

learning

target

is

defined

on

the

training

data.“1”“3”“4”“1”“0”“2”“5”“9”Softmax………………Learning

Target16

x

16

=

256x1x2x256………………Ink

1No

ink

0y1

y2y10The

learning

target

is

……y1

has

the

maximum

valuey2

has

the

maximum

valueInput:Input:is

1is

2is

0…………………………Given

a

set

ofLossx1x2xN………………y2y10Loss

𝑙00Loss

can

be

the

distance

between

thenetwork

output

and

targettargety1

As

close

as

1possibleA

good

function

should

make

the

lossof

all

examples

as

small

as

possible.

“1”parameters……………………Total

LossxRNNyR𝑦𝑅x1x2x3NNNNNNy1y2y3𝑦1𝑦2𝑦3For

all

training

data

…𝐿

=𝑟=1𝑅𝑙𝑟Find

the

networkparameters

𝜽∗

thatminimize

total

loss

LTotal

Loss:𝑙1𝑙2𝑙3𝑙𝑅

As

small

as

possibleFind

a

function

infunction

set

thatminimizes

total

loss

LNeuralof

Network

Step

1:define

a

set

function

Step

2:goodness

of

functionStep

3:

pick

the

best

functionThree

Steps

for

Deep

LearningDeep

Learning

is

so

simple

………………

How

to

pick

the

best

functionFind

network

parameters

𝜽∗

that

minimize

total

loss

L

Layer

l

1000neurons

Layer

l+1

106weights

1000

neuronsEnumerate

all

possible

values

Network

parameters

𝜃

=

𝑤1,𝑤2,𝑤3,⋯,𝑏1,𝑏2,𝑏3,⋯

Millions

of

parameters

E.g.

speech

recognition:

8

layers

and

1000

neurons

each

layerGradient

Descent

TotalLoss

𝐿Random,

RBM

pre-train

Usually

good

enoughNetwork

parameters

𝜃

=

𝑤1,𝑤2,⋯,𝑏1,𝑏2,⋯wFind

network

parameters

𝜽∗

that

minimize

total

loss

L

Pick

an

initial

value

for

wGradient

DescentTotalLoss

𝐿Network

parameters

𝜃

=

𝑤1,𝑤2,⋯,𝑏1,𝑏2,⋯

Compute

𝜕𝐿

𝜕𝑤

Increase

wDecrease

w

w

Negative

Positive/album/photo/171572850Find

network

parameters

𝜽∗

that

minimize

total

loss

L

Pick

an

initial

value

for

wGradient

Descent

TotalLoss

𝐿Network

parameters

𝜃

=

𝑤1,𝑤2,⋯,𝑏1,𝑏2,⋯w−𝜂𝜕𝐿

𝜕𝑤“learning

rate”

Compute

𝜕𝐿

𝜕𝑤

𝑤←

𝑤

𝜂𝜕𝐿

𝜕𝑤

Repeat

η

is

calledFind

network

parameters

𝜽∗

that

minimize

total

loss

L

Pick

an

initial

value

for

wGradient

Descent

TotalLoss

𝐿Network

parameters

𝜃

=

𝑤1,𝑤2,⋯,𝑏1,𝑏2,⋯

Compute

𝜕𝐿

𝜕𝑤

𝑤←

𝑤

𝜂𝜕𝐿

𝜕𝑤

Repeat

Until

𝜕𝐿

𝜕𝑤

is

approximately

small

(when

update

is

little)

wFind

network

parameters

𝜽∗

that

minimize

total

loss

L

Pick

an

initial

value

for

w…………Gradient

DescentCompute

𝜕𝐿

𝜕𝑤1

−𝜇𝜕𝐿

𝜕𝑤10.15𝑤2Compute

𝜕𝐿

𝜕𝑤2

−𝜇𝜕𝐿

𝜕𝑤20.05𝑏1Compute

𝜕𝐿

𝜕𝑏1

−𝜇𝜕𝐿

𝜕𝑏10.20.2-0.10.3𝜃𝑤1

𝜕𝐿𝜕𝑤1

𝜕𝐿𝜕𝑤2

⋮𝜕𝐿𝜕𝑏1

⋮𝛻𝐿

=gradient…………Gradient

DescentCompute

𝜕𝐿

𝜕𝑤1

−𝜇𝜕𝐿

𝜕𝑤1Compute

𝜕𝐿

𝜕𝑤2

−𝜇𝜕𝐿

𝜕𝑤2

Compute

𝜕𝐿

𝜕𝑏1

−𝜇𝜕𝐿

𝜕𝑏10.150.05

0.2Compute

𝜕𝐿

𝜕𝑤1

−𝜇𝜕𝐿

𝜕𝑤1Compute

𝜕𝐿

𝜕𝑤2

−𝜇𝜕𝐿

𝜕𝑤2

Compute

𝜕𝐿

𝜕𝑏1

−𝜇𝜕𝐿

𝜕𝑏10.090.150.10

0.2-0.1

0.3………………𝜃𝑤1𝑤2𝑏1𝑤2Gradient

Descent

Color:

Value

ofTotal

Loss

LRandomly

pick

a

starting

point

𝑤1Gradient

Descent

𝑤2

(−𝜂

𝜕𝐿

𝜕𝑤1,

−𝜂𝜕𝐿

𝜕𝑤2)Compute

𝜕𝐿

𝜕𝑤1,

𝜕𝐿

𝜕𝑤2

𝑤1Hopfully,

we

would

reach

a

minima

…..

Color:

Value

of

Total

Loss

L𝐿𝑤1𝑤2Gradient

Descent

-

Difficulty•

Gradient

descent

never

guarantee

global

minima

Different

initial

pointReach

different

minima,so

different

resultsThere

are

some

tips

tohelp

you

avoid

localminima,

no

guarantee.𝑤1𝑤2You

are

playing

Age

of

Empires

You

cannot

(−𝜂

𝜕𝐿

𝜕𝑤1,

−𝜂𝜕𝐿

𝜕𝑤2)

Compute

𝜕𝐿

𝜕𝑤1,

𝜕𝐿

𝜕𝑤2Gradient

Descent

This

is

the

“learning”

of

machines

in

deep

learning

……

Even

alpha

go

using

this

approach.I

hope

you

are

not

too

disappointed

:pPeople

image

……Actually

…..Backpropagation•

Backpropagation:

an

efficient

way

to

compute

𝜕𝐿

𝜕𝑤•

Ref:.tw/~tlkagk/courses/MLDS_2015_2/Lecture/DNN%20backprop.ecm.mp4/index.htmlDon’t

worry

about

𝜕𝐿

𝜕𝑤,

the

toolkits

will

handle

it.台大周伯威同學開發

Step

1:define

a

set

of

function

Step

2:goodness

of

functionStep

3:

pick

the

best

functionConcluding

RemarksDeep

Learning

is

so

simple

……Outline

of

Lecture

IIntroduction

of

Deep

LearningWhy

Deep?“Hello

World”

for

Deep

LearningLayerXSizeWordErrorLayerXSizeRate(%)WordErrorRate(%)1X2k24.22X2k20.43X2k18.4better4X2k17.85X2k17.21X377222.57X2k17.11X463422.61X16k22.1Deeper

is

Better?Seide,

Frank,

Gang

Li,

and

Dong

Yu.

"Conversational

Speech

TranscriptionUsing

Context-Dependent

Deep

Neural

Networks."

Interspeech.

2011.Reference

for

the

reason:http://neuralnetworksandde/chap4.htmlUniversality

Theorem

Any

continuous

function

f

f

:RN

RM

Can

be

realized

by

a

network

with

one

hidden

layer(given

enough

hiddenneurons)Why

“Deep”

neural

network

not

“Fat”

neural

network?x1x2……xNDeepx1x2……xNShallowFat

+

Short

v.s.

Thin

+

Tall

The

same

number

of

parameters

……LayerXSizeWordErrorRate(%)LayerXSizeWordErrorRate(%)1X2k24.22X2k20.43X2k18.44X2k17.85X2k17.21X377222.57X2k17.11X463422.61X16k22.1Fat

+

Short

v.s.

Thin

+

TallSeide,

Frank,

Gang

Li,

and

Dong

Yu.

"Conversational

Speech

TranscriptionUsing

Context-Dependent

Deep

Neural

Networks."

Interspeech.

2011.AnalogyThis

page

is

for

EE

background.less

gates

needed

Logic

circuits•

Logic

circuits

consists

of

gates•

A

two

layers

of

logic

gates

can

represent

any

Boolean

function.•

Using

multiple

layers

of

logic

gates

to

build

some

functions

are

much

simpler

Neural

network•

Neural

network

consists

of

neurons•

A

hidden

layer

network

can

represent

any

continuous

function.•

Using

multiple

layers

of

neurons

to

represent

some

functions

are

much

simpler

lessparameters

lessdata?weak長長髮Little

examples短

短髮Modularization•

Deep

ModularizationImageGirls

with

long

hair

Boys

with

long

hairGirls

withshort

hair

Boys

with

short

hair長髮髮

女女

長髮

女女長髮

男短髮髮

女女短髮

女短髮髮

男男

短髮

男男Classifier

1Classifier

2Classifier

3Classifier

4長長髮短

短髮女女短髮

髮女

女短

短髮Modularization•

Deep

ModularizationImage髮

長髮髮短髮髮

女女

長髮髮

女女短髮

女女

長髮髮

女女

長髮

女女

長髮

長髮

男v.s.

短髮髮

男男

短髮

男男

短髮髮v.s.

短髮髮

男男

短髮

男男Each

basic

classifier

can

have

sufficient

training

examples.

Boy

or

Girl?

Basic

Classifier

Long

or

short?Classifiers

for

the

attributesfinelongLittle

dataModularizationImage•

Deep

Modularization

Boy

or

Girl?following

classifiers

as

modulecan

be

trained

by

little

dataGirls

with

long

hairBoys

withClassifier

1Classifier

hairGirls

withshort

hairBoys

withshort

hair

2Classifier

3Classifier

4

Basic

Classifier

Long

or

short?Sharing

by

the……………………x1x2xN………………The

most

basic

classifiersUse

1st

layer

as

module

to

build

classifiersUse

2nd

layer

as

module

……The

modularization

isautomatically

learned

from

data.Modularization•

Deep

Modularization

Less

training

data?……………………Modularization•

Deep

Modularization

x1

x2

xNThe

most

basic

classifiersUse

1st

layer

as

module

to

build

classifiersUse

2nd

layer

as

module

……Reference:

Zeiler,

M.

D.,

&

Fergus,

R.(2014).

Visualizing

and

understandingconvolutional

networks.

In

ComputerVision–ECCV

2014

(pp.

818-833)

……

……

……Outline

of

Lecture

IIntroduction

of

Deep

LearningWhy

Deep?“Hello

World”

for

Deep

LearningKeras.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Theano%20DNN.ecm.mp4/index.html.tw/~tlkagk/courses/MLDS_2015_2/Lecture/RNN%20training%20(v6).ecm.mp4/index.html

Very

flexible

Need

some

effort

to

learnEasy

to

learn

and

use

(still

have

some

flexibility)You

can

modify

it

if

you

can

writeTensorFlow

or

TheanoInterface

ofTensorFlow

orTheanoor

kerasIf

you

want

to

learn

theano:Keras•

François

Chollet

is

the

author

of

Keras.•

He

currently

works

for

Google

as

a

deep

learningengineer

and

researcher.•

Keras

means

horn

in

Greek•

Documentation:

http://keras.io/•

Example:/fchollet/keras/tree/master/examples使用

Keras

心得感謝

沈昇勳

同學提供圖檔Example

Application•

Handwriting

Digit

RecognitionMachine“1”

28

x

28

MNIST

Data:

/exdb/mnist/

“Hello

world”

for

deep

learningKeras

provides

data

sets

loading

function:http://keras.io/datasets/……………………y1y2y10

Keras28x28

500

500

SoftmaxKeras

KerasStep

3.1:

Configuration

𝑤←

𝑤

𝜂𝜕𝐿

𝜕𝑤

0.1Step

3.2:

Find

the

optimal

network

parametersTraining

data

(Images)

Labels(digits)Next

lecture

KerasStep

3.2:

Find

the

optimal

network

parameters/versions/r0.8/tutorials/mnist/beginners/index.htmlNumber

of

training

examplesnumpy

array28

x

28=784numpy

array10Number

of

training

examples…………Kerashttp://keras.io/getting-started/faq/#how-can-i-save-a-keras-modelHow

to

use

the

neural

network

(testing):case

1:case

2:Save

and

load

modelsKeras•

Using

GPU

to

speed

training•

Way

1•

THEANO_FLAGS=device=gpu0

pythonYourCode.py•

Way

2

(in

your

code)•

import

os•

os.environ["THEANO_FLAGS"]

="device=gpu0"Live

DemoLecture

II:Tips

for

Training

DNN

Step

1:

define

a

set

of

functionStep

2:

goodness

of

function

Step

3:

pick

the

best

functionGood

Results

on

Testing

Data?

YESGood

Results

on

Training

Data?

NOOverfitting!

NO

Neural

NetworkRecipe

of

Deep

Learning

YESTesting

DataTraining

DataDo

not

always

blame

Overfitting

Not

well

trained

Overfitting?Good

Results

on

Testing

Data?

YESRecipe

of

Deep

Learning

YESDifferent

approaches

fordifferent

problems.

e.g.

dropout

for

good

resultson

testing

data

Good

Results

on

Training

Data?

Neural

NetworkGood

Results

on

YESGood

Results

on

Training

Data?Recipe

of

Deep

Learning

YESChoosing

proper

loss

Testing

Data?Mini-batchNew

activation

functionAdaptive

Learning

RateMomentum……

Softmax…………………………x1x2…………lossChoosing

Proper

Loss

“1”target𝑖=110𝑦𝑖

𝑦𝑖2Square

Error

CrossEntropy

−𝑖=110𝑦𝑖𝑙𝑛𝑦𝑖

x256

……Which

one

is

better?𝑦1

1𝑦2

0𝑦10

0y1

1

y2

0y10

0=0=0Let’s

try

itSquare

ErrorCross

EntropyAccuracySquareError0.11CrossEntropy0.84Let’s

try

itTesting:

Training

CrossEntropy

Square

Errorw1w2

Choosing

Proper

Loss

When

using

softmax

output

layer,

choose

cross

entropy

Cross

Entropy

Total

Loss

Square

Error/proceedings/papers/v9/glorot10a/glorot10a.pdfGood

Results

on

YESGood

Results

on

Training

Data?Recipe

of

Deep

Learning

YESChoosing

proper

loss

Testing

Data?Mini-batchNew

activation

functionAdaptive

Learning

RateMomentumMini-batch……Mini-batch………Mini-batchx1NNy1𝑦1𝑙1x31NNy31𝑦31𝑙31𝑙2

x2x16NNNN

y2y16𝑦2

𝑦16𝑙16

Randomly

initialize

network

parameters

Pick

the

1st

batch

𝐿′

=

𝑙1

+

𝑙31

+

⋯Update

parameters

once

Pick

the

2nd

batch

𝐿′′

=

𝑙2

+

𝑙16

+

Update

parameters

once

Until

all

mini-batches

have

been

picked

one

epochRepeat

the

above

processWe

do

not

really

minimize

total

loss!Mini-batch………Mini-batch

x1x31NNNN

y1y31𝑦1𝑦31𝑙1𝑙31𝐿′′

=

𝑙2

+

𝑙16

+

Pick

the

1st

batch

𝐿′

=

𝑙1

+

𝑙31

+

Update

parameters

once

Pick

the

2nd

batch

Update

parameters

once

Until

all

mini-batches

have

been

picked

one

epoch100

examples

in

a

mini-batch

Repeat

20

timesMini-batch……Mini-batch………Mini-batchx1NNy1𝑦1𝑙1x31NNy31𝑦31𝑙31

x2x16NNNN

y2y16𝑦2

𝑦16𝑙2

𝑙16

Randomly

initialize

network

parameters

Pick

the

1st

batch

𝐿′

=

𝑙1

+

𝑙31

+

⋯Update

parameters

once

Pick

the

2nd

batch

𝐿′′

=

𝑙2

+

𝑙16

+

Update

parameters

onceL

is

different

each

timewhen

we

updateparameters!We

do

not

really

minimize

total

loss!Mini-batchOriginal

Gradient

DescentWith

Mini-batch

Unstable!!!The

colors

represent

the

total

loss.

See

allexamplesSee

only

onebatchUpdate

after

seeing

allexamplesIf

there

are

20

batches,

update20

times

in

one

epoch.

Mini-batch

is

FasterOriginal

Gradient

Descent

Not

always

true

with

parallel

computing.With

Mini-batchCan

have

the

same

speed

(not

super

large

data

set)

1

epochMini-batch

has

better

performance!AccuracyMini-batch0.84Nobatch0.12AccuracyMini-batch

is

Better!Testing:EpochMini-batchNo

batchTrainingMini-batchMini-batch……Mini-batch……Mini-batch…………𝑙1

x1x31NNNN

y1y31𝑦1𝑦31𝑙2

x2x16NN

NN

y2y16𝑦2

𝑦16𝑙16Shuffle

the

training

examples

for

each

epochEpoch

1𝑙1

x1x31NNNN

y1y31𝑦1

𝑦31𝑙17𝑙2

x2x16NNNN

y2y16𝑦2

𝑦16𝑙26Epoch

2

𝑙31Don’t

worry.

This

is

the

default

of

Keras.Good

Results

on

YESGood

Results

on

Training

Data?Recipe

of

Deep

Learning

YESChoosing

proper

loss

Testing

Data?Mini-batchNew

activation

functionAdaptive

Learning

RateMomentumHard

to

get

the

power

of

Deep

…Deeper

usually

does

not

imply

better.Results

on

Training

DataAccuracy3layers0.849layers0.11Let’s

try

itTesting:

Training3

layers

9

layers…………………………Vanishing

Gradient

Problembased

on

random!?

……

……

……

Larger

gradients

Learn

very

fastAlready

convergey1y2yM

x1

x2

xNSmaller

gradients

Learn

very

slow

Almost

random………………………………x1x2xN……

…………𝑦𝑀

𝑦1𝑦2𝑦𝑀𝑙

𝜕𝑙𝜕𝑤=?

+∆𝑤Intuitive

way

to

compute

the

derivatives

…+∆𝑙

∆𝑙∆𝑤Vanishing

Gradient

Problem

Smaller

gradientsLargeinput

𝑦1

Smalloutput

𝑦2Hard

to

get

the

power

of

Deep

…In

2006,

people

used

RBM

pre-training.In

2015,

people

use

ReLU.ReLU•

Rectified

Linear

Unit

(ReLU)Reason:

1.

Fast

to

compute2.

Biological

reason3.

Infinite

sigmoidwith

different

biases4.

Vanishing

gradientproblem𝑧𝑎𝑎

=

𝑧𝜎

𝑧

𝑎

=

0[Xavier

Glorot,

AISTATS’11][Andrew

L.

Maas,

ICML’13][Kaiming

He,

arXiv’15]ReLU

x1x2y20000𝑎𝑎

=

𝑧

𝑧

y1𝑎

=

0ReLUx1

x2

y1y2A

Thinner

linear

network

Do

not

have

smaller

gradients𝑧𝑎𝑎

=

𝑧𝑎

=

0Let’s

try

it9layersAccuracySigmoid0.11ReLU0.96Let’s

try

itTesting:•

9

layers

Training

ReLU

SigmoidReLU

-

variant

𝐿𝑒𝑎𝑘𝑦

𝑅𝑒𝐿𝑈

𝑎

𝑎

=

𝑧

𝑧𝑎

=

0.01𝑧

𝑃𝑎𝑟𝑎𝑚𝑒𝑡𝑟𝑖𝑐

𝑅𝑒𝐿𝑈

𝑎

𝑎

=

𝑧

𝑧𝑎

=

𝛼𝑧

α

also

learned

by

gradient

descentMaxout•

Learnable

activation

function

[Ian

J.

Goodfellow,

ICML’13]x1x2InputMax+5+++7−1171MaxMax+1+++24324ReLU

is

a

special

cases

of

MaxoutYou

can

have

more

than

2

elements

in

a

group.neuron

MaxMaxout•

Learnable

activation

function

[Ian

J.

Goodfellow,

ICML’13]

Activation

function

in

maxout

network

can

be

any

piecewise

linear

convex

function

How

many

pieces

depending

on

how

many

elements

in

a

groupReLU

is

a

special

cases

of

Maxout2

elements

in

a

group3

elements

in

a

groupGood

Results

on

YESGood

Results

on

Training

Data?Recipe

of

Deep

Learning

YESChoosing

proper

loss

Testing

Data?Mini-batchNew

activation

functionAdaptive

Learning

RateMomentum𝑤1Learning

Rates

𝑤2

Set

the

learning

rate

η

carefullyIf

learning

rate

is

too

largeTotal

loss

may

not

decreaseafter

each

update𝑤1Learning

Rates

𝑤2

Set

the

learning

rate

η

carefullyIf

learning

rate

is

too

largeTotal

loss

may

not

decreaseafter

each

updateIf

learning

rate

is

too

smallTraining

would

be

too

slowLearning

Rates•

Popular

&

Simple

Idea:

Reduce

the

learning

rate

by

some

factor

every

few

epochs.

At

the

beginning,

we

are

far

from

the

destination,

so

we

use

larger

learning

rate

After

several

epochs,

we

are

close

to

the

destination,

so

we

reduce

the

learning

rate•

E.g.

1/t

decay:

𝜂𝑡

=

𝜂𝑡

+

1•

Learning

rate

cannot

be

one-size-fits-all

Giving

different

parameters

different

learning

ratesconstant

𝑔𝑖

is

𝜕𝐿

𝜕𝑤

obtained

at

the

i-th

updateߟ𝑤

=

𝜂𝑡𝑖=0𝑔𝑖2Summation

of

the

square

of

the

previous

derivativesAdagrad

Original:

𝑤

𝑤

𝜂𝜕𝐿

𝜕𝑤

Adagrad:

w

𝑤

ߟ𝑤𝜕𝐿

𝜕𝑤

Parameter

dependent

learning

rate0g1g……0.10.2……0g1g……20.010.0……20AdagradObservation:

1.

Learning

rate

is

smaller

and

smaller

for

all

parameters2.

Smaller

derivatives,

largerlearning

rate,

and

vice

versa

𝜂

0.12

𝜂0.12

+

0.22

𝜂

2

𝜂202

+

102==

𝜂0.1

𝜂0.22=

=

𝜂20

𝜂

22Why?ߟ𝑤

=

𝜂𝑡𝑖=0𝑔𝑖2

𝑤1Learning

rate:𝑤2

Learning

rate:2.

Smaller

derivatives,

largerlearning

rate,

and

vice

versaWhy?

Larger

derivatives

SmallerLearning

Rate

Smaller

Derivatives

Larger

Learning

RateNot

the

whole

story

……•

Adagrad

[John

Duchi,

JMLR’11]•

RMSprop•

/watch?v=O3sxAc4hxZU•

Adadelta

[Matthew

D.

Zeiler,

arXiv’12]•

“No

more

pesky

learning

rates”

[Tom

Schaul,

arXiv’12]•

AdaSecant

[Caglar

Gulcehre,

arXiv’14]•

Adam

[Diederik

P.

Kingma,

ICLR’15]•

Nadam•

/proj2015/054_report.pdfGood

Results

on

YESGood

Results

on

Training

Data?Recipe

of

Deep

Learning

YESChoosing

proper

loss

Testing

Data?Mini-batchNew

activation

functionAdaptive

Learning

RateMomentumHard

to

findoptimal

network

parametersTotal

LossThe

value

of

a

network

parameter

w𝜕𝐿

𝜕𝑤

=0Very

slow

at

the

plateau

Stuck

at

saddle

point

Stuck

at

local

minima𝜕𝐿

𝜕𝑤

=0𝜕𝐿

𝜕𝑤

≈0In

physical

world

……•

MomentumHow

about

put

this

phenomenonin

gradient

descent?

Momentumcost

Still

not

guarantee

reaching

global

minima,

but

give

some

hope

……Movement

=Negative

of

𝜕𝐿∕𝜕𝑤

+

Momentum

Negative

of

𝜕𝐿

𝜕𝑤

Momentum

Real

Movement

𝜕𝐿∕𝜕𝑤

=

0AdamRMSProp

(Advanced

Adagrad)

+

MomentumAccuracyOriginal0.96Adam0.97Let’s

try

it•

ReLU,

3

layer

TrainingTesting:AdamOriginalGood

Results

onGood

Results

onRecipe

of

Deep

Learning

YESEarly

Stopping

Testing

Data?Regularization

YESDropout

Training

Data?Network

StructureWhy

Overfitting?•

Training

data

and

testing

data

can

be

different.Training

Data:Testing

Data:Learning

target

is

defined

by

the

training

data.The

parameters

achie

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论