斯坦福机器学习所有问题及答案合集_第1页
斯坦福机器学习所有问题及答案合集_第2页
斯坦福机器学习所有问题及答案合集_第3页
斯坦福机器学习所有问题及答案合集_第4页
斯坦福机器学习所有问题及答案合集_第5页
已阅读5页,还剩92页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

CS229,PublicCourse

ProblemSet#1: SupervisedLearning

Newton’smethodforcomputingleastsquares

Inthisproblem,wewillprovethatifweuseNewton’smethodsolvetheleastsquaresoptimizationproblem,thenweonlyneedoneiterationtoconvergetoθ∗.

2 i=1

FindtheHessianofthecostfunctionJ(θ)=1Pm (θTx(i)−y(i))2.

ShowthatthefirstiterationofNewton’smethodgivesusθ⋆=(XTX)−1XT~y,thesolutiontoourleastsquaresproblem.

Locally-weightedlogisticregression

Inthisproblemyouwillimplementalocally-weightedversionoflogisticregression,whereweweightdifferenttrainingexamplesdifferentlyaccordingtothequerypoint.Thelocally-weightedlogisticregressionproblemistomaximize

λ Xm

h i

(i)

2

ℓ(θ)=−θTθ+ w

i=1

y(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i))) .

The−λ2θTθhereiswhatisknownasaregularizationparameter,whichwillbediscussedinafuturelecture,butwhichweincludeherebecauseitisneededforNewton’smethodtoperformwellonthistask.Fortheentiretyofthisproblemyoucanusethevalueλ=0.0001.

Usingthisdefinition,thegradientofℓ(θ)isgivenby

∇θℓ(θ)=XTz−λθ

wherez∈RmisdefinedbyandtheHessianisgivenby

zi=w(i)(y(i)−hθ(x(i)))H=XTDX−λI

whereD∈Rm×misadiagonalmatrixwith

Dii=−w(i)hθ(x(i))(1−hθ(x(i)))

Forthesakeofthisproblemyoucanjustusetheaboveformulas,butyoushouldtrytoderivetheseresultsforyourselfaswell.

Givenaquerypointx,wechoosecomputetheweights

||x−x(i)||2

w(i)=exp—

2τ2 .

Muchlikethelocallyweightedlinearregressionthatwasdiscussedinclass,thisweightingschemegivesmorewhenthe“nearby”pointswhenpredictingtheclassofanewexample.

ImplementtheNewton-Raphsonalgorithmforoptimizingℓ(θ)foranewquerypointx,andusethistopredicttheclassofx.

Theq2/directorycontainsdataandcodeforthisproblem.Youshouldimplementthey=lwlr(Xtrain,ytrain,x,tau)functioninthelwlr.mfile.Thisfunc-tiontakesasinputthetrainingset(theXtrainandytrainmatrices,intheformdescribedintheclassnotes),anewquerypointxandtheweightbandwitdhtau.Giventhisinputthefunctionshould1)computeweightsw(i)foreachtrainingexam-ple,usingtheformulaabove,2)maximizeℓ(θ)usingNewton’smethod,andfinally3)outputy=1{hθ(x)>0.5}astheprediction.

Weprovidetwoadditionalfunctionsthatmighthelp.The[Xtrain,ytrain]=loaddata;functionwillloadthematricesfromfilesinthedata/folder.Thefunc-tionplotlwlr(Xtrain,ytrain,tau,resolution)willplottheresultingclas-sifier(assumingyouhaveproperlyimplementedlwlr.m).Thisfunctionevaluatesthelocallyweightedlogisticregressionclassifieroveralargegridofpointsandplotstheresultingpredictionasblue(predictingy=0)orred(predictingy=1).Dependingonhowfastyourlwlrfunctionis,creatingtheplotmighttakesometime,sowerecommenddebuggingyourcodewithresolution=50;andlaterincreaseittoatleast200togetabetterideaofthedecisionboundary.

Evaluatethesystemwithavarietyofdifferentbandwidthparametersτ.Inparticular,tryτ=0.01,0.050.1,0.51.0,5.0.Howdoestheclassificationboundarychangewhenvaryingthisparameter?Canyoupredictwhatthedecisionboundaryofordinary(unweighted)logisticregressionwouldlooklike?

Multivariateleastsquares

Sofarinclass,wehaveonlyconsideredcaseswhereourtargetvariableyisascalarvalue.Supposethatinsteadoftryingtopredictasingleoutput,wehaveatrainingsetwithmultipleoutputsforeachexample:

{(x(i),y(i)),i=1,...,m},x(i)∈Rn,y(i)∈Rp.

Thusforeachtrainingexample,y(i)isvector-valued,withpentries.Wewishtousealinearmodeltopredicttheoutputs,asinleastsquares,byspecifyingtheparametermatrixΘin

y=ΘTx,

whereΘ∈Rn×p.

Thecostfunctionforthiscaseis

m

J(Θ)=1 p

2

2

j

(ΘTx(i))j−y(i) .

i=1j=1

WriteJ(Θ)inmatrix-vectornotation(i.e.,withoutusinganysummations).[Hint:Startwiththem×ndesignmatrix

— (x(1))T —

X= — (x(2))T —

.

—(x(m))T —

andthem×ptargetmatrix

Y=

— (y(1))T —

— (y(2))T —

.

—(y(m))T —

andthenworkouthowtoexpressJ(Θ)intermsofthesematrices.]

FindtheclosedformsolutionforΘwhichminimizesJ(Θ).Thisistheequivalenttothenormalequationsforthemultivariatecase.

j

Supposeinsteadofconsideringthemultivariatevectorsy(i)allatonce,weinsteadcomputeeachvariabley(i)separatelyforeachj=1,...,p.Inthiscase,wehaveapindividuallinearmodels,oftheform

y(i)=θTx(i),j=1,...,p.

j j

(Sohere,eachθj∈Rn).Howdotheparametersfromthesepindependentleastsquaresproblemscomparetothemultivariatesolution?

NaiveBayes

Inthisproblem,welookatmaximumlikelihoodparameterestimationusingthenaiveBayesassumption.Here,theinputfeaturesxj,j=1,...,ntoourmodelarediscrete,binary-valuedvariables,soxj∈{0,1}.Wecallx=[x1x2···xn]Ttobetheinputvector.Foreachtrainingexample,ouroutputtargetsareasinglebinary-valuey∈{0,1}.Ourmodelisthenparameterizedbyφj|y=0=p(xj=1|y=0),φj|y=1=p(xj=1|y=1),andφy=p(y=1).Wemodelthejointdistributionof(x,y)accordingto

p(y)=(φy)y(1−φy)1−y

Yn

p(x|y=0)= p(xj|y=0)

j=1

=

p(x|y=1)=

=

Yn

x 1x

(φj|y=0)j(1−φj|y=0)−j

j=1

Yn

p(xj|y=1)

j=1

x 1x

Yn

(φj|y=1)j(1−φj|y=1)−j

j=1

i=1

Findthejointlikelihoodfunctionℓ(ϕ)=logQmp(x(i),y(i);ϕ)intermsofthemodelparametersgivenabove. Here,ϕrepresentstheentiresetofparameters

{φy,φj|y=0,φj|y=1,j=1,...,n}.

Showthattheparameterswhichmaximizethelikelihoodfunctionarethesameas

thosegiveninthelecturenotes;i.e.,that

Pm

1{x(i)=1∧y(i)=0}

φj|y=0=

i=1Pmj

i=11{y(i)=0}

Pm1{x(i)=1∧y(i)=1}

Pm

φj|y=1=

i=1 j

i=11{y(i)=1}

Pm1{y(i)=1}

φy=

i=1 .

m

ConsidermakingapredictiononsomenewdatapointxusingthemostlikelyclassestimategeneratedbythenaiveBayesalgorithm.ShowthatthehypothesisreturnedbynaiveBayesisalinearclassifier—i.e.,ifp(y=0|x)andp(y=1|x)aretheclassprobabilitiesreturnedbynaiveBayes,showthatthereexistssomeθ∈Rn+1such

that

p(y=1|x)≥p(y=0|x)ifandonlyifθT 1

x

≥0.

(Assumeθ0isaninterceptterm.)

Exponentialfamilyandthegeometricdistribution

Considerthegeometricdistributionparameterizedbyφ:

p(y;φ)=(1−φ)y−1φ,y=1,2,3,....

Showthatthegeometricdistributionisintheexponentialfamily,andgiveb(y),η,T(y),anda(η).

ConsiderperformingregressionusingaGLMmodelwithageometricresponsevari-able.Whatisthecanonicalresponsefunctionforthefamily?Youmayusethefactthatthemeanofageometricdistributionisgivenby1/φ.

Foratrainingset{(x(i),y(i));i=1,...,m},letthelog-likelihoodofanexamplebelogp(y(i)|x(i);θ).Bytakingthederivativeofthelog-likelihoodwithrespecttoθj,derivethestochasticgradientascentruleforlearningusingaGLMmodelwithgoemetricresponsesyandthecanonicalresponsefunction.

CS229,PublicCourse

ProblemSet#1Solutions: SupervisedLearning

Newton’smethodforcomputingleastsquares

Inthisproblem,wewillprovethatifweuseNewton’smethodsolvetheleastsquaresoptimizationproblem,thenweonlyneedoneiterationtoconvergetoθ∗.

2 i=1

FindtheHessianofthecostfunctionJ(θ)=1Pm (θTx(i)−y(i))2.

Answer: Asshownintheclassnotes

∂J(θ)

Xm

T(i)

(i)(i)

∂θj

So

= (θx

i=1

–y)xj.

2 Xm∂ T i i i

∂J(θ) =

(θx()−y())x()

∂θj∂θk i=1m

∂θk j

j k

jk

=Xx(i)x(i)=(XTX)

i=1

Therefore,theHessianofJ(θ)isH=XTX.ThiscanalsobederivedbysimplyapplyingrulesfromthelecturenotesonLinearAlgebra.

ShowthatthefirstiterationofNewton’smethodgivesusθ⋆=(XTX)−1XT~y,thesolutiontoourleastsquaresproblem.

Answer: Givenanyθ(0),Newton’smethodfindsθ(1)accordingto

θ(1)= θ(0)−H−1∇θJ(θ(0))

= θ(0)−(XTX)−1(XTXθ(0)−XT~y)

= θ(0)−θ(0)+(XTX)−1XT~y

= (XTX)−1XT~y.

Therefore,nomatterwhatθ(0)wepick,Newton’smethodalwaysfindsθ⋆afteroneiteration.

Locally-weightedlogisticregression

Inthisproblemyouwillimplementalocally-weightedversionoflogisticregression,whereweweightdifferenttrainingexamplesdifferentlyaccordingtothequerypoint.Thelocally-weightedlogisticregressionproblemistomaximize

λ Xm

h i

(i)

2

ℓ(θ)=−θTθ+ w

i=1

y(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i))) .

The−λ2θTθhereiswhatisknownasaregularizationparameter,whichwillbediscussedinafuturelecture,butwhichweincludeherebecauseitisneededforNewton’smethodtoperformwellonthistask.Fortheentiretyofthisproblemyoucanusethevalueλ=0.0001.

Usingthisdefinition,thegradientofℓ(θ)isgivenby

∇θℓ(θ)=XTz−λθ

wherez∈RmisdefinedbyandtheHessianisgivenby

zi=w(i)(y(i)−hθ(x(i)))H=XTDX−λI

whereD∈Rm×misadiagonalmatrixwith

Dii=−w(i)hθ(x(i))(1−hθ(x(i)))

Forthesakeofthisproblemyoucanjustusetheaboveformulas,butyoushouldtrytoderivetheseresultsforyourselfaswell.

Givenaquerypointx,wechoosecomputetheweights

||x−x(i)||2

w(i)=exp—

2τ2 .

Muchlikethelocallyweightedlinearregressionthatwasdiscussedinclass,thisweightingschemegivesmorewhenthe“nearby”pointswhenpredictingtheclassofanewexample.

ImplementtheNewton-Raphsonalgorithmforoptimizingℓ(θ)foranewquerypointx,andusethistopredicttheclassofx.

Theq2/directorycontainsdataandcodeforthisproblem.Youshouldimplementthey=lwlr(Xtrain,ytrain,x,tau)functioninthelwlr.mfile.Thisfunc-tiontakesasinputthetrainingset(theXtrainandytrainmatrices,intheformdescribedintheclassnotes),anewquerypointxandtheweightbandwitdhtau.Giventhisinputthefunctionshould1)computeweightsw(i)foreachtrainingexam-ple,usingtheformulaabove,2)maximizeℓ(θ)usingNewton’smethod,andfinally3)outputy=1{hθ(x)>0.5}astheprediction.

Weprovidetwoadditionalfunctionsthatmighthelp.The[Xtrain,ytrain]=loaddata;functionwillloadthematricesfromfilesinthedata/folder.Thefunc-tionplotlwlr(Xtrain,ytrain,tau,resolution)willplottheresultingclas-sifier(assumingyouhaveproperlyimplementedlwlr.m).Thisfunctionevaluatesthelocallyweightedlogisticregressionclassifieroveralargegridofpointsandplotstheresultingpredictionasblue(predictingy=0)orred(predictingy=1).Dependingonhowfastyourlwlrfunctionis,creatingtheplotmighttakesometime,sowerecommenddebuggingyourcodewithresolution=50;andlaterincreaseittoatleast200togetabetterideaofthedecisionboundary.

Answer: Ourimplementationoflwlr.m:functiony=lwlr(X_train,y_train,x,tau)

m=size(X_train,1);

n=size(X_train,2);

theta=zeros(n,1);

%computeweights

w=exp(-sum((X_train-repmat(x’,m,1)).^2,2)/(2*tau));

%performNewton’smethodg=ones(n,1);

while(norm(g)>1e-6)

h=1./(1+exp(-X_train*theta));

g=X_train’*(w.*(y_train-h))-1e-4*theta;

H=-X_train’*diag(w.*h.*(1-h))*X_train-1e-4*eye(n);theta=theta-H\g;

end

%returnpredictedy

y=double(x’*theta>0);

Evaluatethesystemwithavarietyofdifferentbandwidthparametersτ.Inparticular,tryτ=0.01,0.050.1,0.51.0,5.0.Howdoestheclassificationboundarychangewhenvaryingthisparameter?Canyoupredictwhatthedecisionboundaryofordinary(unweighted)logisticregressionwouldlooklike?

Answer: Thesearetheresultingdecisionboundaries,forthedifferentvaluesofτ.

tau=0.01 tau=005 tau=0.1

tau=0.5 tau=05 tau=5

Forsmallerτ,theclassifierappearstooverfitthedataset,obtainingzerotrainingerror,butoutputtingasporadiclookingdecisionboundary.Asτgrows,theresultingdeci-sionboundarybecomessmoother,eventuallyconverging(inthelimitasτ→∞totheunweightedlinearregressionsolution).

Multivariateleastsquares

Sofarinclass,wehaveonlyconsideredcaseswhereourtargetvariableyisascalarvalue.Supposethatinsteadoftryingtopredictasingleoutput,wehaveatrainingsetwith

multipleoutputsforeachexample:

{(x(i),y(i)),i=1,...,m},x(i)∈Rn,y(i)∈Rp.

Thusforeachtrainingexample,y(i)isvector-valued,withpentries.Wewishtousealinearmodeltopredicttheoutputs,asinleastsquares,byspecifyingtheparametermatrixΘin

y=ΘTx,

whereΘ∈Rn×p.

Thecostfunctionforthiscaseis

m

J(Θ)=1 p

2

2

j

(ΘTx(i))j−y(i) .

i=1j=1

WriteJ(Θ)inmatrix-vectornotation(i.e.,withoutusinganysummations).[Hint:Startwiththem×ndesignmatrix

X=

andthem×ptargetmatrix

Y=

— (x(1))T —

— (x(2))T —

.

(x(m))T —

(y(1))T —

(y(2))T —

.

(y(m))T —

andthenworkouthowtoexpressJ(Θ)intermsofthesematrices.]Answer:Theobjectivefunctioncanbeexpressedas

J(Θ)=1tr(XΘ−Y)T(XΘ−Y).

2

Toseethis,notethat

J(Θ) =

tr(XΘ−Y)T(XΘ−Y)

X

= 1 XΘ−Y)T(XΘ−Y)

2i ii

XX

= (XΘ−Y)2

i j ij

m

p

= 1 (ΘTx(i))−y(i)2

2 j j

i=1j=1

FindtheclosedformsolutionforΘwhichminimizesJ(Θ).Thisistheequivalenttothenormalequationsforthemultivariatecase.

Answer: FirstwetakethegradientofJ(Θ)withrespecttoΘ.

∇ΘJ(Θ)=∇Θ

= ∇Θ

1

1tr(XΘ−Y)T(XΘ−Y)

2

1trΘTXTXΘ−ΘTXTY−YTXΘ−YTT

2

= 2∇Θ

tr(ΘTXTXΘ)−tr(ΘTXTY)−tr(YTXΘ)+tr(YTY)

= 1∇Θ

2

tr(ΘTXTXΘ)−2tr(YTXΘ)+tr(YTY)

= 1XTXΘ+XTXΘ−2XTY2

= XTXΘ−XTY

Settingthisexpressiontozeroweobtain

Θ=(XTX)−1XTY.

Thislooksverysimilartotheclosedformsolutionintheunivariatecase,exceptnowY

isam×pmatrix,sothenΘisalsoamatrix,ofsizen×p.

j

Supposeinsteadofconsideringthemultivariatevectorsy(i)allatonce,weinsteadcomputeeachvariabley(i)separatelyforeachj=1,...,p.Inthiscase,wehaveapindividuallinearmodels,oftheform

y(i)=θTx(i),j=1,...,p.

j j

(Sohere,eachθj∈Rn).Howdotheparametersfromthesepindependentleastsquaresproblemscomparetothemultivariatesolution?

Answer: Thistime,weconstructasetofvectors

y

(1)

j

y(2)

~yj=

j

.

y

(m)

j

,j=1,...,p.

Thenourj-thlinearmodelcanbesolvedbytheleastsquaressolution

θj=(XTX)−1XT~yj.

Ifwelineupourθj,weseethatwehavethefollowingequation:

[θ1θ2···θp]= (XTX)−1XT~y1(XTX)−1XT~y2···(XTX)−1XT~yp

=(XTX)−1XT[~y1~y2···~yp]

= (XTX)−1XTY

= Θ.

Thus,ourpindividualleastsquaresproblemsgivetheexactsamesolutionasthemulti-variateleastsquares.

NaiveBayes

Inthisproblem,welookatmaximumlikelihoodparameterestimationusingthenaiveBayesassumption.Here,theinputfeaturesxj,j=1,...,ntoourmodelarediscrete,binary-valuedvariables,soxj∈{0,1}.Wecallx=[x1x2···xn]Ttobetheinputvector.Foreachtrainingexample,ouroutputtargetsareasinglebinary-valuey∈{0,1}.Ourmodelisthenparameterizedbyφj|y=0=p(xj=1|y=0),φj|y=1=p(xj=1|y=1),andφy=p(y=1).Wemodelthejointdistributionof(x,y)accordingto

p(y)=(φy)y(1−φy)1−y

Yn

p(x|y=0)= p(xj|y=0)

j=1

=

p(x|y=1)=

=

Yn

x 1x

(φj|y=0)j(1−φj|y=0)−j

j=1

Yn

p(xj|y=1)

j=1

x 1x

Yn

(φj|y=1)j(1−φj|y=1)−j

j=1

i=1

Findthejointlikelihoodfunctionℓ(ϕ)=logQmp(x(i),y(i);ϕ)intermsofthemodelparametersgivenabove. Here,ϕrepresentstheentiresetofparameters

{φy,φj|y=0,φj|y=1,j=1,...,n}.Answer:

Ym

ℓ(ϕ)=log p(x(i),y(i);ϕ)

i=1

Ym

= log p(x(i)|y(i);ϕ)p(y(i);ϕ)

i=1

j

= logYmYn p(x(i)|y(i);ϕ) p(y(i);ϕ)

i=1

Xm

j=1

(i) Xn

(i)(i)

=

i=1

Xm"

logp(y

;ϕ)+

j=1

logp(xj|y;ϕ)

= y(i)logφy+(1−y(i))log(1−φy)

i=1

Xn (i) (i)

+ xjlogφj|y(i)+(1−xj)log(1−φj|y(i))

j=1

Showthattheparameterswhichmaximizethelikelihoodfunctionarethesameas

thosegiveninthelecturenotes;i.e.,that

Pm

1{x(i)=1∧y(i)=0}

φj|y=0=

i=1Pmj

i=11{y(i)=0}

Pm1{x(i)=1∧y(i)=1}

Pm

φj|y=1=

i=1 j

i=11{y(i)=1}

Pm1{y(i)=1}

φy=

i=1 .

m

Answer: Theonlytermsinℓ(ϕ)whichhavenon-zerogradientwithrespecttoφj|y=0

arethosewhichincludeφj|y(i).Therefore,

m

∇φj|y=0

ℓ(ϕ)=∇φj|y=0

|

= ∇φjy=0

i=1

Xm

i=1

x(i)logφj|y()+(1−x(i))log(1−φj|y(i))

j

j i j

x(i)log(φj|y=0)1{y(i)=0}

j

+(1−x(i))log(1−φj|y=0)1{y(i)=0}

Xm (i) 1 1

= xj 1{y(i)=0}−(1−x(i))

1{y(i)=0}.

i=1 φj|y=0

Setting∇φj|y=0ℓ(ϕ)=0gives

j 1−φj|y=0

m

0=

i=1

Xm

(i)

x

j

1

φj|y=0

1{y(i)=0}−(1−x(i))

1

1−φj|y=0

1{y(i)=0}

j

=

i=1

Xm

=

i=1

Xm

x(i)(1−φj|y=0)1{y(i)=0}−(1−x(i))φj|y=01{y(i)=0}

j

j j

(x(i)−φj|y=0)1{y(i)=0}

j

Xm

=

j

i=1m

x(i)·1{y(i)=0}

—φj|y=0

i=1

1{y(i)=0}

Xm

=

i=1

1{x(i)=1∧y(i)=0}—φj|y=0

i=1

1{y(i)=0}.

Wethenarriveatourdesiredresult

Pm

1{x(i)=1∧y(i)=0}

φj|y=0=

i=1Pmj

i=11{y(i)=0}

Thesolutionforφj|y=1proceedsintheidenticalmanner.

Tosolveforφy,

∇φℓ(ϕ)=∇φ

y y

Xm

i=1

y(i)logφy+(1−y(i))log(1−φy)

m

=

i=1

y(i)1—(1−y(i))

φy

1

1−φy

Thensetting∇φy=0givesus

m

(i)1 1

0=

i=1

Xm

y —(1−y(i))

φ

y

1−φy

=

i=1m

X

y(i)(1−φy)−(1−y(i))φy

m

X

=i y(i)−

φy.

Therefore,

=1 i=1

Pm1{y(i)=1}

φy= i=1m .

ConsidermakingapredictiononsomenewdatapointxusingthemostlikelyclassestimategeneratedbythenaiveBayesalgorithm.ShowthatthehypothesisreturnedbynaiveBayesisalinearclassifier—i.e.,ifp(y=0|x)andp(y=1|x)aretheclass

probabilitiesreturnedbynaiveBayes,showthatthereexistssomeθ∈Rn+1suchthat

p(y=1|x)≥p(y=0|x)ifandonlyifθT 1

x

≥0.

(Assumeθ0isaninterceptterm.)Answer:

p(y=1|x)≥p(y=0|x)

⇐⇒p(y=1|x)≥1p(y=0|x)

Qn

j=1

p(xj|y=1)p(y=1)

⇐⇒Qn

p(x|y=0)

≥1

p(y=0)

j=1 j x

1−x

n

Qj=1(φj|y=0)j(1−φj|y=0)

j φy

⇐⇒Qn

)x(1−φ

)1−x

≥1

(1−φ)

j=1

j|y=1j

j|y=1 j y

⇐⇒Xn

xlog φj|y=1+(1−x)log 1−φj|y=0 +log φy ≥0

j j

j=1 φj|y=0 1−φj|y=0 1−φy

Xn (φj|y=1)(1−φj|y=0) Xn

−φj φy

⇐⇒ xjlog

+ log

|y=1 +log ≥0

j=1

⇐⇒θT 1

x

(φj|y=0)(1−φj|y=1)

≥0,

j=1 1−φj|y=0 1−φy

where

θ0=

Xn

log

1−φj|y=1 +log

φy

j=1 1−φj|y=0 1−φy

θj =log(φj|y=1)(1−φj|y=0),j=1,...,n.(φj|y=0)(1−φj|y=1)

Exponentialfamilyandthegeometricdistribution

Considerthegeometricdistributionparameterizedbyφ:

p(y;φ)=(1−φ)y−1φ,y=1,2,3,....

Showthatthegeometricdistributionisintheexponentialfamily,andgiveb(y),η,T(y),anda(η).

Answer:

p(y;φ)=(1−φ)y−1φ

=explog(1−φ)y−1+logφ

=exp[(y−1)log(1−φ)+logφ]

=expylog(1−φ)−log1−φ

φ

Then

b(y)=1

η=log(1−φ)T(y)=y

a(η)=log

1−φ

φ

=log

eη ,

1−eη

wherethelastlinefollowsbecuaseη=log(1−φ)⇒eη=1−φ⇒φ=1−eη.

ConsiderperformingregressionusingaGLMmodelwithageometricresponsevari-able.Whatisthecanonicalresponsefunctionforthefamily?Youmayusethefactthatthemeanofageometricdistributionisgivenby1/φ.

Answer: 1 1

g(η)=E[y;φ]=φ=1−eη.

Foratrainingset{(x(i),y(i));i=1,...,m},letthelog-likelihoodofanexamplebelogp(y(i)|x(i);θ).Bytakingthederivativeofthelog-likelihoodwithrespecttoθj,derivethestochasticgradientascentruleforlearningusingaGLMmodelwithgoemetricresponsesyandthecanonicalresponsefunction.

Answer: Thelog-likelihoodofanexample(x(i),y(i))isdefinedasℓ(θ)=logp(y(i)|x(i);θ).Toderivethestochasticgradientascentrule,usetheresultsfrompreviouspartsandthestandardGLMassumptionthatη=θTx.

ℓi(θ)=log

"

exp θTx(i)·y(i)−log

eθTx(i)

1−eθTx(i)

!!#

=logexpθTx(i)·y(i)−log 1

e−θTx(i)−1

=θTx(i)·y(i)+loge−θTx(i)−1

j

j

∂ e−θTx(i)

∂θj

ℓi(θ)=x(i)y(i)+

e−θTx(i)−1

(−x(i))

= x(i)y(i)− 1 x(i)

j 1−e−θTx(i)j

j

= y(i)− 1

1−eθTx(i)

x(i).

Thusthestochasticgradientascentupdateruleshouldbe

∂ℓi(θ)

whichis

θj:=θj+α

∂θj,

j

θj:=θj+αy(i)− 1 x(i).

1−eθTx(i)

CS229,PublicCourse

ProblemSet#2: Kernels,SVMs,andTheory

Kernelridgeregression

Incontrasttoordinaryleastsquareswhichhasacostfunction

1Xm

J(θ)=2

i=1

(θTx(i)−y(i))2,

wecanalsoaddatermthatpenalizeslargeweightsinθ.Inridgeregression,ourleastsquarescostisregularizedbyaddingatermλkθk2,whereλ>0isafixed(known)constant(regularizationwillbediscussedatgreaterlengthinanupcomingcourselecutre).Theridgeregressioncostfunctionisthen

1Xm λ

J(θ)=2

(θTx(i)−y(i))2+

i=1

kθk2.2

Usethevectornotationdescribedinclasstofindaclosed-formexpreesionforthevalueofθwhichminimizestheridgeregressioncostfunction.

Supposethatwewanttousekernelstoimplicitlyrepresentourfeaturevectorsinahigh-dimensional(possiblyinfinitedimensional)space.Usingafeaturemappingφ,theridgeregressioncostfunctionbecomes

1Xm λ

J(θ)=2

(θTφ(x(i))−y(i))2+

i=1

kθk2.2

MakingapredictiononanewinputxnewwouldnowbedonebycomputingθTφ(xnew).Showhowwecanusethe“kerneltrick”toobtainaclosedformforthepredictiononthenewinputwithouteverexplicitlycomputingφ(xnew).Youmayassumethattheparametervectorθcanbeexpressedasalinearcombinationoftheinputfeature

vectors;i.e.,θ=Pmαiφ(x(i))forsomesetofparametersαi.[Hint:Youmayfindi=t1hefollowingidentityuseful:

(λI+BA)−1B=B(λI+AB)−1.

Ifyouwant,youcantrytoprovethisaswell,thoughthisisnotrequiredfortheproblem.]

ℓ2normsoftmarginSVMs

Inclass,wesawthatifourdataisnotlinearlyseparable,thenweneedtomodifyoursupportvectormachinealgorithmbyintroducinganerrormarginthatmustbeminimized.Specifically,theformulationwehavelookedatisknownastheℓ1normsoftmarginSVM.Inthisproblemwewillconsideranalternativemethod,knownastheℓ2normsoftmarginSVM.Thisnewalgorithmisgivenbythefollowingoptimizationproblem(noticethattheslackpenaltiesarenowsquared):

minw,b,ξ1kwk2+CPmξ2

2 2 i=1i .

s.t.y(i)(wTx(i)+b)≥1−ξi,i=1,...,m

Noticethatwehavedroppedtheξi≥0constraintintheℓ2problem.Showthatthesenon-negativityconstraintscanberemoved.Thatis,showthattheoptimalvalueoftheobjectivewillbethesamewhetherornottheseconstraintsarepresent.

WhatistheLagrangianoftheℓ2softmarginSVMoptimizationproblem?

MinimizetheLagrangianwithrespecttow,b,andξbytakingthefollowinggradients:

∇wL,∂L∂b,and∇ξL,andthensettingthemequalto0.Here,ξ=[ξ1,ξ2,...,ξm]T.

Whatisthedualoftheℓ2softmarginSVMoptimizationproblem?

SVMwithGaussiankernel

ConsiderthetaskoftrainingasupportvectormachineusingtheGaussiankernelK(x,z)=exp(−kx−zk2/τ2).Wewillshowthataslongastherearenotwoidenticalpointsinthetrainingset,wecanalwaysfindavalueforthebandwidthparameterτsuchthattheSVMachieveszerotrainingerror.

Recallfromclassthatthedecisionfunctionlearnedbythesupportvectormachinecanbewrittenas

Xm

f(x)= αiy(i)K(x(i),x)+b.

i=1

Assumethatthetrainingdata{(x(1),y(1)),...,(x(m),y(m))}consistsofpointswhichareseparatedbyatleastadistanceofǫ;thatis,||x(j)−x(i)||≥ǫforanyi6=j.Findvaluesforthesetofparameters{α1,...,αm,b}andGaussiankernelwidthτsuchthatx(i)iscorrectlyclassified,foralli=1,...,m.[Hint:Letαi=1foralliandb=0.Nownoticethatfory∈{−1,+1}thepredictiononx(i)willbecorrectif

|f(x(i))−y(i)|<1,sofindavalueofτthatsatisfiesthisinequalityforalli.]

SupposewerunaSVMwithslackvariablesusingtheparameterτyoufoundinpart

(a).Willtheresultingclassifiernecessarilyobtainzerotrainingerror?Whyorwhynot?Ashortexplanation(withoutproof)willsuffice.

SupposeweruntheSMOalgorithmtotrainanSVMwithslackvariables,undertheconditionsstatedabove,usingthevalueofτyoupickedinthepreviouspart,andusingsomearbitraryvalueofC(whichyoudonotknowbeforehand).Willthisnecessarilyresultinaclassifierthatachievezerotrainingerror?Whyorwhynot?Again,ashortexplanationissufficient.

NaiveBayesandSVMsforSpamClassification

Inthisquestionyou’lllookintotheNaiveBayesandSupportVectorMachinealgorithmsforaspamclassificationproblem.However,insteadofimplementingthealgorithmsyour-self,you’lluseafreelyavailablemachinelearninglibrary.Therearemanysuchlibrariesavailable,withdifferentstrengthsandweaknesses,butforthisproblemyou’llusetheWEKAmachinelearningpackage,availableat

http://www.cs.waikato.ac.nz/ml/weka/.

WEKAimplementsmanystandardmachinelearningalgorithms,iswritteninJava,andhasbothaGUIandacommandlineinterface.Itisnotthebestlibraryforverylarge-scaledatasets,butitisveryniceforplayingaroundwithmanydifferentalgorithmsonmediumsizeproblems.

YoucandownloadandinstallWEKAbyfollowingtheinstructionsgivenonthewebsiteabove.Touseitfromthecommandline,youfirstneedtoinstallajavaruntimeenviron-ment,thenaddtheweka.jarfiletoyourCLASSPATHenvironmentvariable.Finally,you

cancallWEKAusingthecommand:

java<classifier>-t<trainingfile>-T<testfile>

Forexample,toruntheNaiveBayesclassifier(usingthemultinomialeventmodel)onourprovidedspamdatasetbyrunningthecommand:

javaweka.classifiers.bayes.NaiveBayesMultinomial-tspamtrain1000.arff-Tspamtest.arff

Thespamclassificationdatasetintheq4/directorywasprovidedcourtesyofChristianShelton(cshelton@).Eachexamplecorrespondstoaparticularemail,andeachfeaturecorrespondestoaparticularword.Forprivacyreasonswehaveremovedtheactualwordsthemselvesfromthedataset,andinsteadlabelthefeaturesgenericallyasf1,f2,etc.However,thedatasetisfromarealspamclassificationtask,sotheresultsdemonstratetheperformanceofthesealgorithmsonareal-worldproblem.Theq4/directoryactuallycon-tainsseveraldifferenttrainingfiles,namedspamtrain50.arff,spamtrain100.arff,etc(the“.arff”formatisthedefaultformatbyWEKA),eachcontainingthecorrespondingnumberoftrainingexamples.The

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论