




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
CS229,PublicCourse
ProblemSet#1: SupervisedLearning
Newton’smethodforcomputingleastsquares
Inthisproblem,wewillprovethatifweuseNewton’smethodsolvetheleastsquaresoptimizationproblem,thenweonlyneedoneiterationtoconvergetoθ∗.
2 i=1
FindtheHessianofthecostfunctionJ(θ)=1Pm (θTx(i)−y(i))2.
ShowthatthefirstiterationofNewton’smethodgivesusθ⋆=(XTX)−1XT~y,thesolutiontoourleastsquaresproblem.
Locally-weightedlogisticregression
Inthisproblemyouwillimplementalocally-weightedversionoflogisticregression,whereweweightdifferenttrainingexamplesdifferentlyaccordingtothequerypoint.Thelocally-weightedlogisticregressionproblemistomaximize
λ Xm
h i
(i)
2
ℓ(θ)=−θTθ+ w
i=1
y(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i))) .
The−λ2θTθhereiswhatisknownasaregularizationparameter,whichwillbediscussedinafuturelecture,butwhichweincludeherebecauseitisneededforNewton’smethodtoperformwellonthistask.Fortheentiretyofthisproblemyoucanusethevalueλ=0.0001.
Usingthisdefinition,thegradientofℓ(θ)isgivenby
∇θℓ(θ)=XTz−λθ
wherez∈RmisdefinedbyandtheHessianisgivenby
zi=w(i)(y(i)−hθ(x(i)))H=XTDX−λI
whereD∈Rm×misadiagonalmatrixwith
Dii=−w(i)hθ(x(i))(1−hθ(x(i)))
Forthesakeofthisproblemyoucanjustusetheaboveformulas,butyoushouldtrytoderivetheseresultsforyourselfaswell.
Givenaquerypointx,wechoosecomputetheweights
||x−x(i)||2
w(i)=exp—
2τ2 .
Muchlikethelocallyweightedlinearregressionthatwasdiscussedinclass,thisweightingschemegivesmorewhenthe“nearby”pointswhenpredictingtheclassofanewexample.
ImplementtheNewton-Raphsonalgorithmforoptimizingℓ(θ)foranewquerypointx,andusethistopredicttheclassofx.
Theq2/directorycontainsdataandcodeforthisproblem.Youshouldimplementthey=lwlr(Xtrain,ytrain,x,tau)functioninthelwlr.mfile.Thisfunc-tiontakesasinputthetrainingset(theXtrainandytrainmatrices,intheformdescribedintheclassnotes),anewquerypointxandtheweightbandwitdhtau.Giventhisinputthefunctionshould1)computeweightsw(i)foreachtrainingexam-ple,usingtheformulaabove,2)maximizeℓ(θ)usingNewton’smethod,andfinally3)outputy=1{hθ(x)>0.5}astheprediction.
Weprovidetwoadditionalfunctionsthatmighthelp.The[Xtrain,ytrain]=loaddata;functionwillloadthematricesfromfilesinthedata/folder.Thefunc-tionplotlwlr(Xtrain,ytrain,tau,resolution)willplottheresultingclas-sifier(assumingyouhaveproperlyimplementedlwlr.m).Thisfunctionevaluatesthelocallyweightedlogisticregressionclassifieroveralargegridofpointsandplotstheresultingpredictionasblue(predictingy=0)orred(predictingy=1).Dependingonhowfastyourlwlrfunctionis,creatingtheplotmighttakesometime,sowerecommenddebuggingyourcodewithresolution=50;andlaterincreaseittoatleast200togetabetterideaofthedecisionboundary.
Evaluatethesystemwithavarietyofdifferentbandwidthparametersτ.Inparticular,tryτ=0.01,0.050.1,0.51.0,5.0.Howdoestheclassificationboundarychangewhenvaryingthisparameter?Canyoupredictwhatthedecisionboundaryofordinary(unweighted)logisticregressionwouldlooklike?
Multivariateleastsquares
Sofarinclass,wehaveonlyconsideredcaseswhereourtargetvariableyisascalarvalue.Supposethatinsteadoftryingtopredictasingleoutput,wehaveatrainingsetwithmultipleoutputsforeachexample:
{(x(i),y(i)),i=1,...,m},x(i)∈Rn,y(i)∈Rp.
Thusforeachtrainingexample,y(i)isvector-valued,withpentries.Wewishtousealinearmodeltopredicttheoutputs,asinleastsquares,byspecifyingtheparametermatrixΘin
y=ΘTx,
whereΘ∈Rn×p.
Thecostfunctionforthiscaseis
m
J(Θ)=1 p
2
2
j
(ΘTx(i))j−y(i) .
i=1j=1
WriteJ(Θ)inmatrix-vectornotation(i.e.,withoutusinganysummations).[Hint:Startwiththem×ndesignmatrix
— (x(1))T —
X= — (x(2))T —
.
—(x(m))T —
andthem×ptargetmatrix
Y=
— (y(1))T —
— (y(2))T —
.
—(y(m))T —
andthenworkouthowtoexpressJ(Θ)intermsofthesematrices.]
FindtheclosedformsolutionforΘwhichminimizesJ(Θ).Thisistheequivalenttothenormalequationsforthemultivariatecase.
j
Supposeinsteadofconsideringthemultivariatevectorsy(i)allatonce,weinsteadcomputeeachvariabley(i)separatelyforeachj=1,...,p.Inthiscase,wehaveapindividuallinearmodels,oftheform
y(i)=θTx(i),j=1,...,p.
j j
(Sohere,eachθj∈Rn).Howdotheparametersfromthesepindependentleastsquaresproblemscomparetothemultivariatesolution?
NaiveBayes
Inthisproblem,welookatmaximumlikelihoodparameterestimationusingthenaiveBayesassumption.Here,theinputfeaturesxj,j=1,...,ntoourmodelarediscrete,binary-valuedvariables,soxj∈{0,1}.Wecallx=[x1x2···xn]Ttobetheinputvector.Foreachtrainingexample,ouroutputtargetsareasinglebinary-valuey∈{0,1}.Ourmodelisthenparameterizedbyφj|y=0=p(xj=1|y=0),φj|y=1=p(xj=1|y=1),andφy=p(y=1).Wemodelthejointdistributionof(x,y)accordingto
p(y)=(φy)y(1−φy)1−y
Yn
p(x|y=0)= p(xj|y=0)
j=1
=
p(x|y=1)=
=
Yn
x 1x
(φj|y=0)j(1−φj|y=0)−j
j=1
Yn
p(xj|y=1)
j=1
x 1x
Yn
(φj|y=1)j(1−φj|y=1)−j
j=1
i=1
Findthejointlikelihoodfunctionℓ(ϕ)=logQmp(x(i),y(i);ϕ)intermsofthemodelparametersgivenabove. Here,ϕrepresentstheentiresetofparameters
{φy,φj|y=0,φj|y=1,j=1,...,n}.
Showthattheparameterswhichmaximizethelikelihoodfunctionarethesameas
thosegiveninthelecturenotes;i.e.,that
Pm
1{x(i)=1∧y(i)=0}
φj|y=0=
i=1Pmj
i=11{y(i)=0}
Pm1{x(i)=1∧y(i)=1}
Pm
φj|y=1=
i=1 j
i=11{y(i)=1}
Pm1{y(i)=1}
φy=
i=1 .
m
ConsidermakingapredictiononsomenewdatapointxusingthemostlikelyclassestimategeneratedbythenaiveBayesalgorithm.ShowthatthehypothesisreturnedbynaiveBayesisalinearclassifier—i.e.,ifp(y=0|x)andp(y=1|x)aretheclassprobabilitiesreturnedbynaiveBayes,showthatthereexistssomeθ∈Rn+1such
that
p(y=1|x)≥p(y=0|x)ifandonlyifθT 1
x
≥0.
(Assumeθ0isaninterceptterm.)
Exponentialfamilyandthegeometricdistribution
Considerthegeometricdistributionparameterizedbyφ:
p(y;φ)=(1−φ)y−1φ,y=1,2,3,....
Showthatthegeometricdistributionisintheexponentialfamily,andgiveb(y),η,T(y),anda(η).
ConsiderperformingregressionusingaGLMmodelwithageometricresponsevari-able.Whatisthecanonicalresponsefunctionforthefamily?Youmayusethefactthatthemeanofageometricdistributionisgivenby1/φ.
Foratrainingset{(x(i),y(i));i=1,...,m},letthelog-likelihoodofanexamplebelogp(y(i)|x(i);θ).Bytakingthederivativeofthelog-likelihoodwithrespecttoθj,derivethestochasticgradientascentruleforlearningusingaGLMmodelwithgoemetricresponsesyandthecanonicalresponsefunction.
CS229,PublicCourse
ProblemSet#1Solutions: SupervisedLearning
Newton’smethodforcomputingleastsquares
Inthisproblem,wewillprovethatifweuseNewton’smethodsolvetheleastsquaresoptimizationproblem,thenweonlyneedoneiterationtoconvergetoθ∗.
2 i=1
FindtheHessianofthecostfunctionJ(θ)=1Pm (θTx(i)−y(i))2.
Answer: Asshownintheclassnotes
∂J(θ)
Xm
T(i)
(i)(i)
∂θj
So
= (θx
i=1
–y)xj.
2 Xm∂ T i i i
∂J(θ) =
(θx()−y())x()
∂θj∂θk i=1m
∂θk j
j k
jk
=Xx(i)x(i)=(XTX)
i=1
Therefore,theHessianofJ(θ)isH=XTX.ThiscanalsobederivedbysimplyapplyingrulesfromthelecturenotesonLinearAlgebra.
ShowthatthefirstiterationofNewton’smethodgivesusθ⋆=(XTX)−1XT~y,thesolutiontoourleastsquaresproblem.
Answer: Givenanyθ(0),Newton’smethodfindsθ(1)accordingto
θ(1)= θ(0)−H−1∇θJ(θ(0))
= θ(0)−(XTX)−1(XTXθ(0)−XT~y)
= θ(0)−θ(0)+(XTX)−1XT~y
= (XTX)−1XT~y.
Therefore,nomatterwhatθ(0)wepick,Newton’smethodalwaysfindsθ⋆afteroneiteration.
Locally-weightedlogisticregression
Inthisproblemyouwillimplementalocally-weightedversionoflogisticregression,whereweweightdifferenttrainingexamplesdifferentlyaccordingtothequerypoint.Thelocally-weightedlogisticregressionproblemistomaximize
λ Xm
h i
(i)
2
ℓ(θ)=−θTθ+ w
i=1
y(i)loghθ(x(i))+(1−y(i))log(1−hθ(x(i))) .
The−λ2θTθhereiswhatisknownasaregularizationparameter,whichwillbediscussedinafuturelecture,butwhichweincludeherebecauseitisneededforNewton’smethodtoperformwellonthistask.Fortheentiretyofthisproblemyoucanusethevalueλ=0.0001.
Usingthisdefinition,thegradientofℓ(θ)isgivenby
∇θℓ(θ)=XTz−λθ
wherez∈RmisdefinedbyandtheHessianisgivenby
zi=w(i)(y(i)−hθ(x(i)))H=XTDX−λI
whereD∈Rm×misadiagonalmatrixwith
Dii=−w(i)hθ(x(i))(1−hθ(x(i)))
Forthesakeofthisproblemyoucanjustusetheaboveformulas,butyoushouldtrytoderivetheseresultsforyourselfaswell.
Givenaquerypointx,wechoosecomputetheweights
||x−x(i)||2
w(i)=exp—
2τ2 .
Muchlikethelocallyweightedlinearregressionthatwasdiscussedinclass,thisweightingschemegivesmorewhenthe“nearby”pointswhenpredictingtheclassofanewexample.
ImplementtheNewton-Raphsonalgorithmforoptimizingℓ(θ)foranewquerypointx,andusethistopredicttheclassofx.
Theq2/directorycontainsdataandcodeforthisproblem.Youshouldimplementthey=lwlr(Xtrain,ytrain,x,tau)functioninthelwlr.mfile.Thisfunc-tiontakesasinputthetrainingset(theXtrainandytrainmatrices,intheformdescribedintheclassnotes),anewquerypointxandtheweightbandwitdhtau.Giventhisinputthefunctionshould1)computeweightsw(i)foreachtrainingexam-ple,usingtheformulaabove,2)maximizeℓ(θ)usingNewton’smethod,andfinally3)outputy=1{hθ(x)>0.5}astheprediction.
Weprovidetwoadditionalfunctionsthatmighthelp.The[Xtrain,ytrain]=loaddata;functionwillloadthematricesfromfilesinthedata/folder.Thefunc-tionplotlwlr(Xtrain,ytrain,tau,resolution)willplottheresultingclas-sifier(assumingyouhaveproperlyimplementedlwlr.m).Thisfunctionevaluatesthelocallyweightedlogisticregressionclassifieroveralargegridofpointsandplotstheresultingpredictionasblue(predictingy=0)orred(predictingy=1).Dependingonhowfastyourlwlrfunctionis,creatingtheplotmighttakesometime,sowerecommenddebuggingyourcodewithresolution=50;andlaterincreaseittoatleast200togetabetterideaofthedecisionboundary.
Answer: Ourimplementationoflwlr.m:functiony=lwlr(X_train,y_train,x,tau)
m=size(X_train,1);
n=size(X_train,2);
theta=zeros(n,1);
%computeweights
w=exp(-sum((X_train-repmat(x’,m,1)).^2,2)/(2*tau));
%performNewton’smethodg=ones(n,1);
while(norm(g)>1e-6)
h=1./(1+exp(-X_train*theta));
g=X_train’*(w.*(y_train-h))-1e-4*theta;
H=-X_train’*diag(w.*h.*(1-h))*X_train-1e-4*eye(n);theta=theta-H\g;
end
%returnpredictedy
y=double(x’*theta>0);
Evaluatethesystemwithavarietyofdifferentbandwidthparametersτ.Inparticular,tryτ=0.01,0.050.1,0.51.0,5.0.Howdoestheclassificationboundarychangewhenvaryingthisparameter?Canyoupredictwhatthedecisionboundaryofordinary(unweighted)logisticregressionwouldlooklike?
Answer: Thesearetheresultingdecisionboundaries,forthedifferentvaluesofτ.
tau=0.01 tau=005 tau=0.1
tau=0.5 tau=05 tau=5
Forsmallerτ,theclassifierappearstooverfitthedataset,obtainingzerotrainingerror,butoutputtingasporadiclookingdecisionboundary.Asτgrows,theresultingdeci-sionboundarybecomessmoother,eventuallyconverging(inthelimitasτ→∞totheunweightedlinearregressionsolution).
Multivariateleastsquares
Sofarinclass,wehaveonlyconsideredcaseswhereourtargetvariableyisascalarvalue.Supposethatinsteadoftryingtopredictasingleoutput,wehaveatrainingsetwith
multipleoutputsforeachexample:
{(x(i),y(i)),i=1,...,m},x(i)∈Rn,y(i)∈Rp.
Thusforeachtrainingexample,y(i)isvector-valued,withpentries.Wewishtousealinearmodeltopredicttheoutputs,asinleastsquares,byspecifyingtheparametermatrixΘin
y=ΘTx,
whereΘ∈Rn×p.
Thecostfunctionforthiscaseis
m
J(Θ)=1 p
2
2
j
(ΘTx(i))j−y(i) .
i=1j=1
WriteJ(Θ)inmatrix-vectornotation(i.e.,withoutusinganysummations).[Hint:Startwiththem×ndesignmatrix
X=
andthem×ptargetmatrix
Y=
— (x(1))T —
— (x(2))T —
.
(x(m))T —
(y(1))T —
(y(2))T —
.
(y(m))T —
andthenworkouthowtoexpressJ(Θ)intermsofthesematrices.]Answer:Theobjectivefunctioncanbeexpressedas
J(Θ)=1tr(XΘ−Y)T(XΘ−Y).
2
Toseethis,notethat
J(Θ) =
tr(XΘ−Y)T(XΘ−Y)
X
= 1 XΘ−Y)T(XΘ−Y)
2i ii
XX
= (XΘ−Y)2
i j ij
m
p
= 1 (ΘTx(i))−y(i)2
2 j j
i=1j=1
FindtheclosedformsolutionforΘwhichminimizesJ(Θ).Thisistheequivalenttothenormalequationsforthemultivariatecase.
Answer: FirstwetakethegradientofJ(Θ)withrespecttoΘ.
∇ΘJ(Θ)=∇Θ
= ∇Θ
1
1tr(XΘ−Y)T(XΘ−Y)
2
1trΘTXTXΘ−ΘTXTY−YTXΘ−YTT
2
= 2∇Θ
tr(ΘTXTXΘ)−tr(ΘTXTY)−tr(YTXΘ)+tr(YTY)
= 1∇Θ
2
tr(ΘTXTXΘ)−2tr(YTXΘ)+tr(YTY)
= 1XTXΘ+XTXΘ−2XTY2
= XTXΘ−XTY
Settingthisexpressiontozeroweobtain
Θ=(XTX)−1XTY.
Thislooksverysimilartotheclosedformsolutionintheunivariatecase,exceptnowY
isam×pmatrix,sothenΘisalsoamatrix,ofsizen×p.
j
Supposeinsteadofconsideringthemultivariatevectorsy(i)allatonce,weinsteadcomputeeachvariabley(i)separatelyforeachj=1,...,p.Inthiscase,wehaveapindividuallinearmodels,oftheform
y(i)=θTx(i),j=1,...,p.
j j
(Sohere,eachθj∈Rn).Howdotheparametersfromthesepindependentleastsquaresproblemscomparetothemultivariatesolution?
Answer: Thistime,weconstructasetofvectors
y
(1)
j
y(2)
~yj=
j
.
y
(m)
j
,j=1,...,p.
Thenourj-thlinearmodelcanbesolvedbytheleastsquaressolution
θj=(XTX)−1XT~yj.
Ifwelineupourθj,weseethatwehavethefollowingequation:
[θ1θ2···θp]= (XTX)−1XT~y1(XTX)−1XT~y2···(XTX)−1XT~yp
=(XTX)−1XT[~y1~y2···~yp]
= (XTX)−1XTY
= Θ.
Thus,ourpindividualleastsquaresproblemsgivetheexactsamesolutionasthemulti-variateleastsquares.
NaiveBayes
Inthisproblem,welookatmaximumlikelihoodparameterestimationusingthenaiveBayesassumption.Here,theinputfeaturesxj,j=1,...,ntoourmodelarediscrete,binary-valuedvariables,soxj∈{0,1}.Wecallx=[x1x2···xn]Ttobetheinputvector.Foreachtrainingexample,ouroutputtargetsareasinglebinary-valuey∈{0,1}.Ourmodelisthenparameterizedbyφj|y=0=p(xj=1|y=0),φj|y=1=p(xj=1|y=1),andφy=p(y=1).Wemodelthejointdistributionof(x,y)accordingto
p(y)=(φy)y(1−φy)1−y
Yn
p(x|y=0)= p(xj|y=0)
j=1
=
p(x|y=1)=
=
Yn
x 1x
(φj|y=0)j(1−φj|y=0)−j
j=1
Yn
p(xj|y=1)
j=1
x 1x
Yn
(φj|y=1)j(1−φj|y=1)−j
j=1
i=1
Findthejointlikelihoodfunctionℓ(ϕ)=logQmp(x(i),y(i);ϕ)intermsofthemodelparametersgivenabove. Here,ϕrepresentstheentiresetofparameters
{φy,φj|y=0,φj|y=1,j=1,...,n}.Answer:
Ym
ℓ(ϕ)=log p(x(i),y(i);ϕ)
i=1
Ym
= log p(x(i)|y(i);ϕ)p(y(i);ϕ)
i=1
j
= logYmYn p(x(i)|y(i);ϕ) p(y(i);ϕ)
i=1
Xm
j=1
(i) Xn
(i)(i)
=
i=1
Xm"
logp(y
;ϕ)+
j=1
logp(xj|y;ϕ)
= y(i)logφy+(1−y(i))log(1−φy)
i=1
Xn (i) (i)
+ xjlogφj|y(i)+(1−xj)log(1−φj|y(i))
j=1
Showthattheparameterswhichmaximizethelikelihoodfunctionarethesameas
thosegiveninthelecturenotes;i.e.,that
Pm
1{x(i)=1∧y(i)=0}
φj|y=0=
i=1Pmj
i=11{y(i)=0}
Pm1{x(i)=1∧y(i)=1}
Pm
φj|y=1=
i=1 j
i=11{y(i)=1}
Pm1{y(i)=1}
φy=
i=1 .
m
Answer: Theonlytermsinℓ(ϕ)whichhavenon-zerogradientwithrespecttoφj|y=0
arethosewhichincludeφj|y(i).Therefore,
m
∇φj|y=0
ℓ(ϕ)=∇φj|y=0
|
= ∇φjy=0
i=1
Xm
i=1
x(i)logφj|y()+(1−x(i))log(1−φj|y(i))
j
j i j
x(i)log(φj|y=0)1{y(i)=0}
j
+(1−x(i))log(1−φj|y=0)1{y(i)=0}
Xm (i) 1 1
= xj 1{y(i)=0}−(1−x(i))
1{y(i)=0}.
i=1 φj|y=0
Setting∇φj|y=0ℓ(ϕ)=0gives
j 1−φj|y=0
m
0=
i=1
Xm
(i)
x
j
1
φj|y=0
1{y(i)=0}−(1−x(i))
1
1−φj|y=0
1{y(i)=0}
j
=
i=1
Xm
=
i=1
Xm
x(i)(1−φj|y=0)1{y(i)=0}−(1−x(i))φj|y=01{y(i)=0}
j
j j
(x(i)−φj|y=0)1{y(i)=0}
j
Xm
=
j
i=1m
x(i)·1{y(i)=0}
—φj|y=0
i=1
1{y(i)=0}
Xm
=
i=1
1{x(i)=1∧y(i)=0}—φj|y=0
i=1
1{y(i)=0}.
Wethenarriveatourdesiredresult
Pm
1{x(i)=1∧y(i)=0}
φj|y=0=
i=1Pmj
i=11{y(i)=0}
Thesolutionforφj|y=1proceedsintheidenticalmanner.
Tosolveforφy,
∇φℓ(ϕ)=∇φ
y y
Xm
i=1
y(i)logφy+(1−y(i))log(1−φy)
m
=
i=1
y(i)1—(1−y(i))
φy
1
1−φy
Thensetting∇φy=0givesus
m
(i)1 1
0=
i=1
Xm
y —(1−y(i))
φ
y
1−φy
=
i=1m
X
y(i)(1−φy)−(1−y(i))φy
m
X
=i y(i)−
φy.
Therefore,
=1 i=1
Pm1{y(i)=1}
φy= i=1m .
ConsidermakingapredictiononsomenewdatapointxusingthemostlikelyclassestimategeneratedbythenaiveBayesalgorithm.ShowthatthehypothesisreturnedbynaiveBayesisalinearclassifier—i.e.,ifp(y=0|x)andp(y=1|x)aretheclass
probabilitiesreturnedbynaiveBayes,showthatthereexistssomeθ∈Rn+1suchthat
p(y=1|x)≥p(y=0|x)ifandonlyifθT 1
x
≥0.
(Assumeθ0isaninterceptterm.)Answer:
p(y=1|x)≥p(y=0|x)
⇐⇒p(y=1|x)≥1p(y=0|x)
Qn
j=1
p(xj|y=1)p(y=1)
⇐⇒Qn
p(x|y=0)
≥1
p(y=0)
j=1 j x
1−x
n
Qj=1(φj|y=0)j(1−φj|y=0)
j φy
⇐⇒Qn
(φ
)x(1−φ
)1−x
≥1
(1−φ)
j=1
j|y=1j
j|y=1 j y
⇐⇒Xn
xlog φj|y=1+(1−x)log 1−φj|y=0 +log φy ≥0
j j
j=1 φj|y=0 1−φj|y=0 1−φy
Xn (φj|y=1)(1−φj|y=0) Xn
−φj φy
⇐⇒ xjlog
+ log
|y=1 +log ≥0
j=1
⇐⇒θT 1
x
(φj|y=0)(1−φj|y=1)
≥0,
j=1 1−φj|y=0 1−φy
where
θ0=
Xn
log
1−φj|y=1 +log
φy
j=1 1−φj|y=0 1−φy
θj =log(φj|y=1)(1−φj|y=0),j=1,...,n.(φj|y=0)(1−φj|y=1)
Exponentialfamilyandthegeometricdistribution
Considerthegeometricdistributionparameterizedbyφ:
p(y;φ)=(1−φ)y−1φ,y=1,2,3,....
Showthatthegeometricdistributionisintheexponentialfamily,andgiveb(y),η,T(y),anda(η).
Answer:
p(y;φ)=(1−φ)y−1φ
=explog(1−φ)y−1+logφ
=exp[(y−1)log(1−φ)+logφ]
=expylog(1−φ)−log1−φ
φ
Then
b(y)=1
η=log(1−φ)T(y)=y
a(η)=log
1−φ
φ
=log
eη ,
1−eη
wherethelastlinefollowsbecuaseη=log(1−φ)⇒eη=1−φ⇒φ=1−eη.
ConsiderperformingregressionusingaGLMmodelwithageometricresponsevari-able.Whatisthecanonicalresponsefunctionforthefamily?Youmayusethefactthatthemeanofageometricdistributionisgivenby1/φ.
Answer: 1 1
g(η)=E[y;φ]=φ=1−eη.
Foratrainingset{(x(i),y(i));i=1,...,m},letthelog-likelihoodofanexamplebelogp(y(i)|x(i);θ).Bytakingthederivativeofthelog-likelihoodwithrespecttoθj,derivethestochasticgradientascentruleforlearningusingaGLMmodelwithgoemetricresponsesyandthecanonicalresponsefunction.
Answer: Thelog-likelihoodofanexample(x(i),y(i))isdefinedasℓ(θ)=logp(y(i)|x(i);θ).Toderivethestochasticgradientascentrule,usetheresultsfrompreviouspartsandthestandardGLMassumptionthatη=θTx.
ℓi(θ)=log
"
exp θTx(i)·y(i)−log
eθTx(i)
1−eθTx(i)
!!#
=logexpθTx(i)·y(i)−log 1
e−θTx(i)−1
=θTx(i)·y(i)+loge−θTx(i)−1
j
j
∂ e−θTx(i)
∂θj
ℓi(θ)=x(i)y(i)+
e−θTx(i)−1
(−x(i))
= x(i)y(i)− 1 x(i)
j 1−e−θTx(i)j
j
= y(i)− 1
1−eθTx(i)
x(i).
Thusthestochasticgradientascentupdateruleshouldbe
∂ℓi(θ)
whichis
θj:=θj+α
∂θj,
j
θj:=θj+αy(i)− 1 x(i).
1−eθTx(i)
CS229,PublicCourse
ProblemSet#2: Kernels,SVMs,andTheory
Kernelridgeregression
Incontrasttoordinaryleastsquareswhichhasacostfunction
1Xm
J(θ)=2
i=1
(θTx(i)−y(i))2,
wecanalsoaddatermthatpenalizeslargeweightsinθ.Inridgeregression,ourleastsquarescostisregularizedbyaddingatermλkθk2,whereλ>0isafixed(known)constant(regularizationwillbediscussedatgreaterlengthinanupcomingcourselecutre).Theridgeregressioncostfunctionisthen
1Xm λ
J(θ)=2
(θTx(i)−y(i))2+
i=1
kθk2.2
Usethevectornotationdescribedinclasstofindaclosed-formexpreesionforthevalueofθwhichminimizestheridgeregressioncostfunction.
Supposethatwewanttousekernelstoimplicitlyrepresentourfeaturevectorsinahigh-dimensional(possiblyinfinitedimensional)space.Usingafeaturemappingφ,theridgeregressioncostfunctionbecomes
1Xm λ
J(θ)=2
(θTφ(x(i))−y(i))2+
i=1
kθk2.2
MakingapredictiononanewinputxnewwouldnowbedonebycomputingθTφ(xnew).Showhowwecanusethe“kerneltrick”toobtainaclosedformforthepredictiononthenewinputwithouteverexplicitlycomputingφ(xnew).Youmayassumethattheparametervectorθcanbeexpressedasalinearcombinationoftheinputfeature
vectors;i.e.,θ=Pmαiφ(x(i))forsomesetofparametersαi.[Hint:Youmayfindi=t1hefollowingidentityuseful:
(λI+BA)−1B=B(λI+AB)−1.
Ifyouwant,youcantrytoprovethisaswell,thoughthisisnotrequiredfortheproblem.]
ℓ2normsoftmarginSVMs
Inclass,wesawthatifourdataisnotlinearlyseparable,thenweneedtomodifyoursupportvectormachinealgorithmbyintroducinganerrormarginthatmustbeminimized.Specifically,theformulationwehavelookedatisknownastheℓ1normsoftmarginSVM.Inthisproblemwewillconsideranalternativemethod,knownastheℓ2normsoftmarginSVM.Thisnewalgorithmisgivenbythefollowingoptimizationproblem(noticethattheslackpenaltiesarenowsquared):
minw,b,ξ1kwk2+CPmξ2
2 2 i=1i .
s.t.y(i)(wTx(i)+b)≥1−ξi,i=1,...,m
Noticethatwehavedroppedtheξi≥0constraintintheℓ2problem.Showthatthesenon-negativityconstraintscanberemoved.Thatis,showthattheoptimalvalueoftheobjectivewillbethesamewhetherornottheseconstraintsarepresent.
WhatistheLagrangianoftheℓ2softmarginSVMoptimizationproblem?
MinimizetheLagrangianwithrespecttow,b,andξbytakingthefollowinggradients:
∇wL,∂L∂b,and∇ξL,andthensettingthemequalto0.Here,ξ=[ξ1,ξ2,...,ξm]T.
Whatisthedualoftheℓ2softmarginSVMoptimizationproblem?
SVMwithGaussiankernel
ConsiderthetaskoftrainingasupportvectormachineusingtheGaussiankernelK(x,z)=exp(−kx−zk2/τ2).Wewillshowthataslongastherearenotwoidenticalpointsinthetrainingset,wecanalwaysfindavalueforthebandwidthparameterτsuchthattheSVMachieveszerotrainingerror.
Recallfromclassthatthedecisionfunctionlearnedbythesupportvectormachinecanbewrittenas
Xm
f(x)= αiy(i)K(x(i),x)+b.
i=1
Assumethatthetrainingdata{(x(1),y(1)),...,(x(m),y(m))}consistsofpointswhichareseparatedbyatleastadistanceofǫ;thatis,||x(j)−x(i)||≥ǫforanyi6=j.Findvaluesforthesetofparameters{α1,...,αm,b}andGaussiankernelwidthτsuchthatx(i)iscorrectlyclassified,foralli=1,...,m.[Hint:Letαi=1foralliandb=0.Nownoticethatfory∈{−1,+1}thepredictiononx(i)willbecorrectif
|f(x(i))−y(i)|<1,sofindavalueofτthatsatisfiesthisinequalityforalli.]
SupposewerunaSVMwithslackvariablesusingtheparameterτyoufoundinpart
(a).Willtheresultingclassifiernecessarilyobtainzerotrainingerror?Whyorwhynot?Ashortexplanation(withoutproof)willsuffice.
SupposeweruntheSMOalgorithmtotrainanSVMwithslackvariables,undertheconditionsstatedabove,usingthevalueofτyoupickedinthepreviouspart,andusingsomearbitraryvalueofC(whichyoudonotknowbeforehand).Willthisnecessarilyresultinaclassifierthatachievezerotrainingerror?Whyorwhynot?Again,ashortexplanationissufficient.
NaiveBayesandSVMsforSpamClassification
Inthisquestionyou’lllookintotheNaiveBayesandSupportVectorMachinealgorithmsforaspamclassificationproblem.However,insteadofimplementingthealgorithmsyour-self,you’lluseafreelyavailablemachinelearninglibrary.Therearemanysuchlibrariesavailable,withdifferentstrengthsandweaknesses,butforthisproblemyou’llusetheWEKAmachinelearningpackage,availableat
http://www.cs.waikato.ac.nz/ml/weka/.
WEKAimplementsmanystandardmachinelearningalgorithms,iswritteninJava,andhasbothaGUIandacommandlineinterface.Itisnotthebestlibraryforverylarge-scaledatasets,butitisveryniceforplayingaroundwithmanydifferentalgorithmsonmediumsizeproblems.
YoucandownloadandinstallWEKAbyfollowingtheinstructionsgivenonthewebsiteabove.Touseitfromthecommandline,youfirstneedtoinstallajavaruntimeenviron-ment,thenaddtheweka.jarfiletoyourCLASSPATHenvironmentvariable.Finally,you
cancallWEKAusingthecommand:
java<classifier>-t<trainingfile>-T<testfile>
Forexample,toruntheNaiveBayesclassifier(usingthemultinomialeventmodel)onourprovidedspamdatasetbyrunningthecommand:
javaweka.classifiers.bayes.NaiveBayesMultinomial-tspamtrain1000.arff-Tspamtest.arff
Thespamclassificationdatasetintheq4/directorywasprovidedcourtesyofChristianShelton(cshelton@).Eachexamplecorrespondstoaparticularemail,andeachfeaturecorrespondestoaparticularword.Forprivacyreasonswehaveremovedtheactualwordsthemselvesfromthedataset,andinsteadlabelthefeaturesgenericallyasf1,f2,etc.However,thedatasetisfromarealspamclassificationtask,sotheresultsdemonstratetheperformanceofthesealgorithmsonareal-worldproblem.Theq4/directoryactuallycon-tainsseveraldifferenttrainingfiles,namedspamtrain50.arff,spamtrain100.arff,etc(the“.arff”formatisthedefaultformatbyWEKA),eachcontainingthecorrespondingnumberoftrainingexamples.The
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025甲醇行业市场分析报告
- 第三方医学实验室公司工资总额审计报告(加薪方案)
- 外贸针织围巾合同范本
- 2025年传统银饰项目合作计划书
- 人教版历史与社会七年级上册第二单元第二课《自然环境》教学设计1
- Unit 5 There is a big bed Part A Let's spell(教学设计)-2024-2025学年人教PEP版英语五年级上册
- 油罐车投资建设项目立项报告
- 《第四章 第3节 平面镜成像》教学设计教学反思-2023-2024学年初中物理人教版八年级上册
- 2025年度电站废弃物资回收与处置合同
- 2023-2029年中国光瓶酒行业市场发展现状及投资规划建议报告
- 2024-2025年第二学期学校教导处工作计划(二)
- 2025年苏州卫生职业技术学院高职单招职业技能测试近5年常考版参考题库含答案解析
- 二零二五年度博物馆场地租赁与文物保护合作协议3篇
- 2025年春新人教版历史七年级下册全册课件
- 2024年钟山职业技术学院高职单招语文历年参考题库含答案解析
- 骆驼祥子-(一)-剧本
- 魏晋南北朝时期中外文化的交流
- 渔业行业智能化海洋牧场养殖方案
- 《工程勘察设计收费标准》(2002年修订本)
- 《债权法教学》课件
- 太傻天书(完整版)
评论
0/150
提交评论