通过方差减少实现零样本模型的稳健微调 Robust Fine-tuning of Zero-shot Models via Variance Reduction_第1页
通过方差减少实现零样本模型的稳健微调 Robust Fine-tuning of Zero-shot Models via Variance Reduction_第2页
通过方差减少实现零样本模型的稳健微调 Robust Fine-tuning of Zero-shot Models via Variance Reduction_第3页
通过方差减少实现零样本模型的稳健微调 Robust Fine-tuning of Zero-shot Models via Variance Reduction_第4页
通过方差减少实现零样本模型的稳健微调 Robust Fine-tuning of Zero-shot Models via Variance Reduction_第5页
已阅读5页,还剩41页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

RobustFine-tuningofZero-shotModelsviaVarianceReduction

BeierZhuJiequanCuiHanwangZhang

NanyangTechnologicalUniversity

arXiv:2411.06966v1[cs.CV]11Nov2024

beier002@e.ntu.edu.sg,hanwangzhang@.sg

Abstract

Whenfine-tuningzero-shotmodelslikeCLIP,ourdesideratumisforthefine-tunedmodeltoexcelinbothin-distribution(ID)andout-of-distribution(OOD).Recently,ensemble-basedmodels(ESM)havebeenshowntooffersignificantrobustnessimprovement,whilepreservinghighIDaccuracy.However,ourstudyfindsthatESMsdonotsolvetheID-OODtrade-offs:theyachievepeakperformanceforIDandOODaccuracyatdifferentmixingcoefficients.WhenoptimizedforOODaccuracy,theensemblemodelexhibitsanoticeabledeclineinIDaccuracy,andviceversa.Incontrast,weproposeasample-wiseensemblingtechniquethatcansimultaneouslyattainthebestIDandOODaccuracywithoutthetrade-offs.Specifically,weconstructaZero-ShotFailure(ZSF)setcontainingtrainingsamplesincorrectlypredictedbythezero-shotmodel.Foreachtestsample,wecalculateitsdistancetotheZSFsetandassignahigherweighttothefine-tunedmodelintheensembleifthedistanceissmall.WetermourmethodVarianceReductionFine-tuning(VRF),asiteffectivelyreducesthevarianceinensemblepredictions,therebydecreasingresidualerror.OnImageNetandfivederiveddistributionshifts,ourVRFfurtherimprovestheOODaccuracyby1.5-2.0ppovertheensemblebaselineswhilemaintainingorincreasingIDaccuracy.VRFachievessimilarlargerobustnessgains(0.9-3.1pp)onotherdistributionshiftsbenchmarks.Codesare

availablein

/BeierZhu/VRF.

1Introduction

Toensurethereliabilityofmachinelearningsystems,itisessentialtodevelopmodelsthatcangeneralizetounseen,out-of-distributionenvironments.

Largepre-trainedmodelssuchasCLIP[20]

andALIGN[10]haverecentlyshownremarkablerobustnessagainstchallengingdistributionshifts

.However,itiswidelyacknowledgedthattheseimprovementsinrobustnessaremostpronouncedinthezero-shotsetting,whileconventionalfine-tuningonthesemodelsoftencompromisesrobustness

whencomparedtozero-shotperformance[28,

15,

14]

.ThisphenomenonisknownastheID-OODtrade-offs,i.e.,improvingperformanceonin-distribution(ID)datacansometimesleadtodecreased

performanceonout-of-distribution(OOD)data[12,

25]

.

Inrecentyears,ensemble-basedmodels(ESMs)havedemonstratedsignificantsuccessinaddressing

theID-OODdilemma[17,

28,

14,

31]

.Specifically,denotetheinputasx,thezero-shotmodel

as(y|x;θzs)andthefine-tunedmodelas(y|x;θft),existingESMstypicallyemploytheoutput-

spaceensemble(OSE)[14,

31],whichoutputs

(y|x;θose)=α(y|x;θft)+(1−α)(y|x;θzs),

andtheweight-spaceensemble(WSE)[28,

17],whichoutputs

(y|x;θwse)=(y|x;αθft+(1−

α)θzs),whereα∈[0,1].Comparedtofine-tunedmodels,ESMsoffersignificantaccuracyenhance-mentsunderdistributionshift,whilemaintaininghighIDaccuracy.

However,ESMcannotfullyaddresstheID-OODtrade-offs.InFigure

1

(a),byvaryingthemixing

coefficientα,weplottheID-OODfrontiercurves(pinkline)fortheCLIPViT-B/16modelon38thConferenceonNeuralInformationProcessingSystems(NeurIPS2024).

2

59

4-—

L56二

o53

>

+3.6%IDAcc

efficient

Ens

emblingW

thvaringCo

---BesereerBes★zer

Fin

VR

tIDaccuratOODacco-shot

e-TunedF(ours)

Cywitha=0uracywith

.5

0.3

677073767982

ImageNetAccuracy(ID)

0.8

1.01.21.4DistancetoZSFSet(d(x))

1.4

1.2

0.8

Accft

Acczs

(a)(b)

Figure1:(a)ID-OODfrontiercurvesfortheCLIPViT-B/16modelontheID(ImageNet)andOOD(IN-{V2,R,A,Sketch}andObjectNet)datasetsbyvaryingthemixingcoefficientα.TheensemblemodelachievesitsbestIDandOODperformanceatdifferentαvalues.OurmethodVRFsimultaneouslyattainsthebestIDandOODaccuracy,outperformingtheensembleby3.6%onOODand1.6%onIDatitsoptimalperformancepoints.(b)Relationshipbetweentheratioof . increases.

ImageNet[3](ID)andfivederiveddistribution-shifteddatasets(OOD):ImageNet-V2[21],ImageNet

-

R[7],ImageNet-A[9],ImageNet-Sketch[27]andObjectNet[1]

.WefindthattheensemblemodelachievesitsoptimalIDandOODperformanceatdifferentαvalues:thebestIDaccuracyisachievedatα=0.5andthebestOODaccuracyisobtainedatα=0.3.WhentheensemblemodelreachesitsoptimalvalueforOOD,theperformanceonIDdecreasesby3.6%relativetoitspeak.Similarly,whentheensemblemodelisoptimizedforID,theperformanceonOODdecreasesby1.6%relativetoitsbestvalue–theID-OODtrade-offsstillpersistforESMs.Thisraisesanaturalquestion:

Canensemble-basedmodelssimultaneouslyattainthebestIDandOODaccuracy?

Inthispaper,weaffirmativelyanswerthisquestionbyproposingasample-wiseensemblingtechnique,dubbedvariancereductionfine-tuning(VRF).ThismethodismotivatedbyanempiricalfindingillustratedinFig

1

(b).Foreachsampleinthetrainingdataset,ifthefine-tunedmodelcorrectlypredictsthelabelwhilethezero-shotmodelfails,wecollectitsfeaturesrepresentationinthefine-tunedmodelasthezero-shotfailure(ZSF)set.Wethenmeasurethedistanced(x)ofeachtestsamplextotheZSFset.Basedonthisdistance,testsamplesaregroupedintobins,andwecomputethe

Section

C.7)

.increases.Intuitively,thecloserasampleistotheZSFset,themorelikelyitisthatthezero-shotmodelmakesincorrectpredictions,whereasthefine-tunedmodelismorelikelytobeaccurate,leadingtoahigher

higherweightforthefine-tunedmodel,andviceversa.

AsdepictedbytheorangediamondinFig.

1

(a),byleveragingthesample-wiseweights,ourVRFsimultaneouslyattainsthebestIDandOODaccuracy.InSection

5,weshowthatonavarietyof

differentmodelsandtasks,ourVRFapproachconsistentlyoutperformstheexistingfine-tuning

andensemblingmethods,includinglinearprobing,end-to-endfine-tuning,LP-FT[15],OSEand

WSE[28]

.Inspecific,onImageNetandfivederiveddistributionshifts,ourVRFfurtherimprovestheOODaccuracyby1.5-2.0ppovertheensemblebaselineswhilemaintainingorincreasingIDaccuracy.Furthermore,inSection

4,wejustifyourapproachbydemonstratingthatiteffectively

minimizesthevarianceoftheensemblemodels,resultinginreducedresidualerror.

3

2RelatedWork

MitigatingID-OODtrade-offs.Improvingperformanceonin-distributiondatacansometimeslead

toadecreaseinperformanceonout-of-distributiondata,andviceversa.ThisphenomenonisknownastheID-OODtrade-offs.Xieetal.

[29]leverageauxiliaryinformationasoutputsofauxiliarytasks

topre-trainamodeltoreduceOODerror.

KhaniandLiang[12]showthatself-trainingonlarge

amountsofunlabeleddatacanmitigatesuchtrade-offsbyremovingspuriousfeatures.Tripuranenietal.

[25]tacklethisproblembylearningrepresentationsthatarerobustacrossdiversetasks

.However,thesemethodsusuallynecessitateadditionalunlabeleddataorauxiliaryinformation.Incontrast,ourVRFisastraightforwardvariationoffine-tuningthatdoesnotrequireanyextradata.

Robustfine-tuningofzero-shotmodels.

Vision-languagemodelslikeCLIP[20]havedemonstrated

outstandingimprovementsinrobustness.Itiscommonlyacknowledgedthatconventionalfine-tuningmethodsoftencompromiserobustnesswhencomparedtozero-shotperformance.Therefore,

enhancingdownstreamrobustnesshasbeenthefocusofsubsequentworks[15,

28,

5,

19,

6,

30]

.Kumaretal.

[15]showthatatwo-processoflinearprobingfollowedbyfullfine-tuningcanalleviate

featuredistortion,leadingtostrongerOODperformancewithoutsacrificingIDaccuracy.Wortsmanetal.

[28]proposeamethodofweightinterpolationbetweenthezero-shotandthefine-tunedmodels

toimprovebothIDandOODaccuracy.Goyaletal.

[5]demonstratethatmimickingthecontrastive

pre-trainingobjectivestofine-tunethezero-shotmodelsoutperformstuningviathetraditional

supervisedcross-entropyloss.However,theID-OODtrade-offsarestillobservedwiththesemethods.

Incontrast,ourmethodVRFcansimultaneouslyachievethebestIDandOODaccuracy.

3Methods

3.1SetUp

Task:Consideraclassificationsettingwherethegoalistomapaninstancex∈Xtoalabely∈

andafine-tunedmodelf(;θft)whichistrainedonD.Below,weoutlinetheimplementationofthezero-shotandfine-tunedmodels:

Y=[K].Weareprovidedwithazero-shotmodelf(·;θzs),adownstreamdatasetD={xi,yi,

•Zero-shotmodels

(ZS):WeinvestigateCLIPmodels[20]asourzero-shotmodels

.CLIPmodelsarepre-trainedusingimage-textpairs{(x1,t1),...,(xB,tB)}fromtheInternet.TheobjectiveoftheCLIPmodelsistotrainavisualencoderΦvandatextencoderΦtsuchthatthecosinesimilarity<Φv(xi),Φt(ti)>ismaximizedrelativetounmatchedpairs.CLIPmodelsperformzero-shotinferenceforKclassesbymatchingxwithpotentialclassnames{c1,...,cK}.Concretely,byextendingtheclassname{ck}toaprompt“tk=aphotoofa{ck}”,thezero-shotmodeloutputsthescore(logit)forclasskasf(x;θzs)k=<Φv(x),Φt(tk)>.Thepredictedprobabilitiescanbecalculatedusingthesoftmaxfunction,i.e.,(y|x;θzs)=softmax(f(x;θzs))y.Themodeloutputsthelabelaspred(f(x;θzs))=argmaxif(x;θzs)i

•Linearclassifiers(LC):WelearnalinearclassifierontopofthevisualembeddingΦv(x)whilefreezingthevisualencoderΦv.Theparametersofthelinearclassifierareoptimizedtominimizethecross-entropylossonD.

•End-to-endfine-tuning(E2E-FT):Weupdateboththelinearclassifierandthevisualencoderbyminimizingthecross-entropylossonD.

•Linearprobingthenfullfine-tuning

[15](LP-FT):Weemployatwo-phasefine-tuning

approach:initiallytrainingalinearclassifier,followedbyfullfine-tuningstartingfromthesolutionderivedfromtrainingthelinearclassifier.

•Output-spaceensemble(OSE):Weperformlinearinterpolationoftheoutputsbetweenazero-shotmodelandafine-tunedmodel(e.g.,E2E-FTorLP-FT):

(y|x;θose)=α(y|x;θft)+(1−α)(y|x;θzs),whereα∈[0,1](1)

•Weight-spaceensemble

[28](WSE)

.Wecombinetheweightsthroughlinearinterpolationbetweenazero-shotmodelandafine-tunedmodel:

(y|x;θwse)=(y|x;αθft+(1−α)θzs),whereα∈[0,1](2)

4

Algorithm1VariationReductionFine-tuning

1:Given:TrainingdatasetD,azero-shotmodelfzsandafine-tunedmodelfft.

2:Buildzero-shotfailuresetVusingEq.

(3)

.▷Step1:Identification

3:InferenceStage:

4:Givenatestsamplex,computeitsfeaturerepresentationv,zero-shotpredictionzs(y|x)andfine-tunedmodelpredictionft(y|x).

5:Computethek-NNdistancetoVasd(x)usingEq.

(4)

.▷Step2:DistanceCalculation

6:Computetheweightω(x)usingEq.

(6)

.

7:Returnvrf(y|x)usingEq.

(5)

▷Step3:Sample-WiseEnsembling

3.2VarianceReductionFine-tuning

Wenowpresentourproposedmethod,VRF,whichconsistsofthreesteps.First,beforetheinferencestage,wecollecttheZero-ShotFailure(ZSF)set.Second,foragiventestsample,wecalculateitsdistancetotheZSFset.Third,weassignweightstocombinepredictionsfromthezero-shotandfine-tunedmodelsbasedonthisdistance.

Step1(Identification).ForeachxiinthetrainingdatasetD,ifthefine-tunedmodelcorrectlypredictsthelabelwhilethezero-shotmodelfails,wecollectitsfeaturerepresentationvi=Φv(xi;θft)fromthefine-tunedmodeltoformthezero-shotfailuresetV.Specifically,Visdefinedas:

V={vis.t.yi=pred(fft(xi))andyipred(fzs(xi))}.(3)

Here,fzs(·)andfft(·)areusedtodenotef(·;θzs)andf(·;θft),respectively,forsimplicity.

Step2(DistanceCalculation).ThekeyempiricalobservationunderpinningVRFisthatinthevicinityoftheZSFset,atestsampletypicallyexhibitslowerzero-shotaccuracy(Acczs)andhigherfine-tunedaccuracy(Accft).Consequently,thedistancefromthesampletotheZSFsetincreases.Inthispaper,weadoptnon-parametricdensity

estimationusingnearestneighbors[24]tomeasurethedistanceofatestsampletothedataset

V.Specifically,duringinference,wederivethefeaturerepresentationvofatestsamplex,andcomputetheℓ2distances∥v−vi∥2w.r.t.vi∈V.WereorderVaccordingtotheincreasingℓ2distanceanddenotetheorderedsetinsequenceasV′=(v(1),v(2),...,v(|V|)).ThedistanceofxtoVisdefinedastheℓ2distancetothek-thnearestneighbor(k-NN),i.e.,

d(x;V,k)=∥v−v(k)∥2.(4)Ifthereisnoambiguity,weused(x)todenoted(x;V,k)forreadability.SincethefeaturesinCLIPmodelsareℓ2normalized,d(x)areboundedbetween[0,2].

Fine-TunedAcc/Zero-ShotAcc

Step3(Sample-WiseEnsembling).Weimplementsample-wiseout-spaceensemblingintheform:

1.4

vrf(y|x)=ω(x)·ft(y|x)+(1−ω(x))·zs(y|x),(5)

1.2

1.0

whereω(x)∈(0,1).WeusethedistancetoZSFsetd(x)todeterminetheweightω.Asshownbythebluelinein

0.8

Fig

2,asmallervalueofd(x)

correspondstoalarger

ratio,andviceversa.Therefore,wesettheweightωtobe

weightofFTmodel

0.75

0.70

0.65

0.60

0.55

0.50

0.45

-●-AccuarcyRatio

(x),a=1.5,b=0.6

between0and1,weemployasigmoidfunctionσ()as:

wherea,b>0aretwohyper-parameterssweepedusing

inverselyproportionaltod(x).Giventhatωisbounded0.80.91.istaetoFst.3(d(x)1).41.51.6

ω(x)=σ(−(d(x)−a)/b)(6)Figure2:Relationshipbetween

,theweightω(x).

accuracyonIDvalidationset.WevisualizetheweightcurveingreenonFig

2,setting

a=1.5andb=0.6.WesummarizethewholeprocessinAlgorithm

1.

4Justification

WenowprovethatourVRFcaneffectivelyreducethevarianceofthecombinedmodel,resultinginlowererrorscomparedtoensemblingusingaconstantmixingcoefficient.

5

4.1Background

Theoutputsofawelltrainedclassifierareexpectedtoapproximatetheaposteriorclassdistribution.Apartfromtheirreducibleerror(Bayeserror),theresidualerrorofaclassifiercanbebrokendownintobiasandvariancecomponents.Inspecific,foratestsamplex,theprobabilityoutputofaclassifierparameterizedbyθcanbeexpressedas:

(y|x;θ)=P(y|x)+βy+ηy(x),(7)

reidualorfor–x

whereP(y|x)denotesthetrueaposteriordistribution,βyisthelabelbiasof(y|x;θ)whichisindependenttotheinputx,andηy(x)isrelatedtothegiveninputx.Inthisstudy,weprimarilyattributetheresidualerrortothevarianceterm(i.e.,βy=0),asthelabelbiasprobleminfoundationmodelshasbeeneffectivelyaddressedbyZhuetal.

[31]

.Tumeretal.

[26]haveproventhatthe

expectedresidualerrorEisgivenby:

E=V[ηy(x)](8)

s,

wheresisaconstantfactorrelatedtothederivativeofthetrueaposteriordistributionandisindependentofthetrainedmodel,andV[ηy(x)]isthevariance.

4.2VarianceReductionFine-tuningLeadstoLowerResidualError

Letusshiftourfocustotheeffectsofcombiningthezero-shotandfine-tunedmodels.Letgzs(·)andgft(·)betwofunctionsthatproduceweightsforensemblingthemodels.Subjecttotheconstraintthatgzs(x)+gft(x)=1,theresidualerrorofthecombinedclassifierisobtainedby:

vrf(y|x)=gzs(x)zs(y|x)+gft(x)ft(y|x)=P(y|x)+gzs(x)·ηzs(x)+gft(x)·ηft(x),(9)

、◆

、–

ηvrf(x)

whereweomitthesubscriptyofηforreadability.Thevarianceofηvrf(x)canbeexpressedas:

V[ηvrf(x)]=gzs(x)2·V[ηzs(x)]+gft(x)2·V[ηft(x)].(10)

Here,weassumetheresidualerrorsareindependentfollowingtheassumptionofthepreviousstudies

ofCLIPfine-tuning[14,

31]

.WefurtherexplorethecaseofcorrelatedresidualerrorsinSection

B.

AccordingtoEq.

(8),thereductioninvariancecanbereadilytranslatedintoareductioninerror

rates.ToobtainthesmallestvarianceV[ηvrf(x)],weminimizeEq.

(10)

usingLagrangemultipliertoenforcetheconstraintthatgzs(x)+gft(x)=1,andobtaintheoptimalweightfunctiongftas:

gft(x)===(1+)−1Ⅸ(11)

SinceⅨd(x)−1(asmallerdistanced(x)asshowninFig.

2),

wedesigntheweightingfunctiongft(x)=ω(x)Ⅸd(x)−1asinEq.

(6)

.

5Experiments

5.1ExperimentalSetup

Datasetswithdistributionshifts.

WeprovidetheresultsforImageNet[3]anditsfivederived

distributionshifts:(1)ImageNet-V2(IN-V2)[21]:Testimagessampledafteradecadeoftheoriginal

ImageNet.

(2)ImageNet-R(IN-R)[7]:Containsrenditions(e.g.

,art,cartoons,graffiti).(3)ImageNet

Sketch(IN-Sketch)[27]:Consistsofsketchesratherthannaturalphotos

.

(4)ImageNet-A(IN-A)[9]:

Collectsreal-worldimagesthataremisclassifiedbyResNetmodels.

(5)ObjectNet[1],atestset

featuringobjectswithdiversebackgrounds,rotations,andimagingviewpoints.Weextendour

analysistoincludeastandarddistributionshiftbenchmark[15,

14,

4]:CIFAR-10

→STL-10,where

theIDisCIFAR-10[13],andtheOODisSTL-10[2]

.Weremovedthe“monkey”classfromSTL-10,asitdoesnotexistinCIFAR-10.Inaddition,wealsoconsidersubpopulationshifts,wheretheIDdatacontainsafewsub-categories,andtheOODdatacomprisesdifferentsub-categorieswithinthe

6

Table1:AccuracyofvariousmethodsonImageNetandderiveddistributionshiftsforCLIPViT-B/32.

Method

IN

Distributionshifts

Avgshifts

IN-V2

IN-Sketch

IN-A

IN-R

ObjectNet

Zero-shot[20]

63.3

55.9

42.3

31.5

69.3

43.5

48.5

Linearclassifier[20]

75.4

63.4

38.8

26.1

58.7

41.5

45.7

E2E-FT[28]

76.2

64.2

38.7

21.0

57.1

40.1

44.2

+Weight-spaceensemble[28]

77.9

67.2

45.1

28.8

66.4

45.1

50.5

+Output-spaceensemble

77.3

66.0

44.2

27.1

68.4

44.4

50.0

+VRF(ours)

77.6

66.7

47.0

29.2

70.9

46.3

52.0

+0.3

+0.7

+2.8

+2.1

+2.5

+1.9

+2.0

LP-FT[15]

76.9

64.8

39.9

25.7

69.9

42.6

48.6

+Weight-spaceEnsemble[28]

78.0

67.0

44.8

31.2

65.8

46.1

51.0

+Output-spaceEnsemble

77.8

66.3

44.0

29.5

66.2

45.5

50.3

+VRF(ours)

77.8

66.7

46.1

31.0

70.0

46.3

51.8

+0.0

+0.4

+2.1

+1.5

+3.8

+0.8

+1.5

Table2:AccuracyofvariousmethodsonImageNetandderiveddistributionshiftsforCLIPViT-B/16.

Method

IN

Distributionshifts

Avgshifts

IN-V2

IN-Sketch

IN-A

IN-R

ObjectNet

Zero-shot[20]

68.3

61.9

48.3

50.1

77.6

54.2

58.4

Linearclassifier[20]

79.3

69.1

44.8

44.3

66.7

51.1

55.2

E2E-FT[28]

81.3

70.6

45.1

36.6

65.6

50.5

53.7

+Weight-spaceensemble[28]

82.5

73.1

51.6

47.6

75.1

55.7

60.6

+Output-spaceensemble

82.2

72.0

50.6

46.8

76.7

54.9

60.2

+VRF(ours)

82.3

72.1

52.9

48.4

78.7

56.4

61.8

+0.1

+0.1

+2.3

+1.6

+2.0

+1.5

+1.6

LP-FT[15]

81.5

70.7

46.7

41.4

66.4

52.4

55.5

+Weight-spaceensemble[28]

82.4

73.0

51.5

50.6

74.2

56.6

61.2

+Output-spaceensemble

82.1

72.3

50.9

50.9

74.9

55.7

60.9

+VRF(ours)

82.1

72.3

52.9

51.2

78.8

57.2

62.4

+0.0

+0.0

+2.0

+0.3

+3.9

+1.5

+1.5

sameparentcategory.

Following[15,

14],weadoptEntity30dataset[23],whichaimstocategorize

imagesintooneof30entitycategories,suchas“vehicle”and“insect”.

Baselines.

Weadopttwomodels:CLIPViT-B/32andalargerViT-B/16fromOpenAI[20]

.ThedefaultmodelusedinablationstudiesistheCLIPViT-B/16.Inadditiontothezero-shotmodels,wecompareourapproachagainstfivestandardmethodsforadaptingpre-trainedmodels:(1)linear

classifier[20],(2)E2E-FT,(3)LP-FT[15],(4)OSE,and(5)WSE[28]

.ThedescriptionsofthesemethodshavebeenincludedinSection

3.1.

Implementationdetails.Whenfine-tuningE2E-FTmodels,weadheretoWortsmanetal.

[28],

employingthedefaultPyTorchAdamWoptimizerfor10epochswithweightdecayof0.1andacosine-annealinglearningrateschedulewith500warm-upsteps.Unlessspecified,weusealearningrateof3×10−5,gradientclippingatnorm1.Whenfine-tuningLP-FT,wefirstadoptthesettingsofWortsmanetal.

[28]totrainthelinearclassifier,thenfullfine-tunethemodelsatalearning

rateof1×10−5.Forefficientlyperforming

k-NNsearch,weuseFaisslibrary[11]

.DenotethesizeoftheZSFsetis|V|,wescalethekaccordingtoapercentagep%ofthesampleset,wherek=floor(p%·|V|).Inthispaper,pissetto0.1%,avalueconsistentwiththedefaultsettingproposedbySunetal.

[24]

.Notethatallthehyperparameters,e.g.,α,a,b,aresearchedusingtheaccuracyonthein-distribution(ID)validationset.Deriveddistributionshiftdatasetsareonlyforevaluationandnotforhyperparametersweeps.SeeAppendix

C.1

forthedetailsofexperimentaldetails.

7

Method

CIFAR→STL

Entity-30

ID

OOD

ID

OOD

Zero-shot[20]

90.1

98.4

68.3

68.2

Linearclassifier

95.8

97.7

95.3

69.6

E2E-FT[28]

98.6

96.1

96.9

68.2

+WSE[28]

98.7

97.8

97.2

71.9

+OSE

98.6

96.6

97.0

71.5

+VRF(ours)

98.6

98.4

97.0

72.7

+0.0

+1.8

+0.0

+1.2

LP-FT[15]

98.5

96.3

96.9

68.8

+WSE[28]

98.7

97.9

97.3

72.1

+OSE

98.6

97.7

97.2

71.8

+VRF(ours)

98.6

98.6

97.4

72.9

+0.0

+0.9

+0.2

+1.1

Table3:AccuracyofvariousmethodsonCIFAR-10→STL-10andEntity-30.

Method

CIFAR→STL

Entity-30

ID

OOD

ID

OOD

Zero-shot[20]

88.3

97.1

65.2

66.5

Linearclassifier

95.0

96.6

93.3

68.1

E2E-FT[28]

97.9

93.5

94.4

65.1

+WSE[28]

98.2

95.7

94.6

68.8

+OSE

97.9

95.9

94.4

66.4

+VRF(ours)

97.8

97.3

94.5

69.5

-0.1

+1.4

+0.1

+3.1

LP-FT[15]

97.9

95.0

94.6

67.7

+WSE[28]

98.1

96.4

94.8

68.8

+OSE

98.1

96.4

94.7

68.5

+VRF(ours)

98.1

97.5

94.8

70.1

+0.0

+1.1

+0.1

+1.6

(a)CLIPViT-B/32(b)CLIPViT-B/16

CIFAR-10→STL-10Entity-30

(a.1)(a.2)(b.1)(b.2)

Figure3:ID-OODfrontiercurvesbyvaryingthemixingcoefficientαcurvesfortheCLIP

ViT-B/16.(a)CIFAR-10(ID)andSTL-10(OOD)results.(b)Entity-30results.

5.2Results

ImageNetanditsfiveshifteddistributionresults.InTable

1

and

2,wereporttheID-OOD

accuraciesoffine-tuningbaselinesforCLIPViT-32andCLIPViT-16models,respectively.ForOSEandWSE,wechoosethemixingcoefficientαwiththehighestIDvalidationaccuracy.Toenhanceclarityintheresults,wedenotetheimprovementoverOSEas∆inTables

1

and

2.

WeobservethatourVRFbooststheaccuracyoffine-tunedmodels,includingensemblingbaselinemodels,acrossfiveImageNetdistributionshifteddatasets,whilemaintainingorimprovingtheImageNetin-distributionperformance.Forinstance,inTable

1,whenensemblingwiththeE2E-FTmodel,our

VRFoutperformstheOSEmodelby2.0%ondistributionshiftswhileincreasingtheIDaccuracyby0.3%.ComparedtoWSEmodels,ourVRFachievesadeltaof1.2%ondistributionshifts,whilemaintainingIDperformancewithin0.2%,asshowninE2E-FTpartofTable

2.

CIFAR-10→STL-10andEntity-30results.WereporttheaccuracyofvariousmethodsinTable

3

(a,b).Wenotethatfine-tuningbaselinescanenhancetheaccuracyonCIFAR-10comparedtothezero-shotmodels.However,thisimprovementcomesattheexpenseofreducedaccuracyonSTL-10.Forinstance,E2E-FTleadstoadecreaseofapproximately3.6%inSTL-10accuracy,asshowninTable

3(a)

.Previousensemblemethodscanmitigatethedegradationtosomeextent,buttheSTL-10performancestilllagsbehindthezero-shotperformance,e.g.,InTable

3(b),theaccuracyofE2E-FT

+WSEis97.8%whereasthezero-shotperformanceis98.4%.Incontrast,ourVRFsimultaneouslyimprovesaccuracyonbothCIFAR-10andSTL-10.Similarly,forEntity-30,ourVRFcanfurtherimprovementtheOODperformancewhencomparedtoWSEandOSEmethods.

Inaddition,weplottheID-OODfrontiercurvesinFigure

3

(a.1&b.1),respectively.SimilartotheresultsonImageNet(Figure

1(a)),theensemblemodelachievesitsbestIDandOODperformances

atdifferentαvalues.Forinstance,ontheCIFAR-10benchmark,whentheensemblemodelattainsitsoptimalIDvalueatα=0.7,theOODperformancedecreasesby2.0%relativetoitspeak.

8

Table4:ResultsofVRFforlinear-probedmodelsusingCLIPViT-B/16models.

Method

ImageNetIDOOD

CIFAR-10IDOOD

Entity-30IDOOD

Zero-shotclassifier[20]

68.358.4

90.198.4

68.368.2

Linearclassifier

79.3

55.2

95.8

97.7

95.3

69.6

WSE/OSE

79.9

57.8

95.8

97.7

95.5

70.5

VRF(ours)

79.8

58.5

95.8

98.4

95.4

71.4

Conversely,whentheoptimalOODvalueisreachedatα=0.3,theperformanceonIDdiminishesby2.7%fromitsbest.Incontrast,ourVRFsimultaneouslyattainstheIDandOODperformance.

WealsoanalyzetherelationbetweentheratioinFigure

3

(a.2&b.2).ConsistentwiththefindingsfromImageNet(Figure

1

(b)),weobservethattheratiodecreasesasd(x)increases,whichfurthersupportsourdesignofassigningahigherweighttofine-tunedmodelsifd(x)issmal

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论