版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
RobustFine-tuningofZero-shotModelsviaVarianceReduction
BeierZhuJiequanCuiHanwangZhang
NanyangTechnologicalUniversity
arXiv:2411.06966v1[cs.CV]11Nov2024
beier002@e.ntu.edu.sg,hanwangzhang@.sg
Abstract
Whenfine-tuningzero-shotmodelslikeCLIP,ourdesideratumisforthefine-tunedmodeltoexcelinbothin-distribution(ID)andout-of-distribution(OOD).Recently,ensemble-basedmodels(ESM)havebeenshowntooffersignificantrobustnessimprovement,whilepreservinghighIDaccuracy.However,ourstudyfindsthatESMsdonotsolvetheID-OODtrade-offs:theyachievepeakperformanceforIDandOODaccuracyatdifferentmixingcoefficients.WhenoptimizedforOODaccuracy,theensemblemodelexhibitsanoticeabledeclineinIDaccuracy,andviceversa.Incontrast,weproposeasample-wiseensemblingtechniquethatcansimultaneouslyattainthebestIDandOODaccuracywithoutthetrade-offs.Specifically,weconstructaZero-ShotFailure(ZSF)setcontainingtrainingsamplesincorrectlypredictedbythezero-shotmodel.Foreachtestsample,wecalculateitsdistancetotheZSFsetandassignahigherweighttothefine-tunedmodelintheensembleifthedistanceissmall.WetermourmethodVarianceReductionFine-tuning(VRF),asiteffectivelyreducesthevarianceinensemblepredictions,therebydecreasingresidualerror.OnImageNetandfivederiveddistributionshifts,ourVRFfurtherimprovestheOODaccuracyby1.5-2.0ppovertheensemblebaselineswhilemaintainingorincreasingIDaccuracy.VRFachievessimilarlargerobustnessgains(0.9-3.1pp)onotherdistributionshiftsbenchmarks.Codesare
availablein
/BeierZhu/VRF.
1Introduction
Toensurethereliabilityofmachinelearningsystems,itisessentialtodevelopmodelsthatcangeneralizetounseen,out-of-distributionenvironments.
Largepre-trainedmodelssuchasCLIP[20]
andALIGN[10]haverecentlyshownremarkablerobustnessagainstchallengingdistributionshifts
.However,itiswidelyacknowledgedthattheseimprovementsinrobustnessaremostpronouncedinthezero-shotsetting,whileconventionalfine-tuningonthesemodelsoftencompromisesrobustness
whencomparedtozero-shotperformance[28,
15,
14]
.ThisphenomenonisknownastheID-OODtrade-offs,i.e.,improvingperformanceonin-distribution(ID)datacansometimesleadtodecreased
performanceonout-of-distribution(OOD)data[12,
25]
.
Inrecentyears,ensemble-basedmodels(ESMs)havedemonstratedsignificantsuccessinaddressing
theID-OODdilemma[17,
28,
14,
31]
.Specifically,denotetheinputasx,thezero-shotmodel
as(y|x;θzs)andthefine-tunedmodelas(y|x;θft),existingESMstypicallyemploytheoutput-
spaceensemble(OSE)[14,
31],whichoutputs
(y|x;θose)=α(y|x;θft)+(1−α)(y|x;θzs),
andtheweight-spaceensemble(WSE)[28,
17],whichoutputs
(y|x;θwse)=(y|x;αθft+(1−
α)θzs),whereα∈[0,1].Comparedtofine-tunedmodels,ESMsoffersignificantaccuracyenhance-mentsunderdistributionshift,whilemaintaininghighIDaccuracy.
However,ESMcannotfullyaddresstheID-OODtrade-offs.InFigure
1
(a),byvaryingthemixing
coefficientα,weplottheID-OODfrontiercurves(pinkline)fortheCLIPViT-B/16modelon38thConferenceonNeuralInformationProcessingSystems(NeurIPS2024).
2
口
59
4-—
L56二
o53
>
+3.6%IDAcc
efficient
Ens
emblingW
thvaringCo
---BesereerBes★zer
Fin
VR
tIDaccuratOODacco-shot
e-TunedF(ours)
Cywitha=0uracywith
.5
0.3
677073767982
ImageNetAccuracy(ID)
0.8
1.01.21.4DistancetoZSFSet(d(x))
1.4
1.2
0.8
Accft
Acczs
(a)(b)
Figure1:(a)ID-OODfrontiercurvesfortheCLIPViT-B/16modelontheID(ImageNet)andOOD(IN-{V2,R,A,Sketch}andObjectNet)datasetsbyvaryingthemixingcoefficientα.TheensemblemodelachievesitsbestIDandOODperformanceatdifferentαvalues.OurmethodVRFsimultaneouslyattainsthebestIDandOODaccuracy,outperformingtheensembleby3.6%onOODand1.6%onIDatitsoptimalperformancepoints.(b)Relationshipbetweentheratioof . increases.
ImageNet[3](ID)andfivederiveddistribution-shifteddatasets(OOD):ImageNet-V2[21],ImageNet
-
R[7],ImageNet-A[9],ImageNet-Sketch[27]andObjectNet[1]
.WefindthattheensemblemodelachievesitsoptimalIDandOODperformanceatdifferentαvalues:thebestIDaccuracyisachievedatα=0.5andthebestOODaccuracyisobtainedatα=0.3.WhentheensemblemodelreachesitsoptimalvalueforOOD,theperformanceonIDdecreasesby3.6%relativetoitspeak.Similarly,whentheensemblemodelisoptimizedforID,theperformanceonOODdecreasesby1.6%relativetoitsbestvalue–theID-OODtrade-offsstillpersistforESMs.Thisraisesanaturalquestion:
Canensemble-basedmodelssimultaneouslyattainthebestIDandOODaccuracy?
Inthispaper,weaffirmativelyanswerthisquestionbyproposingasample-wiseensemblingtechnique,dubbedvariancereductionfine-tuning(VRF).ThismethodismotivatedbyanempiricalfindingillustratedinFig
1
(b).Foreachsampleinthetrainingdataset,ifthefine-tunedmodelcorrectlypredictsthelabelwhilethezero-shotmodelfails,wecollectitsfeaturesrepresentationinthefine-tunedmodelasthezero-shotfailure(ZSF)set.Wethenmeasurethedistanced(x)ofeachtestsamplextotheZSFset.Basedonthisdistance,testsamplesaregroupedintobins,andwecomputethe
Section
C.7)
.increases.Intuitively,thecloserasampleistotheZSFset,themorelikelyitisthatthezero-shotmodelmakesincorrectpredictions,whereasthefine-tunedmodelismorelikelytobeaccurate,leadingtoahigher
higherweightforthefine-tunedmodel,andviceversa.
AsdepictedbytheorangediamondinFig.
1
(a),byleveragingthesample-wiseweights,ourVRFsimultaneouslyattainsthebestIDandOODaccuracy.InSection
5,weshowthatonavarietyof
differentmodelsandtasks,ourVRFapproachconsistentlyoutperformstheexistingfine-tuning
andensemblingmethods,includinglinearprobing,end-to-endfine-tuning,LP-FT[15],OSEand
WSE[28]
.Inspecific,onImageNetandfivederiveddistributionshifts,ourVRFfurtherimprovestheOODaccuracyby1.5-2.0ppovertheensemblebaselineswhilemaintainingorincreasingIDaccuracy.Furthermore,inSection
4,wejustifyourapproachbydemonstratingthatiteffectively
minimizesthevarianceoftheensemblemodels,resultinginreducedresidualerror.
3
2RelatedWork
MitigatingID-OODtrade-offs.Improvingperformanceonin-distributiondatacansometimeslead
toadecreaseinperformanceonout-of-distributiondata,andviceversa.ThisphenomenonisknownastheID-OODtrade-offs.Xieetal.
[29]leverageauxiliaryinformationasoutputsofauxiliarytasks
topre-trainamodeltoreduceOODerror.
KhaniandLiang[12]showthatself-trainingonlarge
amountsofunlabeleddatacanmitigatesuchtrade-offsbyremovingspuriousfeatures.Tripuranenietal.
[25]tacklethisproblembylearningrepresentationsthatarerobustacrossdiversetasks
.However,thesemethodsusuallynecessitateadditionalunlabeleddataorauxiliaryinformation.Incontrast,ourVRFisastraightforwardvariationoffine-tuningthatdoesnotrequireanyextradata.
Robustfine-tuningofzero-shotmodels.
Vision-languagemodelslikeCLIP[20]havedemonstrated
outstandingimprovementsinrobustness.Itiscommonlyacknowledgedthatconventionalfine-tuningmethodsoftencompromiserobustnesswhencomparedtozero-shotperformance.Therefore,
enhancingdownstreamrobustnesshasbeenthefocusofsubsequentworks[15,
28,
5,
19,
6,
30]
.Kumaretal.
[15]showthatatwo-processoflinearprobingfollowedbyfullfine-tuningcanalleviate
featuredistortion,leadingtostrongerOODperformancewithoutsacrificingIDaccuracy.Wortsmanetal.
[28]proposeamethodofweightinterpolationbetweenthezero-shotandthefine-tunedmodels
toimprovebothIDandOODaccuracy.Goyaletal.
[5]demonstratethatmimickingthecontrastive
pre-trainingobjectivestofine-tunethezero-shotmodelsoutperformstuningviathetraditional
supervisedcross-entropyloss.However,theID-OODtrade-offsarestillobservedwiththesemethods.
Incontrast,ourmethodVRFcansimultaneouslyachievethebestIDandOODaccuracy.
3Methods
3.1SetUp
Task:Consideraclassificationsettingwherethegoalistomapaninstancex∈Xtoalabely∈
andafine-tunedmodelf(;θft)whichistrainedonD.Below,weoutlinetheimplementationofthezero-shotandfine-tunedmodels:
Y=[K].Weareprovidedwithazero-shotmodelf(·;θzs),adownstreamdatasetD={xi,yi,
•Zero-shotmodels
(ZS):WeinvestigateCLIPmodels[20]asourzero-shotmodels
.CLIPmodelsarepre-trainedusingimage-textpairs{(x1,t1),...,(xB,tB)}fromtheInternet.TheobjectiveoftheCLIPmodelsistotrainavisualencoderΦvandatextencoderΦtsuchthatthecosinesimilarity<Φv(xi),Φt(ti)>ismaximizedrelativetounmatchedpairs.CLIPmodelsperformzero-shotinferenceforKclassesbymatchingxwithpotentialclassnames{c1,...,cK}.Concretely,byextendingtheclassname{ck}toaprompt“tk=aphotoofa{ck}”,thezero-shotmodeloutputsthescore(logit)forclasskasf(x;θzs)k=<Φv(x),Φt(tk)>.Thepredictedprobabilitiescanbecalculatedusingthesoftmaxfunction,i.e.,(y|x;θzs)=softmax(f(x;θzs))y.Themodeloutputsthelabelaspred(f(x;θzs))=argmaxif(x;θzs)i
•Linearclassifiers(LC):WelearnalinearclassifierontopofthevisualembeddingΦv(x)whilefreezingthevisualencoderΦv.Theparametersofthelinearclassifierareoptimizedtominimizethecross-entropylossonD.
•End-to-endfine-tuning(E2E-FT):Weupdateboththelinearclassifierandthevisualencoderbyminimizingthecross-entropylossonD.
•Linearprobingthenfullfine-tuning
[15](LP-FT):Weemployatwo-phasefine-tuning
approach:initiallytrainingalinearclassifier,followedbyfullfine-tuningstartingfromthesolutionderivedfromtrainingthelinearclassifier.
•Output-spaceensemble(OSE):Weperformlinearinterpolationoftheoutputsbetweenazero-shotmodelandafine-tunedmodel(e.g.,E2E-FTorLP-FT):
(y|x;θose)=α(y|x;θft)+(1−α)(y|x;θzs),whereα∈[0,1](1)
•Weight-spaceensemble
[28](WSE)
.Wecombinetheweightsthroughlinearinterpolationbetweenazero-shotmodelandafine-tunedmodel:
(y|x;θwse)=(y|x;αθft+(1−α)θzs),whereα∈[0,1](2)
4
Algorithm1VariationReductionFine-tuning
1:Given:TrainingdatasetD,azero-shotmodelfzsandafine-tunedmodelfft.
2:Buildzero-shotfailuresetVusingEq.
(3)
.▷Step1:Identification
3:InferenceStage:
4:Givenatestsamplex,computeitsfeaturerepresentationv,zero-shotpredictionzs(y|x)andfine-tunedmodelpredictionft(y|x).
5:Computethek-NNdistancetoVasd(x)usingEq.
(4)
.▷Step2:DistanceCalculation
6:Computetheweightω(x)usingEq.
(6)
.
7:Returnvrf(y|x)usingEq.
(5)
▷Step3:Sample-WiseEnsembling
3.2VarianceReductionFine-tuning
Wenowpresentourproposedmethod,VRF,whichconsistsofthreesteps.First,beforetheinferencestage,wecollecttheZero-ShotFailure(ZSF)set.Second,foragiventestsample,wecalculateitsdistancetotheZSFset.Third,weassignweightstocombinepredictionsfromthezero-shotandfine-tunedmodelsbasedonthisdistance.
Step1(Identification).ForeachxiinthetrainingdatasetD,ifthefine-tunedmodelcorrectlypredictsthelabelwhilethezero-shotmodelfails,wecollectitsfeaturerepresentationvi=Φv(xi;θft)fromthefine-tunedmodeltoformthezero-shotfailuresetV.Specifically,Visdefinedas:
V={vis.t.yi=pred(fft(xi))andyipred(fzs(xi))}.(3)
Here,fzs(·)andfft(·)areusedtodenotef(·;θzs)andf(·;θft),respectively,forsimplicity.
Step2(DistanceCalculation).ThekeyempiricalobservationunderpinningVRFisthatinthevicinityoftheZSFset,atestsampletypicallyexhibitslowerzero-shotaccuracy(Acczs)andhigherfine-tunedaccuracy(Accft).Consequently,thedistancefromthesampletotheZSFsetincreases.Inthispaper,weadoptnon-parametricdensity
estimationusingnearestneighbors[24]tomeasurethedistanceofatestsampletothedataset
V.Specifically,duringinference,wederivethefeaturerepresentationvofatestsamplex,andcomputetheℓ2distances∥v−vi∥2w.r.t.vi∈V.WereorderVaccordingtotheincreasingℓ2distanceanddenotetheorderedsetinsequenceasV′=(v(1),v(2),...,v(|V|)).ThedistanceofxtoVisdefinedastheℓ2distancetothek-thnearestneighbor(k-NN),i.e.,
d(x;V,k)=∥v−v(k)∥2.(4)Ifthereisnoambiguity,weused(x)todenoted(x;V,k)forreadability.SincethefeaturesinCLIPmodelsareℓ2normalized,d(x)areboundedbetween[0,2].
Fine-TunedAcc/Zero-ShotAcc
Step3(Sample-WiseEnsembling).Weimplementsample-wiseout-spaceensemblingintheform:
1.4
vrf(y|x)=ω(x)·ft(y|x)+(1−ω(x))·zs(y|x),(5)
1.2
1.0
whereω(x)∈(0,1).WeusethedistancetoZSFsetd(x)todeterminetheweightω.Asshownbythebluelinein
0.8
Fig
2,asmallervalueofd(x)
correspondstoalarger
ratio,andviceversa.Therefore,wesettheweightωtobe
weightofFTmodel
0.75
0.70
0.65
0.60
0.55
0.50
0.45
-●-AccuarcyRatio
(x),a=1.5,b=0.6
between0and1,weemployasigmoidfunctionσ()as:
wherea,b>0aretwohyper-parameterssweepedusing
inverselyproportionaltod(x).Giventhatωisbounded0.80.91.istaetoFst.3(d(x)1).41.51.6
ω(x)=σ(−(d(x)−a)/b)(6)Figure2:Relationshipbetween
,theweightω(x).
accuracyonIDvalidationset.WevisualizetheweightcurveingreenonFig
2,setting
a=1.5andb=0.6.WesummarizethewholeprocessinAlgorithm
1.
4Justification
WenowprovethatourVRFcaneffectivelyreducethevarianceofthecombinedmodel,resultinginlowererrorscomparedtoensemblingusingaconstantmixingcoefficient.
5
4.1Background
Theoutputsofawelltrainedclassifierareexpectedtoapproximatetheaposteriorclassdistribution.Apartfromtheirreducibleerror(Bayeserror),theresidualerrorofaclassifiercanbebrokendownintobiasandvariancecomponents.Inspecific,foratestsamplex,theprobabilityoutputofaclassifierparameterizedbyθcanbeexpressedas:
(y|x;θ)=P(y|x)+βy+ηy(x),(7)
reidualorfor–x
whereP(y|x)denotesthetrueaposteriordistribution,βyisthelabelbiasof(y|x;θ)whichisindependenttotheinputx,andηy(x)isrelatedtothegiveninputx.Inthisstudy,weprimarilyattributetheresidualerrortothevarianceterm(i.e.,βy=0),asthelabelbiasprobleminfoundationmodelshasbeeneffectivelyaddressedbyZhuetal.
[31]
.Tumeretal.
[26]haveproventhatthe
expectedresidualerrorEisgivenby:
E=V[ηy(x)](8)
s,
wheresisaconstantfactorrelatedtothederivativeofthetrueaposteriordistributionandisindependentofthetrainedmodel,andV[ηy(x)]isthevariance.
4.2VarianceReductionFine-tuningLeadstoLowerResidualError
Letusshiftourfocustotheeffectsofcombiningthezero-shotandfine-tunedmodels.Letgzs(·)andgft(·)betwofunctionsthatproduceweightsforensemblingthemodels.Subjecttotheconstraintthatgzs(x)+gft(x)=1,theresidualerrorofthecombinedclassifierisobtainedby:
vrf(y|x)=gzs(x)zs(y|x)+gft(x)ft(y|x)=P(y|x)+gzs(x)·ηzs(x)+gft(x)·ηft(x),(9)
、◆
、–
ηvrf(x)
whereweomitthesubscriptyofηforreadability.Thevarianceofηvrf(x)canbeexpressedas:
V[ηvrf(x)]=gzs(x)2·V[ηzs(x)]+gft(x)2·V[ηft(x)].(10)
Here,weassumetheresidualerrorsareindependentfollowingtheassumptionofthepreviousstudies
ofCLIPfine-tuning[14,
31]
.WefurtherexplorethecaseofcorrelatedresidualerrorsinSection
B.
AccordingtoEq.
(8),thereductioninvariancecanbereadilytranslatedintoareductioninerror
rates.ToobtainthesmallestvarianceV[ηvrf(x)],weminimizeEq.
(10)
usingLagrangemultipliertoenforcetheconstraintthatgzs(x)+gft(x)=1,andobtaintheoptimalweightfunctiongftas:
gft(x)===(1+)−1Ⅸ(11)
SinceⅨd(x)−1(asmallerdistanced(x)asshowninFig.
2),
wedesigntheweightingfunctiongft(x)=ω(x)Ⅸd(x)−1asinEq.
(6)
.
5Experiments
5.1ExperimentalSetup
Datasetswithdistributionshifts.
WeprovidetheresultsforImageNet[3]anditsfivederived
distributionshifts:(1)ImageNet-V2(IN-V2)[21]:Testimagessampledafteradecadeoftheoriginal
ImageNet.
(2)ImageNet-R(IN-R)[7]:Containsrenditions(e.g.
,art,cartoons,graffiti).(3)ImageNet
Sketch(IN-Sketch)[27]:Consistsofsketchesratherthannaturalphotos
.
(4)ImageNet-A(IN-A)[9]:
Collectsreal-worldimagesthataremisclassifiedbyResNetmodels.
(5)ObjectNet[1],atestset
featuringobjectswithdiversebackgrounds,rotations,andimagingviewpoints.Weextendour
analysistoincludeastandarddistributionshiftbenchmark[15,
14,
4]:CIFAR-10
→STL-10,where
theIDisCIFAR-10[13],andtheOODisSTL-10[2]
.Weremovedthe“monkey”classfromSTL-10,asitdoesnotexistinCIFAR-10.Inaddition,wealsoconsidersubpopulationshifts,wheretheIDdatacontainsafewsub-categories,andtheOODdatacomprisesdifferentsub-categorieswithinthe
6
Table1:AccuracyofvariousmethodsonImageNetandderiveddistributionshiftsforCLIPViT-B/32.
Method
IN
Distributionshifts
Avgshifts
IN-V2
IN-Sketch
IN-A
IN-R
ObjectNet
Zero-shot[20]
63.3
55.9
42.3
31.5
69.3
43.5
48.5
Linearclassifier[20]
75.4
63.4
38.8
26.1
58.7
41.5
45.7
E2E-FT[28]
76.2
64.2
38.7
21.0
57.1
40.1
44.2
+Weight-spaceensemble[28]
77.9
67.2
45.1
28.8
66.4
45.1
50.5
+Output-spaceensemble
77.3
66.0
44.2
27.1
68.4
44.4
50.0
+VRF(ours)
77.6
66.7
47.0
29.2
70.9
46.3
52.0
∆
+0.3
+0.7
+2.8
+2.1
+2.5
+1.9
+2.0
LP-FT[15]
76.9
64.8
39.9
25.7
69.9
42.6
48.6
+Weight-spaceEnsemble[28]
78.0
67.0
44.8
31.2
65.8
46.1
51.0
+Output-spaceEnsemble
77.8
66.3
44.0
29.5
66.2
45.5
50.3
+VRF(ours)
77.8
66.7
46.1
31.0
70.0
46.3
51.8
∆
+0.0
+0.4
+2.1
+1.5
+3.8
+0.8
+1.5
Table2:AccuracyofvariousmethodsonImageNetandderiveddistributionshiftsforCLIPViT-B/16.
Method
IN
Distributionshifts
Avgshifts
IN-V2
IN-Sketch
IN-A
IN-R
ObjectNet
Zero-shot[20]
68.3
61.9
48.3
50.1
77.6
54.2
58.4
Linearclassifier[20]
79.3
69.1
44.8
44.3
66.7
51.1
55.2
E2E-FT[28]
81.3
70.6
45.1
36.6
65.6
50.5
53.7
+Weight-spaceensemble[28]
82.5
73.1
51.6
47.6
75.1
55.7
60.6
+Output-spaceensemble
82.2
72.0
50.6
46.8
76.7
54.9
60.2
+VRF(ours)
82.3
72.1
52.9
48.4
78.7
56.4
61.8
∆
+0.1
+0.1
+2.3
+1.6
+2.0
+1.5
+1.6
LP-FT[15]
81.5
70.7
46.7
41.4
66.4
52.4
55.5
+Weight-spaceensemble[28]
82.4
73.0
51.5
50.6
74.2
56.6
61.2
+Output-spaceensemble
82.1
72.3
50.9
50.9
74.9
55.7
60.9
+VRF(ours)
82.1
72.3
52.9
51.2
78.8
57.2
62.4
∆
+0.0
+0.0
+2.0
+0.3
+3.9
+1.5
+1.5
sameparentcategory.
Following[15,
14],weadoptEntity30dataset[23],whichaimstocategorize
imagesintooneof30entitycategories,suchas“vehicle”and“insect”.
Baselines.
Weadopttwomodels:CLIPViT-B/32andalargerViT-B/16fromOpenAI[20]
.ThedefaultmodelusedinablationstudiesistheCLIPViT-B/16.Inadditiontothezero-shotmodels,wecompareourapproachagainstfivestandardmethodsforadaptingpre-trainedmodels:(1)linear
classifier[20],(2)E2E-FT,(3)LP-FT[15],(4)OSE,and(5)WSE[28]
.ThedescriptionsofthesemethodshavebeenincludedinSection
3.1.
Implementationdetails.Whenfine-tuningE2E-FTmodels,weadheretoWortsmanetal.
[28],
employingthedefaultPyTorchAdamWoptimizerfor10epochswithweightdecayof0.1andacosine-annealinglearningrateschedulewith500warm-upsteps.Unlessspecified,weusealearningrateof3×10−5,gradientclippingatnorm1.Whenfine-tuningLP-FT,wefirstadoptthesettingsofWortsmanetal.
[28]totrainthelinearclassifier,thenfullfine-tunethemodelsatalearning
rateof1×10−5.Forefficientlyperforming
k-NNsearch,weuseFaisslibrary[11]
.DenotethesizeoftheZSFsetis|V|,wescalethekaccordingtoapercentagep%ofthesampleset,wherek=floor(p%·|V|).Inthispaper,pissetto0.1%,avalueconsistentwiththedefaultsettingproposedbySunetal.
[24]
.Notethatallthehyperparameters,e.g.,α,a,b,aresearchedusingtheaccuracyonthein-distribution(ID)validationset.Deriveddistributionshiftdatasetsareonlyforevaluationandnotforhyperparametersweeps.SeeAppendix
C.1
forthedetailsofexperimentaldetails.
7
Method
CIFAR→STL
Entity-30
ID
OOD
ID
OOD
Zero-shot[20]
90.1
98.4
68.3
68.2
Linearclassifier
95.8
97.7
95.3
69.6
E2E-FT[28]
98.6
96.1
96.9
68.2
+WSE[28]
98.7
97.8
97.2
71.9
+OSE
98.6
96.6
97.0
71.5
+VRF(ours)
98.6
98.4
97.0
72.7
∆
+0.0
+1.8
+0.0
+1.2
LP-FT[15]
98.5
96.3
96.9
68.8
+WSE[28]
98.7
97.9
97.3
72.1
+OSE
98.6
97.7
97.2
71.8
+VRF(ours)
98.6
98.6
97.4
72.9
∆
+0.0
+0.9
+0.2
+1.1
Table3:AccuracyofvariousmethodsonCIFAR-10→STL-10andEntity-30.
Method
CIFAR→STL
Entity-30
ID
OOD
ID
OOD
Zero-shot[20]
88.3
97.1
65.2
66.5
Linearclassifier
95.0
96.6
93.3
68.1
E2E-FT[28]
97.9
93.5
94.4
65.1
+WSE[28]
98.2
95.7
94.6
68.8
+OSE
97.9
95.9
94.4
66.4
+VRF(ours)
97.8
97.3
94.5
69.5
∆
-0.1
+1.4
+0.1
+3.1
LP-FT[15]
97.9
95.0
94.6
67.7
+WSE[28]
98.1
96.4
94.8
68.8
+OSE
98.1
96.4
94.7
68.5
+VRF(ours)
98.1
97.5
94.8
70.1
∆
+0.0
+1.1
+0.1
+1.6
(a)CLIPViT-B/32(b)CLIPViT-B/16
CIFAR-10→STL-10Entity-30
(a.1)(a.2)(b.1)(b.2)
Figure3:ID-OODfrontiercurvesbyvaryingthemixingcoefficientαcurvesfortheCLIP
ViT-B/16.(a)CIFAR-10(ID)andSTL-10(OOD)results.(b)Entity-30results.
5.2Results
ImageNetanditsfiveshifteddistributionresults.InTable
1
and
2,wereporttheID-OOD
accuraciesoffine-tuningbaselinesforCLIPViT-32andCLIPViT-16models,respectively.ForOSEandWSE,wechoosethemixingcoefficientαwiththehighestIDvalidationaccuracy.Toenhanceclarityintheresults,wedenotetheimprovementoverOSEas∆inTables
1
and
2.
WeobservethatourVRFbooststheaccuracyoffine-tunedmodels,includingensemblingbaselinemodels,acrossfiveImageNetdistributionshifteddatasets,whilemaintainingorimprovingtheImageNetin-distributionperformance.Forinstance,inTable
1,whenensemblingwiththeE2E-FTmodel,our
VRFoutperformstheOSEmodelby2.0%ondistributionshiftswhileincreasingtheIDaccuracyby0.3%.ComparedtoWSEmodels,ourVRFachievesadeltaof1.2%ondistributionshifts,whilemaintainingIDperformancewithin0.2%,asshowninE2E-FTpartofTable
2.
CIFAR-10→STL-10andEntity-30results.WereporttheaccuracyofvariousmethodsinTable
3
(a,b).Wenotethatfine-tuningbaselinescanenhancetheaccuracyonCIFAR-10comparedtothezero-shotmodels.However,thisimprovementcomesattheexpenseofreducedaccuracyonSTL-10.Forinstance,E2E-FTleadstoadecreaseofapproximately3.6%inSTL-10accuracy,asshowninTable
3(a)
.Previousensemblemethodscanmitigatethedegradationtosomeextent,buttheSTL-10performancestilllagsbehindthezero-shotperformance,e.g.,InTable
3(b),theaccuracyofE2E-FT
+WSEis97.8%whereasthezero-shotperformanceis98.4%.Incontrast,ourVRFsimultaneouslyimprovesaccuracyonbothCIFAR-10andSTL-10.Similarly,forEntity-30,ourVRFcanfurtherimprovementtheOODperformancewhencomparedtoWSEandOSEmethods.
Inaddition,weplottheID-OODfrontiercurvesinFigure
3
(a.1&b.1),respectively.SimilartotheresultsonImageNet(Figure
1(a)),theensemblemodelachievesitsbestIDandOODperformances
atdifferentαvalues.Forinstance,ontheCIFAR-10benchmark,whentheensemblemodelattainsitsoptimalIDvalueatα=0.7,theOODperformancedecreasesby2.0%relativetoitspeak.
8
Table4:ResultsofVRFforlinear-probedmodelsusingCLIPViT-B/16models.
Method
ImageNetIDOOD
CIFAR-10IDOOD
Entity-30IDOOD
Zero-shotclassifier[20]
68.358.4
90.198.4
68.368.2
Linearclassifier
79.3
55.2
95.8
97.7
95.3
69.6
WSE/OSE
79.9
57.8
95.8
97.7
95.5
70.5
VRF(ours)
79.8
58.5
95.8
98.4
95.4
71.4
Conversely,whentheoptimalOODvalueisreachedatα=0.3,theperformanceonIDdiminishesby2.7%fromitsbest.Incontrast,ourVRFsimultaneouslyattainstheIDandOODperformance.
WealsoanalyzetherelationbetweentheratioinFigure
3
(a.2&b.2).ConsistentwiththefindingsfromImageNet(Figure
1
(b)),weobservethattheratiodecreasesasd(x)increases,whichfurthersupportsourdesignofassigningahigherweighttofine-tunedmodelsifd(x)issmal
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2024品牌招商代理服务协议模板版B版
- 2024年幼儿园教职工劳动协议模板版B版
- 2024年企业信息化系统集成实施与运维合同
- 2024年可再生能源开发合同
- 2024年度健身会所承包运营合同样本版
- 湖南省2023-2024学年高二物理上学期第二次联考试题含解析
- 2024年度农业设备租赁协议样本版
- 2024年实验技术服务协议样本版B版
- 2024年城市供水与排水合同
- 2024年厂房转让及维修责任合同
- 姜子牙动漫电影
- 体检销售工作总结
- 德智体美劳五育融合心得体会
- 律师事务所金融不良资产包收购与处置
- 农产品收购可行性方案
- MOOC Academic English Writing (学术英语写作)-北京理工大学 中国大学慕课答案
- 《如何撰写人物通讯》课件
- 城市轨道交通乘客行为分析
- 外墙高空清洁安全施工方案
- 《价格变化原因》课件
- Google人力资源管理案例分析
评论
0/150
提交评论