




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
AgentPlanningwithWorldKnowledgeModel
ShuofeiQiao◆RunnanFang◆NingyuZhang◆YuqiZhu◆,XiangChen◆,
ShuminDeng%,YongJiang
,PengjunXie
,FeiHuang
,HuajunChen◆t
◆ZhejiangUniversity%NationalUniversityofSingapore
AlibabaGroup
{shuofei,zhangningyu}@zju.edu.cn
arXiv:2405.14205v2[cs.CL]15Oct2024
Abstract
Recentendeavorstowardsdirectlyusinglargelanguagemodels(LLMs)asagentmodelstoexecuteinteractiveplanningtaskshaveshowncommendableresults.Despitetheirachievements,however,theystillstrugglewithbrainlesstrial-and-erroringlobalplanningandgeneratinghallucinatoryactionsinlocalplanningduetotheirpoorunderstandingofthe“real”physicalworld.Imitatinghumans’mentalworldknowledgemodelwhichprovidesglobalpriorknowledgebeforethetaskandmaintainslocaldynamicknowledgeduringthetask,inthispaper,weintroduceparametricWorldKnowledgeModel(WKM)tofacilitateagentplanning.Concretely,westeertheagentmodeltoself-synthesizeknowledgefrombothexpertandsampledtrajectories.ThenwedevelopWKM,providingpriortaskknowledgetoguidetheglobalplanninganddynamicstateknowledgetoassistthelocalplanning.Experimentalresultsonthreecomplexreal-worldsimulateddatasetswiththreestate-of-the-artopen-sourceLLMs,Mistral-7B,Gemma-7B,andLlama-3-8B,demonstratethatourmethodcanachievesuperiorperformancecomparedtovariousstrongbaselines.Besides,weanalyzetoillustratethatourWKMcaneffectivelyalleviatetheblindtrial-and-errorandhallucinatoryactionissues,providingstrongsupportfortheagent’sunderstandingoftheworld.Otherinterestingfindingsinclude:1)ourinstance-leveltaskknowledgecangeneralizebettertounseentasks,2)weakWKMcanguidestrongagentmodelplanning,and3)unifiedWKMtraininghaspromisingpotentialforfurtherdevelopmen
t3.
agentmodel
→trial-and-error
correctpath
first
step
…
…
hallucinatory
action(a)
agentmodel
correctpath
taskknowledge
state
[ulu]+[un]
worldknowledgemodel
……
trajectories
know_probsagent_probs
knowledge
firststep
(b)
…
…
Figure1:Traditionalagentplanningvs.Agentplanningwithworldknowledgemodel.
1Introduction
TheremarkableadvancesinLargeLanguageModels(LLMs)havewitnessedarapiddevelopmentof
variousnaturallanguageprocessingtasks[25,
16,
28,
47,
60,
33]
.Recently,multipleattemptsthat
*EqualContribution.
tCorrespondingAuthor.
3Thecodeisavailableat
/zjunlp/WKM.
38thConferenceonNeuralInformationProcessingSystems(NeurIPS2024).
3
TrainingPhase
WorldKnowledgeModel
(c)ModelTraining
●inputoutput
(at,st,at+1)
Agent:gotofridge1
Obs:Thefridge1isclosed
StateKnowledge:Yourtaskisto…Youarechecking…
Agent:openfridge1
Obs:Thefridge1isopen.Init
(a)TaskKnowledgeSynthesis
(b)StateKnowledgeSummarization
Task:putacleanegginmicrowave
Task:puttwonewspapersindrawer
agentmodel
knowledgemodel
stateknowledgebase
fromagentmodel
gotake
put
…
heat
Stateknowledgewillnot
appearinthecontextof
agentmodelduringtraining
andinference.
Youareinthemiddleofaroom…
Task:putacleanegginmicrowave.
TaskKnowledge:
Youshouldfirstfindaneggand…Theworkflowsare:…
Agent:gotocountertop1
Obs:Onthecountertop1,you
seeacreditcard2,adishsponge2…
Agent:gotocountertop2
Obs:Onthecountertop2,you
seeacreditcard1,apen1,apen
2,anewspaper1…
Reward:0.0
τl
ExpertTrajectory
SampledTrajectory
PlanningPhase
(d)PlanningwithWKM
Youareinthemiddleofaroom.Lookingquicklyaroundyou,youseeaarmchair1,acabinet1,adrawer2,adrawer1,asofa1…
Task:puttwonewspapersindrawer.
TaskKnowledge:
Whentryingtoplacemultipleobjectsinadrawer,
youshouldfirstlocatealltheobjects,thengotothedraweroneatatime,andplaceeachobjectinsidebeforeclosingthedrawer.Theactionworkflowsare:
1)Locateallobjects.
2)Gotothedrawer.
3)Placeoneobjectin/onthedrawer.
4)Closethedrawer.
5)Repeatsteps2-4foreachobject.
at
Agent:gotosofa1
fromknowledgemodelfromenvironment
(1-γ)·pknow+γ·pagent
Obs:Onthesofa1,youseeacreditcard2,a
newspaper1.st
StateKnowledge:Yourtaskistoputtwo
newspapersindrawer.Youarecheckingsofa1andthereisanewspaper1onit.
gotake
put
at+1
…
heat
Agent:takenewspaper1fromsofa1
Obs:Youpickupthenewspaper1fromthesofa1.
StateKnowledge:Yourtaskistoputtwo
newspapersindrawer.Youarecheckingsofa1andhavefoundonenewspaper.Nextyoushouldfind
anothernewspaper.
……
Agent:putnewspaper2in/onthedrawer1Reward:1.0
Agent:gotofridge1
Obs:Thefridge1isclosed
Agent:openfridge1
Obs:Thefridge1isopen.Init,youseeacup3,acup1,a
lettuce1…
Agent:putegg2inmicrowave1
Reward:1.0
τw
Figure2:OverviewofourWKM.Wetrainaworldknowledgemodelontheknowledgesynthesizedbytheagentmodelitselffrombothexpertandexploredtrajectories,providingpriortaskknowledgetoguideglobalplanninganddynamicstateknowledgetoassistlocalplanning.
theactionat+1basedonhtateachtimestept+1:
at+1~πθ(·|ht).(2)Specifically,a0~πθ(·|u)isgeneratedaccordingtothetaskinstructionu.Thewholetrajectoryτconcludeswhenthetaskiscompletedorexceedsthemaximumtimesteps.Thentheproductionoftheentiretrajectorywithtimelengthncanbemodeledas:
n
πθ(τ|u)=Ⅱπθ(at+1|ht)πθ(a0|u).(3)
t=0
Ultimately,thefinalrewardr(u,τ)∈[0,1]representingthetaskcompletionrateiscalculated.Notethatwefollowa
REACT-style[54]trajectorythatincludesrationalesbeforeeachaction
.Weuseatorepresenttheactionwithrationalesforconvenience.
WorldKnowledgeModel.Worldknowledgemodelservesashumans’mentalcognitionofthephysicalenvironment,moreintricatethanthewordknowledgemodelwhichLLM-poweredagent
modelsaretrainedtobe[61,
10,
52,
13]
.Our“world”herereferstothesimulatedenvironmentofthetask.Basedonthestaticenvironmentofthetaskandthedynamicchangesduringinteractionwiththeagent,wedefineworldknowledgeasacombinationofpriorglobalknowledgeanddynamiclocalknowledge,correspondingtotheblindtrial-and-errorprobleminglobalplanningandthehallucinatoryactionissueinlocalplanningintraditionalagentmodels,respectively.Toattainpreciseandefficientagentplanning,wedevelopaparametricWKMtosimulatethementalWKMofhumans.
3Method
AsshowninFigure
2,westeertheagentmodeltoself-synthesizethe
taskknowledgefromthe
comparisonofexpertandsampledtrajectories(§3.1)
.Thenweprompttheagentmodeltoself-summarizethestateknowledgebasedonhistoricalbehaviorandconstructastateknowledgebase
(§3.2)
.ThegeneratedknowledgewillbeintegratedintotheexperttrajectoriesfortrainingtheWKM.
Afterthetrainingprocess(§3.3),weaugmenttheagentmodelwiththeworldknowledgemodelto
achieveeffectiveandaccurateplanning(§3.4)
.
3.1TaskKnowledgeSynthesis
Thetaskknowledgeservesasthepriorknowledgetoguidetheagentmodel’sglobalplanningandpreventitfromdroppingintoblindtrial-and-error.
4
ExperiencedAgentExploration.Weprimarilyacquiretaskknowledgethroughthecomparisonofpreferencetrajectories(chosenvs.rejected).Inordertoimprovethequalityofrejectedtrajectoriesandobtainmoretargetedtaskknowledge,weemployanexperiencedagentforexploration.Firstly,we
trainavanillalanguagemodelwithexperttrajectories4
fromthetrainingsettoobtainanexperiencedagent.Subsequently,theexperiencedagentexploresthetrainingsettasksagaintogeneraterejectedtrajectories.Ourpurposeistoextractsuperiortaskknowledgethatcannotbeacquiredsolelythroughsupervisedfine-tuningonchosentrajectories,thusfurthereffectivelyboostingtheagent’scapabilities.
SelfKnowledgeSynthesis.Withtheexperttrajectoriesasthechosenonesandthetrajectoriessampledfromtheexperiencedagentastherejectedones,weprompttheagentmodelitselftosynthesizethetaskknowledge.SupposingKisthetaskknowledgespace:
κ∼πθ(·|ρTaskKnow,u,τw,τl),(4)whereκ∈Kisthetaskknowledge,ρTaskKnowstandsfortheprompttoinstructthetaskknowledgeextraction,andτw,τlarethechosenandrejectedtrajectoriesrespectively.Notethatgiventhesametasku,τwandτlalwayssatisfyr(u,τw)=1≥r(u,τl).Evenwhenr(u,τw)=r(u,τl),westillconsidertrajectoriessampledfromtheexperiencedagentasrejectedones.Thisisbecauseexperttrajectoriesoftenhaveshortersteplengths,enablingtheagenttolearnmoreknowledgeofefficientplanning.Fordetailedpromptsoftaskknowledgesynthesis,pleaserefertoAppendix
I.1.
3.2StateKnowledgeSummarization
Thestateknowledgeservesasthedynamicknowledgetoconstraintheagentmodel’slocalplanningandpreventitfromgeneratinghallucinatoryactions.Weprompttheagentmodeltoself-summarizestateknowledgeateachplanningstepbasedontheexperttrajectoriestoguaranteequality.Fordetailedpromptsofstateknowledgesummarization,pleaserefertoAppendix
I.2.
SupposingthepromptusedtosummarizestateknowledgeisρStateKnowandthestateknowledges∈SisapartofthestatespaceS,thegenerationofstateknowledgeattimetcanberepresentedas:
st∼πθ(·|ρStateKnow,ht).(5)
StateKnowledgeBaseConstruction.Toavoidconfusioncausedbyexcessiveadditionalinfor-mation,insteadofexplicitlyconcatenatingthestateknowledgetothecontext,weconstructastate
knowledgebaseforretrieval(weanalyzein§4.3
howexplicitstateknowledgemayaffecttheperfor-manceofagentmodel).Wecombinethestateknowledgestwiththepreviousactionatandnextactionat+1fromtheexperttrajectorytoformaaction-state-actiontriplet(at,st,at+1).Afteriterat-ingthroughallexperttrajectories,weobtainaStateKnowledgeBaseB={(s,apre,anext)(i)}i||1,whereapre=at,anext=at+1,and|B|isthesizeofthestateknowledgebase.
3.3ModelTraining
Weintegratethegeneratedworldknowledgeintoexperttrajectoriesandtrainaworldknowledgemodel.Theagentmodelneedstobere-trainedtoadapttotheincorporationoftaskknowledge.NotethatouragentmodelandknowledgemodelarebothtrainedwithLoRAsharingthesamebackbone.WelisttheexamplesoftrainingdataforboththeagentmodelandWKMinAppendix
E.
AgentModelTraining.GiventheexperttrajectoriesdatasetD={(u,κ,τw)(i)}i|1|withtask
knowledgeκgeneratedin§3.1,wetraintheagentmodeltofollowthetaskknowledgetogenerate
actions.Underanauto-regressivemanner,thelossoftheagentmodelcanbeformulatedas:
Lagent(πθ)=−Eτw∼D[πθ(τw|u,κ)](6)
SupposeX=(x1,x2,...,x|X|)isthetokensequenceofthetrajectoryτw,wehave:
Here1(xj∈A)istheindicatorfunctiontomasktokensunrelatedtoactions.Pleasenotethatτwheredoesnotinclude
thestateknowledgementionedin§3.2.
4Fordetailsonhowtocollectexperttrajectories,pleaserefertoAppendix
A.
5
WorldKnowledgeModelTraining.Themaindifferenceinthetrainingdatabetweentheagentandknowledgemodelistheaddedstateknowledge.Giventheexperttrajectoriesdatasetwithboth
taskandstateknowledgeD′={(u,κ,τ)(i)}i|whereτ=(a0,o0,s0,...,an,on,sn),theloss
oftheknowledgemodelπϕcanbeformulatedas:
Lknow(πϕ)=−Eκ,τ∼D′[πϕ(κ|u)πϕ(τ|u,κ)](8)SupposeX′=(x,x,...,x′|X′|)isthetokensequenceoftheexperttrajectorywithstateknowledgeτandY=(y1,y2,...,y|Y|)representsthetokensequenceofthetaskknowledgeκ,wehave:
πϕ(κ|u)=−Σi|1|logπϕ(yi|u,y<i)(9)
|X′|
πϕ(τ|u,κ)=−(1(x∈S)×logπϕ(x|u,κ,x′<j)),(10)
where1(xj∈S)istheindicatorfunctiontomasktokensunrelatedtostateknowledge.
3.4AgentPlanningwithWorldKnowledgeModel
Atinferencetime,theagentmodelplansontheevaluationtaskswiththeaidoftheworldknowledgemodel.Weredefinethehistoricaltrajectoryht=(u,κ,a0,o0,a1,o1,...,at,ot).Givenaspecifictaskinstructionu,theknowledgemodelfirstgeneratesthetaskknowledgeκ∼πϕ(·|u),thentheagentmodelstartsplanning.AssumingtheavailableactionsetAu⊆Aforthetaskuis
(α,α,...,αu(|Au|)),atanytimet≥0,insteadofdirectlygeneratinganextactionat+1∈Au
basedonht,wefirstemploytheworldknowledgemodeltogeneratethecurrentstateknowledgest∼πϕ(·|ht)andleveragesttoquerythestateknowledgebaseB={(s,apre,anext)(i)}i||1.Withthestateknowledgeasthekey,weretrieveNnearesttripletsfromwhereapre=atbasedon
semanticsimilarityandcollectthecorrespondingnextactionsanext.Wecounttheprobabilityof
eachactionpknow(αu(i))=i,whereNiistheoccurrencenumberofactionαu(i)inallthecollected
anext.Therefore,wegettheprobabilityacquiredfromthestateknowledgebase:
Pknow(Au)=(pknow(α),pknow(α),···,pknow(αu(|Au|))),
|Σi|pknow(αu(i))=1.(11)
Afterward,wesampletheprobabilitydistributionofthefirsttokenforeachactionαu(i),1≤i≤|Au|fromthelastlayeroftheagentmodelandapplyasoftmaxfunctiontonormalizetheprobabilitydistribution.Wedefinetheprobabilityacquiredfromtheagentmodelas:
Pagent(Au)=(pagent(α),pagent(α),···,pagent(αu(|Au|))),
|Σi|pagent(αu(i))=1.(12)
Finally,wedeterminethenextactionbycombiningtheabovetwoprobabilities:
at+1=argmax(γ·pagent(αu(i))+(1−γ)·pknow(αu(i))),(13)
αu(i)∈Au,1≤i≤|Au|
whereγisthehyperparameterthatcontrolstheproportionofPagent(Au).Basedontheabove,weenhancetheagentplanningbyglobalguidancefromtaskknowledgeandlocalconstraintsfromstateknowledgegeneratedbyourWKM.DuetotheWKMandretrieval,theinferencestageincursadditionaltimeoverheadcomparedtothepureagentmodel.Theapproximateratioisaround2.5:1.
4Experiments
4.1ExperimentalSettings
DatasetsandMetrics.Weevaluateourmethodonthreereal-worldsimulatedplanningdatasets:ALFWorld
[41],
WebShop
[53],and
ScienceWorld
[50]
.AlFWorldandScienceWorldinclude
6
Table1:MainResults.Thebestresultsaremarkedinboldandthesecond-bestresultsaremarkedwithunderline.Alltheprompt-basedbaselines(。)areevaluatedunderone-shotpromptingandallthefine-tuning-basedbaselines(。)aretrainedthroughLoRA.RedrepresentsthechangesofWKMrelativetotheoptimalresultsinthebaselines.WKMandagentmodelaredifferentLoRAssharingthesamebackbone.
BackboneMethod
ALFWorld
WebShop
ScienceWorld
Seen
Unseen
Seen
Unseen
GPT-3.5-TurboGPT-4
。REACT
8.5744.29
5.9738.05
44.3762.76
15.4167.32
13.9965.09
Mistral-7B
。REACT
7.86
5.22
14.63
20.72
17.65
。Reflexion
11.56
6.00
16.64
21.07
18.11
。NAT
64.43
68.96
61.01
57.12
50.79
。ETO
66.84
71.43
64.09
58.17
51.85
。KNOWAGENT
70.44
70.72
61.28
59.32
47.24
WKM
73.57+3.13
76.87+5.44
65.48+1.39
62.12+2.80
53.62+1.77
Gemma-7B
。REACT
6.43
2.24
5.93
3.58
3.51
。Reflexion
7.14
2.99
7.71
4.94
3.93
。NAT
67.86
65.88
55.82
47.63
44.98
。ETO
。KNOWAGENT
66.4369.29
68.6667.60
62.6758.80
50.4448.55
47.8445.28
WKM
70.71+1.42
70.40+1.74
63.75+1.08
53.68+3.24
49.24+1.40
Llama-3-8B
。REACT
2.86
3.73
19.32
24.76
22.66
。Reflexion
4.29
4.48
22.73
27.23
25.41
。NAT
60.71
59.70
61.60
55.24
48.76
。ETO
64.29
64.18
64.57
57.90
52.33
。KNOWAGENT
66.71
62.69
64.40
58.67
49.18
WKM
68.57+1.86
65.93+1.75
66.64+2.07
60.12+1.55
54.75+2.42
unseentaskstoevaluatetheagent’sgeneralizationability.TherewardofALFWorldisbinary0or1,indicatingwhethertheagenthascompletedthetaskornot.WebShopandScienceWorldprovidedenserewardsfrom0to1tomeasurethecompletionlevelofthetask.Forallthedatasets,weapplyaveragerewardasthefinalmetrics.PleaserefertoAppendix
B
fordetaileddatasetinformation.
ModelsandBaselines.Weevaluateonthreestate-of-the-artopen-sourcemodels:1)Mistral-7B
[16],theMistral-7B-Instruct-v0.2version
.2)Gemma-7B
[24],theGemma-1.1-7B-itversion
.3)Llama-3-8B
[25],theMeta-Llama-3-8B-Instructversion
.Wecompareourmethodwithtwoprompt-basedbaselines:REACT
[54]and
Reflexion
[40]
.Besides,weadopttwostrongbaselinesthatintroducerejectedtrajectoriesintothetrainingprocesstolearnfromexperience:NAT
[49],learn
fromrejectedtrajectoriesthroughSFT,andETO
[44],learnfromrejectedtrajectoriesthroughDPO
[36]
.Moreover,wecomparewithaknowledge-augmentedplanningmethodKNOWAGENT.WealsoincludeChatGPT
(gpt-3.5-turbo-0125)[27]and
GPT-4
(gpt-4-32K-0613)[28]forcomparison
.Alltheprompt-basedbaselinesaretestedunderone-shotandallthefine-tuning-basedbaselinesare
trainedwithLoRA[12]
.PleaserefertoAppendix
C
forbaselinesandre-producingdetails.
TrainingandInferenceSetups.
Wefine-tunetheproposedapproachwithLoRA[12]usingthe
LlamaFactory[62]framework
.Duringtraining,themodelistunedafterfinishingtheentiretrajectoryratherthaneachstepofaction.Thelearningrateis1e-4andthesequencelengthis2048forallthemodels.Thetrainingepochis3andthebatchsizeis32.
WeadopttheAdamWoptimizer[22]
withacosinelearningscheduler.Duringinference,weapplytheembeddinglayerofWKMastheencoderandusethecosinesimilaritybetweensentencesforretrieval.Thenumberofretrievedaction-state-actiontripletsNissetto3000andthePagent(Au)weightγissetto{0.4,0.5,0.7}.Allthetrainingandinferenceexperimentsareconductedon8NVIDIAV10032GGPUswithin12hours.PleaserefertoAppendix
D
fordetailedhyperparametersusedinourpaper.
4.2Results
MainResults.AsshowninTable
1,
forprompt-basedbaselinesonopen-sourcemodels,bothRE-ACTandReflexionexhibitpoorperformance,farbehindourmethodandfine-tuning-basedbaselinesonvariousdatasets.GPT-3.5-TurboperformsordinarilyontwodatasetsotherthanWebShop,anditevenfallsbehindMistral-7BandLlama-3-8B’sREACTperformanceonScienceWorld.However,GPT-4exhibitsstrongperformanceacrossvariousdatasets.Nevertheless,ourapproach,through
7
w/oall
w/state
w/task
w/task&state
w/orejected
merge
prompt
AverageReward
69.29
73
.57
7675.37
.87
71.5770.71
69.4070.67
67.86
67.19
65.4665.14
67.40
63.57
80
70
60
50
seenunseen
62.44
63.68
65
.48
63.9763.70
61.03
56.98
80
70
60
50
test
60.81
62
.12
58.51
52.78
55.04
55.49
56.36
53.425350.3251.52
.62
51.7848.38
45.27
70
60
50
40
seenunseen
ALFWorldWebShopScienceWorld
Figure3:AblationStudyonMistral-7B.w/oallmeansthevanillaexperiencedagentmodeltrainingwithpureexperttrajectories.w/stateistestingagentmodelwithonlystateknowledgebaseconstraints.w/taskstandsforguidingagentmodelwithonlytaskknowledge.w/task&stateisourWKMwithbothtaskknowledgeguidanceandstateknowledgeconstraints.w/orejectedmeanssynthesizingtaskknowledgesolelythroughexperttrajectories.mergestandsfortrainingWKMandtheagentmodeltogetherwithonesinglemodel.promptmeansusingfew-shotpromptstoreplacetheWKMforprovidingknowledge.
Table2:AverageSteps.ThemaximumnumberofstepsinALFWorldandWebShopis40and10.InScienceWorld,thenumberofstepsrangesfrom10to120dependingonthetasktype,withanaverageofaround40.
Method
ALFWorld
WebShop
ScienceWorld
Seen
Unseen
Seen
Unseen
NAT
23.27
23.42
4.08
20.18
21.21
ETO
19.82
22.29
3.99
24.13
26.35
KNOWAGENT
18.51
24.56
4.01
21.06
24.74
WKM
17.66
17.92
3.97
18.74
19.59
Table3:HallucinatoryActionRatesonALFWorld.Wecalculatetheproportionoftrajectoriescontaininginvalidactionsregard-lessoftheircorrectness.
Method
ALFWorld
SeenUnseen
NATETO
KNOWAGENT
45.71%50.00%34.29%36.57%33.57%44.78%
WKM
32.86%29.85%
LoRAtrainingalone,surpassesGPT-4onALFWorld(44.29→73.57onseen,38.05→76.87onunseen)andWebShop(62.76→66.64).Forfine-tuning-basedbaselines,bothNATandETOfallbehindourmethod,implyingthatjustintegratingworldknowledgeforagentmodelsisworthmorethanfurtherfussySFTorDPOonnegativeexamples.OurmethodalsoperformsbetterthanKNOWA-GENTwhichbringshuman-designedfixedactionknowledgeandlongactionpathsintotrajectories.ThissuggeststheeffectivenessofourWKMwhichisresponsibleforgeneratinginstance-leveltaskknowledgeandmaintainingimplicitactionconstraints.Furthermore,KNOWAGENT’sperformanceonunseentasksisnotasimpressiveasonseentasks,whileWKMcankeepitsadvantage.ThisphenomenonalsodemonstratesthegeneralizationabilityofWKM.
ApproachAblations.AsshowninFigure
3,takingMistral-7Basanexample,wedecompose
thekeycomponentsofWKMtoexaminetherolesofthetaskandstateknowledgeseparately.Inamacroview,removingeachmoduleresultsinacleardropintheagent’sperformance,whichvalidatesthepowerofourworldknowledge.Furthermore,theimprovementthroughtaskknowledge(w/task)ismorepronouncedthanthatthroughstateknowledge(w/state),suggestingthenecessityofglobalpriorknowledgeforagentplanning.Amoremicroobservationrevealsthattheimpactofstateknowledgeismoresignificantonseentaskscomparedtounseentasks,whiletheinfluenceoftaskknowledgeissustainableacrossseenandunseentasks.Thismaybeattributedthatalthoughourreal-timestateknowledgeisgeneratedbyWKM,thestateknowledgebaseisbuiltonthetrainingset,whichmayweakengeneralizationtosomeextent.Additionally,tovalidateourmotivationofallowingtheagenttolearntaskknowledgefrombothexpertandgeneratedtrajectories,weexcludetherejectedtrajectoriesduringthesynthesisoftaskknowledge,instructingtheagentmodeltosynthesizeknowledgesolelybasedonthechosentrajectories.Theresults(w/orejected)demonstratethatlearningfromthecontrastbetweenchosenandrejectedtrajectoriesismoreeffectivethanlearningfromchosenexamplesalone.ThisprocedureisalittlesimilartoDPO,butweachieveitthroughknowledgeaugmentationratherthandirectlyconvertingitintoalosscalculationbetweenchosenandrejectedtrajectories.AdditionalresultscanfurtherevidentthattrainingaWKMseparatelyperformsbetterthantrainingonesinglemodeltogetherwiththeagentmodelaswellasusingfew-shotpromptstoreplaceWKMforprovidingknowledge.
4.3Analysis
Worldknowledgecanmitigateblindtrial-and-errorandreducehallucinatoryactions.WecomparethenumberofplanningstepsforeachdatasetbetweenthreestrongbaselinesandWKMandcalculatetheaveragestepsofeachmethod.AsdepictedinFigure
9
(inAppendix
F),WKM
8
demonstratestheabilitytocompleteasignificantproportionoftasksusingtheshortesttrajectory,indicatingthatguidancefromworldknowledgecaneffectivelyreducetheagent’sblindtrial-and-errorintheenvironment.TakingafurtherperspectivefromanaveragestandpointinTable
2,itcan
beobservedthatWKMexhibitsloweraverageplanningstepscomparedtootherbaselines.AsALFWorldcanrespondtoinvalidactions,inTable
3,wecountthepercentageofhallucinatoryactions
thatoccurredintrajectoriesfromALFWorldforeachmethod.Theresultsconfirmtheeffectivenessofourworldknowledgemodeltodecreasehallucinatoryactions.Furthermore,itisworthnotingthatmostbaselinesshowaprominentincreaseintheaveragenumberofstepsandpercentageofinvalidactionswhentransitioningfromseentaskstounseentasks,butWKMcanstillmaintainarelativelylowlevel.Thisreflectslaterallythatourworldknowledgecanstilleffectivelyguidetheagentmodelonunseentasks,highlightingtheknowledgegeneralizationbroughtbytheworldknowledgemodel.Toseehowourworldknowledgeworks,pleaserefertoourcasestudyinAppendix
H.
Ourinstance-levelknowledgecangeneralizebettertounseentasks.Tofurtherexplorethebenefitofusingaknowledgemodeltogenerateinstance-leveltaskknowledge,wecarefullysurveythetaskknowledgegeneratedbyourWKMandabstractitintodataset-levelknowledgeforeachdataset.Thenweretraintheagentmodelto
adapttonewdataset-levelknowledge5.
AsillustratedinFigure
4,
wecomparetheperformanceofdataset-levelknowledgewithourinstance-leveltaskknowledge(WKMw/ostate)onALFWorldandScienceWorld.Itcanbeobservedthatourmodel-generatedinstance-levelknowledgenotonlysurpasseshuman-designedknowledgeonseentasksbutalsoexhibitsevenmoreremarkableperformanceonunseentasks,withtheimprovementinperformanceonunseentaskssignificantlygreaterthanthatonseentasks.Thisphenomenonstraightlyreflectsthestronggeneralizationabilityofourknowledgemodelcomparedtorigidlydesignedknowledgebyhumans.
AverageReward
80
70
60
50
40
HumanWKM
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 中职礼仪活动方案
- 中观国学团建活动方案
- 中青班志愿者活动方案
- 丰富信息宣传活动方案
- 串串新店活动方案
- 专业室内设计软件购买协议
- 医学影像学病例解析试题
- 小鸟搬家记讲述一个温馨的动植物故事作文(12篇)
- 临河儿童植树活动方案
- 丹东市安全生产活动方案
- 2025年中考英语冲刺仿真模拟测试卷(含答案)
- 2025国家开放大学《商务英语1》综合测试形考任务答案
- 浪潮软件开发面试题目及答案
- 2025年全国保密教育考试试卷附答案(三套)
- 2025年保密观题库及答案
- 2025年河北中考模拟(原创一)语文试题及答案
- 股权代签协议书范本
- 生物安全柜试题及答案
- 安徽教编美术试题及答案
- 2025年大学英语四级考试试题及答案解析
- 临床成人床旁心电监测护理规程
评论
0/150
提交评论