![WKM 增强智能体规划的世界知识模型 Agent Planning with World Knowledge Model_第1页](http://file4.renrendoc.com/view14/M09/1B/14/wKhkGWc4kZuAI4itAAKaqP1wRPQ964.jpg)
![WKM 增强智能体规划的世界知识模型 Agent Planning with World Knowledge Model_第2页](http://file4.renrendoc.com/view14/M09/1B/14/wKhkGWc4kZuAI4itAAKaqP1wRPQ9642.jpg)
![WKM 增强智能体规划的世界知识模型 Agent Planning with World Knowledge Model_第3页](http://file4.renrendoc.com/view14/M09/1B/14/wKhkGWc4kZuAI4itAAKaqP1wRPQ9643.jpg)
![WKM 增强智能体规划的世界知识模型 Agent Planning with World Knowledge Model_第4页](http://file4.renrendoc.com/view14/M09/1B/14/wKhkGWc4kZuAI4itAAKaqP1wRPQ9644.jpg)
![WKM 增强智能体规划的世界知识模型 Agent Planning with World Knowledge Model_第5页](http://file4.renrendoc.com/view14/M09/1B/14/wKhkGWc4kZuAI4itAAKaqP1wRPQ9645.jpg)
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
AgentPlanningwithWorldKnowledgeModel
ShuofeiQiao◆RunnanFang◆NingyuZhang◆YuqiZhu◆,XiangChen◆,
ShuminDeng%,YongJiang
,PengjunXie
,FeiHuang
,HuajunChen◆t
◆ZhejiangUniversity%NationalUniversityofSingapore
AlibabaGroup
{shuofei,zhangningyu}@zju.edu.cn
arXiv:2405.14205v2[cs.CL]15Oct2024
Abstract
Recentendeavorstowardsdirectlyusinglargelanguagemodels(LLMs)asagentmodelstoexecuteinteractiveplanningtaskshaveshowncommendableresults.Despitetheirachievements,however,theystillstrugglewithbrainlesstrial-and-erroringlobalplanningandgeneratinghallucinatoryactionsinlocalplanningduetotheirpoorunderstandingofthe“real”physicalworld.Imitatinghumans’mentalworldknowledgemodelwhichprovidesglobalpriorknowledgebeforethetaskandmaintainslocaldynamicknowledgeduringthetask,inthispaper,weintroduceparametricWorldKnowledgeModel(WKM)tofacilitateagentplanning.Concretely,westeertheagentmodeltoself-synthesizeknowledgefrombothexpertandsampledtrajectories.ThenwedevelopWKM,providingpriortaskknowledgetoguidetheglobalplanninganddynamicstateknowledgetoassistthelocalplanning.Experimentalresultsonthreecomplexreal-worldsimulateddatasetswiththreestate-of-the-artopen-sourceLLMs,Mistral-7B,Gemma-7B,andLlama-3-8B,demonstratethatourmethodcanachievesuperiorperformancecomparedtovariousstrongbaselines.Besides,weanalyzetoillustratethatourWKMcaneffectivelyalleviatetheblindtrial-and-errorandhallucinatoryactionissues,providingstrongsupportfortheagent’sunderstandingoftheworld.Otherinterestingfindingsinclude:1)ourinstance-leveltaskknowledgecangeneralizebettertounseentasks,2)weakWKMcanguidestrongagentmodelplanning,and3)unifiedWKMtraininghaspromisingpotentialforfurtherdevelopmen
t3.
agentmodel
→trial-and-error
correctpath
first
step
…
…
hallucinatory
action(a)
agentmodel
correctpath
taskknowledge
state
[ulu]+[un]
worldknowledgemodel
……
trajectories
know_probsagent_probs
knowledge
firststep
(b)
…
…
Figure1:Traditionalagentplanningvs.Agentplanningwithworldknowledgemodel.
1Introduction
TheremarkableadvancesinLargeLanguageModels(LLMs)havewitnessedarapiddevelopmentof
variousnaturallanguageprocessingtasks[25,
16,
28,
47,
60,
33]
.Recently,multipleattemptsthat
*EqualContribution.
tCorrespondingAuthor.
3Thecodeisavailableat
/zjunlp/WKM.
38thConferenceonNeuralInformationProcessingSystems(NeurIPS2024).
3
TrainingPhase
WorldKnowledgeModel
(c)ModelTraining
●inputoutput
(at,st,at+1)
Agent:gotofridge1
Obs:Thefridge1isclosed
StateKnowledge:Yourtaskisto…Youarechecking…
Agent:openfridge1
Obs:Thefridge1isopen.Init
(a)TaskKnowledgeSynthesis
(b)StateKnowledgeSummarization
Task:putacleanegginmicrowave
Task:puttwonewspapersindrawer
agentmodel
knowledgemodel
stateknowledgebase
fromagentmodel
gotake
put
…
heat
Stateknowledgewillnot
appearinthecontextof
agentmodelduringtraining
andinference.
Youareinthemiddleofaroom…
Task:putacleanegginmicrowave.
TaskKnowledge:
Youshouldfirstfindaneggand…Theworkflowsare:…
Agent:gotocountertop1
Obs:Onthecountertop1,you
seeacreditcard2,adishsponge2…
Agent:gotocountertop2
Obs:Onthecountertop2,you
seeacreditcard1,apen1,apen
2,anewspaper1…
Reward:0.0
τl
ExpertTrajectory
SampledTrajectory
PlanningPhase
(d)PlanningwithWKM
Youareinthemiddleofaroom.Lookingquicklyaroundyou,youseeaarmchair1,acabinet1,adrawer2,adrawer1,asofa1…
Task:puttwonewspapersindrawer.
TaskKnowledge:
Whentryingtoplacemultipleobjectsinadrawer,
youshouldfirstlocatealltheobjects,thengotothedraweroneatatime,andplaceeachobjectinsidebeforeclosingthedrawer.Theactionworkflowsare:
1)Locateallobjects.
2)Gotothedrawer.
3)Placeoneobjectin/onthedrawer.
4)Closethedrawer.
5)Repeatsteps2-4foreachobject.
at
Agent:gotosofa1
fromknowledgemodelfromenvironment
(1-γ)·pknow+γ·pagent
Obs:Onthesofa1,youseeacreditcard2,a
newspaper1.st
StateKnowledge:Yourtaskistoputtwo
newspapersindrawer.Youarecheckingsofa1andthereisanewspaper1onit.
gotake
put
at+1
…
heat
Agent:takenewspaper1fromsofa1
Obs:Youpickupthenewspaper1fromthesofa1.
StateKnowledge:Yourtaskistoputtwo
newspapersindrawer.Youarecheckingsofa1andhavefoundonenewspaper.Nextyoushouldfind
anothernewspaper.
……
Agent:putnewspaper2in/onthedrawer1Reward:1.0
Agent:gotofridge1
Obs:Thefridge1isclosed
Agent:openfridge1
Obs:Thefridge1isopen.Init,youseeacup3,acup1,a
lettuce1…
Agent:putegg2inmicrowave1
Reward:1.0
τw
Figure2:OverviewofourWKM.Wetrainaworldknowledgemodelontheknowledgesynthesizedbytheagentmodelitselffrombothexpertandexploredtrajectories,providingpriortaskknowledgetoguideglobalplanninganddynamicstateknowledgetoassistlocalplanning.
theactionat+1basedonhtateachtimestept+1:
at+1~πθ(·|ht).(2)Specifically,a0~πθ(·|u)isgeneratedaccordingtothetaskinstructionu.Thewholetrajectoryτconcludeswhenthetaskiscompletedorexceedsthemaximumtimesteps.Thentheproductionoftheentiretrajectorywithtimelengthncanbemodeledas:
n
πθ(τ|u)=Ⅱπθ(at+1|ht)πθ(a0|u).(3)
t=0
Ultimately,thefinalrewardr(u,τ)∈[0,1]representingthetaskcompletionrateiscalculated.Notethatwefollowa
REACT-style[54]trajectorythatincludesrationalesbeforeeachaction
.Weuseatorepresenttheactionwithrationalesforconvenience.
WorldKnowledgeModel.Worldknowledgemodelservesashumans’mentalcognitionofthephysicalenvironment,moreintricatethanthewordknowledgemodelwhichLLM-poweredagent
modelsaretrainedtobe[61,
10,
52,
13]
.Our“world”herereferstothesimulatedenvironmentofthetask.Basedonthestaticenvironmentofthetaskandthedynamicchangesduringinteractionwiththeagent,wedefineworldknowledgeasacombinationofpriorglobalknowledgeanddynamiclocalknowledge,correspondingtotheblindtrial-and-errorprobleminglobalplanningandthehallucinatoryactionissueinlocalplanningintraditionalagentmodels,respectively.Toattainpreciseandefficientagentplanning,wedevelopaparametricWKMtosimulatethementalWKMofhumans.
3Method
AsshowninFigure
2,westeertheagentmodeltoself-synthesizethe
taskknowledgefromthe
comparisonofexpertandsampledtrajectories(§3.1)
.Thenweprompttheagentmodeltoself-summarizethestateknowledgebasedonhistoricalbehaviorandconstructastateknowledgebase
(§3.2)
.ThegeneratedknowledgewillbeintegratedintotheexperttrajectoriesfortrainingtheWKM.
Afterthetrainingprocess(§3.3),weaugmenttheagentmodelwiththeworldknowledgemodelto
achieveeffectiveandaccurateplanning(§3.4)
.
3.1TaskKnowledgeSynthesis
Thetaskknowledgeservesasthepriorknowledgetoguidetheagentmodel’sglobalplanningandpreventitfromdroppingintoblindtrial-and-error.
4
ExperiencedAgentExploration.Weprimarilyacquiretaskknowledgethroughthecomparisonofpreferencetrajectories(chosenvs.rejected).Inordertoimprovethequalityofrejectedtrajectoriesandobtainmoretargetedtaskknowledge,weemployanexperiencedagentforexploration.Firstly,we
trainavanillalanguagemodelwithexperttrajectories4
fromthetrainingsettoobtainanexperiencedagent.Subsequently,theexperiencedagentexploresthetrainingsettasksagaintogeneraterejectedtrajectories.Ourpurposeistoextractsuperiortaskknowledgethatcannotbeacquiredsolelythroughsupervisedfine-tuningonchosentrajectories,thusfurthereffectivelyboostingtheagent’scapabilities.
SelfKnowledgeSynthesis.Withtheexperttrajectoriesasthechosenonesandthetrajectoriessampledfromtheexperiencedagentastherejectedones,weprompttheagentmodelitselftosynthesizethetaskknowledge.SupposingKisthetaskknowledgespace:
κ∼πθ(·|ρTaskKnow,u,τw,τl),(4)whereκ∈Kisthetaskknowledge,ρTaskKnowstandsfortheprompttoinstructthetaskknowledgeextraction,andτw,τlarethechosenandrejectedtrajectoriesrespectively.Notethatgiventhesametasku,τwandτlalwayssatisfyr(u,τw)=1≥r(u,τl).Evenwhenr(u,τw)=r(u,τl),westillconsidertrajectoriessampledfromtheexperiencedagentasrejectedones.Thisisbecauseexperttrajectoriesoftenhaveshortersteplengths,enablingtheagenttolearnmoreknowledgeofefficientplanning.Fordetailedpromptsoftaskknowledgesynthesis,pleaserefertoAppendix
I.1.
3.2StateKnowledgeSummarization
Thestateknowledgeservesasthedynamicknowledgetoconstraintheagentmodel’slocalplanningandpreventitfromgeneratinghallucinatoryactions.Weprompttheagentmodeltoself-summarizestateknowledgeateachplanningstepbasedontheexperttrajectoriestoguaranteequality.Fordetailedpromptsofstateknowledgesummarization,pleaserefertoAppendix
I.2.
SupposingthepromptusedtosummarizestateknowledgeisρStateKnowandthestateknowledges∈SisapartofthestatespaceS,thegenerationofstateknowledgeattimetcanberepresentedas:
st∼πθ(·|ρStateKnow,ht).(5)
StateKnowledgeBaseConstruction.Toavoidconfusioncausedbyexcessiveadditionalinfor-mation,insteadofexplicitlyconcatenatingthestateknowledgetothecontext,weconstructastate
knowledgebaseforretrieval(weanalyzein§4.3
howexplicitstateknowledgemayaffecttheperfor-manceofagentmodel).Wecombinethestateknowledgestwiththepreviousactionatandnextactionat+1fromtheexperttrajectorytoformaaction-state-actiontriplet(at,st,at+1).Afteriterat-ingthroughallexperttrajectories,weobtainaStateKnowledgeBaseB={(s,apre,anext)(i)}i||1,whereapre=at,anext=at+1,and|B|isthesizeofthestateknowledgebase.
3.3ModelTraining
Weintegratethegeneratedworldknowledgeintoexperttrajectoriesandtrainaworldknowledgemodel.Theagentmodelneedstobere-trainedtoadapttotheincorporationoftaskknowledge.NotethatouragentmodelandknowledgemodelarebothtrainedwithLoRAsharingthesamebackbone.WelisttheexamplesoftrainingdataforboththeagentmodelandWKMinAppendix
E.
AgentModelTraining.GiventheexperttrajectoriesdatasetD={(u,κ,τw)(i)}i|1|withtask
knowledgeκgeneratedin§3.1,wetraintheagentmodeltofollowthetaskknowledgetogenerate
actions.Underanauto-regressivemanner,thelossoftheagentmodelcanbeformulatedas:
Lagent(πθ)=−Eτw∼D[πθ(τw|u,κ)](6)
SupposeX=(x1,x2,...,x|X|)isthetokensequenceofthetrajectoryτw,wehave:
Here1(xj∈A)istheindicatorfunctiontomasktokensunrelatedtoactions.Pleasenotethatτwheredoesnotinclude
thestateknowledgementionedin§3.2.
4Fordetailsonhowtocollectexperttrajectories,pleaserefertoAppendix
A.
5
WorldKnowledgeModelTraining.Themaindifferenceinthetrainingdatabetweentheagentandknowledgemodelistheaddedstateknowledge.Giventheexperttrajectoriesdatasetwithboth
taskandstateknowledgeD′={(u,κ,τ)(i)}i|whereτ=(a0,o0,s0,...,an,on,sn),theloss
oftheknowledgemodelπϕcanbeformulatedas:
Lknow(πϕ)=−Eκ,τ∼D′[πϕ(κ|u)πϕ(τ|u,κ)](8)SupposeX′=(x,x,...,x′|X′|)isthetokensequenceoftheexperttrajectorywithstateknowledgeτandY=(y1,y2,...,y|Y|)representsthetokensequenceofthetaskknowledgeκ,wehave:
πϕ(κ|u)=−Σi|1|logπϕ(yi|u,y<i)(9)
|X′|
πϕ(τ|u,κ)=−(1(x∈S)×logπϕ(x|u,κ,x′<j)),(10)
where1(xj∈S)istheindicatorfunctiontomasktokensunrelatedtostateknowledge.
3.4AgentPlanningwithWorldKnowledgeModel
Atinferencetime,theagentmodelplansontheevaluationtaskswiththeaidoftheworldknowledgemodel.Weredefinethehistoricaltrajectoryht=(u,κ,a0,o0,a1,o1,...,at,ot).Givenaspecifictaskinstructionu,theknowledgemodelfirstgeneratesthetaskknowledgeκ∼πϕ(·|u),thentheagentmodelstartsplanning.AssumingtheavailableactionsetAu⊆Aforthetaskuis
(α,α,...,αu(|Au|)),atanytimet≥0,insteadofdirectlygeneratinganextactionat+1∈Au
basedonht,wefirstemploytheworldknowledgemodeltogeneratethecurrentstateknowledgest∼πϕ(·|ht)andleveragesttoquerythestateknowledgebaseB={(s,apre,anext)(i)}i||1.Withthestateknowledgeasthekey,weretrieveNnearesttripletsfromwhereapre=atbasedon
semanticsimilarityandcollectthecorrespondingnextactionsanext.Wecounttheprobabilityof
eachactionpknow(αu(i))=i,whereNiistheoccurrencenumberofactionαu(i)inallthecollected
anext.Therefore,wegettheprobabilityacquiredfromthestateknowledgebase:
Pknow(Au)=(pknow(α),pknow(α),···,pknow(αu(|Au|))),
|Σi|pknow(αu(i))=1.(11)
Afterward,wesampletheprobabilitydistributionofthefirsttokenforeachactionαu(i),1≤i≤|Au|fromthelastlayeroftheagentmodelandapplyasoftmaxfunctiontonormalizetheprobabilitydistribution.Wedefinetheprobabilityacquiredfromtheagentmodelas:
Pagent(Au)=(pagent(α),pagent(α),···,pagent(αu(|Au|))),
|Σi|pagent(αu(i))=1.(12)
Finally,wedeterminethenextactionbycombiningtheabovetwoprobabilities:
at+1=argmax(γ·pagent(αu(i))+(1−γ)·pknow(αu(i))),(13)
αu(i)∈Au,1≤i≤|Au|
whereγisthehyperparameterthatcontrolstheproportionofPagent(Au).Basedontheabove,weenhancetheagentplanningbyglobalguidancefromtaskknowledgeandlocalconstraintsfromstateknowledgegeneratedbyourWKM.DuetotheWKMandretrieval,theinferencestageincursadditionaltimeoverheadcomparedtothepureagentmodel.Theapproximateratioisaround2.5:1.
4Experiments
4.1ExperimentalSettings
DatasetsandMetrics.Weevaluateourmethodonthreereal-worldsimulatedplanningdatasets:ALFWorld
[41],
WebShop
[53],and
ScienceWorld
[50]
.AlFWorldandScienceWorldinclude
6
Table1:MainResults.Thebestresultsaremarkedinboldandthesecond-bestresultsaremarkedwithunderline.Alltheprompt-basedbaselines(。)areevaluatedunderone-shotpromptingandallthefine-tuning-basedbaselines(。)aretrainedthroughLoRA.RedrepresentsthechangesofWKMrelativetotheoptimalresultsinthebaselines.WKMandagentmodelaredifferentLoRAssharingthesamebackbone.
BackboneMethod
ALFWorld
WebShop
ScienceWorld
Seen
Unseen
Seen
Unseen
GPT-3.5-TurboGPT-4
。REACT
8.5744.29
5.9738.05
44.3762.76
15.4167.32
13.9965.09
Mistral-7B
。REACT
7.86
5.22
14.63
20.72
17.65
。Reflexion
11.56
6.00
16.64
21.07
18.11
。NAT
64.43
68.96
61.01
57.12
50.79
。ETO
66.84
71.43
64.09
58.17
51.85
。KNOWAGENT
70.44
70.72
61.28
59.32
47.24
WKM
73.57+3.13
76.87+5.44
65.48+1.39
62.12+2.80
53.62+1.77
Gemma-7B
。REACT
6.43
2.24
5.93
3.58
3.51
。Reflexion
7.14
2.99
7.71
4.94
3.93
。NAT
67.86
65.88
55.82
47.63
44.98
。ETO
。KNOWAGENT
66.4369.29
68.6667.60
62.6758.80
50.4448.55
47.8445.28
WKM
70.71+1.42
70.40+1.74
63.75+1.08
53.68+3.24
49.24+1.40
Llama-3-8B
。REACT
2.86
3.73
19.32
24.76
22.66
。Reflexion
4.29
4.48
22.73
27.23
25.41
。NAT
60.71
59.70
61.60
55.24
48.76
。ETO
64.29
64.18
64.57
57.90
52.33
。KNOWAGENT
66.71
62.69
64.40
58.67
49.18
WKM
68.57+1.86
65.93+1.75
66.64+2.07
60.12+1.55
54.75+2.42
unseentaskstoevaluatetheagent’sgeneralizationability.TherewardofALFWorldisbinary0or1,indicatingwhethertheagenthascompletedthetaskornot.WebShopandScienceWorldprovidedenserewardsfrom0to1tomeasurethecompletionlevelofthetask.Forallthedatasets,weapplyaveragerewardasthefinalmetrics.PleaserefertoAppendix
B
fordetaileddatasetinformation.
ModelsandBaselines.Weevaluateonthreestate-of-the-artopen-sourcemodels:1)Mistral-7B
[16],theMistral-7B-Instruct-v0.2version
.2)Gemma-7B
[24],theGemma-1.1-7B-itversion
.3)Llama-3-8B
[25],theMeta-Llama-3-8B-Instructversion
.Wecompareourmethodwithtwoprompt-basedbaselines:REACT
[54]and
Reflexion
[40]
.Besides,weadopttwostrongbaselinesthatintroducerejectedtrajectoriesintothetrainingprocesstolearnfromexperience:NAT
[49],learn
fromrejectedtrajectoriesthroughSFT,andETO
[44],learnfromrejectedtrajectoriesthroughDPO
[36]
.Moreover,wecomparewithaknowledge-augmentedplanningmethodKNOWAGENT.WealsoincludeChatGPT
(gpt-3.5-turbo-0125)[27]and
GPT-4
(gpt-4-32K-0613)[28]forcomparison
.Alltheprompt-basedbaselinesaretestedunderone-shotandallthefine-tuning-basedbaselinesare
trainedwithLoRA[12]
.PleaserefertoAppendix
C
forbaselinesandre-producingdetails.
TrainingandInferenceSetups.
Wefine-tunetheproposedapproachwithLoRA[12]usingthe
LlamaFactory[62]framework
.Duringtraining,themodelistunedafterfinishingtheentiretrajectoryratherthaneachstepofaction.Thelearningrateis1e-4andthesequencelengthis2048forallthemodels.Thetrainingepochis3andthebatchsizeis32.
WeadopttheAdamWoptimizer[22]
withacosinelearningscheduler.Duringinference,weapplytheembeddinglayerofWKMastheencoderandusethecosinesimilaritybetweensentencesforretrieval.Thenumberofretrievedaction-state-actiontripletsNissetto3000andthePagent(Au)weightγissetto{0.4,0.5,0.7}.Allthetrainingandinferenceexperimentsareconductedon8NVIDIAV10032GGPUswithin12hours.PleaserefertoAppendix
D
fordetailedhyperparametersusedinourpaper.
4.2Results
MainResults.AsshowninTable
1,
forprompt-basedbaselinesonopen-sourcemodels,bothRE-ACTandReflexionexhibitpoorperformance,farbehindourmethodandfine-tuning-basedbaselinesonvariousdatasets.GPT-3.5-TurboperformsordinarilyontwodatasetsotherthanWebShop,anditevenfallsbehindMistral-7BandLlama-3-8B’sREACTperformanceonScienceWorld.However,GPT-4exhibitsstrongperformanceacrossvariousdatasets.Nevertheless,ourapproach,through
7
w/oall
w/state
w/task
w/task&state
w/orejected
merge
prompt
AverageReward
69.29
73
.57
7675.37
.87
71.5770.71
69.4070.67
67.86
67.19
65.4665.14
67.40
63.57
80
70
60
50
seenunseen
62.44
63.68
65
.48
63.9763.70
61.03
56.98
80
70
60
50
test
60.81
62
.12
58.51
52.78
55.04
55.49
56.36
53.425350.3251.52
.62
51.7848.38
45.27
70
60
50
40
seenunseen
ALFWorldWebShopScienceWorld
Figure3:AblationStudyonMistral-7B.w/oallmeansthevanillaexperiencedagentmodeltrainingwithpureexperttrajectories.w/stateistestingagentmodelwithonlystateknowledgebaseconstraints.w/taskstandsforguidingagentmodelwithonlytaskknowledge.w/task&stateisourWKMwithbothtaskknowledgeguidanceandstateknowledgeconstraints.w/orejectedmeanssynthesizingtaskknowledgesolelythroughexperttrajectories.mergestandsfortrainingWKMandtheagentmodeltogetherwithonesinglemodel.promptmeansusingfew-shotpromptstoreplacetheWKMforprovidingknowledge.
Table2:AverageSteps.ThemaximumnumberofstepsinALFWorldandWebShopis40and10.InScienceWorld,thenumberofstepsrangesfrom10to120dependingonthetasktype,withanaverageofaround40.
Method
ALFWorld
WebShop
ScienceWorld
Seen
Unseen
Seen
Unseen
NAT
23.27
23.42
4.08
20.18
21.21
ETO
19.82
22.29
3.99
24.13
26.35
KNOWAGENT
18.51
24.56
4.01
21.06
24.74
WKM
17.66
17.92
3.97
18.74
19.59
Table3:HallucinatoryActionRatesonALFWorld.Wecalculatetheproportionoftrajectoriescontaininginvalidactionsregard-lessoftheircorrectness.
Method
ALFWorld
SeenUnseen
NATETO
KNOWAGENT
45.71%50.00%34.29%36.57%33.57%44.78%
WKM
32.86%29.85%
LoRAtrainingalone,surpassesGPT-4onALFWorld(44.29→73.57onseen,38.05→76.87onunseen)andWebShop(62.76→66.64).Forfine-tuning-basedbaselines,bothNATandETOfallbehindourmethod,implyingthatjustintegratingworldknowledgeforagentmodelsisworthmorethanfurtherfussySFTorDPOonnegativeexamples.OurmethodalsoperformsbetterthanKNOWA-GENTwhichbringshuman-designedfixedactionknowledgeandlongactionpathsintotrajectories.ThissuggeststheeffectivenessofourWKMwhichisresponsibleforgeneratinginstance-leveltaskknowledgeandmaintainingimplicitactionconstraints.Furthermore,KNOWAGENT’sperformanceonunseentasksisnotasimpressiveasonseentasks,whileWKMcankeepitsadvantage.ThisphenomenonalsodemonstratesthegeneralizationabilityofWKM.
ApproachAblations.AsshowninFigure
3,takingMistral-7Basanexample,wedecompose
thekeycomponentsofWKMtoexaminetherolesofthetaskandstateknowledgeseparately.Inamacroview,removingeachmoduleresultsinacleardropintheagent’sperformance,whichvalidatesthepowerofourworldknowledge.Furthermore,theimprovementthroughtaskknowledge(w/task)ismorepronouncedthanthatthroughstateknowledge(w/state),suggestingthenecessityofglobalpriorknowledgeforagentplanning.Amoremicroobservationrevealsthattheimpactofstateknowledgeismoresignificantonseentaskscomparedtounseentasks,whiletheinfluenceoftaskknowledgeissustainableacrossseenandunseentasks.Thismaybeattributedthatalthoughourreal-timestateknowledgeisgeneratedbyWKM,thestateknowledgebaseisbuiltonthetrainingset,whichmayweakengeneralizationtosomeextent.Additionally,tovalidateourmotivationofallowingtheagenttolearntaskknowledgefrombothexpertandgeneratedtrajectories,weexcludetherejectedtrajectoriesduringthesynthesisoftaskknowledge,instructingtheagentmodeltosynthesizeknowledgesolelybasedonthechosentrajectories.Theresults(w/orejected)demonstratethatlearningfromthecontrastbetweenchosenandrejectedtrajectoriesismoreeffectivethanlearningfromchosenexamplesalone.ThisprocedureisalittlesimilartoDPO,butweachieveitthroughknowledgeaugmentationratherthandirectlyconvertingitintoalosscalculationbetweenchosenandrejectedtrajectories.AdditionalresultscanfurtherevidentthattrainingaWKMseparatelyperformsbetterthantrainingonesinglemodeltogetherwiththeagentmodelaswellasusingfew-shotpromptstoreplaceWKMforprovidingknowledge.
4.3Analysis
Worldknowledgecanmitigateblindtrial-and-errorandreducehallucinatoryactions.WecomparethenumberofplanningstepsforeachdatasetbetweenthreestrongbaselinesandWKMandcalculatetheaveragestepsofeachmethod.AsdepictedinFigure
9
(inAppendix
F),WKM
8
demonstratestheabilitytocompleteasignificantproportionoftasksusingtheshortesttrajectory,indicatingthatguidancefromworldknowledgecaneffectivelyreducetheagent’sblindtrial-and-errorintheenvironment.TakingafurtherperspectivefromanaveragestandpointinTable
2,itcan
beobservedthatWKMexhibitsloweraverageplanningstepscomparedtootherbaselines.AsALFWorldcanrespondtoinvalidactions,inTable
3,wecountthepercentageofhallucinatoryactions
thatoccurredintrajectoriesfromALFWorldforeachmethod.Theresultsconfirmtheeffectivenessofourworldknowledgemodeltodecreasehallucinatoryactions.Furthermore,itisworthnotingthatmostbaselinesshowaprominentincreaseintheaveragenumberofstepsandpercentageofinvalidactionswhentransitioningfromseentaskstounseentasks,butWKMcanstillmaintainarelativelylowlevel.Thisreflectslaterallythatourworldknowledgecanstilleffectivelyguidetheagentmodelonunseentasks,highlightingtheknowledgegeneralizationbroughtbytheworldknowledgemodel.Toseehowourworldknowledgeworks,pleaserefertoourcasestudyinAppendix
H.
Ourinstance-levelknowledgecangeneralizebettertounseentasks.Tofurtherexplorethebenefitofusingaknowledgemodeltogenerateinstance-leveltaskknowledge,wecarefullysurveythetaskknowledgegeneratedbyourWKMandabstractitintodataset-levelknowledgeforeachdataset.Thenweretraintheagentmodelto
adapttonewdataset-levelknowledge5.
AsillustratedinFigure
4,
wecomparetheperformanceofdataset-levelknowledgewithourinstance-leveltaskknowledge(WKMw/ostate)onALFWorldandScienceWorld.Itcanbeobservedthatourmodel-generatedinstance-levelknowledgenotonlysurpasseshuman-designedknowledgeonseentasksbutalsoexhibitsevenmoreremarkableperformanceonunseentasks,withtheimprovementinperformanceonunseentaskssignificantlygreaterthanthatonseentasks.Thisphenomenonstraightlyreflectsthestronggeneralizationabilityofourknowledgemodelcomparedtorigidlydesignedknowledgebyhumans.
AverageReward
80
70
60
50
40
HumanWKM
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 变电运维安全责任清单
- 五千以内加减混合两步运算质量监控习题大全附答案
- 音乐教学优化策略
- 初级银行业法律法规与综合能力-2018年初级银行从业资格考试《法律法规与综合能力》真题汇编3
- 初级银行管理-银行专业初级《银行管理》高分通关卷4
- 初级个人理财-初级银行从业资格《个人理财》押题密卷3
- 职业体验申请书
- 加强机场安检提高旅客效率
- 教师续签申请书
- 2021人教版四年级数学下册第一单元-1.2乘、除法的意义和各部分间的关系-同步练习(含答案)
- 急救护理学第十章灾难救护讲解
- 2025年常德职业技术学院高职单招职业技能测试近5年常考版参考题库含答案解析
- Unit2 No rules no order Section A Grammar 英文版说课稿2024-2025学年人教版(2024)七年级英语下册
- 行政单位会计核算职责(4篇)
- 2024年山东司法警官职业学院高职单招语文历年参考题库含答案解析
- 2024版消防设计质量问题案例分析手册建筑机电专业
- 《义务教育道德与法治课程标准》解读
- 2024年临沧永德县人民法院聘用制书记员招聘考试真题
- 中医院发展中医重点专科、学科加强中医药人才培养的具体措施
- 2025年中国私域电商行业市场运行态势、市场规模及发展趋势研究报告
- 社区意识形态工作2025年度工作计划
评论
0/150
提交评论