1小时视频-语言理解 HourVideo -1-Hour Video-Language Understanding_第1页
1小时视频-语言理解 HourVideo -1-Hour Video-Language Understanding_第2页
1小时视频-语言理解 HourVideo -1-Hour Video-Language Understanding_第3页
1小时视频-语言理解 HourVideo -1-Hour Video-Language Understanding_第4页
1小时视频-语言理解 HourVideo -1-Hour Video-Language Understanding_第5页
已阅读5页,还剩50页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

HourVideo:1-HourVideo-LanguageUnderstanding

KeshigeyanChandrasegaranAgrimGuptaLeaM.HadzicTaranKotaJimmingHeCristobalEyzaguirreZaneDuranteManlingLiJiajunWuLiFei-Fei

arXiv:2411.04998v1[cs.CV]7Nov2024

StanfordUniversity

Abstract

WepresentHourVideo,abenchmarkdatasetforhour-longvideo-languageun-derstanding.Ourdatasetconsistsofanoveltasksuitecomprisingsummarization,perception(recall,tracking),visualreasoning(spatial,temporal,predictive,causal,counterfactual),andnavigation(room-to-room,objectretrieval)tasks.HourVideoincludes500manuallycuratedegocentricvideosfromtheEgo4Ddataset,span-ningdurationsof20to120minutes,andfeatures12,976high-quality,five-waymultiple-choicequestions.Benchmarkingresultsrevealthatmultimodalmod-els,includingGPT-4andLLaVA-NeXT,achievemarginalimprovementsoverrandomchance.Instarkcontrast,humanexpertssignificantlyoutperformthestate-of-the-artlong-contextmultimodalmodel,GeminiPro1.5(85.0%vs.37.3%),highlightingasubstantialgapinmultimodalcapabilities.Ourbenchmark,evalua-tiontoolkit,prompts,anddocumentationareavailableat

.

1Introduction

Humansdemonstratearemarkableabilitytoprocessvisualstimulioverlongtimehorizons,enablingthemtoperceive,planandactintherealworld.Considertheroutinetaskofcookingameal.Thisactivityinvolvesacontinuousandadaptivevisualprocess:identifyingandusingingredientsandtools,monitoringstatechangesofvariousdishes,andadjustingcookingduration/techniquesbasedonvisualcuessuchascolorandtexture.Suchsustainedvisualprocessingiscrucialtoachievingthedesiredculinaryoutcomes.Naturally,endowingautonomousagentswiththiscapabilityhasbeenalong-standinggoalinthefieldofArtificialIntelligence.

Inrecentyears,largemultimodalmodels[

1

3

]haveemergedasapromisingapproachtowardachievingthisgoal.Typically,thesemodelsareevaluatedusingmultipledatasetsthattestcapabilitiessuchasobjectrecognition[

4,

5

],imagecomprehension[

6

8

],andactionrecognition[

9]

.However,thesebenchmarksareoftenrestrictedtosingleimagesorshortvideoclips,usuallylastingfromafewsecondstonomorethanthreeminutes[

9

12]

.Whilethesebenchmarkshavespurredsignificantadvancements,adeeperexplorationintolong-formvideo-languageunderstandingisessentialtodevelopmultimodalsystemsthatcanformthebasisforfutureautonomousagentsandassistants.

Asignificantchallengeinevaluatinglong-formvideo-languageunderstandingcapabilitiesisdesign-ingtasksthatgenuinelynecessitatelong-termcomprehension,i.e.,tasksthatrequirelong-rangedependencies.Merelyposingquestionsthatcanbeansweredbywatchingabriefsegmentofalengthyvideoeffectivelyreducesthetasktoacombinationoftemporallocalizationandshort-clipunderstanding.Furthermore,whileintriguingnarrativeinquiriescancertainlybeformulatedforlong-formvideossuchastelevisionshowsandfilms,itisimperativetoensurethatthequestionsarenottriviallyanswerableduetothevastpriorknowledgeencodedinmodernlargelanguagemodels.Inthiswork,weintroduceHourVideo—abenchmarkdatasetdesignedforlong-formvideo-languageunderstanding.Todesigntasksthatrequirelong-termcomprehension,wefirstproposeanoveltask

Correspondenceto{keshik,agrim}@stanford.edu

38thConferenceonNeuralInformationProcessingSystems(NeurIPS2024)TrackonDatasetsandBenchmarks.

2

suite(Tab.

1

),comprisingsummarization,perception(recall,tracking),visualreasoning(spatial,temporal,predictive,causal,counterfactual),andnavigation(room-to-room,objectretrieval)tasks.Foreachtask,wemanuallycreatequestionprototypesdesignedtoensurethatcorrectlyansweringthemrequiresidentificationandsynthesisofinformationacrossmultipletemporalsegmentswithinthelong-formvideos.Guidedbyourtasksuite,wecurated500egocentricvideosfromtheEgo4Ddataset[

13

]—covering77uniqueeverydayactivitiesandrangingfrom20to120minutes—togeneratequestionsbasedonourprototypes.Thecombinationofourcomprehensivetasksuiteandeverydaymundaneegocentricvideosprovidesarobustframeworktorigorouslyevaluatemultimodalmodels’capabilitiesinunderstandinglong-formvideos.Finally,wedevelopedaquestion-answergenerationpipelineutilizingtheexpertiseoftrainedhumanannotators(800+hoursofeffort)andlargelanguagemodels(LLMs),resultinginacollectionof12,976high-quality,five-waymultiple-choicequestions.Wecomprehensivelyevaluatestate-of-the-artmultimodalmodelsonHourVideo(Tab.

2

,Fig.

4

),includingGPT-4V[

2

],Gemini1.5Pro[

3

],andLLaVA-NeXT[

14

]inazero-shotsetting.OurfindingsrevealthatGPT-4VandLLaVA-NeXTachieveonlymarginalimprovementsoverarandompredictor(20%),obtainingaccuraciesof25.7%and22.3%,respectively.Gemini1.5Pro,designedspecificallyforlong-contextmultimodalunderstanding,obtainsanaccuracyof37.3%,which,whilebetter,isstillsubstantiallylowerthantheaverageperformanceofhumanexpertsat85.0%.Theseresultssuggestthatwhilethemultimodalcommunityhasmademeaningfulprogress,asignificantgapremainstobebridgedbeforethesesystemscanmatchhuman-levellong-formvideounderstandingcapabilities.Progressinlong-formvideounderstandingcouldenablenewapplicationsincludingARassistants,embodiedagents,andinteractivevideoplatforms.WehopethatHourVideowillserveasabenchmarktofacilitateresearchinthisdirectionandenablethedevelopmentofmultimodalmodelsthatcanunderstandendlessstreamsofvisualdata.

2BenchmarkDesignandConstruction

Whileopen-endedquestionansweringcloselyemulateshumaninteraction,automatingtheevaluationoffree-formnaturallanguageresponsesremainschallenging.Giventhatourprimarygoalistoassesslong-formvideo-languageunderstandingcapabilities,weoptforafive-waymultiple-choicequestion-answering(MCQ)task.Thisapproachsimplifiestheevaluationprocessbyallowingtocalculateanaggregatequestion-answeringaccuracymetric.Inthefollowingsection,wedescribeourtasksuiteandquestion-answergenerationpipelineindetail,bothofwhicharedesignedtocuratediversehigh-qualityfive-waymultiple-choicequestions(MCQs).

2.1TaskSuite

Creatingacomprehensivebenchmarkforlong-formvideo-languageunderstandingischallenging,primarilybecauseformulatingmeaningfulquestionsthatrequireprocessingandsynthesizingin-formationacrossvarioustemporalsegmentsishighlynontrivial,evenforexperthumanannotators.Moreover,wenotethatevenbenchmarksforimageorshortvideoclipunderstandingaredifficulttoconstruct.Asaresult,wetypicallyobservetwocommonstrategiesforbenchmarkcreation:(1)pre-definedlabelspacestestingforaspecificskillorwithinnarrowdomains(e.g.,Kinetics[

9

]andSomething-Something[

15

]);or(2)gluingtogetherdifferentdatasets,eachdesignedtotestaspecificmodelcapability[

16

19]

.Incontrast,asinglebenchmarkthatcancomprehensivelytestasuiteofmodelcapabilitiescansignificantlybenefittheresearchcommunity.

Wedrawinspirationfrombothlinesofresearchmethodologiesandintroduceanovelsuiteoftasksdesignedtobenchmarklong-formvideo-languageunderstandingcapabilitiesforone-hour-longvideos.Ourtasksuiteencompassesacomprehensivesetofperceptualandcognitivetasks,includingsummarization,perception(recall,tracking),visualreasoning(spatial,temporal,predictive,causal,counterfactual),andnavigation(room-to-room,objectretrieval)tasks.Ourstrategydrawsinspirationfromthetwocommonapproachespreviouslydiscussed:(1)designingnarrowlyfocusedquestionprototypestosignificantlystreamlinethequestion-answercreationprocess,and(2)creatingadiversesuiteoftasksthatholisticallyevaluateabroadspectrumofmultimodalcapabilities.OurtasksuitewithmanuallydesignedquestionprototypesareshowninTable

1.

Inparticular,thereare18sub-tasksinourproposedtasksuiteandexampleMCQsfromHourVideoareshowninFig.

1.

3

VisualReasoning

Summarization

01:10:26

00:54:52

TemporalSequencing

00:00:40

00:43:32

00:01:0000:04:5800:16:44

Describethesequenceofactivitiesthecamerawearerperformedrelatedtopreparationandcookingoffood.

A)Thecamerawearertakesouttheingredients,peels,cuts,andcooksthepotatoes,continuestomashtheminthepot

B)Peeled,chopped,andcookedpotatoes,interactedwithindividuals,adjustedcookingsettings,andsetthediningtable.

C)Takesouttheingredients,peels,cuts,andcooksthepotatoes,coolsthepotatoeswithcoldwater,continuestomashtheminthepot,andadjuststhecookersetting.

D)Thecamerawearersliced,diced,andboiledpotatoes,interactedwithindividuals,andmodifiedcookingtimes.

E)Thecamerawearerpeeled,chopped,andsautéedvegetables,interactedwithindividuals,andadjustedcookingsettings,demonstratingamethodicalapproachtomealpreparation.

00:22:48

Selectthecorrectstatementregardingthespatialproximityofobjectsinthevideo.

A)Thecamerawearer'sseatisequidistantfromboththedriver'sseatandthebusdooronthebus.

B)Thecashierisclosertothediningtablewherethecamerawearereatspizzathanthetrashbin.

C)Thedriver'sseatispositioneddirectlyacrossfromthecamerawearer'sseat,whilethebusdoorisbehindthecamerawearer.

D)Theweighingstationisadjacenttotheentrance,withthebananasectionatthefarend.

E)Theentranceisnearertotheweighingstationthanthebananasectionatthestore.

Predictive00:46:24

Temporal00:32:30

00:40:03

00:38:12

00:44:00

00:20:56

00:21:51

00:00:12

00:06:12

00:09:48

00:40:23

00:20:36

00:29:52

00:19:30

iiiiiwri

AftershoppingandinteractingwiththeCashieratthecheckout,whatwillthecamerawearerdonext?

A)Looksattheirphonewhilepushingatrolley,exitsthestore,handsoverthetrolleytoawoman,thencyclebackhome.

B)Looksattheirphonewhilepushingatrolley,exitsthestore,handsoverthetrolleytoWoman,buysadrinkattheexit,thenruntowardsthebusstand.

C)Looksattheirphonewhilepushingatrolley,exitsthestore,handsoverthetrolleytoaman,thenwalktowardsthebusstop.

D)Looksattheirphonewhilepushingatrolley,exitsthestore,handsoverthetrolleytoWoman,thenruntowardsthebusstand.

E)Looksattheirphonewhilepushingatrolley,exitsthestore,buysadrinkattheexit,thenruntowardsthebusstand.

Selectthecorrectstatementregardingfrequenciesofdifferenttoolusageinthevideo

A)Duringtheconstructionactivities,themitersawwasusedmorefrequentlycomparedtothecordlessdrill.

B)Duringthewoodworkingtasks,thetapemeasurewasusedmoreoftencomparedtotheruler.

C)Duringthedeckconstruction,thehammerwasusedmorefrequentlycomparedtotheimpactdriver.

D)Duringwoodworkingactivity,thetapemeasurewasusedmorefrequentlycomparedtothecircularsaw.

E)Duringthewoodworkingactivity,themitersawwasusedmorefrequentlycomparedtothetapemeasure.

Causal

01:17:11

00:10:28

00:09:50

00:09:58

00:12:24

00:20:26

00:26:32

Whydidthechildmovethestepstoolnearthekitchencountertop?

A)Thechildmovedthestepstoolnearthekitchencountertoptoreachitandhelpwithpreparingdoughandbeatingeggs.

B)Thechildmovedthestepstoolnearthekitchencountertoptoaccessitandretrievethecookiejar.

C)Thechildmovedthestepstoolnearthekitchencountertoptoreachthesinkoveritandwashherhands.

D)Thechildmovedthestepstoolnearthekitchencountertoptoreachitandhelpwithpreparingdough.

E)Thechildmovedthestepstoolnearthekitchencountertoptoaccessthetopdraweraboveitandfindthemeasuringspoons.

Counterfactual01:10:26

00:43:32

`wrii

00:39:01

00:54:52

00:16:44

00:15:45

00:10:55

Whatifthecamerawearerusedtheoventomakemashedpotatoes?

A)Overallcookingtimewouldhaveincreasedastheovenwasalsousedbythecamerawearertobakecookies.

B)Overallcookingtimewouldhaveincreasedastheovenwouldconsumemoretimecomparedtousinginductionstove.

C)Overallcookingtimewouldhaveincreasedastheovenwasalsousedbythemantobakecookies.

D)Overallcookingtimewouldhaveincreasedastheovenwouldconsumemoretimecomparedtousinggascooker.

E)Overallcookingtimewouldhaveincreasedastheovenwouldconsumemoretimecomparedtousingmicrowave.

00:46:24

Spatial

00:02:34

00:04:50

00:23:47

00:44:00

00:18:16

Perception

01:17:11

01:13:37

InformationRetrieval/FactualRecall

00:00:2000:33:0500:33:2600:34:0800:35:31

Tracking00:30:00

00:00:0000:03:5500:04:0400:09:1300:18:2300:30:00

Listthelocationsthecamerawearervisited.

Person2Person3Person1

<1minute8minutes30minutes

A)Kitchen,BBQArea,StorageRoom,Garage,Room,Pavements

>

Identifytheuniqueindividualsthecamerawearerinteractedwith.

A)2AdultsB)1AdultC)4AdultsD)5AdultsE)3Adults

B)Kitchen,Garden,StorageRoom,Garage,Room,Pavements

C)Kitchen,Bathroom,StorageRoom,Garage,Room,Pavements,

D)Kitchen,Room,Balcony,StorageRoom,Garage,LivingRoom

E)Kitchen,Balcony,StorageRoom,Garage,Room,Pavements

Navigation

01:17:11

00:34:0800:35:31

00:33:0500:33:2600:33:50

00:00:20

01:13:3701:13:5001:17:02

Room-to-RoomNavigation:Howcanthecamerawearergettothebackyardfromthekitchen?

A)C)D)E)

ObjectRetrieval:Howcanthecamerawearerretrievethemotorcyclefromthekitchen?

A)Exitthekitchentowardsthestairsandexitthroughthedoor.Themotorbikeisoutside.

B)Exitthekitchenthroughthedoorintothebackyard,andthemotorbikeisontheright.

C)Exitthekitchenandturnleft.Walkthroughthelivingroomandgothroughthedoorintothebackyard.Themotorbikeisontheright.

D)Exitthekitchentothelivingroomandturnleft.Gothroughthedoortothebackyard;themotorbikeisontheright.

E)Exitthekitchenandturnleft.Walkdownthehallwayandturnrightbeforethestairs.Exitthedoorandthemotorbikeisoutside.

Figure1:ExampleMCQsfromHourVideofordifferenttasks.Thecorrectanswersareunderlined.

4

Summarization

KeyEvents/Objects

Summarizethekeyinteractionsofthecamerawearerinthe[supermarket].

TemporalSequencing

Describethesequenceofactivitiesperformedbythecamerawearerto[preparethedessert].

Compare/Contrast

Howdidthecamerawearer’sactivitiesinthe[apartment]differfromthoseinthe[restaurant]?

Perception

InformationRetrieval

•FactualRecall

What[dairyproducts]didthecamerawearer[pickup]inthe[supermarket]?

•SequenceRecall

Whatdidthecamerawearerdoimmediatelyafter[weighingtomatoes]atthe[supermarket]?

•TemporalDistance

Howlongafterstartingto[eatpizza]didthecamerawearer[disposeofthepizzabox]?

Tracking

Listtheunique[individuals]thecamerawearerinteractedwithatthe[drugstore].

VisualReasoning

Spatial

•Relationship

Wherewasthe[microwave]placedinrelationtothe[stove]inthe[kitchen]?

•Proximity

Isthe[microwave]closertothe[fridge]comparedtothe[sink]?

•Layout

Whichisthecorrect[IMAGE]depictingthelayoutofthecamerawearer’s[apartment]?

Temporal

•Duration

Whichactivitydidthecamerawearerspendmoretimeon:[cooking]or[playingthepiano]?

•Frequency

Didthecamerawearerusethe[circularsaw]or[crosscutsaw]morefrequentlyto[cutwood]?

•Pre-requisites

Whatpreparationstepsdidthecamerawearertakebefore[bakingcookies]?

Predictive

Whatisthemostlikelyactivitythecamerawearerwilldonextafter[doinglaundry]?

Causal

Whydidthecamerawearer[leavethegarageforthesecondtime]?

Counterfactual

Whatifthecamerawearerusedthe[oven]to[cookmashedpotatoes]?

Navigation

Room-to-Room

Howdidthecamerawearergetfromthe[buildingentrance]tothe[apartment]?

ObjectRetrieval

Howcanthecamerawearerretrievethe[TVremote]iftheyareinthe[kitchen]?

Table1:Ourproposedtasksuitewithquestionprototypes.Thistableshowsall4tasksand18sub-tasksproposedinHourVideo,alongwiththecorrespondinghandcraftedquestionprototypesdesignedtoevaluatelong-formvideo-languageunderstandingcapabilities.

2.2DatasetGenerationPipeline

Inthissection,weprovideanoverviewofthequestion-answercreationpipelinethatwedevelopedtocreateHourVideo.ThepipelineissummarizedinFig.

2.

Videocuration,Stage1.Acrucialdesignconsiderationforthisbenchmarkistheselectionofvideosourcesandtypes.WechosetheEgo4D[

13

]datasetforourvideosformultiplereasons:(1)itsegocentricperspectivealignswellwiththetypicalvisualinputforautonomousagentsandassistants;(2)itfeaturesextensivevisualnarrations,whichaidincreatingdiversemultiple-choicequestions;and(3)itisreadilyaccessibleundertheEgo4Dlicense.Wemanuallyreviewed1,470videos,rangingfrom20to120minutes,fromtheEgo4Ddataset,assessingtheirpotentialtogeneraterelevantquestionsforvarioustasksinourtasksuite.Weengagedfivehumanexpertsforvideocuration.Followingthisprocess,wecurated500egocentricvideos.

5

VideoCuration

>100hours

BlindFiltering

MCQGeneration

3MCQRefinementusingHumanFeedback

LLM

LLM

+

>400hours

MCQ2

MCQ3

MCQ2

ExpertMCQRefinement

LLM

MCQ3MCQ4

+

>300hours

MCQ4

MCQ5

1

2

4

5

Figure2:Ourdatasetgenerationpipeline.WedevelopadatasetgenerationpipelineconsistingoffivestagestocreateHourVideo.Weleverageover800hoursofhumaneffortintotalcorrespondingtoVideocuration(Stage1),MCQRefinementusingHumanFeedback(Stage3)andExpertMCQRefinement(Stage5)stages.WeuseLLMsforMCQGeneration(Stage2),MCQRefinementusingHumanFeedback(Stage3)andBlindFiltering(Stage4).Wenotethatcausal,counterfactualandnavigationquestionsaremanuallygeneratedbyhumanexperts(SeeSec.

2.2

fordetails).

CandidateMCQGeneration,Stage2.Theobjectiveofthisstageistoproducehigh-qualityMCQsforeachtask,requiringanalysisandsynthesisofinformationacrossmultipletemporalsegmentsinalong-formvideo.Initially,wemanuallydevelopquestiontemplate(s)foreachtaskinthesuite.AsshowninTable

1

,transformingaquestiontemplateintoanactualquestioninvolvesincorporatingvideo-specificinformationtailoredtothetaskandtemplate.Tofacilitatethis,weutilizethedetailednarrationsfromtheEgo4Ddataset,transformingthemintoastructuredformatthatcanbeprocessedbyanLLM.Specifically,wesegmentthevideoat20-minuteintervals,witheachsegment’srepresentationincludingasummaryandalistoftools,fooditems,technology,humans,pets,andphysicallocationsencounteredbythecamerawearerinthevideo.Wenotethatsynthesizingastructuredrepresentationandaquestiontemplateintoavalidquestionwithcorrectandincorrectanswerspresentsasignificantchallenge,evenforadvancedLLMs.Consequently,foreachtask,weformulatedetailedpromptsthatofferquestionprototypes,comprehensiveinstructions,in-contextexamples,andstep-by-stepguidanceonhowtotransformaquestiontemplateintoavalidcandidateMCQ2.Intotal,wedeveloped25task-specificprompts.

MCQRefinementwithLLMsusingHumanFeedback,Stage3.ThepurposeofthisphaseistorefineMCQ2,createdinthepreviousstage.MCQ2maycontaininvalidquestions,incorrectanswers,trivialincorrectoptions,andvariousotherissues.WeidentifiedthatasignificantsourceoftheseissuesstemmedfromrelyingonthenoisynarrationsinEgo4D.Forexample,differentnarratorswithinthesamevideocouldrefertoadishwasherasa"platerack"oruseotherterms,andanindividualmightbedescribedasan"adult,""personwitharedandwhiteshirt,""manY,"or"teenager"atvarioustimesinthenarration.Theseinconsistencies,combinedwithourautomaticquestiongenerationinthefirststage,couldleadtogenerationofinvalidMCQs.ToaddressnoisyMCQs,weimplementahumanfeedbacksystemwheretrainedannotatorsaretaskedwith:1)assessingthevalidityofeachquestiontoensureitalignswiththevideocontent,2)verifyingtheaccuracyofthegivenanswer—iffoundincorrect,theyprovidethecorrectanswerinfree-formtext,3)ensuringthatallincorrectoptionsarefactuallywrongandclearlydistinguishablefromthecorrectanswer.WegatherhumanfeedbackforallMCQ2,involvingover400hoursofhumaneffort.Wethendesignprompts,toautomaticallyrefineMCQ2usingthishumanfeedbacktoproduceMCQ3.Weengagedseventrainedannotatorsinthisstage.

Blindfiltering,Stage4.ModernLLMspossessextensivepriorknowledgeandcanthuseasilyanswercertainquestionswithoutneedingtoanalyzethevideos.Theobjectiveofthisphaseistoeliminatequestionsthatcanbeansweredthroughpriorknowledgeorcanbetriviallyansweredwithoutrequiringanyinformationfromthevideo.Toaddressthis,wedoblindfilteringofMCQ3,utilizingtwoseparateblindLLMs(GPT-4-turboandGPT-4).Specifically,weexcludeanyMCQthatiscorrectlyansweredbyatleastoneLLMwithoutvideoinput.AlthoughthismethodmayaggressivelyremoveMCQs,itensuresthattheremainingMCQ4areofhighqualityandspecificallytailoredtotestlong-formvideo-languageunderstanding.

6

2

Summarization(714)

KeyEvents/ObjectsIdentification

467

TemporalSequencing

152

Compare/Contrast

95

Spatial(3173)

Relationship

1889

Proximity

1239

Layout

45

Perception(3777)Navigation(312)

FactualRecall2479

SequenceRecall854

TemporalDistance267

Tracking177Temporal(4292)

Duration

1945

Frequency

1815

Pre-requisites

532

Room-to-Room

ObjectRetrieval

Predictive(407)

Causal(150)

Counterfactual(151)

Cooking

Cleaning/laundryConstruction/renovationEating

Crafting/knitting

CarpenterTalkingwithfamilymembersIndoorNavigation(walking)

WatchingtvOnascreen(phone/laptop)

ListeningtomusicGroceryshoppingindoors

PlayingwithpetsWalkingonstreetGardening

Baker

DoingyardworkCar-commuting,roadtripBikemechanic

Workingatdesk

204

070

140

#MCQspervideo

20-30

30-40

40-50

50-60

60-70

70-80

80-90

90-100

100-110

110-120

Count

Count

1

120

192

4

2

3

140

105

70

35

0

3

Duration(inminutes)

Figure3:DatasetStatistics.⃝1:HourVideoincludes500videossourcedfromtheEgo4Ddataset,spanning77everydayscenarios.Thebarchartshowsthetop20scenarios.:WereportthenumberofMCQspertask/sub-task.Intotal,thereare12,976questionsinHourVideo.:WeshowthedistributionofvideodurationinHourVideo.TheaveragedurationofvideosinHourVideois45.7minutes,with113videosextendingbeyondonehour.:WeshowthedistributionofnumberofMCQspervideo.Onaverage,eachvideocontains26MCQs.

ExpertRefinement,Stage5.TheaimofthisstageistoenhancethequalityofMCQ4byutilizingaselectedgroupofexperthumanannotators.Thisstageservesasacomprehensivesteptoaddressvariousremainingissuesthatmighthavepersistedthroughpriorstages.Examplesofexpertrefinementincludetransformingabroadquestionlike"Wheredidthecamerawearerleavethekeys?"intoamoreprecisequery:"Wheredidthecamerawearerleavethebikekeysafterreturninghomefromshopping?”Over300hoursofexperthumaneffortareemployedinthisstagetocarefullyexamineandrefineMCQ4,culminatinginahigh-qualityMCQ5.Weengagedfourhumanexpertsinthisstage.

ManualGeneration.Despiteourextensiveeffortstoautomatefullyorpartially,wediscoveredthatcertaintasksdidnotalignwellwiththepipelinewedescribedearlier.Specifically,forcausal,counterfactual,spatiallayoutandnavigationtasks,wefounditmoreeffectivetomanuallygeneratequestionswithhumanexpertsratherthanprocessingthroughourmulti-stagepipeline.Consequently,forthesetasksinourbenchmark,wegeneratedhigh-qualityquestions,albeitinasmallerquantity.Fourhumanexpertswereengagedinthisstage,generatingatotalof658MCQs(5.1%).

Implementationdetails.WeusedGPT-4inourpipelineasitoffersimpressivecapabilitiestofollowcomplexmulti-stepinstructions.WeusedtheChain-of-Thought[

20

]promptingstrategyandatemperatureof0.1forallstagesinvolvingLLMsinourpipeline.WeshowanexampleMCQlife-cycleinFig.

B.2.

SeeSupplementary

B

formoredetailsondatasetgeneration.WeincludetheexactpromptsusedforgeneratingMCQ2forthefollowingtasks:•Narrationcompilation(Fig.

E.1

),•Summarization(Fig.

E.2,

E.3

),•Perception/InformationRetrieval/FactualRecall(Fig.

E.4,

E.5

),

•VisualReasoning/Spatial/Relationship(Fig.

E.6,

E.7)

.

2.3HourVideoStatistics

HourVideoconsistsof500videosfromtheEgo4Ddataset,covering77dailylifescenariossuchascooking,cleaning,eating,watchingTV,baking,etc.(Fig.

3)

.Thedatasetincludes381hoursofvideofootage,withvideodurationsrangingfrom20to120minutes(Figure

3)

.Onaverage,eachvideoisapproximately45.7minuteslong,

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论