版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
HourVideo:1-HourVideo-LanguageUnderstanding
KeshigeyanChandrasegaranAgrimGuptaLeaM.HadzicTaranKotaJimmingHeCristobalEyzaguirreZaneDuranteManlingLiJiajunWuLiFei-Fei
arXiv:2411.04998v1[cs.CV]7Nov2024
StanfordUniversity
Abstract
WepresentHourVideo,abenchmarkdatasetforhour-longvideo-languageun-derstanding.Ourdatasetconsistsofanoveltasksuitecomprisingsummarization,perception(recall,tracking),visualreasoning(spatial,temporal,predictive,causal,counterfactual),andnavigation(room-to-room,objectretrieval)tasks.HourVideoincludes500manuallycuratedegocentricvideosfromtheEgo4Ddataset,span-ningdurationsof20to120minutes,andfeatures12,976high-quality,five-waymultiple-choicequestions.Benchmarkingresultsrevealthatmultimodalmod-els,includingGPT-4andLLaVA-NeXT,achievemarginalimprovementsoverrandomchance.Instarkcontrast,humanexpertssignificantlyoutperformthestate-of-the-artlong-contextmultimodalmodel,GeminiPro1.5(85.0%vs.37.3%),highlightingasubstantialgapinmultimodalcapabilities.Ourbenchmark,evalua-tiontoolkit,prompts,anddocumentationareavailableat
.
1Introduction
Humansdemonstratearemarkableabilitytoprocessvisualstimulioverlongtimehorizons,enablingthemtoperceive,planandactintherealworld.Considertheroutinetaskofcookingameal.Thisactivityinvolvesacontinuousandadaptivevisualprocess:identifyingandusingingredientsandtools,monitoringstatechangesofvariousdishes,andadjustingcookingduration/techniquesbasedonvisualcuessuchascolorandtexture.Suchsustainedvisualprocessingiscrucialtoachievingthedesiredculinaryoutcomes.Naturally,endowingautonomousagentswiththiscapabilityhasbeenalong-standinggoalinthefieldofArtificialIntelligence.
Inrecentyears,largemultimodalmodels[
1
–
3
]haveemergedasapromisingapproachtowardachievingthisgoal.Typically,thesemodelsareevaluatedusingmultipledatasetsthattestcapabilitiessuchasobjectrecognition[
4,
5
],imagecomprehension[
6
–
8
],andactionrecognition[
9]
.However,thesebenchmarksareoftenrestrictedtosingleimagesorshortvideoclips,usuallylastingfromafewsecondstonomorethanthreeminutes[
9
–
12]
.Whilethesebenchmarkshavespurredsignificantadvancements,adeeperexplorationintolong-formvideo-languageunderstandingisessentialtodevelopmultimodalsystemsthatcanformthebasisforfutureautonomousagentsandassistants.
Asignificantchallengeinevaluatinglong-formvideo-languageunderstandingcapabilitiesisdesign-ingtasksthatgenuinelynecessitatelong-termcomprehension,i.e.,tasksthatrequirelong-rangedependencies.Merelyposingquestionsthatcanbeansweredbywatchingabriefsegmentofalengthyvideoeffectivelyreducesthetasktoacombinationoftemporallocalizationandshort-clipunderstanding.Furthermore,whileintriguingnarrativeinquiriescancertainlybeformulatedforlong-formvideossuchastelevisionshowsandfilms,itisimperativetoensurethatthequestionsarenottriviallyanswerableduetothevastpriorknowledgeencodedinmodernlargelanguagemodels.Inthiswork,weintroduceHourVideo—abenchmarkdatasetdesignedforlong-formvideo-languageunderstanding.Todesigntasksthatrequirelong-termcomprehension,wefirstproposeanoveltask
Correspondenceto{keshik,agrim}@stanford.edu
38thConferenceonNeuralInformationProcessingSystems(NeurIPS2024)TrackonDatasetsandBenchmarks.
2
suite(Tab.
1
),comprisingsummarization,perception(recall,tracking),visualreasoning(spatial,temporal,predictive,causal,counterfactual),andnavigation(room-to-room,objectretrieval)tasks.Foreachtask,wemanuallycreatequestionprototypesdesignedtoensurethatcorrectlyansweringthemrequiresidentificationandsynthesisofinformationacrossmultipletemporalsegmentswithinthelong-formvideos.Guidedbyourtasksuite,wecurated500egocentricvideosfromtheEgo4Ddataset[
13
]—covering77uniqueeverydayactivitiesandrangingfrom20to120minutes—togeneratequestionsbasedonourprototypes.Thecombinationofourcomprehensivetasksuiteandeverydaymundaneegocentricvideosprovidesarobustframeworktorigorouslyevaluatemultimodalmodels’capabilitiesinunderstandinglong-formvideos.Finally,wedevelopedaquestion-answergenerationpipelineutilizingtheexpertiseoftrainedhumanannotators(800+hoursofeffort)andlargelanguagemodels(LLMs),resultinginacollectionof12,976high-quality,five-waymultiple-choicequestions.Wecomprehensivelyevaluatestate-of-the-artmultimodalmodelsonHourVideo(Tab.
2
,Fig.
4
),includingGPT-4V[
2
],Gemini1.5Pro[
3
],andLLaVA-NeXT[
14
]inazero-shotsetting.OurfindingsrevealthatGPT-4VandLLaVA-NeXTachieveonlymarginalimprovementsoverarandompredictor(20%),obtainingaccuraciesof25.7%and22.3%,respectively.Gemini1.5Pro,designedspecificallyforlong-contextmultimodalunderstanding,obtainsanaccuracyof37.3%,which,whilebetter,isstillsubstantiallylowerthantheaverageperformanceofhumanexpertsat85.0%.Theseresultssuggestthatwhilethemultimodalcommunityhasmademeaningfulprogress,asignificantgapremainstobebridgedbeforethesesystemscanmatchhuman-levellong-formvideounderstandingcapabilities.Progressinlong-formvideounderstandingcouldenablenewapplicationsincludingARassistants,embodiedagents,andinteractivevideoplatforms.WehopethatHourVideowillserveasabenchmarktofacilitateresearchinthisdirectionandenablethedevelopmentofmultimodalmodelsthatcanunderstandendlessstreamsofvisualdata.
2BenchmarkDesignandConstruction
Whileopen-endedquestionansweringcloselyemulateshumaninteraction,automatingtheevaluationoffree-formnaturallanguageresponsesremainschallenging.Giventhatourprimarygoalistoassesslong-formvideo-languageunderstandingcapabilities,weoptforafive-waymultiple-choicequestion-answering(MCQ)task.Thisapproachsimplifiestheevaluationprocessbyallowingtocalculateanaggregatequestion-answeringaccuracymetric.Inthefollowingsection,wedescribeourtasksuiteandquestion-answergenerationpipelineindetail,bothofwhicharedesignedtocuratediversehigh-qualityfive-waymultiple-choicequestions(MCQs).
2.1TaskSuite
Creatingacomprehensivebenchmarkforlong-formvideo-languageunderstandingischallenging,primarilybecauseformulatingmeaningfulquestionsthatrequireprocessingandsynthesizingin-formationacrossvarioustemporalsegmentsishighlynontrivial,evenforexperthumanannotators.Moreover,wenotethatevenbenchmarksforimageorshortvideoclipunderstandingaredifficulttoconstruct.Asaresult,wetypicallyobservetwocommonstrategiesforbenchmarkcreation:(1)pre-definedlabelspacestestingforaspecificskillorwithinnarrowdomains(e.g.,Kinetics[
9
]andSomething-Something[
15
]);or(2)gluingtogetherdifferentdatasets,eachdesignedtotestaspecificmodelcapability[
16
–
19]
.Incontrast,asinglebenchmarkthatcancomprehensivelytestasuiteofmodelcapabilitiescansignificantlybenefittheresearchcommunity.
Wedrawinspirationfrombothlinesofresearchmethodologiesandintroduceanovelsuiteoftasksdesignedtobenchmarklong-formvideo-languageunderstandingcapabilitiesforone-hour-longvideos.Ourtasksuiteencompassesacomprehensivesetofperceptualandcognitivetasks,includingsummarization,perception(recall,tracking),visualreasoning(spatial,temporal,predictive,causal,counterfactual),andnavigation(room-to-room,objectretrieval)tasks.Ourstrategydrawsinspirationfromthetwocommonapproachespreviouslydiscussed:(1)designingnarrowlyfocusedquestionprototypestosignificantlystreamlinethequestion-answercreationprocess,and(2)creatingadiversesuiteoftasksthatholisticallyevaluateabroadspectrumofmultimodalcapabilities.OurtasksuitewithmanuallydesignedquestionprototypesareshowninTable
1.
Inparticular,thereare18sub-tasksinourproposedtasksuiteandexampleMCQsfromHourVideoareshowninFig.
1.
3
VisualReasoning
Summarization
01:10:26
00:54:52
TemporalSequencing
00:00:40
00:43:32
00:01:0000:04:5800:16:44
Describethesequenceofactivitiesthecamerawearerperformedrelatedtopreparationandcookingoffood.
A)Thecamerawearertakesouttheingredients,peels,cuts,andcooksthepotatoes,continuestomashtheminthepot
B)Peeled,chopped,andcookedpotatoes,interactedwithindividuals,adjustedcookingsettings,andsetthediningtable.
C)Takesouttheingredients,peels,cuts,andcooksthepotatoes,coolsthepotatoeswithcoldwater,continuestomashtheminthepot,andadjuststhecookersetting.
D)Thecamerawearersliced,diced,andboiledpotatoes,interactedwithindividuals,andmodifiedcookingtimes.
E)Thecamerawearerpeeled,chopped,andsautéedvegetables,interactedwithindividuals,andadjustedcookingsettings,demonstratingamethodicalapproachtomealpreparation.
00:22:48
Selectthecorrectstatementregardingthespatialproximityofobjectsinthevideo.
A)Thecamerawearer'sseatisequidistantfromboththedriver'sseatandthebusdooronthebus.
B)Thecashierisclosertothediningtablewherethecamerawearereatspizzathanthetrashbin.
C)Thedriver'sseatispositioneddirectlyacrossfromthecamerawearer'sseat,whilethebusdoorisbehindthecamerawearer.
D)Theweighingstationisadjacenttotheentrance,withthebananasectionatthefarend.
E)Theentranceisnearertotheweighingstationthanthebananasectionatthestore.
Predictive00:46:24
Temporal00:32:30
00:40:03
00:38:12
00:44:00
00:20:56
00:21:51
00:00:12
00:06:12
00:09:48
00:40:23
00:20:36
00:29:52
00:19:30
iiiiiwri
AftershoppingandinteractingwiththeCashieratthecheckout,whatwillthecamerawearerdonext?
A)Looksattheirphonewhilepushingatrolley,exitsthestore,handsoverthetrolleytoawoman,thencyclebackhome.
B)Looksattheirphonewhilepushingatrolley,exitsthestore,handsoverthetrolleytoWoman,buysadrinkattheexit,thenruntowardsthebusstand.
C)Looksattheirphonewhilepushingatrolley,exitsthestore,handsoverthetrolleytoaman,thenwalktowardsthebusstop.
D)Looksattheirphonewhilepushingatrolley,exitsthestore,handsoverthetrolleytoWoman,thenruntowardsthebusstand.
E)Looksattheirphonewhilepushingatrolley,exitsthestore,buysadrinkattheexit,thenruntowardsthebusstand.
Selectthecorrectstatementregardingfrequenciesofdifferenttoolusageinthevideo
A)Duringtheconstructionactivities,themitersawwasusedmorefrequentlycomparedtothecordlessdrill.
B)Duringthewoodworkingtasks,thetapemeasurewasusedmoreoftencomparedtotheruler.
C)Duringthedeckconstruction,thehammerwasusedmorefrequentlycomparedtotheimpactdriver.
D)Duringwoodworkingactivity,thetapemeasurewasusedmorefrequentlycomparedtothecircularsaw.
E)Duringthewoodworkingactivity,themitersawwasusedmorefrequentlycomparedtothetapemeasure.
Causal
01:17:11
00:10:28
00:09:50
00:09:58
00:12:24
00:20:26
00:26:32
Whydidthechildmovethestepstoolnearthekitchencountertop?
A)Thechildmovedthestepstoolnearthekitchencountertoptoreachitandhelpwithpreparingdoughandbeatingeggs.
B)Thechildmovedthestepstoolnearthekitchencountertoptoaccessitandretrievethecookiejar.
C)Thechildmovedthestepstoolnearthekitchencountertoptoreachthesinkoveritandwashherhands.
D)Thechildmovedthestepstoolnearthekitchencountertoptoreachitandhelpwithpreparingdough.
E)Thechildmovedthestepstoolnearthekitchencountertoptoaccessthetopdraweraboveitandfindthemeasuringspoons.
Counterfactual01:10:26
00:43:32
`wrii
00:39:01
00:54:52
00:16:44
00:15:45
00:10:55
Whatifthecamerawearerusedtheoventomakemashedpotatoes?
A)Overallcookingtimewouldhaveincreasedastheovenwasalsousedbythecamerawearertobakecookies.
B)Overallcookingtimewouldhaveincreasedastheovenwouldconsumemoretimecomparedtousinginductionstove.
C)Overallcookingtimewouldhaveincreasedastheovenwasalsousedbythemantobakecookies.
D)Overallcookingtimewouldhaveincreasedastheovenwouldconsumemoretimecomparedtousinggascooker.
E)Overallcookingtimewouldhaveincreasedastheovenwouldconsumemoretimecomparedtousingmicrowave.
00:46:24
Spatial
00:02:34
00:04:50
00:23:47
00:44:00
00:18:16
Perception
01:17:11
01:13:37
InformationRetrieval/FactualRecall
00:00:2000:33:0500:33:2600:34:0800:35:31
Tracking00:30:00
00:00:0000:03:5500:04:0400:09:1300:18:2300:30:00
Listthelocationsthecamerawearervisited.
Person2Person3Person1
<1minute8minutes30minutes
A)Kitchen,BBQArea,StorageRoom,Garage,Room,Pavements
>
Identifytheuniqueindividualsthecamerawearerinteractedwith.
A)2AdultsB)1AdultC)4AdultsD)5AdultsE)3Adults
B)Kitchen,Garden,StorageRoom,Garage,Room,Pavements
C)Kitchen,Bathroom,StorageRoom,Garage,Room,Pavements,
D)Kitchen,Room,Balcony,StorageRoom,Garage,LivingRoom
E)Kitchen,Balcony,StorageRoom,Garage,Room,Pavements
Navigation
01:17:11
00:34:0800:35:31
00:33:0500:33:2600:33:50
00:00:20
01:13:3701:13:5001:17:02
Room-to-RoomNavigation:Howcanthecamerawearergettothebackyardfromthekitchen?
A)C)D)E)
ObjectRetrieval:Howcanthecamerawearerretrievethemotorcyclefromthekitchen?
A)Exitthekitchentowardsthestairsandexitthroughthedoor.Themotorbikeisoutside.
B)Exitthekitchenthroughthedoorintothebackyard,andthemotorbikeisontheright.
C)Exitthekitchenandturnleft.Walkthroughthelivingroomandgothroughthedoorintothebackyard.Themotorbikeisontheright.
D)Exitthekitchentothelivingroomandturnleft.Gothroughthedoortothebackyard;themotorbikeisontheright.
E)Exitthekitchenandturnleft.Walkdownthehallwayandturnrightbeforethestairs.Exitthedoorandthemotorbikeisoutside.
Figure1:ExampleMCQsfromHourVideofordifferenttasks.Thecorrectanswersareunderlined.
4
Summarization
KeyEvents/Objects
Summarizethekeyinteractionsofthecamerawearerinthe[supermarket].
TemporalSequencing
Describethesequenceofactivitiesperformedbythecamerawearerto[preparethedessert].
Compare/Contrast
Howdidthecamerawearer’sactivitiesinthe[apartment]differfromthoseinthe[restaurant]?
Perception
InformationRetrieval
•FactualRecall
What[dairyproducts]didthecamerawearer[pickup]inthe[supermarket]?
•SequenceRecall
Whatdidthecamerawearerdoimmediatelyafter[weighingtomatoes]atthe[supermarket]?
•TemporalDistance
Howlongafterstartingto[eatpizza]didthecamerawearer[disposeofthepizzabox]?
Tracking
Listtheunique[individuals]thecamerawearerinteractedwithatthe[drugstore].
VisualReasoning
Spatial
•Relationship
Wherewasthe[microwave]placedinrelationtothe[stove]inthe[kitchen]?
•Proximity
Isthe[microwave]closertothe[fridge]comparedtothe[sink]?
•Layout
Whichisthecorrect[IMAGE]depictingthelayoutofthecamerawearer’s[apartment]?
Temporal
•Duration
Whichactivitydidthecamerawearerspendmoretimeon:[cooking]or[playingthepiano]?
•Frequency
Didthecamerawearerusethe[circularsaw]or[crosscutsaw]morefrequentlyto[cutwood]?
•Pre-requisites
Whatpreparationstepsdidthecamerawearertakebefore[bakingcookies]?
Predictive
Whatisthemostlikelyactivitythecamerawearerwilldonextafter[doinglaundry]?
Causal
Whydidthecamerawearer[leavethegarageforthesecondtime]?
Counterfactual
Whatifthecamerawearerusedthe[oven]to[cookmashedpotatoes]?
Navigation
Room-to-Room
Howdidthecamerawearergetfromthe[buildingentrance]tothe[apartment]?
ObjectRetrieval
Howcanthecamerawearerretrievethe[TVremote]iftheyareinthe[kitchen]?
Table1:Ourproposedtasksuitewithquestionprototypes.Thistableshowsall4tasksand18sub-tasksproposedinHourVideo,alongwiththecorrespondinghandcraftedquestionprototypesdesignedtoevaluatelong-formvideo-languageunderstandingcapabilities.
2.2DatasetGenerationPipeline
Inthissection,weprovideanoverviewofthequestion-answercreationpipelinethatwedevelopedtocreateHourVideo.ThepipelineissummarizedinFig.
2.
Videocuration,Stage1.Acrucialdesignconsiderationforthisbenchmarkistheselectionofvideosourcesandtypes.WechosetheEgo4D[
13
]datasetforourvideosformultiplereasons:(1)itsegocentricperspectivealignswellwiththetypicalvisualinputforautonomousagentsandassistants;(2)itfeaturesextensivevisualnarrations,whichaidincreatingdiversemultiple-choicequestions;and(3)itisreadilyaccessibleundertheEgo4Dlicense.Wemanuallyreviewed1,470videos,rangingfrom20to120minutes,fromtheEgo4Ddataset,assessingtheirpotentialtogeneraterelevantquestionsforvarioustasksinourtasksuite.Weengagedfivehumanexpertsforvideocuration.Followingthisprocess,wecurated500egocentricvideos.
5
VideoCuration
心
>100hours
BlindFiltering
MCQGeneration
3MCQRefinementusingHumanFeedback
LLM
LLM
+
>400hours
MCQ2
MCQ3
MCQ2
ExpertMCQRefinement
LLM
MCQ3MCQ4
心
+
>300hours
MCQ4
MCQ5
1
2
4
5
Figure2:Ourdatasetgenerationpipeline.WedevelopadatasetgenerationpipelineconsistingoffivestagestocreateHourVideo.Weleverageover800hoursofhumaneffortintotalcorrespondingtoVideocuration(Stage1),MCQRefinementusingHumanFeedback(Stage3)andExpertMCQRefinement(Stage5)stages.WeuseLLMsforMCQGeneration(Stage2),MCQRefinementusingHumanFeedback(Stage3)andBlindFiltering(Stage4).Wenotethatcausal,counterfactualandnavigationquestionsaremanuallygeneratedbyhumanexperts(SeeSec.
2.2
fordetails).
CandidateMCQGeneration,Stage2.Theobjectiveofthisstageistoproducehigh-qualityMCQsforeachtask,requiringanalysisandsynthesisofinformationacrossmultipletemporalsegmentsinalong-formvideo.Initially,wemanuallydevelopquestiontemplate(s)foreachtaskinthesuite.AsshowninTable
1
,transformingaquestiontemplateintoanactualquestioninvolvesincorporatingvideo-specificinformationtailoredtothetaskandtemplate.Tofacilitatethis,weutilizethedetailednarrationsfromtheEgo4Ddataset,transformingthemintoastructuredformatthatcanbeprocessedbyanLLM.Specifically,wesegmentthevideoat20-minuteintervals,witheachsegment’srepresentationincludingasummaryandalistoftools,fooditems,technology,humans,pets,andphysicallocationsencounteredbythecamerawearerinthevideo.Wenotethatsynthesizingastructuredrepresentationandaquestiontemplateintoavalidquestionwithcorrectandincorrectanswerspresentsasignificantchallenge,evenforadvancedLLMs.Consequently,foreachtask,weformulatedetailedpromptsthatofferquestionprototypes,comprehensiveinstructions,in-contextexamples,andstep-by-stepguidanceonhowtotransformaquestiontemplateintoavalidcandidateMCQ2.Intotal,wedeveloped25task-specificprompts.
MCQRefinementwithLLMsusingHumanFeedback,Stage3.ThepurposeofthisphaseistorefineMCQ2,createdinthepreviousstage.MCQ2maycontaininvalidquestions,incorrectanswers,trivialincorrectoptions,andvariousotherissues.WeidentifiedthatasignificantsourceoftheseissuesstemmedfromrelyingonthenoisynarrationsinEgo4D.Forexample,differentnarratorswithinthesamevideocouldrefertoadishwasherasa"platerack"oruseotherterms,andanindividualmightbedescribedasan"adult,""personwitharedandwhiteshirt,""manY,"or"teenager"atvarioustimesinthenarration.Theseinconsistencies,combinedwithourautomaticquestiongenerationinthefirststage,couldleadtogenerationofinvalidMCQs.ToaddressnoisyMCQs,weimplementahumanfeedbacksystemwheretrainedannotatorsaretaskedwith:1)assessingthevalidityofeachquestiontoensureitalignswiththevideocontent,2)verifyingtheaccuracyofthegivenanswer—iffoundincorrect,theyprovidethecorrectanswerinfree-formtext,3)ensuringthatallincorrectoptionsarefactuallywrongandclearlydistinguishablefromthecorrectanswer.WegatherhumanfeedbackforallMCQ2,involvingover400hoursofhumaneffort.Wethendesignprompts,toautomaticallyrefineMCQ2usingthishumanfeedbacktoproduceMCQ3.Weengagedseventrainedannotatorsinthisstage.
Blindfiltering,Stage4.ModernLLMspossessextensivepriorknowledgeandcanthuseasilyanswercertainquestionswithoutneedingtoanalyzethevideos.Theobjectiveofthisphaseistoeliminatequestionsthatcanbeansweredthroughpriorknowledgeorcanbetriviallyansweredwithoutrequiringanyinformationfromthevideo.Toaddressthis,wedoblindfilteringofMCQ3,utilizingtwoseparateblindLLMs(GPT-4-turboandGPT-4).Specifically,weexcludeanyMCQthatiscorrectlyansweredbyatleastoneLLMwithoutvideoinput.AlthoughthismethodmayaggressivelyremoveMCQs,itensuresthattheremainingMCQ4areofhighqualityandspecificallytailoredtotestlong-formvideo-languageunderstanding.
6
2
Summarization(714)
KeyEvents/ObjectsIdentification
467
TemporalSequencing
152
Compare/Contrast
95
Spatial(3173)
Relationship
1889
Proximity
1239
Layout
45
Perception(3777)Navigation(312)
FactualRecall2479
SequenceRecall854
TemporalDistance267
Tracking177Temporal(4292)
Duration
1945
Frequency
1815
Pre-requisites
532
Room-to-Room
ObjectRetrieval
Predictive(407)
Causal(150)
Counterfactual(151)
Cooking
Cleaning/laundryConstruction/renovationEating
Crafting/knitting
CarpenterTalkingwithfamilymembersIndoorNavigation(walking)
WatchingtvOnascreen(phone/laptop)
ListeningtomusicGroceryshoppingindoors
PlayingwithpetsWalkingonstreetGardening
Baker
DoingyardworkCar-commuting,roadtripBikemechanic
Workingatdesk
204
070
140
#MCQspervideo
20-30
30-40
40-50
50-60
60-70
70-80
80-90
90-100
100-110
110-120
Count
Count
1
120
192
4
2
3
140
105
70
35
0
3
Duration(inminutes)
Figure3:DatasetStatistics.⃝1:HourVideoincludes500videossourcedfromtheEgo4Ddataset,spanning77everydayscenarios.Thebarchartshowsthetop20scenarios.:WereportthenumberofMCQspertask/sub-task.Intotal,thereare12,976questionsinHourVideo.:WeshowthedistributionofvideodurationinHourVideo.TheaveragedurationofvideosinHourVideois45.7minutes,with113videosextendingbeyondonehour.:WeshowthedistributionofnumberofMCQspervideo.Onaverage,eachvideocontains26MCQs.
ExpertRefinement,Stage5.TheaimofthisstageistoenhancethequalityofMCQ4byutilizingaselectedgroupofexperthumanannotators.Thisstageservesasacomprehensivesteptoaddressvariousremainingissuesthatmighthavepersistedthroughpriorstages.Examplesofexpertrefinementincludetransformingabroadquestionlike"Wheredidthecamerawearerleavethekeys?"intoamoreprecisequery:"Wheredidthecamerawearerleavethebikekeysafterreturninghomefromshopping?”Over300hoursofexperthumaneffortareemployedinthisstagetocarefullyexamineandrefineMCQ4,culminatinginahigh-qualityMCQ5.Weengagedfourhumanexpertsinthisstage.
ManualGeneration.Despiteourextensiveeffortstoautomatefullyorpartially,wediscoveredthatcertaintasksdidnotalignwellwiththepipelinewedescribedearlier.Specifically,forcausal,counterfactual,spatiallayoutandnavigationtasks,wefounditmoreeffectivetomanuallygeneratequestionswithhumanexpertsratherthanprocessingthroughourmulti-stagepipeline.Consequently,forthesetasksinourbenchmark,wegeneratedhigh-qualityquestions,albeitinasmallerquantity.Fourhumanexpertswereengagedinthisstage,generatingatotalof658MCQs(5.1%).
Implementationdetails.WeusedGPT-4inourpipelineasitoffersimpressivecapabilitiestofollowcomplexmulti-stepinstructions.WeusedtheChain-of-Thought[
20
]promptingstrategyandatemperatureof0.1forallstagesinvolvingLLMsinourpipeline.WeshowanexampleMCQlife-cycleinFig.
B.2.
SeeSupplementary
B
formoredetailsondatasetgeneration.WeincludetheexactpromptsusedforgeneratingMCQ2forthefollowingtasks:•Narrationcompilation(Fig.
E.1
),•Summarization(Fig.
E.2,
E.3
),•Perception/InformationRetrieval/FactualRecall(Fig.
E.4,
E.5
),
•VisualReasoning/Spatial/Relationship(Fig.
E.6,
E.7)
.
2.3HourVideoStatistics
HourVideoconsistsof500videosfromtheEgo4Ddataset,covering77dailylifescenariossuchascooking,cleaning,eating,watchingTV,baking,etc.(Fig.
3)
.Thedatasetincludes381hoursofvideofootage,withvideodurationsrangingfrom20to120minutes(Figure
3)
.Onaverage,eachvideoisapproximately45.7minuteslong,
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 《接触网施工》课件 3.2.2 横梁安装
- 快速上手ERP系统:2024年培训教程
- 《逃家小兔》课件设计与应用
- 2024年幼儿启蒙:《逃家小兔》课件在早期教育中的妙用
- 2024大学批评通报(5篇)
- 科目四考试技巧口诀表-驾考实操
- 《档案的源流》课件
- 《春之声》教案创新设计:融入2024年教育政策
- 2024年《大小多少》课件:数字与尺寸的理解与应用
- 2024年《弟子规》教案:传统文化与时代精神的结合
- 2023年新高考数学(新高考Ⅰ卷)真题评析及2024备考策略
- 湖北省武汉市华中师范大学附属小学六年级小升初语文测试卷(8套试卷带答案解析)
- 新媒体运营(用户运营内容运营活动运营产品运营社群运营)PPT完整全套教学课件
- 赣州市中小学三年级上册计算机教室上机记录表
- 任务七食品中脂肪含量测定
- 《IT人员职业规划》
- 初级社会统计学智慧树知到答案章节测试2023年哈尔滨工程大学
- 诗歌鉴赏基本知识点
- 人文英语3范文+人文英语3阅读740
- GB/T 3274-2007碳素结构钢和低合金结构钢热轧厚钢板和钢带
- GB/T 311.3-2007绝缘配合第3部分:高压直流换流站绝缘配合程序
评论
0/150
提交评论