版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
are
Integratingsenses:
HowAIislearningto
see,hear,andinteract
RolandMemisevic
SeniorDirectorofEngineeringatQualcommAIResearch
JointworkwithSunnyPanchal,ApratimBhattacharyya,
GuillaumeBerger,AntoineMercier,RezaPourreza,SanjayHaresh,andothers
September24,2024
SnapdragonandQualcommbrandedproductsareproductsofQualcommTechnologies,Inc.and/oritssubsidiaries.
Agenda
•Keyconcept:streamingarchitecture
•Importanceofdatasetsforend-to-endtraining
•Efficienthuman-AIinteractionandvideo-basedreasoning
•ImprovingstreamingvideoLLMsusingauxiliarytasks
•Q&A
2
MODALITYANDUSECASECAPABILITYANDKPI
Longercontextwindow
Allowsin-depthconversations
VoiceUI
Voiceisanaturalandintuitiveinterfaceforconversation
GenerativeAI
Largemultimodalmodelso
Utilizingmoresensinginputmodalitiestobetterunderstandtheworld
oPersonalization
Fine-tunedmodelscustomizedtoconsumers,enterprises,orindustries(e.g.,LoRA)
capabilities
continueto
increase
Higherresolution
Processhigherfidelityimagesforbetteraccuracy
Video&3D
Generatingcontentforaricherandmorerealisticexperience
Agents
Executemulti-steptaskswithreasoningautonomouslytoachieveagoal
3
LoRA:low-rankadaptation
Full-stackAIoptimization
forLMs
Runscompletely
onthedevice
Significantlyreduces
runtimelatencyandpowerconsumption
Continuouslyimproves
theQualcomm®AIStack
LM:Languagevisionmodel
Designinganefficientdiffusionmodelthroughknowledge
distillationforhighaccuracy
Knowledgedistillationforpruningandremovingofattentionblocks,resultinginaccuratemodelwithimproved
performanceandpowerefficiency
Qualcomm®AIEnginedirect
forimprovedperformanceandminimizedmemoryspillage
AIaccelerationontheQualcomm®HexagonmNPUoftheSnapdragon®8Gen3MobileProcessor
4
HybridAI
Distributeworkloadsamongcloudand
edge/devicestodelivermorepowerful,
efficient,andhighlyoptimizedexperiences
Centralcloud
Easeofdevelopment&deploymentTraining|Verylargemodels
Aggregation|Absoluteperformance
Edgecloud(on-premornearby)
Immediacy|Reliability|Personalization|Privacy|SecurityFine-tuning|Aggregation
On
device
Immediacy|Reliability|Personalization|Privacy|SecurityCost|Energy
Toscale,thecenterofgravityofAIprocessingismovingtotheedge
5
World’sfirst
largemultimodalmodel(LMM)
onan
Androidphone
LLM:LargeLanguageModel;LLaVA:LargeLanguageandVisionAssistant
LLMscannowsee
7+billionparameterLMM,LLaVA,
withtext,speech,andimageinputs
Multi-turnintuitiveconversationsaboutanimageataresponsive
tokenrate
Full-stackAIoptimization
toachievehighperformanceatlowpower
Enhancedprivacy,reliability,personalization,andcostwithon-deviceprocessing
6
7
Goal:TrainingAImodelstoseeandinteractwithhumans
SMARTHOMEMOBILEROBOTICS
8
Visually-groundedLLM
Vision
Action
recognition
Orchestrator
Situatedvision-languagemodels
•Processalivevideostreaminrealtimeanddynamicallyinteractwithusers
LLM
•Determinewhattosayandwhentosayit
Frontend
•Enablethepathtohumanoids
TTS
Open-ended,asynchronous
interactionwithsituatedagentsisanopenchallenge
•Limitedtoturn-basedinteractionsaboutofflinedocumentsorimages
•Limitedtocapturingmomentarysnapshotsofrealityin
aVQA-styledialogue
Researchingvisually-groundedLLMswiththeabilitytoreasonandinteractwiththeenvironment
WhattoSayandWhentoSayit:Video-LanguageModelandBenchmarkforSituatedInteractions(2024);OpenEQA:EmbodiedQuestionAnsweringintheEraofFoundationModels(2024);VQA:visualquestionanswering9
201020122014
SPEECHTOTEXT
Audio
Pipeline
Neuralnetwork
Text
OBJECT
RECOGNITION
Pixels
Pipeline
Neuralnetwork
Objects
LANGUAGE
TRANSLATION
English
Pipeline
Neuralnetwork
French
Neuralnetworkshavereplacedincreasinglycomplexcomputationalpipelines
10
INPUTSTREAM(AUTO-REGRESSIVE)NEURALNETWORKBEHAVIORSTREAM
End-to-endbackpropforagents
11
Keyconcept:
Multi-modalstreamingarchitecture
INPUTSTREAM(AUTO-REGRESSIVE)NEURALNETWORKBEHAVIORSTREAM
TRAINEDEND-TO-END
EXTERNALINPUT
(e.g.,camera)
LANGUAGEORACTIONS
AUTO-REGRESSIVELLM
•Anauto-regressivelanguage
modelisausefulcomponent
ofamulti-modalagentbecauseitisalreadyabletoperform
adialoguewithauser
•Additionally,languagemakes
iteasytoencodesurrogatetasksforadegreeof“commonsense”toemerge
End-to-endlearningrequiresa
multi-modalstreamingarchitecture
13
End-to-endlearning
requiresa
multi-modalstreaming
architecture
AUTO-REGRESSIVELLM
LANGUAGEORACTIONS
EXTERNAL
INPUT
(e.g.,camera)
CONTEXTWINDOW
F
T
T
T
F
T
T
T
F
T
T
R
O
O
O
R
O
O
O
R
O
O
A
K
K
K
A
K
K
K
A
K
K
M
E
E
E
M
E
E
E
M
E
E
E
N
N
N
E
N
N
N
E
N
N
•Visualfoundationmodelsthatcombineanimagefeatureextractorwithalanguagemodelback-bonehavebecomeincreasinglycommon
•Therearemultipledifferentwaystocombinevisualinformationwithlanguagemodeltokes,e.g.:
•Cross-attention(e.g.,Flamingo)
•Dedicatedvisiontokens(e.g.,Llava)
…goodforapplicationslikeCaptioningandVisualQuestionAnswering
However,…
…aliveagentthatcanutilizeareal-timecamerafeed
requiresasystemthatcancontinuouslyattendtovisualinput
•Challenges:
•Freelyinterleavedvisionframesandlanguagetokens
•Dependencesbetweenvisionframe-rateandtokenrate
•Trainingdata,allowingamodeltolearnwhattosayandwhen
•Recentwork:“VideoLLM-online:OnlineVideoLargeLanguageModelforStreamingVideo“,Chenetal.,2024andourwork,whichIwillpresentinthenextslides
14
Flamingo:aVisualLanguageModelforFew-ShotLearning”,Alayracetal2022“VisualInstructionTuning”,Liuetal.2023
Importanceofdatasetsforend-to-endtraining
16
Datasetsforend-to-endtrainingofvisualassistants
Keyrequirementforend-to-endtraining:
alignedvideofeed(frames)+assistant’scomments(tokens)
“HoloAssist:anEgocentric
HumanInteractionDatasetforInteractiveAIAssistantsintheRealWorld”
Wangetal.2024
1stpersonvideosshowingavarietyoftasks(20tasksacross16objects)
“CanFoundationModelsWatch,TalkandGuideYouStepbySteptoMakeaCake?”
Baoetal.2023
1stpersonvideosshowingpreparationofcupcakes
“LiveFitnessCoachingasaTestbedfor
SituatedInteractions”
Panchaletal.2024
3rdpersonvideosshowingfitnessexercisesandtheircorrections
Fitnessquestionsdataset
148
300k
470+
exercises
short-clipvideos
hours
1900
unique
participants
1.1M+
high-level
question-answerpairs
400k+
fine-grained
question-answerpairs
FIT-Coach
benchmarkanddataset
Fitnessfeedbackdataset
9+
hoursoffitness
coachingsession
148
exercisesessions
∼3.5
minutes
longsessionswith5to6
exercises
21
unique
participants
Anovelinteractivevisualcoachingbenchmarkanddatasetasatest-bedforreal-time,real-world
situatedinteraction
Aimedatthedevelopmentofinteractivemulti-modalvision-language
modelsbasedinthecontrolledbutchallengingfitnesscoachingdomain
LiveFitnessCoachingasaTestbedforSituatedInteraction,Panchal,Bhattacharyya,etal.202417
18
Fitnessassistantdatasetandbenchmark
Shortvideoclipsshowingtheuserperformingindividualexercises,
alongwithlabelsforperformanceandcommonmistakes(~300kclipsofduration~5-10secondseach)
Long-rangevideosshowingtheuserexercising,alongwithalignedcommentsbythecoach
(~200sessionsacross5-6exerciseseach)
Numberofvideos
UniqueParticipants
AverageDuration(s)
ExercisesperVideo
TotalNumberofExercises
TotalClasses
SHORTCLIPS
LONG-RANGE
Train
Test
Train
Testt
290,775
1,800+5.6±1.1
1
148
1866
16,429
100
5.6±1.2
1
148
1690
153
21
213.4±3.1
5-6
23
—
69
7
213.7±3.3
5-6
23
—
FitnessQuestions
TotalHigh-levelQuestions
TotalFine-grainedQuestions
1,193,056
404,082
78,390
80,694
—
—
—
—
FitnessFeedbacks
AverageFeedbacksperExercise
AverageSilencePeriod(s)tt
AverageFeedbackLength(words)
2.0±10.1n/a
9.0±6.1
2.4±6.9
n/a
9.1±5.0
5.0±1.35.2±1.46.3±3.8
5.0±1.25.3±1.26.6±4.0
19
Fitnessassistantdatasetandbenchmark
LongfitnesssessionsdatasetShortfitnessclipsdataset
20
OurdatasetmeetsalltheneedsofinteractiveAIassistants
DATASET
DOMAIN
HUMANACTIONS
INTERACTIVE
MISTAKES
CORRECTIVEFEEDBACKS
DOMAINEXPERTISE
LENGTH
ActionRecognitionDatasets
NTURGB+D
FineGym
Fitness
Fitness
√
√
x
x
x
x
x
x
√
√
708
ProceduralActivityDatasets
YouCook2
Cooking
x
x
x
x
x
176
Epic-Kitchens
Cooking
x
x
x
x
x
100
HowTo100M
Daily-life
√
x
x
x
x
134k
Ego-4D
Daily-life
x
x
x
x
x
3670
Ego-Exo4D
Daily-life
x
x
√
x
x
1422
Assembly-101
Toyassm.
x
x
√
x
x
513
InteractiveAIAssistantDatasets
WTAG
Cooking
x
x
√
√
x
10
HoloAssist
Obj.manip.
x
x
√
√
x
166
QEVD(Ours)
Fitness
√
√
√
√
√
474
Efficienthuman-AIinteractionandvideo-basedreasoning
22
Detailedarchitecture:
Learningwhattosayandwhentosayit
AUTO-REGRESSIVELLM
Visualstream
PROMPT
LANGUAGEBACKBONE
EXTERNALINPUT
(e.g.,camera)
LANGUAGEORACTIONS
SELF-ATTN
SELF-ATTN
SELF-ATTN
!!!
3DCNN
SELF-ATTN
CROSS-ATTN4…
SELF-ATTN
CROSS-ATTN
SELF-ATTN
<next>
SELF-ATTN
CROSS-ATTN
SELF-ATTN
CROSS-ATTN
SELF-ATTN
<feedback>
3DCNN
SELF-ATTN
SELF-ATTN
SELF-ATTN
SELF-ATTN
SELF-ATTN
smooth
SELF-ATTN
CROSS-ATTN
SELF-ATTN
CROSS-ATTN
SELF-ATTN
<next>
3DCNN
SELF-ATTN
SELF-ATTN
SELF-ATTN
SELF-ATTN
SELF-ATTN
onooo
AUTO-REGRESSIVELLM
Steppablecausal3dconvolutions
enableefficientstreamingmotionperception
Existingvision
languagemodelsusea2dCNNor
visiontransformerasthevisual
featureextractor
Thismakesthemunsuitablefortaskssuchasfitnesscoaching,whichinvolveunderstanding
ofhumanbehaviorsand
motionpatterns
EXTERNAL
INPUT
(e.g.,camera)
LANGUAGEORACTIONS
Weusea3dCNNasthefeatureextractor,whichwehaveshowntobewell-suitedtoend-to-end
learning(“Isend-to-endlearningenoughforfitnessactivity
recognition?”,Mercieretal.2023)
Efficientvisualstreamingatinferencetimecanbeenabledusing
SteppableConv
PreviousNew
steppable,causalconvolutions:
StandardConv
Enhanceyourappwiththe
abilitytosee&interactwith
humansviaanyRGBcamera:
/quic/sense
CausalConv
timestepstimestep
“Isend-to-endlearningenoughforfitnessactivityrecognition?”,Mercieretal.202323
ImprovingstreamingvideoLLMsusingauxiliarytasks
Languagegenerationisnotonlyausefultask,butitalsohelpsamodelacquireadegreeof“commonsense”
Usingalanguagedecodertoprovidesurrogatetaskstothemodelattrainingtime
Pre-trainingamodelon
adifficultcaptioningtask(Something-something
byGoyaletal.2017)…
…allowsustoimprovepredictionaccuracyonaseparateHomeCookingTask:
“Ontheeffectivenessoftaskgranularityfortransferlearning”(Mahdisoltani,etal.2018)
Generatingcomplextextualdescriptions
Generatingsimpletextualdescriptions
Classificationon
178classactions
Classificationon
40actiongroups
Baselineclassification
onimages
Trainingfromscratch
7,7
34,3
59,7
55,8
62,8
47,1
54,4
*“Thesomething-somethingvideodatabaseforlearningandevaluatingvisualcommonsense”(Goyaletal.2017)25
26
Avision-languagemodelcanlearnlow-levelvisualskillsbyencodingvisualinformationaslanguage
Encodingvisualinformationaslanguage
isanaturalwaytoteachavision-languagemodellow-levelvisualskills,suchas
objectidentification,detection,etc.
Theuseofthesevisualskillsatinferencetimeislikeperformingchain-of-thoughtreasoningforvisualinferencetasks
“Look,RememberandReason:Groundedreasoninginvideoswithlanguagemodels”
Bhattacharyya,etal.2024
13
18
18
21
21
33
3
21
21
33
3
21
21
33
3
21
Method
StaticCamera
MovingCamera
Top1
Top5
Top1
Top5
ALOE(Dinget.Al.)
74.0
94.0
59.7
90.1
TFCV3D(Zhanget.al.)
79.7
95.5
-
-
LRR(w/oSurrogateTasks)
68.5
88.7
62.7
86.7
LRR(fine-tuned)
84.1
97.2
80.4
96.7
LRR(joint)
81.0
97.3
73.7
95.6
Example:Something-Else(Materzynskaetal.,2020):Example:CATER(Girdharetal.,2020):
Method
Base
Compositional
Top1
Top5
Top1
Top5
STIN+OIE+NL(Materzynskaetal.,2020,MIT)
78.1
94.5
56.2
81.3
Video-ChatGPT(Maazetal.,2023)
52.6
75.8
38.6
67.8
LRR(w/oSurrogateTasks)
52.6
75.8
50.1
70.8
LRR(fine-tuned)
80.2
96.1
62.0
86.3
LRR(joint)
-
-
61.1
85.4
Stochasticprobingallowsustodistillvisualskillsintothemodel
•Encodingtheextractedlow-levelinformationastokensgrowsthecontextwindowanditcanbeinefficient
•Relyingonexplicitrepresentationsoflow-levelcomputervision
features(suchasboundingboxpositions)mayalsoleadtobrittleness
•Wethereforeproposetodistilllow-levelvisualskillsintothemodelusingaprocesswerefertoasStochasticProbing:
Stochasticprobing:Duringtraining,promptamodelatrandomtime-stepstoperformlow-levelvisualtasks
ACRE
Compositional
Systematic
InferenceSpeed*(sec)
ALOE(Dinget.Al.)
LRR
LRR(StochasticProbing)
91.7
99.3
93.9
99.5
99.2
-
0.061
1.415
98.2
*timingonanA100GPU
Stochasticprobingboostsefficiencyatinferencetime
Trainingonvisualskillscanboostperformanceoverclassicapproaches
27
Asimilarapproach:“DistillingStep-by-Step!OutperformingLargerLanguageModelswithLessTrainingDataandSmallerModelSizes”,Hsie,etal.,2023
End-to-endtraininginconjunctionwithstochasticprobingallowsamodeltoprovideusefulandaccuratefeedbackinreal-time
28
29
Qualitativeresults:end-to-endlearningenablesvideoLLMstodeliveraccuratelivefeedback
Question:Provideanappropriatefeedbackfortheuser
Video-LLaMA:Weseeayoungmanstandinginakitchen,wearingaredshirtandwhiteshorts.
Video-ChatGPT:Theuserhassuccessfullydemonstratedtheabilitytoperformabalancingactonapairofstools.
Coach-LLaMA:Thisisawesome.Let’skeeptheintensityhigh!
Groundtruth
Stream-VLM
LLaMA-VID
LLaVA-Next
30
Quantitativeresults:end-to-endlearningenablesvideoLLMstodeliveraccuratelivefeedback
Zero-shotpromptingresults:
METHOD
METEOR
ROUGE-L
BERT
LLM-Acc.
InstructBLIP
0.047
0.040
0.839
1.64
Video-LLaVA
0.057
0.025
0.847
1.82
Video-ChatGPT
0.098
0.078
0.850
2.27
Video-LLaMA
0.101
0.077
0.859
2.28
LLaMA-VID
0.100
0.079
0.859
2.33
LLaVA-Next
0.104
0.078
0.858
2.39
Fine-tuningresults:
METHOD
METEOR
ROUGE-L
BERT
LLM-Acc.
T-F-Score
Socratic-Llama-2-7B
0.094
0.071
0.860
2.39
0.50t
Video-ChatGPT*
0.108
0.093
0.863
2.42
0.50t
LLaMA-VID*
0.106
0.090
0.860
2.40
0.50t
STREAM-VLM
0.125
0.116
0.863
2.56
0.59
STREAM-VLM(w/o3DCNN)
0.090
0.083
0.857
2.17
0.51
STREAM-VLM(w/oAction-Tokens
0.125
0.110
0.861
2.56
0.50t
31
Outlook:CLEVRskillsdatasetforroboticsfoundationmodels
DATASET/SIMULATOR
#TASKS
LANGUAGE
MULTIMODALPROMPTS
ACTIONGRANULARITY
COMPOSITIONALITY
#DEMONSTRATIONS
Real
RoboTurk
3
x
x
ActionDeltas
x
111hrs
BridgeData
71
x
x
ActionDeltas
x
7.2k
Open-X
√
x
ActionDeltas
x
1M
RH20T
√
x
ActionDeltas
x
100k
FMB
7
x
x
ActionDeltas
√
22.5k
Simulated
CALVIN
34
√
x
ActionDeltas
√t
—
Behaviour-1K
1000
x
x
ActionDeltas
x
—
Maniskill2
20
x
x
ActionDeltas
x
≈70k
VIMA
17
√
√
Poses
x
650k
ClevrSkill(our)
36
√
√
ActionDeltas+Poses
√
330k
RunningAIondevicesavesmemory
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 专利技术合作投资合同:2024专业模板版
- 2025年度西餐厅厨房废弃物处理与环保设施建设合同3篇
- 个人与电子商务平台2024年度合作协议3篇
- 2024智能家居系统开发与维护外包合同
- 2024版建筑工程单价及付款条款合同版B版
- 2025年度新型经济林种植承包合同3篇
- 台州浙江台州市黄岩区人大常委会办公室下属事业单位选聘工作人员笔试历年典型考点(频考版试卷)附带答案详解
- 安全网络与智能硬件设备安全性能提升考核试卷
- 供应链中的供应链风险管理策略考核试卷
- 办公设备租赁市场的营销策略考卷考核试卷
- 危重症护理组组长竞聘
- 航空工程材料(第3版)课件 6有色金属
- 印刷厂厂长年终小结
- MOOC 工程图学-天津大学 中国大学慕课答案
- 园林景观工程关键施工技术、措施
- 谈谈微电影创作课件
- 《变革管理》课件
- 各元素离子半径
- 小学五年级数学上册寒假作业天天练30套试题(可打印)
- 地下管道三维轨迹惯性定位测量技术规程
- 特种设备锅炉日管控、周排查、月调度主要项目及内容表
评论
0/150
提交评论