2024多模态AI的感官融合-视觉、听觉与交互技术白皮书(英文版)-高通_第1页
2024多模态AI的感官融合-视觉、听觉与交互技术白皮书(英文版)-高通_第2页
2024多模态AI的感官融合-视觉、听觉与交互技术白皮书(英文版)-高通_第3页
2024多模态AI的感官融合-视觉、听觉与交互技术白皮书(英文版)-高通_第4页
2024多模态AI的感官融合-视觉、听觉与交互技术白皮书(英文版)-高通_第5页
已阅读5页,还剩58页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

are

Integratingsenses:

HowAIislearningto

see,hear,andinteract

RolandMemisevic

SeniorDirectorofEngineeringatQualcommAIResearch

JointworkwithSunnyPanchal,ApratimBhattacharyya,

GuillaumeBerger,AntoineMercier,RezaPourreza,SanjayHaresh,andothers

September24,2024

SnapdragonandQualcommbrandedproductsareproductsofQualcommTechnologies,Inc.and/oritssubsidiaries.

Agenda

•Keyconcept:streamingarchitecture

•Importanceofdatasetsforend-to-endtraining

•Efficienthuman-AIinteractionandvideo-basedreasoning

•ImprovingstreamingvideoLLMsusingauxiliarytasks

•Q&A

2

MODALITYANDUSECASECAPABILITYANDKPI

Longercontextwindow

Allowsin-depthconversations

VoiceUI

Voiceisanaturalandintuitiveinterfaceforconversation

GenerativeAI

Largemultimodalmodelso

Utilizingmoresensinginputmodalitiestobetterunderstandtheworld

oPersonalization

Fine-tunedmodelscustomizedtoconsumers,enterprises,orindustries(e.g.,LoRA)

capabilities

continueto

increase

Higherresolution

Processhigherfidelityimagesforbetteraccuracy

Video&3D

Generatingcontentforaricherandmorerealisticexperience

Agents

Executemulti-steptaskswithreasoningautonomouslytoachieveagoal

3

LoRA:low-rankadaptation

Full-stackAIoptimization

forLMs

Runscompletely

onthedevice

Significantlyreduces

runtimelatencyandpowerconsumption

Continuouslyimproves

theQualcomm®AIStack

LM:Languagevisionmodel

Designinganefficientdiffusionmodelthroughknowledge

distillationforhighaccuracy

Knowledgedistillationforpruningandremovingofattentionblocks,resultinginaccuratemodelwithimproved

performanceandpowerefficiency

Qualcomm®AIEnginedirect

forimprovedperformanceandminimizedmemoryspillage

AIaccelerationontheQualcomm®HexagonmNPUoftheSnapdragon®8Gen3MobileProcessor

4

HybridAI

Distributeworkloadsamongcloudand

edge/devicestodelivermorepowerful,

efficient,andhighlyoptimizedexperiences

Centralcloud

Easeofdevelopment&deploymentTraining|Verylargemodels

Aggregation|Absoluteperformance

Edgecloud(on-premornearby)

Immediacy|Reliability|Personalization|Privacy|SecurityFine-tuning|Aggregation

On

device

Immediacy|Reliability|Personalization|Privacy|SecurityCost|Energy

Toscale,thecenterofgravityofAIprocessingismovingtotheedge

5

World’sfirst

largemultimodalmodel(LMM)

onan

Androidphone

LLM:LargeLanguageModel;LLaVA:LargeLanguageandVisionAssistant

LLMscannowsee

7+billionparameterLMM,LLaVA,

withtext,speech,andimageinputs

Multi-turnintuitiveconversationsaboutanimageataresponsive

tokenrate

Full-stackAIoptimization

toachievehighperformanceatlowpower

Enhancedprivacy,reliability,personalization,andcostwithon-deviceprocessing

6

7

Goal:TrainingAImodelstoseeandinteractwithhumans

SMARTHOMEMOBILEROBOTICS

8

Visually-groundedLLM

Vision

Action

recognition

Orchestrator

Situatedvision-languagemodels

•Processalivevideostreaminrealtimeanddynamicallyinteractwithusers

LLM

•Determinewhattosayandwhentosayit

Frontend

•Enablethepathtohumanoids

TTS

Open-ended,asynchronous

interactionwithsituatedagentsisanopenchallenge

•Limitedtoturn-basedinteractionsaboutofflinedocumentsorimages

•Limitedtocapturingmomentarysnapshotsofrealityin

aVQA-styledialogue

Researchingvisually-groundedLLMswiththeabilitytoreasonandinteractwiththeenvironment

WhattoSayandWhentoSayit:Video-LanguageModelandBenchmarkforSituatedInteractions(2024);OpenEQA:EmbodiedQuestionAnsweringintheEraofFoundationModels(2024);VQA:visualquestionanswering9

201020122014

SPEECHTOTEXT

Audio

Pipeline

Neuralnetwork

Text

OBJECT

RECOGNITION

Pixels

Pipeline

Neuralnetwork

Objects

LANGUAGE

TRANSLATION

English

Pipeline

Neuralnetwork

French

Neuralnetworkshavereplacedincreasinglycomplexcomputationalpipelines

10

INPUTSTREAM(AUTO-REGRESSIVE)NEURALNETWORKBEHAVIORSTREAM

End-to-endbackpropforagents

11

Keyconcept:

Multi-modalstreamingarchitecture

INPUTSTREAM(AUTO-REGRESSIVE)NEURALNETWORKBEHAVIORSTREAM

TRAINEDEND-TO-END

EXTERNALINPUT

(e.g.,camera)

LANGUAGEORACTIONS

AUTO-REGRESSIVELLM

•Anauto-regressivelanguage

modelisausefulcomponent

ofamulti-modalagentbecauseitisalreadyabletoperform

adialoguewithauser

•Additionally,languagemakes

iteasytoencodesurrogatetasksforadegreeof“commonsense”toemerge

End-to-endlearningrequiresa

multi-modalstreamingarchitecture

13

End-to-endlearning

requiresa

multi-modalstreaming

architecture

AUTO-REGRESSIVELLM

LANGUAGEORACTIONS

EXTERNAL

INPUT

(e.g.,camera)

CONTEXTWINDOW

F

T

T

T

F

T

T

T

F

T

T

R

O

O

O

R

O

O

O

R

O

O

A

K

K

K

A

K

K

K

A

K

K

M

E

E

E

M

E

E

E

M

E

E

E

N

N

N

E

N

N

N

E

N

N

•Visualfoundationmodelsthatcombineanimagefeatureextractorwithalanguagemodelback-bonehavebecomeincreasinglycommon

•Therearemultipledifferentwaystocombinevisualinformationwithlanguagemodeltokes,e.g.:

•Cross-attention(e.g.,Flamingo)

•Dedicatedvisiontokens(e.g.,Llava)

…goodforapplicationslikeCaptioningandVisualQuestionAnswering

However,…

…aliveagentthatcanutilizeareal-timecamerafeed

requiresasystemthatcancontinuouslyattendtovisualinput

•Challenges:

•Freelyinterleavedvisionframesandlanguagetokens

•Dependencesbetweenvisionframe-rateandtokenrate

•Trainingdata,allowingamodeltolearnwhattosayandwhen

•Recentwork:“VideoLLM-online:OnlineVideoLargeLanguageModelforStreamingVideo“,Chenetal.,2024andourwork,whichIwillpresentinthenextslides

14

Flamingo:aVisualLanguageModelforFew-ShotLearning”,Alayracetal2022“VisualInstructionTuning”,Liuetal.2023

Importanceofdatasetsforend-to-endtraining

16

Datasetsforend-to-endtrainingofvisualassistants

Keyrequirementforend-to-endtraining:

alignedvideofeed(frames)+assistant’scomments(tokens)

“HoloAssist:anEgocentric

HumanInteractionDatasetforInteractiveAIAssistantsintheRealWorld”

Wangetal.2024

1stpersonvideosshowingavarietyoftasks(20tasksacross16objects)

“CanFoundationModelsWatch,TalkandGuideYouStepbySteptoMakeaCake?”

Baoetal.2023

1stpersonvideosshowingpreparationofcupcakes

“LiveFitnessCoachingasaTestbedfor

SituatedInteractions”

Panchaletal.2024

3rdpersonvideosshowingfitnessexercisesandtheircorrections

Fitnessquestionsdataset

148

300k

470+

exercises

short-clipvideos

hours

1900

unique

participants

1.1M+

high-level

question-answerpairs

400k+

fine-grained

question-answerpairs

FIT-Coach

benchmarkanddataset

Fitnessfeedbackdataset

9+

hoursoffitness

coachingsession

148

exercisesessions

∼3.5

minutes

longsessionswith5to6

exercises

21

unique

participants

Anovelinteractivevisualcoachingbenchmarkanddatasetasatest-bedforreal-time,real-world

situatedinteraction

Aimedatthedevelopmentofinteractivemulti-modalvision-language

modelsbasedinthecontrolledbutchallengingfitnesscoachingdomain

LiveFitnessCoachingasaTestbedforSituatedInteraction,Panchal,Bhattacharyya,etal.202417

18

Fitnessassistantdatasetandbenchmark

Shortvideoclipsshowingtheuserperformingindividualexercises,

alongwithlabelsforperformanceandcommonmistakes(~300kclipsofduration~5-10secondseach)

Long-rangevideosshowingtheuserexercising,alongwithalignedcommentsbythecoach

(~200sessionsacross5-6exerciseseach)

Numberofvideos

UniqueParticipants

AverageDuration(s)

ExercisesperVideo

TotalNumberofExercises

TotalClasses

SHORTCLIPS

LONG-RANGE

Train

Test

Train

Testt

290,775

1,800+5.6±1.1

1

148

1866

16,429

100

5.6±1.2

1

148

1690

153

21

213.4±3.1

5-6

23

69

7

213.7±3.3

5-6

23

FitnessQuestions

TotalHigh-levelQuestions

TotalFine-grainedQuestions

1,193,056

404,082

78,390

80,694

FitnessFeedbacks

AverageFeedbacksperExercise

AverageSilencePeriod(s)tt

AverageFeedbackLength(words)

2.0±10.1n/a

9.0±6.1

2.4±6.9

n/a

9.1±5.0

5.0±1.35.2±1.46.3±3.8

5.0±1.25.3±1.26.6±4.0

19

Fitnessassistantdatasetandbenchmark

LongfitnesssessionsdatasetShortfitnessclipsdataset

20

OurdatasetmeetsalltheneedsofinteractiveAIassistants

DATASET

DOMAIN

HUMANACTIONS

INTERACTIVE

MISTAKES

CORRECTIVEFEEDBACKS

DOMAINEXPERTISE

LENGTH

ActionRecognitionDatasets

NTURGB+D

FineGym

Fitness

Fitness

x

x

x

x

x

x

708

ProceduralActivityDatasets

YouCook2

Cooking

x

x

x

x

x

176

Epic-Kitchens

Cooking

x

x

x

x

x

100

HowTo100M

Daily-life

x

x

x

x

134k

Ego-4D

Daily-life

x

x

x

x

x

3670

Ego-Exo4D

Daily-life

x

x

x

x

1422

Assembly-101

Toyassm.

x

x

x

x

513

InteractiveAIAssistantDatasets

WTAG

Cooking

x

x

x

10

HoloAssist

Obj.manip.

x

x

x

166

QEVD(Ours)

Fitness

474

Efficienthuman-AIinteractionandvideo-basedreasoning

22

Detailedarchitecture:

Learningwhattosayandwhentosayit

AUTO-REGRESSIVELLM

Visualstream

PROMPT

LANGUAGEBACKBONE

EXTERNALINPUT

(e.g.,camera)

LANGUAGEORACTIONS

SELF-ATTN

SELF-ATTN

SELF-ATTN

!!!

3DCNN

SELF-ATTN

CROSS-ATTN4…

SELF-ATTN

CROSS-ATTN

SELF-ATTN

<next>

SELF-ATTN

CROSS-ATTN

SELF-ATTN

CROSS-ATTN

SELF-ATTN

<feedback>

3DCNN

SELF-ATTN

SELF-ATTN

SELF-ATTN

SELF-ATTN

SELF-ATTN

smooth

SELF-ATTN

CROSS-ATTN

SELF-ATTN

CROSS-ATTN

SELF-ATTN

<next>

3DCNN

SELF-ATTN

SELF-ATTN

SELF-ATTN

SELF-ATTN

SELF-ATTN

onooo

AUTO-REGRESSIVELLM

Steppablecausal3dconvolutions

enableefficientstreamingmotionperception

Existingvision

languagemodelsusea2dCNNor

visiontransformerasthevisual

featureextractor

Thismakesthemunsuitablefortaskssuchasfitnesscoaching,whichinvolveunderstanding

ofhumanbehaviorsand

motionpatterns

EXTERNAL

INPUT

(e.g.,camera)

LANGUAGEORACTIONS

Weusea3dCNNasthefeatureextractor,whichwehaveshowntobewell-suitedtoend-to-end

learning(“Isend-to-endlearningenoughforfitnessactivity

recognition?”,Mercieretal.2023)

Efficientvisualstreamingatinferencetimecanbeenabledusing

SteppableConv

PreviousNew

steppable,causalconvolutions:

StandardConv

Enhanceyourappwiththe

abilitytosee&interactwith

humansviaanyRGBcamera:

/quic/sense

CausalConv

timestepstimestep

“Isend-to-endlearningenoughforfitnessactivityrecognition?”,Mercieretal.202323

ImprovingstreamingvideoLLMsusingauxiliarytasks

Languagegenerationisnotonlyausefultask,butitalsohelpsamodelacquireadegreeof“commonsense”

Usingalanguagedecodertoprovidesurrogatetaskstothemodelattrainingtime

Pre-trainingamodelon

adifficultcaptioningtask(Something-something

byGoyaletal.2017)…

…allowsustoimprovepredictionaccuracyonaseparateHomeCookingTask:

“Ontheeffectivenessoftaskgranularityfortransferlearning”(Mahdisoltani,etal.2018)

Generatingcomplextextualdescriptions

Generatingsimpletextualdescriptions

Classificationon

178classactions

Classificationon

40actiongroups

Baselineclassification

onimages

Trainingfromscratch

7,7

34,3

59,7

55,8

62,8

47,1

54,4

*“Thesomething-somethingvideodatabaseforlearningandevaluatingvisualcommonsense”(Goyaletal.2017)25

26

Avision-languagemodelcanlearnlow-levelvisualskillsbyencodingvisualinformationaslanguage

Encodingvisualinformationaslanguage

isanaturalwaytoteachavision-languagemodellow-levelvisualskills,suchas

objectidentification,detection,etc.

Theuseofthesevisualskillsatinferencetimeislikeperformingchain-of-thoughtreasoningforvisualinferencetasks

“Look,RememberandReason:Groundedreasoninginvideoswithlanguagemodels”

Bhattacharyya,etal.2024

13

18

18

21

21

33

3

21

21

33

3

21

21

33

3

21

Method

StaticCamera

MovingCamera

Top1

Top5

Top1

Top5

ALOE(Dinget.Al.)

74.0

94.0

59.7

90.1

TFCV3D(Zhanget.al.)

79.7

95.5

-

-

LRR(w/oSurrogateTasks)

68.5

88.7

62.7

86.7

LRR(fine-tuned)

84.1

97.2

80.4

96.7

LRR(joint)

81.0

97.3

73.7

95.6

Example:Something-Else(Materzynskaetal.,2020):Example:CATER(Girdharetal.,2020):

Method

Base

Compositional

Top1

Top5

Top1

Top5

STIN+OIE+NL(Materzynskaetal.,2020,MIT)

78.1

94.5

56.2

81.3

Video-ChatGPT(Maazetal.,2023)

52.6

75.8

38.6

67.8

LRR(w/oSurrogateTasks)

52.6

75.8

50.1

70.8

LRR(fine-tuned)

80.2

96.1

62.0

86.3

LRR(joint)

-

-

61.1

85.4

Stochasticprobingallowsustodistillvisualskillsintothemodel

•Encodingtheextractedlow-levelinformationastokensgrowsthecontextwindowanditcanbeinefficient

•Relyingonexplicitrepresentationsoflow-levelcomputervision

features(suchasboundingboxpositions)mayalsoleadtobrittleness

•Wethereforeproposetodistilllow-levelvisualskillsintothemodelusingaprocesswerefertoasStochasticProbing:

Stochasticprobing:Duringtraining,promptamodelatrandomtime-stepstoperformlow-levelvisualtasks

ACRE

Compositional

Systematic

InferenceSpeed*(sec)

ALOE(Dinget.Al.)

LRR

LRR(StochasticProbing)

91.7

99.3

93.9

99.5

99.2

-

0.061

1.415

98.2

*timingonanA100GPU

Stochasticprobingboostsefficiencyatinferencetime

Trainingonvisualskillscanboostperformanceoverclassicapproaches

27

Asimilarapproach:“DistillingStep-by-Step!OutperformingLargerLanguageModelswithLessTrainingDataandSmallerModelSizes”,Hsie,etal.,2023

End-to-endtraininginconjunctionwithstochasticprobingallowsamodeltoprovideusefulandaccuratefeedbackinreal-time

28

29

Qualitativeresults:end-to-endlearningenablesvideoLLMstodeliveraccuratelivefeedback

Question:Provideanappropriatefeedbackfortheuser

Video-LLaMA:Weseeayoungmanstandinginakitchen,wearingaredshirtandwhiteshorts.

Video-ChatGPT:Theuserhassuccessfullydemonstratedtheabilitytoperformabalancingactonapairofstools.

Coach-LLaMA:Thisisawesome.Let’skeeptheintensityhigh!

Groundtruth

Stream-VLM

LLaMA-VID

LLaVA-Next

30

Quantitativeresults:end-to-endlearningenablesvideoLLMstodeliveraccuratelivefeedback

Zero-shotpromptingresults:

METHOD

METEOR

ROUGE-L

BERT

LLM-Acc.

InstructBLIP

0.047

0.040

0.839

1.64

Video-LLaVA

0.057

0.025

0.847

1.82

Video-ChatGPT

0.098

0.078

0.850

2.27

Video-LLaMA

0.101

0.077

0.859

2.28

LLaMA-VID

0.100

0.079

0.859

2.33

LLaVA-Next

0.104

0.078

0.858

2.39

Fine-tuningresults:

METHOD

METEOR

ROUGE-L

BERT

LLM-Acc.

T-F-Score

Socratic-Llama-2-7B

0.094

0.071

0.860

2.39

0.50t

Video-ChatGPT*

0.108

0.093

0.863

2.42

0.50t

LLaMA-VID*

0.106

0.090

0.860

2.40

0.50t

STREAM-VLM

0.125

0.116

0.863

2.56

0.59

STREAM-VLM(w/o3DCNN)

0.090

0.083

0.857

2.17

0.51

STREAM-VLM(w/oAction-Tokens

0.125

0.110

0.861

2.56

0.50t

31

Outlook:CLEVRskillsdatasetforroboticsfoundationmodels

DATASET/SIMULATOR

#TASKS

LANGUAGE

MULTIMODALPROMPTS

ACTIONGRANULARITY

COMPOSITIONALITY

#DEMONSTRATIONS

Real

RoboTurk

3

x

x

ActionDeltas

x

111hrs

BridgeData

71

x

x

ActionDeltas

x

7.2k

Open-X

x

ActionDeltas

x

1M

RH20T

x

ActionDeltas

x

100k

FMB

7

x

x

ActionDeltas

22.5k

Simulated

CALVIN

34

x

ActionDeltas

√t

Behaviour-1K

1000

x

x

ActionDeltas

x

Maniskill2

20

x

x

ActionDeltas

x

≈70k

VIMA

17

Poses

x

650k

ClevrSkill(our)

36

ActionDeltas+Poses

330k

RunningAIondevicesavesmemory

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论