版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
DeepMultimodalDataFusion
FEIZHAO,
TheUniversityofAlabamaatBirmingham,Birmingham,AL,USA
CHENGCUIZHANG,
TheUniversityofAlabamaatBirmingham,Birmingham,AL,USA
BAOCHENGGENG,
TheUniversityofAlabamaatBirmingham,Birmingham,AL,USA
MultimodalArtificialIntelligence(MultimodalAI),ingeneral,involvesvarioustypesofdata(e.g.,images,texts,ordatacollectedfromdifferentsensors),featureengineering(e.g.,extraction,combination/fusion),anddecision-making(e.g.,majorityvote).Asarchitecturesbecomemoreandmoresophisticated,multimodalneu-ralnetworkscanintegratefeatureextraction,featurefusion,anddecision-makingprocessesintoonesinglemodel.Theboundariesbetweenthoseprocessesareincreasinglyblurred.Theconventionalmultimodaldatafusiontaxonomy(e.g.,early/latefusion),basedonwhichthefusionoccursin,isnolongersuitableforthemoderndeeplearningera.Therefore,basedonthemain-streamtechniquesused,weproposeanewfine-grainedtaxonomygroupingthestate-of-the-art(SOTA)modelsintofiveclasses:Encoder-Decodermethods,AttentionMechanismmethods,GraphNeuralNetworkmethods,GenerativeNeuralNetworkmethods,andotherConstraint-basedmethods.Mostexistingsurveysonmultimodaldatafusionareonlyfocusedononespecifictaskwithacombinationoftwospecificmodalities.Unlikethose,thissurveycoversabroadercombi-nationofmodalities,includingVision+Language(e.g.,videos,texts),Vision+Sensors(e.g.,images,LiDAR),andsoon,andtheircorrespondingtasks(e.g.,videocaptioning,objectdetection).Moreover,acomparisonamongthesemethodsisprovided,aswellaschallengesandfuturedirectionsinthisarea.
CCSConcepts:•Computingmethodologies→Artificialintelligence;Naturallanguageprocessing;Computervision;Machinelearning;
AdditionalKeyWordsandPhrases:Datafusion,neuralnetworks,multimodaldeeplearning
ACMReferenceFormat:
FeiZhao,ChengcuiZhang,andBaochengGeng.2024.DeepMultimodalDataFusion.ACMComput.Surv.56,9,Article216(April2024),36pages.
/10.1145/3649447
1INTRODUCTION
Data,withoutadoubt,isanextremelyimportantcatalystintechnologicaldevelopment,especiallyinArtificialIntelligence(AI)field.Inthelast20years,theamountofdatageneratedinthisperiodaccountsforabout90%ofalldataavailableintheworld.Moreover,therateofdatagrowthisstillaccelerating.TheexplosionofdataprovidesanunprecedentedchanceforAItothrive.
Withtheadvancementofsensortechnologies,notonlytheamountandqualityofdataisin-creasedandenhanced,butthediversityofdataisalsoskyrocketing.Thedatacapturedfromdiffer-entsensorsprovidepeoplewithdistinct“views”or“perspectives”ofthesameobjects,activities,or
Authors’addresses:F.Zhao,TheUniversityofAlabamaatBirmingham,UniversityHall4105,140210thAve.S.,Birm-ingham,AL,35294,USA;e-mail:larry5@;C.Zhang,TheUniversityofAlabamaatBirmingham,UniversityHall4143,140210thAve.S.,Birmingham,AL,35294,USA;e-mail:czhang02@;B.Geng,TheUniversityofAlabamaatBirmingham,UniversityHall4147,140210thAve.S.,Birmingham,AL,35294,USA;e-mail:bgeng@.
Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Copyrightsforcomponentsofthisworkownedbyothersthantheauthor(s)mustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orrepublish,topostonserversortoredistributetolists,requirespriorspecificpermissionand/orafee.Requestpermissionsfrom
permissions@.
©2024Copyrightheldbytheowner/author(s).PublicationrightslicensedtoACM.ACM0360-0300/2024/04-ART216
/10.1145/3649447
ACMComput.Surv.,Vol.56,No.9,Article216.Publicationdate:April2024.
216:2F.Zhaoetal.
Fig.1.Theworldhasbeenprojectedintomultipledimensions/domains.
phenomena.Inotherwords,peopleareabletoobservethesameobjects,activities,orphenomenaindifferent“dimensions”or“domains”byusingdifferentsensors.Thesenew“views”helppeopleobtainabetterunderstandingoftheworld.Forexample,100yearsago,inthemedicalfield,itwasextremelydifficultforphysicianstodiagnosewhetherapatienthasalungtumorduetothelim-itedwayofobservingorgans.Aftertheinventionofthefirstcomputerizedtomography(CT)scannerbasedonX-raytechnology,thedatacapturedfromthemachineprovidemuchricherinfor-mationaboutlungs,enablingphysicianstomakediagnosesbasedonCTimagesalone.Withtheadvancementoftechnology,magneticresonanceimaging(MRI),amedicalimagingtechniquethatusesstrongmagneticfieldsandradiowaves,hasbeenusedtodetecttumorsaswell.Nowadays,physiciansareabletoaccessmultimodaldataincludingCT,MRI,andbloodtestdata,andsoon.Theaccuracyofdiagnosisbasedonthecombinationofthesedataismuchhigher,comparedwiththatbasedonasinglemodalityalone,e.g.,CT,orMRIonly.ThisisbecausethecomplementaryandredundantinformationamongCT,MRI,andbloodtestdatacanhelpphysiciansbuildamorecomprehensiveviewofanobservedobject,activity,orphenomenon.EvolutionofAIalsofollowsasimilarpath.Initsinfancy,AIonlyfocusesonsolvingproblemsusingasinglemodality.Nowadays,AItoolshavebecomeincreasinglycapableofsolvingreal-worldproblemsbyusingmultimodality.
Whatismultimodality?Inreality,whenweexperiencetheworld,weseeobjects,hearsounds,feeltextures,smellodors,andtasteflavors[
11
].Theworldisrepresentedbyinformationindif-ferentmediums,e.g.,vision,sounds,andtextures.AvisualizationisshowninFigure
1
.Ourrecep-torssuchaseyesandears,helpuscapturetheinformation.Then,ourbrainwillbeabletofusetheinformationfromdifferentreceptorstoformapredictionoradecision.Theinformationob-tainedfromeachsource/mediumcanbeviewedasonemodality.Whenthenumberofmodalitiesisgreaterthanone,wecallitmultimodality.However,insteadofusingeyesandears,machineshighlydependonsensorssuchasRGBcameras,microphones,orothertypesofsensors,asshowninFigure
2.
Eachsensorcanmaptheobservedobjects/activitiesintoitsowndimension.Inotherwords,theobservedobjects/activitiescanbeprojectedintothedimensionofeachsensor.Then,machinesorrobotscancollectthedatafromeachsensorandmakeapredictionordecisionbasedonthem.Intheindustry,therearenumerousapplicationstakingadvantageofmultimodality.Forexample,autonomousvehicle,whichisoneofthehottesttopicssincethe2020s,isatypicalap-plicationrelyingonmultimodality.Suchasystemrequiresmultipletypesofdatafromdifferentsensors,e.g.,LiDARsensors,Radarsensors,cameras,andGPS.Themodelwillfusethesedatatomakereal-timepredictions.Inthemedicalfield,moreandmoreapplicationsrelyonthefusionofmedicalimagingandelectronichealthrecordstoenablemodelstoanalyzeimagingfindingsintheclinicalcontext,e.g.,CTandMRIfusion.
ACMComput.Surv.,Vol.56,No.9,Article216.Publicationdate:April2024.
DeepMultimodalDataFusion216:3
Fig.2.Theworldhasbeenprojectedintomultipledimensions/domainsbydifferenttypesofsensors.
Fig.3.Thekidisstrikingadrum.Evenifthedrumisnotvisible,basedonthevisionandaudioinformation,wecanstillrecognizetheactivitycorrectly.
Whydoweneedmultimodality?Ingeneral,multimodaldatarefertothedatacollectedfromdifferentsensors,e.g.,CTimages,MRIimages,andbloodtestdataforthecancerdiagnosis,RGBdataandLiDARdataforautonomousdrivingsystem,RGBdataandinfrareddataforskeletondetectionofKinect[
28
].Forthesameobservedobjectoractivity,thedatafromdifferentmodali-tiescanhavedistinctexpressionsandperspectives.Althoughthecharacteristicsofthesedatacanbeindependentanddistinct,theyoftenoverlapsemantically.Thisphenomenoniscalledinfor-mationredundancy.Furthermore,informationfromdifferentmodalitiescanbecomplementary.Humanscanunconsciouslyfusethemultimodaldata,obtainknowledge,andmakepredictions.Thecomplementaryandredundantinformationextractedfrommultimodalitiescanhelphumansformacomprehensiveunderstandingoftheworld.AstheexampleshowninFigure
3
,whenakidisdrumming,evenifwecannotseethedrum,wearestillabletorecognizeadrumthatisbeingstruckbasedonthesounds.Inthisprocess,weunconsciouslyfusethevisionandacousticdata,andextractthecomplementaryinformationofthem,tomakeacorrectprediction.Ifthereisonlyonemodalityavailable,e.g.,visionmodalitywiththedrumobjectoutofsight,wecanonlytellthatakidiswavingtwosticks.Withonlythesoundavailable,wewouldonlybeabletotellthatadrumisbeingstruckwithoutknowingwhoisdrumming.Therefore,ingeneral,theindependentinterpretationbasedonindividualmodalityonlypresentspartialinformationoftheobservedactivity.However,themultimodality-basedinterpretationcandeliverthe“fullerpicture”oftheobservedactivity,whichcanbemorerobustandreliablethansingle-modality-basedmodels.Forinstance,autonomousvehiclescontainingmultiplesensorssuchasRGBcamerasandLiDARsensors,needtodetectobjectsontheroadinextremeweatherconditionswherevisibilityisnearzero,e.g.,densefogorheavyrain.Amultimodal-basedmodelcanstilldetectobjectswhilethepure-vision-basedmodelscannot.However,itisextremelyhardformachinestounderstandandfigureouthowtofuseandtakeadvantagesofthecomplementarynatureofmultimodaldatatoimprovetheprediction/classificationaccuracy.
ACMComput.Surv.,Vol.56,No.9,Article216.Publicationdate:April2024.
216:4F.Zhaoetal.
Howtofusemultimodaldata?Inthe1990s,astraditionalMachineLearning(ML),asubclassofAI,flourished,ML-basedmodelsforaddressingmultimodalproblemsbegantothrive.Itbecamecommonforthemachinetoextractknowledgefrommultimodaldataandmakedecisions.How-ever,backthenmostoftheworkswerefocusedonfeatureengineering,e.g.,howtoobtainabetterrepresentationforeachmodality.Duringthattime,manymodality-specifichand-craftedfeatureextractorswereproposed,whichgreatlyrelyonpriorknowledgeofthespecifictasksandthecor-respondingdata.Sincethesefeatureextractorsworkindependently,theycanhardlycapturethecomplementaryandredundantnatureofmultimodalities.Therefore,suchafeatureengineeringprocessinevitablyresultsinalossofinformationbeforethefeaturesaresenttotheML-basedmodel.ThisleadstoanegativeimpactontheperformanceofthetraditionalML-basedmodels.AlthoughtraditionalML-basedmodelshavetheabilitytoanalyzemultimodalinformation,thereisalongwaytoachievetheultimategoalofAI,whichistomimichumansorevensurpasshumanperformance.Therefore,howtofusethedatainawaythatcanautomaticallylearnthecomple-mentaryandredundantinformationandminimizethemanualinterferenceremainsaprobleminthetraditionalMLfield.
Deeplearningisasub-fieldofML.Itallowscomputationalmodelsthatarecomposedofmultipleprocessinglayerstolearnrepresentationsofdatawithmultiplelevelsofabstraction[
88
].Itskeyadvantageisthatthehierarchicalrepresentationscanbelearnedinanautomatedway,whichdoesnotrequiredomainknowledgeorhumaneffort.Forexample,tothedata:X={x1,x2,...,xN},Y={y1,y2,...,yN},atwo-layerneuralnetworkcanbedefinedasthecombinationofmatricesW,andnon-linearfunctionσ(·)asshowninEquation(
1)
.Aftertrainingprocess,wecanfindWforwhichiisclosetoyiforalli≤N.Asthedepthofthemodelcontinuestoincrease,sodoesitsabilityoffeaturerepresentation.
i(xi)=W2·σ(W1xi).(1)
Since2010,multimodaldatafusionhasenteredthestageofdeeplearninginanall-aroundway.Deeplearning-basedmultimodaldatafusionmethodshavedemonstratedoutstandingre-sultsinvariousapplications.Forvideo-audio-basedmultimodaldatafusion,theworksfrom
[35,
37,
51,
163
]addresstheemotionrecognitionproblembyusingdeeplearningtechniques, includingconvolutionalneuralnetworks,longshort-termmemory(LSTM)networks,at- tentionmechanisms,andsoon.Also,forvideo-textmultimodaldatafusion,theworksfrom
[41,
56,
68,
107,
123,
124,
195
]addressthetext-to-videoretrievaltaskbyusingTransformer,BERT,attentionmechanism,adversariallearning,andacombinationofthem.Therearevariousothermultimodaltasks,e.g.,visualquestionanswering(VQA)(text-image:[
154,
220
],text-video:
[82,
223
]),RGB-depthobjectsegmentation[
31,
39
],medicaldataanalysis[
181,
185
],andimagecaptioning[
216,
237
].ComparedtotraditionalML-basedmethods,deepneuralnetwork(DNN)-basedmethodsshowsuperiorperformanceonrepresentationlearningandmodalityfusioniftheamountofthetrainingdataislargeenough.Furthermore,DNNisabletoexecutefeatureengineer- ingbyitself,whichmeansahierarchicalrepresentationcanbeautomaticallylearnedfromdata, insteadofmanuallydesigningorhandcraftingmodality-specificfeatures.Traditionally,themeth-odsofmultimodaldatafusionareclassifiedintofourcategories,basedontheconventionalfusion taxonomyshowninFigure
4
,includingearlyfusion,intermediatefusion,latefusion,andhybridfusion:(1)earlyfusion:Therawdataorpre-processeddataobtainedfromeachmodalityarefusedbeforebeingsenttothemodel;(2)intermediatefusion:thefeaturesextractedfromdifferentmodal- itiesarefusedtogetherandsenttothemodelfordecisionmaking;(3)latefusion(alsoknownas“decisionfuse”):theindividualdecisionsobtainedfromeachmodalityarefusedtoformthefinalprediction,e.g.,majorityvoteorweightedaverage,orametaMLmodelontopofindividualdeci-
sions.(4)hybridfusion:acombinationofearly,intermediate,andlatefusion.WithlargeamountsACMComput.Surv.,Vol.56,No.9,Article216.Publicationdate:April2024.
DeepMultimodalDataFusion216:5
Fig.4.Theconventionaltaxonomycategorizesfusionmethodsintothreeclasses.
ofmultimodaldataavailable,theneedformoreadvancedmethods(VShandpickedwaysoffusion)tofusethemhasgrownunprecedentedly.However,thisconventionalfusiontaxonomycanonlyprovidebasicguidanceformultimodaldatafusion.Inordertoextractthericherrepresentationfrommultimodaldata,thearchitectureofDNNbecomesmoreandmoresophisticated,whichnolongerextractsfeaturesfromeachmodalityseparatelyandindependently.Instead,representationlearning,modalityfusing,anddecisionmakingareinterlacedinmostcases.Therefore,thereisnoneedtospecifyexactlyinwhichpartofthenetworkthemultimodaldatafusionoccurs.Themethodoffusingmultimodaldatahaschangedfromtraditionalexplicitways,e.g.,earlyfusion,in-termediatefusion,andlatefusion,tomoreimplicitways.ToforcetheDNNtolearnhowtoextractcomplementaryandredundantinformationofmultimodaldata,researchershaveinventedvariousconstraintsonDNN,includingspecificallydesignednetworkarchitecturesandregularizationsonlossfunctions,andsoon.Therefore,thedevelopmentofdeeplearninghassignificantlyreshapedthelandscapeofmultimodaldatafusion,revealingtheinadequaciesofthetraditionaltaxonomyoffusionmethods.Theinherentcomplexityofdeeplearningarchitecturesofteninterlacesrepre-sentationlearning,modalityfusing,anddecision-making,defyingthesimplisticcategorizationsofthepast.Furthermore,theshiftfromexplicittomoreimplicitfusionmethods,exemplifiedbyattentionmechanisms,haschallengedthestaticnatureoftraditionalfusionstrategies.Techniquessuchasgraphneuralnetworks(GNNs)andgenerativeneuralnetworks(GenNNs)intro-ducenovelwaysofhandlingandfusingdatathatarenotalignedwiththeearly-to-latefusionframework.Additionally,thedynamicandadaptivefusioncapabilitiesofdeepmodels,coupledwiththechallengesposedbylarge-scaledata,necessitatemoresophisticatedfusionmethodsthantheconventionalcategoriescanencapsulate.Recognizingthesecomplexitiesandtherapidevolu-tion,itbecomesimperativetointroduceataxonomythatdelvesdeeper,capturingthesubtletiesofcontemporaryfusionmethods.
Formultimodaldatafusion,thereareseveralrecentsurveysavailableinthesciencecommunity.Gaoetal.
[46
]provideareviewonmultimodalneuralnetworksandSOTAarchitectures.However,thereviewisonlyfocusedonanarrowresearcharea:theobjectrecognitiontaskforRGB–depthimages.Moreover,thissurveyislimitedtotheconvolutionalneuralnetworks.Zhangetal.
[235
]presentasurveyondeepmultimodalfusion.However,theauthorscategorizethemodelsusingtheconventionaltaxonomy:earlyfusion,latefusion,andhybridfusion.Furthermore,thissurvey
isfocusedontheimagesegmentationtaskonly.Abduetal.
[2
]providealiteraturereviewofACMComput.Surv.,Vol.56,No.9,Article216.Publicationdate:April2024.
216:6F.Zhaoetal.
Fig.5.Thediagramofourproposedfine-grainedtaxonomyofdeepmultimodaldatafusionmodels.
multimodalsentimentanalysisusingdeeplearningapproaches.Itcategorizesthedeeplearning-basedapproachesintothreeclasses:earlyfusion,latefusion,andtemporal-basedfusion.However,similartotheabovesurveys,thisreviewisnarrowlyfocusedonsentimentanalysis.Gaoetal.
[45
]provideasurveyonmultimodaldatafusion.Itintroducesthebasicconceptsofdeeplearningandseveralarchitecturesofdeepmultimodalmodels,includingstackedautoencoder-basedmethods,recurrentneuralnetworks-basedmethods,convolutionalneuralnetwork-basedmethods,andsoon.However,itdoesnotincludetheSOTAlargepre-trainedmodelsandGNNs-basedmethods,e.g.,theBERTmodel.Mengetal.
[121
]presentareviewofMLfordatafusion.ItemphasizesthetraditionalMLtechniquesinsteadofdeeplearningtechniques.Also,theauthorsclassifythemethodsintothreedifferentcategories:signal-levelfusion,feature-levelfusion,anddecision-levelfusion.Thewayofcategorizingthefusionmethodsissimilartothatoftheconventionaltaxonomy:earlyfusion,intermediatefusion,andlatefusion,whichisnotnewtothecommunity.Thereareseveralotherreviews[
4,
128,
227
]inthefieldofmultimodality,mostofwhichfocusonaspecificcombinationofmodalities,e.g.,RGB-depthimages.
Therefore,inthisarticle,weprovideacomprehensivesurveyandcategorizationofdeepmulti-modaldatafusion.Thecontributionsofthisreviewarethree-fold:
—Weprovideanovelfine-grainedtaxonomyofthedeepmultimodaldatafusionmodels,di-vergingfromexistingsurveysthatcategorizefusionmethodsaccordingtoconventionaltaxonomiessuchasearly,intermediate,late,andhybridfusion.Inthissurvey,weexplorethelatestadvancesandgrouptheSOTAfusionmethodsintofivecategories:Encoder-DecoderMethods,AttentionMechanismMethods,GNNMethods,GenNNMethods,andotherConstraint-basedMethods,asshowninFigure
5.
—Weprovideacomprehensivereviewofdeepmultimodaldatafusionconsistingofvariousmodalities,includingVision+Language,Vision+OtherSensors,andsoon.Comparedtotheexistingsurveys[
2,
4,
45,
46,
121,
128,
227,
235,
243
]thatusuallyfocusononesingletask(suchasmultimodalobjectrecognition)withonespecificcombinationoftwomodalities(suchasRGB+depthdata),thissurveyownsabroaderscopecoveringvariousmodalitiesandtheircorrespondingtasks,includingmultimodalobjectsegmentation,multimodalsentimentanalysis,VQA,andvideocaptioning,andsoon.
—Weexplorethenewtrendsofdeepmultimodaldatafusion,andcompareandcontrastSOTAmodels.Someoutdatedmethods,suchasdeepbeliefnetworks,areexcludedfromthisreview.However,thelargepre-trainedmodels,whicharerisingstarsofdeeplearning,areincludedinthereview,e.g.,Transformer-basedpre-trainedmodels.
Therestofthisarticleisorganizedasfollows.Section
2
introducesEncoder-Decoder-basedfusionmethods,inwhichthemethodsaregroupedintothreesub-classes.Section
3
presentsthe
SOTAAttentionmechanismsusedinmultimodaldatafusion.Inthissection,thelargepre-trainedACMComput.Surv.,Vol.56,No.9,Article216.Publicationdate:April2024.
DeepMultimodalDataFusion216:7
Fig.6.ThegeneralstructureofEncoder-Decodermethodtofusemultimodaldata.Theinputdataofeachencodercanbetherawdataofeachmodalityorthefeaturesofeachmodality.Theencoderscanbeinde-pendentorshareweights.Thedecodercancontainupsamplingordownsamplingoperations,dependentonspecifictasks.
modelsareintroduced.InSection
4,
weintroduceGNN-basedmethods.InSection
5
,weintroduceGenNN-basedmethods,inwhichtwomainrolesofGenNN-basedmethodsinmultimodaltasksarepresented.Section
6
presentstheotherconstraintsadoptedinSOTAdeepmultimodalmodelssuchasTensor-basedFusion.InSection
7
,thecurrentnotabletasks,applications,anddatasetsinmultimodaldatafusionwillbeintroduced.Sections
8
and
9
discussthefuturedirectionsofmultimodaldatafusionandtheconclusionofthissurvey.
2ENCODER-DECODER-BASEDFUSION
Encoder-Decoderarchitecturehasbeensuccessfullyadoptedinsingle-modaltaskssuchasimagesegmentation,languagetranslation,datareduction,anddenoising.Insuchanarchitecture,theentirenetworkcanbedividedintotwomajorparts:theencoderpartandthedecoderpart.Theencoderpartusuallyworksasthehigh-levelfeatureextractor,whichprojectstheinputdataintoalatentspacewithrelativelylowerdimensionscomparedtotheoriginalinputdata.Inotherwords,theinputdatawillbetransformedintoitslatentrepresentationbytheencoder.Duringthisprocess,theimportantsemanticinformationoftheinputdatawillbepreserved,whilethenoiseintheinputdatawillberemoved.Aftertheencodingprocess,thedecoderwillgeneratea“prediction”fromthelatentrepresentationoftheinputdata.Forexample,inasemanticsegmentationtask,theexpectedoutputofthedecodercanbeasemanticsegmentationmapwiththesameresolutionastheinputdata.Inaseq-2-seqlanguagetranslationtask,theoutputcanbetheexpectedsequenceinthetargetlanguage.Indatadenoisingtasks,mostworksuseadecodertoreconstructtherawinputdata.
Owingtothestrongrepresentationlearningabilityandgoodflexibilityofthenetworkarchi-tectureofEncoder-Decodermodels,Encoder-Decoderhasbeenadoptedinmoreandmoredeepmultimodaldatafusionmodelsinrecentyears.Basedonthedifferencesintermsofthemodali-tiesandtasks,thearchitecturesofmultimodaldatafusionmodelsvaryfromeachotherwidely.Inthissurvey,wesummarizethegeneralideaoftheEncoder-Decoderfusionmethodsanddiscardsomeofthetask-specificfusionstrategiesthatcannotbegeneralized.ThegeneralstructureoftheEncoder-DecoderfusionisshowninFigure
6.
Aswecansee,thehigh-levelfeaturesobtainedfromdifferentindividualmodalitiesareprojectedintoalatentspace.Then,thetask-specificde-coderwillgeneratethepredictionfromthelearnedlatentrepresentationoftheinputmultimodaldata.Inrealscenarios,thereexistsplentyofvariationsofthisstructure.Wecategorizetheminto3sub-classes:raw-data-levelfusion,hierarchicalfeaturefusion,anddecision-levelfusion.
ACMComput.Surv.,Vol.56,No.9,Article216.Publicationdate:April2024.
216:8F.Zhaoetal.
Fig.7.Visualizationsofdifferentmethods.
2.1Raw-data-levelFusion
Inthisfusion,therawdataofeachmodalityorthedataobtainedfromtheindependentpre-processingofeachmodalitywillbeintegratedattheinputlevel.Then,theformedinputvectorofthemultimodalitieswillbesenttooneencoderforextractinghigh-levelfeatures.Thedatafromindividualmodalitiesarefusedatalowlevel(e.g.,theinputlevel),andonlyoneencoderisappliedtoextractthehigh-levelfeaturesofmultimodaldata.Forexample,fortheimagesegmentationtask,Couprieetal.[
27
]proposethefirstdeeplearning-basedmultimodalfusionmodel.Inthiswork,theauthorsfusethemultimodaldataviaaconcatenationoperation,inwhichtheRGBimageandthedepthimageareconcatenatedalongthechannelaxis.Similarly,Liuetal.
[109
]concatenateRGBimageanddepthimagetogether.Theauthorsutilizedepthinformationtoassistcolorinforma-tionindetectingsalientobjectswithalowercomputationalcostcomparedtothedouble-streamnetworkwhichconsistsoftwoseparatedsub-networksdealingwithRGBdataanddepthdata,respectively.Thekeyadvantagesofthisfusionarethat(1)itcanmaximallypreservetheoriginalinformationofeachmodality,and(2)thedesi
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 供电局微笑服务演讲稿
- 员工代表演讲稿
- 企业普通员工年终工作总结
- 去音标课件教学课件
- 晚上做课件教学课件
- 探矿全证办理流程
- 《EDA技术与设计》全套教学课件
- 部编版历史九年级上册第三单元 第10课《拜占庭帝国和查士丁尼法典》说课稿
- 实数复习课件教学课件
- 建功中学联盟九年级上学期语文10月月考试卷
- 英语-浙江省湖州、衢州、丽水2024年11月三地市高三教学质量检测试卷试题和答案
- 劳动技术教案
- 广东省深圳市2023-2024学年高一上学期生物期中试卷(含答案)
- 第七章 立体几何与空间向量综合测试卷(新高考专用)(学生版) 2025年高考数学一轮复习专练(新高考专用)
- 大学美育(同济大学版)学习通超星期末考试答案章节答案2024年
- 中国急性缺血性卒中诊治指南(2023版)
- 福建省残疾人岗位精英职业技能竞赛(美甲师)参考试题及答案
- 在线学习新变革课件 2024-2025学年人教版(2024)初中信息技术七年级全一册
- 劳动法律学习试题
- 航空器系统与动力装置学习通超星期末考试答案章节答案2024年
- 过敏性休克完整版本
评论
0/150
提交评论