基于生成模型的高保真图像与视频编辑 High-fidelity image and video editing with generative models_第1页
基于生成模型的高保真图像与视频编辑 High-fidelity image and video editing with generative models_第2页
基于生成模型的高保真图像与视频编辑 High-fidelity image and video editing with generative models_第3页
基于生成模型的高保真图像与视频编辑 High-fidelity image and video editing with generative models_第4页
基于生成模型的高保真图像与视频编辑 High-fidelity image and video editing with generative models_第5页
已阅读5页,还剩199页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

HIGH-FIDELITYIMAGEANDVIDEOEDITINGWITHGENERATIVEMODELS

by

CHENYANGQI

AThesisSubmittedto

TheHongKongUniversityofScienceandTechnology

inPartialFulfillmentoftheRequirementsfor

theDegreeofDoctorofPhilosophy

inComputerScienceandEngineering

July2024,HongKong

Copyright©byChenyangQi2024

HKUSTLibrary

Reproductionisprohibitedwithouttheauthor’spriorwrittenconsent

ii

Authorization

IherebydeclarethatIamthesoleauthorofthethesis.

IauthorizetheHongKongUniversityofScienceandTechnologytolendthisthesistootherinstitutionsorindividualsforthepurposeofscholarlyresearch.

IfurtherauthorizetheHongKongUniversityofScienceandTechnologytoreproducethethesisbyphotocopyingorbyothermeans,intotalorinpart,attherequestofotherinstitutionsorindividualsforthepurposeofscholarlyresearch.

CHENYANGQI

24July2024

iii

HIGH-FIDELITYIMAGEANDVIDEOEDITINGWITHGENERATIVEMODELS

by

CHENYANGQI

ThisistocertifythatIhaveexaminedtheabovePh.D.thesis

andhavefoundthatitiscompleteandsatisfactoryinallrespects,

andthatanyandallrevisionsrequiredby

thethesisexaminationcommitteehavebeenmade.

Prof.QifengChen,ThesisSupervisor

Prof.XiaofangZHOU,HeadofDepartment

DepartmentofComputerScienceandEngineering

24July2024

iv

ACKNOWLEDGMENTS

ItwouldhavebeenimpossibletocompletemywonderfulPh.D.journeywithoutthehelpofsomanypeople.

Firstofall,Iwouldliketoexpressmygratitudetomyadvisor,ProfessorQifengChen,forhispatience,support,andencouragement.Istillrememberourfirstmeetingfouryearsago.AlthoughIhadalmostnoexperienceincomputervisionatthattime,ProfessorChenbelievedinmypotentialandgavemethisinvaluableopportunitytopursueknowledgeatHKUST.Infouryears,hehasprovidedkindguidancetome:ideabrainstorming,technicaldesign,resultpresentation,andcareerplanning.

Secondly,Iwouldliketothankmymentorsduringmyinternships:XiaodongCun,YongZhang,XintaoWang,andYingShanatTencentAILab;ZhengzhongTu,KerenYe,HosseinTalebi,MauricioDelbracio,andPeymanMilanfaratGoogleResearch;BoZhang,DongChen,andFangWenatMicrosoftResearch;TaesungParkandJimeiYangatAdobe.Theytaughtmepracticalskillstosolvereal-worldproblemsandbridgethegapbetweenacademiaandindustry.

Next,IwouldliketothankmylabmatesintheHKUSTVisualIntelligenceLab,especiallymycollaboratorsChenyangLei,JiaxinXie,XinYang,KaLeongCheng,YueMa,LiyaJi,JunmingChen,NaFan,andZianQian.Wehavehelpedeachotherinourresearch,andIhavelearnedalotfromtheirinsights.Also,thankstoYueWu,QiangWen,TengfeiWang,YingqingHe,YazhouXing,GuotaoMeng,ZifanShi,MaoshengYe,YueqiXie,andallotherlabmates.Ithasbeenajoyfultimebeingfriendsandpartnerswithyou.

Further,IwouldliketoexpressmysinceregratitudetoProf.YincongChen,Prof.DanXu,andProf.XiaomengLi,Prof.Chi-YingTsui,Prof.LingShi,Prof.ChiewLanTai,andProfYinqiangZhengwhoservedonthequalifyingexaminationcommitteeandthesiscommitteeofmyPh.D.programatHKUST.

Lastbutnotleast,Iappreciatetheendlesssupportfrommyfamilyandmygirlfriend.Yourencouragementhasgivenmethepowertofacethedifficultiesinmyresearch.MygirlfriendXilinZhanghasalsohelpedinrevisingmydraftsbeforealmosteverydeadline.

Thankstoeveryonewhohasofferedtheirkindsupportandhelpinmyacademicjourney!

v

TABLEOFCONTENTS

TitlePagei

AuthorizationPageii

SignaturePageiii

Acknowledgmentsiv

TableofContentsv

ListofFiguresviii

ListofTablesxii

Abstractxiv

Chapter1Introduction1

1.1Background1

1.2DissertationOverview4

Chapter2ThumbnailRescalingusingquantizedautoencoder6

2.1Introduction6

2.2RelatedWork10

2.2.1ImageSuper-resolution10

2.2.2ImageRescaling10

2.2.3ImageCompression11

2.3Method11

2.3.1JPEGPreliminary11

2.3.2OverviewofHyperThumbnail13

2.3.3QuantizationPredictionModule13

2.3.4Frequency-awareDecoder14

2.3.5TrainingObjectives14

vi

2.4Experiments16

2.4.1ImplementationDetails16

2.4.2ExperimentalSetup18

2.4.3ComparewithBaselines19

2.4.4Additionalqualitativeresults23

2.4.5Real-timeInferenceon6KImages27

2.4.6ExtensionforOptimization-basedRescaling27

2.5AblationStudy28

2.6Conclusion30

Chapter3Text-drivenimagerestorationviadiffusionpriors31

3.1Introduction32

3.2RelatedWork34

3.3Method36

3.3.1Preliminaries37

3.3.2Text-drivenImageRestoration37

3.3.3DecouplingSemanticandRestorationPrompts38

3.3.4LearningtoControltheRestoration40

3.4Experiments43

3.4.1Text-basedTrainingDataandBenchmarks43

3.4.2Comparingwithbaselines44

3.4.3PromptingtheSPIRE46

3.5AblationStudy47

3.6Conclusion47

Chapter4Text-drivenvideoeditingusingdiffusionpriors52

4.1Introduction53

4.2RelatedWork55

4.3Methods56

4.3.1Preliminary:LatentDiffusionandInversion57

4.3.2FateZeroVideoEditing59

4.3.3Shape-AwareVideoEditing62

4.4Experiments62

vii

4.4.1ImplementationDetails62

4.4.2Pseudoalgorithmcode63

4.4.3Applications64

4.4.4BaselineComparisons66

4.4.5AblationStudies68

4.5Conclusion69

Chapter5ConclusionandDiscussion72

References74

AppendixAListofPublications91

viii

LISTOFFIGURES

1.1Traditionalparadigm[107,146](a)ofvisualeditingfirstconductsdegradationop-eratorsontrainingdataxtosynthesizeconditionsy,suchaslow-resolutionimages,segmentationmaps,orsketchmaps.Althoughthismethodisstraightforward,itfacesdifficultiesincollectingopen-domainpairedtrainingdataanddesigningaflexibleframeworktounifyalltranslationtasks.(b)Weproposeanewparadigmutilizingpretrainedgenerativemodelsandconditionedoneditinginstructiontoadapttovariouseditingtasksflexibly.

2.1Theapplicationof6Kimagerescalinginthecontextofcloudphotostorageonsmartphones(e.g.,iCloud).Asmorehigh-resolution(HR)imagesareuploadedtocloudstoragenowadays,challengesarebroughttocloudserviceproviders(CSPs)infulfillinglatency-sensitiveimagereadingrequests(e.g.,zoom-in)throughtheinternet.Tofacilitatefastertransmissionandhigh-qualityvisualcontent,ourHy-perThumbnailframeworkhelpsCSPstoencodeanHRimageintoanLRJPEGthumbnail,whichuserscouldcachelocally.Whentheinternetisunstableorun-available,ourmethodcanstillreconstructahigh-fidelityHRimagefromtheJPEGthumbnailinrealtime.

2.2Theoverviewofourapproach.GivenanHRinputimagex,wefirstencodextoitsLRrepresentationywiththeencoderE,wherethescalingfactoriss.Second,wetransformytoDCTcoefficientsCandpredictthequantizationtablesQL,QCwith

ourquantizationpredictionmodule(QPM).toestimatethebitrateofthequantizedcoefficientsCattrainingstage.roundingandtruncation,whichwedenotedas[·],the[QL],[QC]and[C]canbewrittenandreadwithoff-the-shelfJPEGAPIatthetestingstage.TorestoretheHR,weextractfeaturesfromCwithafrequencyfeatureextractorfandproducethehigh-fidelityimagewiththedecoderD.

2.3ReconstructedHRimagesandLRthumbnailsbydifferentmethodsontheDIV2K[6]validationdataset.WecroptherestoredHRimagestoeasethecomparisonandvisualizetheLRcounterpartsatthebottom-right.ThebppiscalculatedonthewholeimageandthePSNRisevaluatedonthecroppedareaofthereconstructedHRimages.

2.4DownscaledLRthumbnailsbydifferentmethodsonSet14imagecomic.Withasimilartargetbpp,ourmodelintroducesleastartifactsinthethumbnailincompar-isontobaselines.

2.5Modelruntime.Weprofilethe4×encoderanddecoderatdifferenttargetresolu-tioninhalf-precisionmode.Especially,weconvertourdecoderfromPyTorchtoTensorRTforfurtherinferencetimereduction.

3

7

12

17

20

21

ix

2.6Therate-HR-distortioncurveonKodak[1]dataset.Ourmethod(s=2,4)outperformsJPEG,IRN[153]intheRDperformance.Forthe‘QPM+JPEG’curve,wheres=1,wefollowthestandardJPEGalgorithmandadoptQPMmoduleasapluginfortableprediction.

2.7Visualresultsofperforming4×rescalingontheDIV2K[6]andFiveK[18]datasetswithbaselinemethodsandourmodels.Theimagesarecroppedtoeasethecomparison.Pleasezoominfordetails.

2.8Moreresultsof4×rescalingwithourframeworkonreal-world6Kimages[18].Pleasezoominfordetails.Notethattheimagesherearecompressedduetothesizelimitofcamera-ready.

2.9QuantizationtablesonKodak[1]images.WevisualizethequantizationtableQL(thegreentable)andQC(theorangetable)forkodim04andkodim09ofdifferentquantizationapproaches.ThemodeltrainedwithQPMachievesthebestRDperformancefromeveryaspect.Formoreanalysis,pleaserefertoSec.2.5inourchapter.

2.10QPMversusimage-invariantquantization.WefirsttrainourmodelswithQPM,withafixedJPEGtableorwithanoptimizedtable,respectively.Then,weevaluatetheatdifferenttargetbitrateonKodak[1]dataset.(a)theRDcurveonreconstructed

HRimageandinputx;(b)theRDcurveonLRthumbnailandtheBicubic

downsampledLRyref.

2.11guidancelossablationonKodak[1]imagekodim17.WevisualizetheHRimageswiththeirLRcounterpartsatthebottom-right.(b)(c)areproducedby4×HyperThumbnailmodelstrainedwithdifferentλ1andthebppis0.4.

3.1WepresentSPIRE:SemanticPrompt-DrivenImageRestoration,atext-basedfoundationmodelforall-in-one,instructedimagerestoration.SPIREallowsuserstoflexiblyleverageeithersemantic-levelcontentprompt,ordegradation-awarerestorationprompt,orboth,toobtaintheirdesiredenhancementresultsbasedonpersonalpreferences.Inotherwords,SPIREcanbeeasilypromptedtoconductblindrestoration,semanticrestoration,ortask-specificgranulartreatment.Ourframeworkalsoenablesanewparadigmofinstruction-basedimagerestoration,providingareliableevaluationbenchmarktofacilitatevision-languagemodelsforlow-levelcomputationalphotographyapplications.

3.2FrameworkofSPIRE.Inthetrainingphase,webeginbysynthesizingadegradedversiony,ofacleanimagex.Ourdegradationsynthesispipelinealsocreatesarestorationpromptcr,whichcontainsnumericparametersthatreflectstheintensityofthedegradationintroduced.Then,weinjectthesyntheticrestorationpromptintoaControlNetadaptor,whichusesourproposedmodulationfusionblocks(γ,β)toconnectwiththefrozenbackbonedrivenbythesemanticpromptcs.Duringtesttime,theuserscanemploytheSPIREframeworkaseitherablindrestorationmodelwithrestorationprompt“Removealldegradation”andemptysemanticprompt∅,ormanuallyadjusttherestorationcrandsemanticpromptscstoobtainwhattheyaskfor.

23

25

26

27

29

29

31

35

x

3.3Degradationambiguitiesinreal-worldproblems.Byadjustingtherestorationprompt,ourmethodcanpreservethemotioneffectthatiscoupledwiththeaddedGaussianblur,whilefullyblindrestorationmodelsdonotprovidethislevelofflexibility.

3.4Promptspacewalkingvisualizationfortherestorationprompt.Giventhesamedegradedinput(upperleft)andemptysemanticprompt∅,ourmethodcandecoupletherestorationdirectionandstrengthviaonlypromptingthequantitativenumberinnaturallanguage.Aninterestingfindingisthatourmodellearnsacontinuousrangeofrestorationstrengthfromdiscretelanguagetokens.

3.5Restorationpromptingforout-of-domainimages.

3.6VisualComparisonwithotherbaselines.Ourmethodofintegratingboththesemanticpromptcsandtherestorationpromptcroutperformsimge-to-imagerestoration(DiffBIR,RetrainedControlNet-SR)andnaivezero-shotcombinationwithsemanticprompt.Itachievesmoresharpandcleanresultswhilemaintainingconsistencywiththedegradedimage.

3.7Test-timesemanticprompting.Ourframeworkrestoresdegradedimagesguidedbyflexiblesemanticprompts,whileunrelatedbackgroundelementsandglobaltonesremainalignedwiththedegradedinputconditioning.Inaddition,Moresemanticpromptingforimageswithmultipleobjects

3.8Mainvisualcomparisonwithbaselines.(Zoominfordetails)

4.1Zero-shottext-drivenvideoediting.Wepresentazero-shotapproachforshape-awarelocalobjecteditingandvideostyleeditingfrompre-traineddiffusionmod-els[150,117]withoutanyoptimizationforeachtargetprompt.

4.2Theoverviewofourapproach.Ourinputistheuser-providedsourcepromptpsrc,targetpromptpeditandcleanlatentz={z1,z2,...zn}encodedfrominputsourcevideox={x1,x2,...xn}withnumberframesninavideosequence.Ontheleft,wefirstinvertthevideousingDDIMinversionpipelineintonoisylatentzTusingthesourcepromptpsrcandaninflated3DU-Netεθ.Duringeachinversiontimestept,

westorebothspatial-temporalself-attentionmapssandcross-attentionmapsc.

AttheeditingstageoftheDDIMdenoising,wedenoisethelatentzTbacktoclean

image0conditionedontargetpromptpedit.Ateachdenoisingtimestept,wefuse

theattentionmaps(sandc)inεθwithstoredattentionmap(s,c)using

theproposedAttentionBlendingBlock.Right:Specifically,wereplacethecross-

attentionmapscofun-editedwords(e.g.,roadandcountryside)withsource

mapscofthem.Inaddition,weblendtheself-attentionmapduringinversion

sandeditingswithanadaptivespatialmaskobtainedfromcross-attention

mapscofeditedwords(e.g..,silverandjeep),whichrepresentstheareasthatthe

userwantstoedit.

4.3Zero-shotlocalattributedediting(cat→tiger)usingstablediffusion.Incontrasttofusionwithattentionduringreconstruction(a)inpreviouswork[49,136,108],ourinversionattentionfusion(b)providesmoreaccuratestructureguidanceandeditingability,asvisualizedontherightside.

42

49

49

50

50

51

53

57

58

xi

4.4Studyofblendedself-attentioninzero-shotshapeediting(rabbit→tiger)usingstablediffusion.Forthandfifthcolumns:Ignoringself-attentioncannotpreservetheoriginalstructureandbackground,andnaivereplacementcausesartifacts.Thirdcolumn:Blendingtheself-attentionusingthecross-attentionmap(thesecondrow)obtainsbothnewshapefromthetargettextwithasimilarposeandbackgroundfromtheinputframe.

4.5Zero-shotobjectshapeeditingonpre-trainedvideodiffusionmodel[150]:Ourframeworkcandirectlyedittheshapeoftheobjectinvideosdrivenbytextpromptsusingatrainedvideodiffusionmodel[150]

4.6Zero-shotattributeandstyleeditingresultsusingStableDiffusion[117].Ourframeworksupportsabstractattributeandstyleeditinglike‘Swarovskicrystal’,‘Ukiyo-e’,and‘MakotoShinkai’.Bestviewedwithzoom-in.

4.7Qualitativecomparisonofourmethodswithotherbaselines.InputsareinFig.4.5andFig4.8.Ourresultshavethebesttemporalconsistency,imagefidelity,andeditingquality.Bestviewedwithzoom-in.

4.8Applicationoflatentsblending.Extendingourattentionblendingstrategytohigh-resolutionlatent,ourframeworkcanpreservetheaccuratelow-levelcolorandtextureofinput.

4.9Inversionattentioncomparedwithreconstructionattentionusingprompt‘desertedshore‘glaciershore’.Theattentionmapsobtainedfromtherecon-structionstagefailtodetecttheboat’sposition,andcannotprovidesuitablemotionguidanceforzero-shotvideoediting.

4.10Ablationstudyofblendedself-attention.Withoutself-attentionfusion,thegeneratedvideocannotpreservethedetailsofinputvideos(e.g.,fence,trees,andcaridentity).Ifwereplacefullself-attentionwithoutaspatialmask,thestructureoftheoriginaljeepmisleadsthegenerationofthePorschecar.

59

62

63

64

65

67

69

xii

LISTOFTABLES

1.1Thecomparisonofdifferentgenerativemodels.

2.1Thecomparisonofdifferentmethodsrelatedtoimagerescaling.(a)Super-resolutionfromdownsampledJPEGdoesnotoptimizerate-distortionperformanceandcanhardlymaintainhighfidelityduetoinformationlostindownsampling.(b)SOTAflow-basedimagerescalingmethodsalsoignorethefilesizeconstraintsand

arenotreal-timefor6Kreconstructionduetothelimitedspeedofinvertiblenetworks.(c)Ourframeworkoptimizesrate-distortionperformancewhilemaintaininghigh-fidelityandreal-time6Kimagerescaling.

2.2Quantitativeevaluationofupscalingefficiencyandreconstructionfidelity.Wekeepbpparound0.3onKodak[1]fordifferentmethods,andthedistortionismeasuredbythePSNRonthereconstructedHRimages.OurapproachoutperformsothermethodswithbetterHRreconstructionandasignificantlylowerruntime.WemeasuretherunningtimeandGMacsofallmodelsbyupscalinga960×540LRimagetoa3840×2160HRimage.ThemeasurementsaremadeonanNvidiaRTX

3090GPUwithPyTorch-1.11.0inhalf-precisionmodeforafaircomparison.

2.3Architecturesofourencoder.

2.4Architecturesofourefficientdecoder.

2.5Quantitativeevaluationofthe4×downsampledLRthumbnailsbydifferentmethods.Thetargetbitrateisaround0.3bpponKodak[1]forallmethods,andwetakeBicubicLRasthegroundtruth.Ourthumbnailpreservesvisualcontentsbetter.

2.6ComparisonofourHyperThumbnailframeworkagainstlearnedcompressionwithJPEGthumbnail.Inadditionalbaseline,weprovideaJPEGthumbnailbesides learnedcompression,andtakethesumofbitstreamsizeandJPEGsizetocal-culatethefinalbpp.Ourframeworkhasbetterrate-distortionperformancethan“Compression+JPEG”baseline.

2.7Ablationstudyofourencoder-decoderarchitecturesonthedownsampling/upsam-plingtimeandthePSNRofreconstructedHRimage/LRthumbnail.

2.8Quantitativeevaluationforoptimization-basedrescaling.

2.9HRreconstructionPSNRwithdifferentdecodercapacity.

3.1QuantitativeresultsontheMS-COCOdataset(withcs)usingourparameterizeddegradation(left)andReal-ESRGANdegradation(right).Wealsodenotethepromptchoiceattesttime.‘Sem’standsforsemanticprompt;‘Res’standsforrestorationprompt.Thefirstgroupofbaselinesaretestedwithoutprompt.Thesecondgrouparecombinedwithsemanticpromptinzero-shotway.

2

8

16

17

18

21

22

24

27

30

42

xiii

3.2Ourtrainingdegradationisrandomlysampledinthesetwopipelinewith50%each.(1)DegradedimagesysynthesizedbyReal-ESRGANarepairedwiththesamerestorationpromptcr=“Removealldegradation”(2)Inother50%iterations,imagesgeneratedbyourparameterizedpipelinearepairedwitheitherarestorationtypeprompt(e.g.,“Deblur”)orarestorationparameterprompt(e.g.,“Deblurwithsigma0.3;”).

3.3NumericalresultsonDIV2Ktestsetwithoutanyprompt.

3.4Ablationofarchitectureanddegradationstrengthincr

3.5Ablationofpromptsprovidedduringbothtrainingandtesting.Weuseanimage-to-imagemodelwithourmodulationfusionlayerasourbaseline.Providingsemanticpromptssignificantlyincreasestheimagequality(1.9lowerFID)andsemanticsimilarity(0.002CLIP-Image),butresultsinworsepixel-levelsimilarity.Incontrast,degradationtypeinformationembeddedinrestorationpromptsimprovesbothpixel-levelfidelityandimagequality.Utilizingdegradationparametersintherestorationinstructionsfurtherimprovesthesemetrics.

3.6Ablationofthearchitecture.Modulatingtheskipfeaturefskipimprovesthefi-delityoftherestoredimagewith3%extraparametersintheadaptor,whilefurthermodulatingthebackbonefeaturesfupdoesnotbringobviousadvantage.

4.1Quantitativeevaluationagainstbaselines.Inouruserstudy,theresultsofourmethodarepreferredoverthosefrombaselines.ForCLIP-Score,weachievethebesttemporalconsistencyandcomparableframewiseeditingaccuracy

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论