基于生成模型的高保真图像与视频编辑 High-fidelity image and video editing with generative models

上传人：策*** IP属地：山西上传时间：2025-01-29 格式：DOCX 页数：204 大小：7.35MB 积分：19.9 举报 版权申诉

基于生成模型的高保真图像与视频编辑 High-fidelity image and video editing with generative models_第2页

基于生成模型的高保真图像与视频编辑 High-fidelity image and video editing with generative models_第3页

基于生成模型的高保真图像与视频编辑 High-fidelity image and video editing with generative models_第4页

基于生成模型的高保真图像与视频编辑 High-fidelity image and video editing with generative models_第5页

已阅读5页，还剩199页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

HIGH-FIDELITYIMAGEANDVIDEOEDITINGWITHGENERATIVEMODELS

CHENYANGQI

AThesisSubmittedto

TheHongKongUniversityofScienceandTechnology

inPartialFulﬁllmentoftheRequirementsfor

theDegreeofDoctorofPhilosophy

inComputerScienceandEngineering

July2024,HongKong

HKUSTLibrary

Reproductionisprohibitedwithouttheauthor’spriorwrittenconsent

Authorization

IherebydeclarethatIamthesoleauthorofthethesis.

IauthorizetheHongKongUniversityofScienceandTechnologytolendthisthesistootherinstitutionsorindividualsforthepurposeofscholarlyresearch.

IfurtherauthorizetheHongKongUniversityofScienceandTechnologytoreproducethethesisbyphotocopyingorbyothermeans,intotalorinpart,attherequestofotherinstitutionsorindividualsforthepurposeofscholarlyresearch.

CHENYANGQI

24July2024

iii

HIGH-FIDELITYIMAGEANDVIDEOEDITINGWITHGENERATIVEMODELS

CHENYANGQI

ThisistocertifythatIhaveexaminedtheabovePh.D.thesis

andhavefoundthatitiscompleteandsatisfactoryinallrespects,

andthatanyandallrevisionsrequiredby

thethesisexaminationcommitteehavebeenmade.

Prof.QifengChen,ThesisSupervisor

Prof.XiaofangZHOU,HeadofDepartment

DepartmentofComputerScienceandEngineering

24July2024

ACKNOWLEDGMENTS

ItwouldhavebeenimpossibletocompletemywonderfulPh.D.journeywithoutthehelpofsomanypeople.

Firstofall,Iwouldliketoexpressmygratitudetomyadvisor,ProfessorQifengChen,forhispatience,support,andencouragement.Istillrememberourﬁrstmeetingfouryearsago.AlthoughIhadalmostnoexperienceincomputervisionatthattime,ProfessorChenbelievedinmypotentialandgavemethisinvaluableopportunitytopursueknowledgeatHKUST.Infouryears,hehasprovidedkindguidancetome:ideabrainstorming,technicaldesign,resultpresentation,andcareerplanning.

Secondly,Iwouldliketothankmymentorsduringmyinternships:XiaodongCun,YongZhang,XintaoWang,andYingShanatTencentAILab;ZhengzhongTu,KerenYe,HosseinTalebi,MauricioDelbracio,andPeymanMilanfaratGoogleResearch;BoZhang,DongChen,andFangWenatMicrosoftResearch;TaesungParkandJimeiYangatAdobe.Theytaughtmepracticalskillstosolvereal-worldproblemsandbridgethegapbetweenacademiaandindustry.

Next,IwouldliketothankmylabmatesintheHKUSTVisualIntelligenceLab,especiallymycollaboratorsChenyangLei,JiaxinXie,XinYang,KaLeongCheng,YueMa,LiyaJi,JunmingChen,NaFan,andZianQian.Wehavehelpedeachotherinourresearch,andIhavelearnedalotfromtheirinsights.Also,thankstoYueWu,QiangWen,TengfeiWang,YingqingHe,YazhouXing,GuotaoMeng,ZifanShi,MaoshengYe,YueqiXie,andallotherlabmates.Ithasbeenajoyfultimebeingfriendsandpartnerswithyou.

Further,IwouldliketoexpressmysinceregratitudetoProf.YincongChen,Prof.DanXu,andProf.XiaomengLi,Prof.Chi-YingTsui,Prof.LingShi,Prof.ChiewLanTai,andProfYinqiangZhengwhoservedonthequalifyingexaminationcommitteeandthesiscommitteeofmyPh.D.programatHKUST.

Lastbutnotleast,Iappreciatetheendlesssupportfrommyfamilyandmygirlfriend.Yourencouragementhasgivenmethepowertofacethedifﬁcultiesinmyresearch.MygirlfriendXilinZhanghasalsohelpedinrevisingmydraftsbeforealmosteverydeadline.

Thankstoeveryonewhohasofferedtheirkindsupportandhelpinmyacademicjourney!

TABLEOFCONTENTS

TitlePagei

AuthorizationPageii

SignaturePageiii

Acknowledgmentsiv

TableofContentsv

ListofFiguresviii

ListofTablesxii

Abstractxiv

Chapter1Introduction1

1.1Background1

1.2DissertationOverview4

Chapter2ThumbnailRescalingusingquantizedautoencoder6

2.1Introduction6

2.2RelatedWork10

2.2.1ImageSuper-resolution10

2.2.2ImageRescaling10

2.2.3ImageCompression11

2.3Method11

2.3.1JPEGPreliminary11

2.3.2OverviewofHyperThumbnail13

2.3.3QuantizationPredictionModule13

2.3.4Frequency-awareDecoder14

2.3.5TrainingObjectives14

2.4Experiments16

2.4.1ImplementationDetails16

2.4.2ExperimentalSetup18

2.4.3ComparewithBaselines19

2.4.4Additionalqualitativeresults23

2.4.5Real-timeInferenceon6KImages27

2.4.6ExtensionforOptimization-basedRescaling27

2.5AblationStudy28

2.6Conclusion30

Chapter3Text-drivenimagerestorationviadiffusionpriors31

3.1Introduction32

3.2RelatedWork34

3.3Method36

3.3.1Preliminaries37

3.3.2Text-drivenImageRestoration37

3.3.3DecouplingSemanticandRestorationPrompts38

3.3.4LearningtoControltheRestoration40

3.4Experiments43

3.4.1Text-basedTrainingDataandBenchmarks43

3.4.2Comparingwithbaselines44

3.4.3PromptingtheSPIRE46

3.5AblationStudy47

3.6Conclusion47

Chapter4Text-drivenvideoeditingusingdiffusionpriors52

4.1Introduction53

4.2RelatedWork55

4.3Methods56

4.3.1Preliminary:LatentDiffusionandInversion57

4.3.2FateZeroVideoEditing59

4.3.3Shape-AwareVideoEditing62

4.4Experiments62

vii

4.4.1ImplementationDetails62

4.4.2Pseudoalgorithmcode63

4.4.3Applications64

4.4.4BaselineComparisons66

4.4.5AblationStudies68

4.5Conclusion69

Chapter5ConclusionandDiscussion72

References74

AppendixAListofPublications91

viii

LISTOFFIGURES

1.1Traditionalparadigm[107,146](a)ofvisualeditingﬁrstconductsdegradationop-eratorsontrainingdataxtosynthesizeconditionsy,suchaslow-resolutionimages,segmentationmaps,orsketchmaps.Althoughthismethodisstraightforward,itfacesdifﬁcultiesincollectingopen-domainpairedtrainingdataanddesigningaﬂexibleframeworktounifyalltranslationtasks.(b)Weproposeanewparadigmutilizingpretrainedgenerativemodelsandconditionedoneditinginstructiontoadapttovariouseditingtasksﬂexibly.

2.1Theapplicationof6Kimagerescalinginthecontextofcloudphotostorageonsmartphones(e.g.,iCloud).Asmorehigh-resolution(HR)imagesareuploadedtocloudstoragenowadays,challengesarebroughttocloudserviceproviders(CSPs)infulﬁllinglatency-sensitiveimagereadingrequests(e.g.,zoom-in)throughtheinternet.Tofacilitatefastertransmissionandhigh-qualityvisualcontent,ourHy-perThumbnailframeworkhelpsCSPstoencodeanHRimageintoanLRJPEGthumbnail,whichuserscouldcachelocally.Whentheinternetisunstableorun-available,ourmethodcanstillreconstructahigh-ﬁdelityHRimagefromtheJPEGthumbnailinrealtime.

2.2Theoverviewofourapproach.GivenanHRinputimagex,weﬁrstencodextoitsLRrepresentationywiththeencoderE,wherethescalingfactoriss.Second,wetransformytoDCTcoefﬁcientsCandpredictthequantizationtablesQL,QCwith

ourquantizationpredictionmodule(QPM).toestimatethebitrateofthequantizedcoefﬁcientsCattrainingstage.roundingandtruncation,whichwedenotedas[·],the[QL],[QC]and[C]canbewrittenandreadwithoff-the-shelfJPEGAPIatthetestingstage.TorestoretheHR,weextractfeaturesfromCwithafrequencyfeatureextractorfandproducethehigh-ﬁdelityimagewiththedecoderD.

2.3ReconstructedHRimagesandLRthumbnailsbydifferentmethodsontheDIV2K[6]validationdataset.WecroptherestoredHRimagestoeasethecomparisonandvisualizetheLRcounterpartsatthebottom-right.ThebppiscalculatedonthewholeimageandthePSNRisevaluatedonthecroppedareaofthereconstructedHRimages.

2.4DownscaledLRthumbnailsbydifferentmethodsonSet14imagecomic.Withasimilartargetbpp,ourmodelintroducesleastartifactsinthethumbnailincompar-isontobaselines.

2.5Modelruntime.Weproﬁlethe4×encoderanddecoderatdifferenttargetresolu-tioninhalf-precisionmode.Especially,weconvertourdecoderfromPyTorchtoTensorRTforfurtherinferencetimereduction.

2.6Therate-HR-distortioncurveonKodak[1]dataset.Ourmethod(s=2,4)outperformsJPEG,IRN[153]intheRDperformance.Forthe‘QPM+JPEG’curve,wheres=1,wefollowthestandardJPEGalgorithmandadoptQPMmoduleasapluginfortableprediction.

2.7Visualresultsofperforming4×rescalingontheDIV2K[6]andFiveK[18]datasetswithbaselinemethodsandourmodels.Theimagesarecroppedtoeasethecomparison.Pleasezoominfordetails.

2.8Moreresultsof4×rescalingwithourframeworkonreal-world6Kimages[18].Pleasezoominfordetails.Notethattheimagesherearecompressedduetothesizelimitofcamera-ready.

2.9QuantizationtablesonKodak[1]images.WevisualizethequantizationtableQL(thegreentable)andQC(theorangetable)forkodim04andkodim09ofdifferentquantizationapproaches.ThemodeltrainedwithQPMachievesthebestRDperformancefromeveryaspect.Formoreanalysis,pleaserefertoSec.2.5inourchapter.

2.10QPMversusimage-invariantquantization.WeﬁrsttrainourmodelswithQPM,withaﬁxedJPEGtableorwithanoptimizedtable,respectively.Then,weevaluatetheatdifferenttargetbitrateonKodak[1]dataset.(a)theRDcurveonreconstructed

HRimageandinputx;(b)theRDcurveonLRthumbnailandtheBicubic

downsampledLRyref.

2.11guidancelossablationonKodak[1]imagekodim17.WevisualizetheHRimageswiththeirLRcounterpartsatthebottom-right.(b)(c)areproducedby4×HyperThumbnailmodelstrainedwithdifferentλ1andthebppis0.4.

3.1WepresentSPIRE:SemanticPrompt-DrivenImageRestoration,atext-basedfoundationmodelforall-in-one,instructedimagerestoration.SPIREallowsuserstoﬂexiblyleverageeithersemantic-levelcontentprompt,ordegradation-awarerestorationprompt,orboth,toobtaintheirdesiredenhancementresultsbasedonpersonalpreferences.Inotherwords,SPIREcanbeeasilypromptedtoconductblindrestoration,semanticrestoration,ortask-speciﬁcgranulartreatment.Ourframeworkalsoenablesanewparadigmofinstruction-basedimagerestoration,providingareliableevaluationbenchmarktofacilitatevision-languagemodelsforlow-levelcomputationalphotographyapplications.

3.2FrameworkofSPIRE.Inthetrainingphase,webeginbysynthesizingadegradedversiony,ofacleanimagex.Ourdegradationsynthesispipelinealsocreatesarestorationpromptcr,whichcontainsnumericparametersthatreﬂectstheintensityofthedegradationintroduced.Then,weinjectthesyntheticrestorationpromptintoaControlNetadaptor,whichusesourproposedmodulationfusionblocks(γ,β)toconnectwiththefrozenbackbonedrivenbythesemanticpromptcs.Duringtesttime,theuserscanemploytheSPIREframeworkaseitherablindrestorationmodelwithrestorationprompt“Removealldegradation”andemptysemanticprompt∅,ormanuallyadjusttherestorationcrandsemanticpromptscstoobtainwhattheyaskfor.

3.3Degradationambiguitiesinreal-worldproblems.Byadjustingtherestorationprompt,ourmethodcanpreservethemotioneffectthatiscoupledwiththeaddedGaussianblur,whilefullyblindrestorationmodelsdonotprovidethislevelofﬂexibility.

3.4Promptspacewalkingvisualizationfortherestorationprompt.Giventhesamedegradedinput(upperleft)andemptysemanticprompt∅,ourmethodcandecoupletherestorationdirectionandstrengthviaonlypromptingthequantitativenumberinnaturallanguage.Aninterestingﬁndingisthatourmodellearnsacontinuousrangeofrestorationstrengthfromdiscretelanguagetokens.

3.5Restorationpromptingforout-of-domainimages.

3.6VisualComparisonwithotherbaselines.Ourmethodofintegratingboththesemanticpromptcsandtherestorationpromptcroutperformsimge-to-imagerestoration(DiffBIR,RetrainedControlNet-SR)andnaivezero-shotcombinationwithsemanticprompt.Itachievesmoresharpandcleanresultswhilemaintainingconsistencywiththedegradedimage.

3.7Test-timesemanticprompting.Ourframeworkrestoresdegradedimagesguidedbyﬂexiblesemanticprompts,whileunrelatedbackgroundelementsandglobaltonesremainalignedwiththedegradedinputconditioning.Inaddition,Moresemanticpromptingforimageswithmultipleobjects

3.8Mainvisualcomparisonwithbaselines.(Zoominfordetails)

4.1Zero-shottext-drivenvideoediting.Wepresentazero-shotapproachforshape-awarelocalobjecteditingandvideostyleeditingfrompre-traineddiffusionmod-els[150,117]withoutanyoptimizationforeachtargetprompt.

4.2Theoverviewofourapproach.Ourinputistheuser-providedsourcepromptpsrc,targetpromptpeditandcleanlatentz={z1,z2,...zn}encodedfrominputsourcevideox={x1,x2,...xn}withnumberframesninavideosequence.Ontheleft,weﬁrstinvertthevideousingDDIMinversionpipelineintonoisylatentzTusingthesourcepromptpsrcandaninﬂated3DU-Netεθ.Duringeachinversiontimestept,

westorebothspatial-temporalself-attentionmapssandcross-attentionmapsc.

AttheeditingstageoftheDDIMdenoising,wedenoisethelatentzTbacktoclean

image0conditionedontargetpromptpedit.Ateachdenoisingtimestept,wefuse

theattentionmaps(sandc)inεθwithstoredattentionmap(s,c)using

theproposedAttentionBlendingBlock.Right:Speciﬁcally,wereplacethecross-

attentionmapscofun-editedwords(e.g.,roadandcountryside)withsource

mapscofthem.Inaddition,weblendtheself-attentionmapduringinversion

sandeditingswithanadaptivespatialmaskobtainedfromcross-attention

mapscofeditedwords(e.g..,silverandjeep),whichrepresentstheareasthatthe

userwantstoedit.

4.3Zero-shotlocalattributedediting(cat→tiger)usingstablediffusion.Incontrasttofusionwithattentionduringreconstruction(a)inpreviouswork[49,136,108],ourinversionattentionfusion(b)providesmoreaccuratestructureguidanceandeditingability,asvisualizedontherightside.

4.4Studyofblendedself-attentioninzero-shotshapeediting(rabbit→tiger)usingstablediffusion.Forthandﬁfthcolumns:Ignoringself-attentioncannotpreservetheoriginalstructureandbackground,andnaivereplacementcausesartifacts.Thirdcolumn:Blendingtheself-attentionusingthecross-attentionmap(thesecondrow)obtainsbothnewshapefromthetargettextwithasimilarposeandbackgroundfromtheinputframe.

4.5Zero-shotobjectshapeeditingonpre-trainedvideodiffusionmodel[150]:Ourframeworkcandirectlyedittheshapeoftheobjectinvideosdrivenbytextpromptsusingatrainedvideodiffusionmodel[150]

4.6Zero-shotattributeandstyleeditingresultsusingStableDiffusion[117].Ourframeworksupportsabstractattributeandstyleeditinglike‘Swarovskicrystal’,‘Ukiyo-e’,and‘MakotoShinkai’.Bestviewedwithzoom-in.

4.7Qualitativecomparisonofourmethodswithotherbaselines.InputsareinFig.4.5andFig4.8.Ourresultshavethebesttemporalconsistency,imageﬁdelity,andeditingquality.Bestviewedwithzoom-in.

4.8Applicationoflatentsblending.Extendingourattentionblendingstrategytohigh-resolutionlatent,ourframeworkcanpreservetheaccuratelow-levelcolorandtextureofinput.

4.9Inversionattentioncomparedwithreconstructionattentionusingprompt‘desertedshore‘glaciershore’.Theattentionmapsobtainedfromtherecon-structionstagefailtodetecttheboat’sposition,andcannotprovidesuitablemotionguidanceforzero-shotvideoediting.

4.10Ablationstudyofblendedself-attention.Withoutself-attentionfusion,thegeneratedvideocannotpreservethedetailsofinputvideos(e.g.,fence,trees,andcaridentity).Ifwereplacefullself-attentionwithoutaspatialmask,thestructureoftheoriginaljeepmisleadsthegenerationofthePorschecar.

xii

LISTOFTABLES

1.1Thecomparisonofdifferentgenerativemodels.

2.1Thecomparisonofdifferentmethodsrelatedtoimagerescaling.(a)Super-resolutionfromdownsampledJPEGdoesnotoptimizerate-distortionperformanceandcanhardlymaintainhighﬁdelityduetoinformationlostindownsampling.(b)SOTAﬂow-basedimagerescalingmethodsalsoignoretheﬁlesizeconstraintsand

arenotreal-timefor6Kreconstructionduetothelimitedspeedofinvertiblenetworks.(c)Ourframeworkoptimizesrate-distortionperformancewhilemaintaininghigh-ﬁdelityandreal-time6Kimagerescaling.

2.2Quantitativeevaluationofupscalingefﬁciencyandreconstructionﬁdelity.Wekeepbpparound0.3onKodak[1]fordifferentmethods,andthedistortionismeasuredbythePSNRonthereconstructedHRimages.OurapproachoutperformsothermethodswithbetterHRreconstructionandasigniﬁcantlylowerruntime.WemeasuretherunningtimeandGMacsofallmodelsbyupscalinga960×540LRimagetoa3840×2160HRimage.ThemeasurementsaremadeonanNvidiaRTX

3090GPUwithPyTorch-1.11.0inhalf-precisionmodeforafaircomparison.

2.3Architecturesofourencoder.

2.4Architecturesofourefﬁcientdecoder.

2.5Quantitativeevaluationofthe4×downsampledLRthumbnailsbydifferentmethods.Thetargetbitrateisaround0.3bpponKodak[1]forallmethods,andwetakeBicubicLRasthegroundtruth.Ourthumbnailpreservesvisualcontentsbetter.

2.6ComparisonofourHyperThumbnailframeworkagainstlearnedcompressionwithJPEGthumbnail.Inadditionalbaseline,weprovideaJPEGthumbnailbesides learnedcompression,andtakethesumofbitstreamsizeandJPEGsizetocal-culatetheﬁnalbpp.Ourframeworkhasbetterrate-distortionperformancethan“Compression+JPEG”baseline.

2.7Ablationstudyofourencoder-decoderarchitecturesonthedownsampling/upsam-plingtimeandthePSNRofreconstructedHRimage/LRthumbnail.

2.8Quantitativeevaluationforoptimization-basedrescaling.

2.9HRreconstructionPSNRwithdifferentdecodercapacity.

3.1QuantitativeresultsontheMS-COCOdataset(withcs)usingourparameterizeddegradation(left)andReal-ESRGANdegradation(right).Wealsodenotethepromptchoiceattesttime.‘Sem’standsforsemanticprompt;‘Res’standsforrestorationprompt.Theﬁrstgroupofbaselinesaretestedwithoutprompt.Thesecondgrouparecombinedwithsemanticpromptinzero-shotway.

xiii

3.2Ourtrainingdegradationisrandomlysampledinthesetwopipelinewith50%each.(1)DegradedimagesysynthesizedbyReal-ESRGANarepairedwiththesamerestorationpromptcr=“Removealldegradation”(2)Inother50%iterations,imagesgeneratedbyourparameterizedpipelinearepairedwitheitherarestorationtypeprompt(e.g.,“Deblur”)orarestorationparameterprompt(e.g.,“Deblurwithsigma0.3;”).

3.3NumericalresultsonDIV2Ktestsetwithoutanyprompt.

3.4Ablationofarchitectureanddegradationstrengthincr

3.5Ablationofpromptsprovidedduringbothtrainingandtesting.Weuseanimage-to-imagemodelwithourmodulationfusionlayerasourbaseline.Providingsemanticpromptssigniﬁcantlyincreasestheimagequality(1.9lowerFID)andsemanticsimilarity(0.002CLIP-Image),butresultsinworsepixel-levelsimilarity.Incontrast,degradationtypeinformationembeddedinrestorationpromptsimprovesbothpixel-levelﬁdelityandimagequality.Utilizingdegradationparametersintherestorationinstructionsfurtherimprovesthesemetrics.

3.6Ablationofthearchitecture.Modulatingtheskipfeaturefskipimprovestheﬁ-delityoftherestoredimagewith3%extraparametersintheadaptor,whilefurthermodulatingthebackbonefeaturesfupdoesnotbringobviousadvantage.

4.1Quantitativeevaluationagainstbaselines.Inouruserstudy,theresultsofourmethodarepreferredoverthosefrombaselines.ForCLIP-Score,weachievethebesttemporalconsistencyandcomparableframewiseeditingaccuracy

人人文库> 全部分类> 应用文书 > 研究报告

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

基于生成模型的高保真图像与视频编辑 High-fidelity image and video editing with generative models

文档简介

温馨提示

最新文档

评论

基于生成模型的高保真图像与视频编辑 High-fidelity image and video editing with generative models

文档简介

温馨提示

最新文档

评论

相关文档