




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
HIGH-FIDELITYIMAGEANDVIDEOEDITINGWITHGENERATIVEMODELS
by
CHENYANGQI
AThesisSubmittedto
TheHongKongUniversityofScienceandTechnology
inPartialFulfillmentoftheRequirementsfor
theDegreeofDoctorofPhilosophy
inComputerScienceandEngineering
July2024,HongKong
Copyright©byChenyangQi2024
HKUSTLibrary
Reproductionisprohibitedwithouttheauthor’spriorwrittenconsent
ii
Authorization
IherebydeclarethatIamthesoleauthorofthethesis.
IauthorizetheHongKongUniversityofScienceandTechnologytolendthisthesistootherinstitutionsorindividualsforthepurposeofscholarlyresearch.
IfurtherauthorizetheHongKongUniversityofScienceandTechnologytoreproducethethesisbyphotocopyingorbyothermeans,intotalorinpart,attherequestofotherinstitutionsorindividualsforthepurposeofscholarlyresearch.
CHENYANGQI
24July2024
iii
HIGH-FIDELITYIMAGEANDVIDEOEDITINGWITHGENERATIVEMODELS
by
CHENYANGQI
ThisistocertifythatIhaveexaminedtheabovePh.D.thesis
andhavefoundthatitiscompleteandsatisfactoryinallrespects,
andthatanyandallrevisionsrequiredby
thethesisexaminationcommitteehavebeenmade.
Prof.QifengChen,ThesisSupervisor
Prof.XiaofangZHOU,HeadofDepartment
DepartmentofComputerScienceandEngineering
24July2024
iv
ACKNOWLEDGMENTS
ItwouldhavebeenimpossibletocompletemywonderfulPh.D.journeywithoutthehelpofsomanypeople.
Firstofall,Iwouldliketoexpressmygratitudetomyadvisor,ProfessorQifengChen,forhispatience,support,andencouragement.Istillrememberourfirstmeetingfouryearsago.AlthoughIhadalmostnoexperienceincomputervisionatthattime,ProfessorChenbelievedinmypotentialandgavemethisinvaluableopportunitytopursueknowledgeatHKUST.Infouryears,hehasprovidedkindguidancetome:ideabrainstorming,technicaldesign,resultpresentation,andcareerplanning.
Secondly,Iwouldliketothankmymentorsduringmyinternships:XiaodongCun,YongZhang,XintaoWang,andYingShanatTencentAILab;ZhengzhongTu,KerenYe,HosseinTalebi,MauricioDelbracio,andPeymanMilanfaratGoogleResearch;BoZhang,DongChen,andFangWenatMicrosoftResearch;TaesungParkandJimeiYangatAdobe.Theytaughtmepracticalskillstosolvereal-worldproblemsandbridgethegapbetweenacademiaandindustry.
Next,IwouldliketothankmylabmatesintheHKUSTVisualIntelligenceLab,especiallymycollaboratorsChenyangLei,JiaxinXie,XinYang,KaLeongCheng,YueMa,LiyaJi,JunmingChen,NaFan,andZianQian.Wehavehelpedeachotherinourresearch,andIhavelearnedalotfromtheirinsights.Also,thankstoYueWu,QiangWen,TengfeiWang,YingqingHe,YazhouXing,GuotaoMeng,ZifanShi,MaoshengYe,YueqiXie,andallotherlabmates.Ithasbeenajoyfultimebeingfriendsandpartnerswithyou.
Further,IwouldliketoexpressmysinceregratitudetoProf.YincongChen,Prof.DanXu,andProf.XiaomengLi,Prof.Chi-YingTsui,Prof.LingShi,Prof.ChiewLanTai,andProfYinqiangZhengwhoservedonthequalifyingexaminationcommitteeandthesiscommitteeofmyPh.D.programatHKUST.
Lastbutnotleast,Iappreciatetheendlesssupportfrommyfamilyandmygirlfriend.Yourencouragementhasgivenmethepowertofacethedifficultiesinmyresearch.MygirlfriendXilinZhanghasalsohelpedinrevisingmydraftsbeforealmosteverydeadline.
Thankstoeveryonewhohasofferedtheirkindsupportandhelpinmyacademicjourney!
v
TABLEOFCONTENTS
TitlePagei
AuthorizationPageii
SignaturePageiii
Acknowledgmentsiv
TableofContentsv
ListofFiguresviii
ListofTablesxii
Abstractxiv
Chapter1Introduction1
1.1Background1
1.2DissertationOverview4
Chapter2ThumbnailRescalingusingquantizedautoencoder6
2.1Introduction6
2.2RelatedWork10
2.2.1ImageSuper-resolution10
2.2.2ImageRescaling10
2.2.3ImageCompression11
2.3Method11
2.3.1JPEGPreliminary11
2.3.2OverviewofHyperThumbnail13
2.3.3QuantizationPredictionModule13
2.3.4Frequency-awareDecoder14
2.3.5TrainingObjectives14
vi
2.4Experiments16
2.4.1ImplementationDetails16
2.4.2ExperimentalSetup18
2.4.3ComparewithBaselines19
2.4.4Additionalqualitativeresults23
2.4.5Real-timeInferenceon6KImages27
2.4.6ExtensionforOptimization-basedRescaling27
2.5AblationStudy28
2.6Conclusion30
Chapter3Text-drivenimagerestorationviadiffusionpriors31
3.1Introduction32
3.2RelatedWork34
3.3Method36
3.3.1Preliminaries37
3.3.2Text-drivenImageRestoration37
3.3.3DecouplingSemanticandRestorationPrompts38
3.3.4LearningtoControltheRestoration40
3.4Experiments43
3.4.1Text-basedTrainingDataandBenchmarks43
3.4.2Comparingwithbaselines44
3.4.3PromptingtheSPIRE46
3.5AblationStudy47
3.6Conclusion47
Chapter4Text-drivenvideoeditingusingdiffusionpriors52
4.1Introduction53
4.2RelatedWork55
4.3Methods56
4.3.1Preliminary:LatentDiffusionandInversion57
4.3.2FateZeroVideoEditing59
4.3.3Shape-AwareVideoEditing62
4.4Experiments62
vii
4.4.1ImplementationDetails62
4.4.2Pseudoalgorithmcode63
4.4.3Applications64
4.4.4BaselineComparisons66
4.4.5AblationStudies68
4.5Conclusion69
Chapter5ConclusionandDiscussion72
References74
AppendixAListofPublications91
viii
LISTOFFIGURES
1.1Traditionalparadigm[107,146](a)ofvisualeditingfirstconductsdegradationop-eratorsontrainingdataxtosynthesizeconditionsy,suchaslow-resolutionimages,segmentationmaps,orsketchmaps.Althoughthismethodisstraightforward,itfacesdifficultiesincollectingopen-domainpairedtrainingdataanddesigningaflexibleframeworktounifyalltranslationtasks.(b)Weproposeanewparadigmutilizingpretrainedgenerativemodelsandconditionedoneditinginstructiontoadapttovariouseditingtasksflexibly.
2.1Theapplicationof6Kimagerescalinginthecontextofcloudphotostorageonsmartphones(e.g.,iCloud).Asmorehigh-resolution(HR)imagesareuploadedtocloudstoragenowadays,challengesarebroughttocloudserviceproviders(CSPs)infulfillinglatency-sensitiveimagereadingrequests(e.g.,zoom-in)throughtheinternet.Tofacilitatefastertransmissionandhigh-qualityvisualcontent,ourHy-perThumbnailframeworkhelpsCSPstoencodeanHRimageintoanLRJPEGthumbnail,whichuserscouldcachelocally.Whentheinternetisunstableorun-available,ourmethodcanstillreconstructahigh-fidelityHRimagefromtheJPEGthumbnailinrealtime.
2.2Theoverviewofourapproach.GivenanHRinputimagex,wefirstencodextoitsLRrepresentationywiththeencoderE,wherethescalingfactoriss.Second,wetransformytoDCTcoefficientsCandpredictthequantizationtablesQL,QCwith
ourquantizationpredictionmodule(QPM).toestimatethebitrateofthequantizedcoefficientsCattrainingstage.roundingandtruncation,whichwedenotedas[·],the[QL],[QC]and[C]canbewrittenandreadwithoff-the-shelfJPEGAPIatthetestingstage.TorestoretheHR,weextractfeaturesfromCwithafrequencyfeatureextractorfandproducethehigh-fidelityimagewiththedecoderD.
2.3ReconstructedHRimagesandLRthumbnailsbydifferentmethodsontheDIV2K[6]validationdataset.WecroptherestoredHRimagestoeasethecomparisonandvisualizetheLRcounterpartsatthebottom-right.ThebppiscalculatedonthewholeimageandthePSNRisevaluatedonthecroppedareaofthereconstructedHRimages.
2.4DownscaledLRthumbnailsbydifferentmethodsonSet14imagecomic.Withasimilartargetbpp,ourmodelintroducesleastartifactsinthethumbnailincompar-isontobaselines.
2.5Modelruntime.Weprofilethe4×encoderanddecoderatdifferenttargetresolu-tioninhalf-precisionmode.Especially,weconvertourdecoderfromPyTorchtoTensorRTforfurtherinferencetimereduction.
3
7
12
17
20
21
ix
2.6Therate-HR-distortioncurveonKodak[1]dataset.Ourmethod(s=2,4)outperformsJPEG,IRN[153]intheRDperformance.Forthe‘QPM+JPEG’curve,wheres=1,wefollowthestandardJPEGalgorithmandadoptQPMmoduleasapluginfortableprediction.
2.7Visualresultsofperforming4×rescalingontheDIV2K[6]andFiveK[18]datasetswithbaselinemethodsandourmodels.Theimagesarecroppedtoeasethecomparison.Pleasezoominfordetails.
2.8Moreresultsof4×rescalingwithourframeworkonreal-world6Kimages[18].Pleasezoominfordetails.Notethattheimagesherearecompressedduetothesizelimitofcamera-ready.
2.9QuantizationtablesonKodak[1]images.WevisualizethequantizationtableQL(thegreentable)andQC(theorangetable)forkodim04andkodim09ofdifferentquantizationapproaches.ThemodeltrainedwithQPMachievesthebestRDperformancefromeveryaspect.Formoreanalysis,pleaserefertoSec.2.5inourchapter.
2.10QPMversusimage-invariantquantization.WefirsttrainourmodelswithQPM,withafixedJPEGtableorwithanoptimizedtable,respectively.Then,weevaluatetheatdifferenttargetbitrateonKodak[1]dataset.(a)theRDcurveonreconstructed
HRimageandinputx;(b)theRDcurveonLRthumbnailandtheBicubic
downsampledLRyref.
2.11guidancelossablationonKodak[1]imagekodim17.WevisualizetheHRimageswiththeirLRcounterpartsatthebottom-right.(b)(c)areproducedby4×HyperThumbnailmodelstrainedwithdifferentλ1andthebppis0.4.
3.1WepresentSPIRE:SemanticPrompt-DrivenImageRestoration,atext-basedfoundationmodelforall-in-one,instructedimagerestoration.SPIREallowsuserstoflexiblyleverageeithersemantic-levelcontentprompt,ordegradation-awarerestorationprompt,orboth,toobtaintheirdesiredenhancementresultsbasedonpersonalpreferences.Inotherwords,SPIREcanbeeasilypromptedtoconductblindrestoration,semanticrestoration,ortask-specificgranulartreatment.Ourframeworkalsoenablesanewparadigmofinstruction-basedimagerestoration,providingareliableevaluationbenchmarktofacilitatevision-languagemodelsforlow-levelcomputationalphotographyapplications.
3.2FrameworkofSPIRE.Inthetrainingphase,webeginbysynthesizingadegradedversiony,ofacleanimagex.Ourdegradationsynthesispipelinealsocreatesarestorationpromptcr,whichcontainsnumericparametersthatreflectstheintensityofthedegradationintroduced.Then,weinjectthesyntheticrestorationpromptintoaControlNetadaptor,whichusesourproposedmodulationfusionblocks(γ,β)toconnectwiththefrozenbackbonedrivenbythesemanticpromptcs.Duringtesttime,theuserscanemploytheSPIREframeworkaseitherablindrestorationmodelwithrestorationprompt“Removealldegradation”andemptysemanticprompt∅,ormanuallyadjusttherestorationcrandsemanticpromptscstoobtainwhattheyaskfor.
23
25
26
27
29
29
31
35
x
3.3Degradationambiguitiesinreal-worldproblems.Byadjustingtherestorationprompt,ourmethodcanpreservethemotioneffectthatiscoupledwiththeaddedGaussianblur,whilefullyblindrestorationmodelsdonotprovidethislevelofflexibility.
3.4Promptspacewalkingvisualizationfortherestorationprompt.Giventhesamedegradedinput(upperleft)andemptysemanticprompt∅,ourmethodcandecoupletherestorationdirectionandstrengthviaonlypromptingthequantitativenumberinnaturallanguage.Aninterestingfindingisthatourmodellearnsacontinuousrangeofrestorationstrengthfromdiscretelanguagetokens.
3.5Restorationpromptingforout-of-domainimages.
3.6VisualComparisonwithotherbaselines.Ourmethodofintegratingboththesemanticpromptcsandtherestorationpromptcroutperformsimge-to-imagerestoration(DiffBIR,RetrainedControlNet-SR)andnaivezero-shotcombinationwithsemanticprompt.Itachievesmoresharpandcleanresultswhilemaintainingconsistencywiththedegradedimage.
3.7Test-timesemanticprompting.Ourframeworkrestoresdegradedimagesguidedbyflexiblesemanticprompts,whileunrelatedbackgroundelementsandglobaltonesremainalignedwiththedegradedinputconditioning.Inaddition,Moresemanticpromptingforimageswithmultipleobjects
3.8Mainvisualcomparisonwithbaselines.(Zoominfordetails)
4.1Zero-shottext-drivenvideoediting.Wepresentazero-shotapproachforshape-awarelocalobjecteditingandvideostyleeditingfrompre-traineddiffusionmod-els[150,117]withoutanyoptimizationforeachtargetprompt.
4.2Theoverviewofourapproach.Ourinputistheuser-providedsourcepromptpsrc,targetpromptpeditandcleanlatentz={z1,z2,...zn}encodedfrominputsourcevideox={x1,x2,...xn}withnumberframesninavideosequence.Ontheleft,wefirstinvertthevideousingDDIMinversionpipelineintonoisylatentzTusingthesourcepromptpsrcandaninflated3DU-Netεθ.Duringeachinversiontimestept,
westorebothspatial-temporalself-attentionmapssandcross-attentionmapsc.
AttheeditingstageoftheDDIMdenoising,wedenoisethelatentzTbacktoclean
image0conditionedontargetpromptpedit.Ateachdenoisingtimestept,wefuse
theattentionmaps(sandc)inεθwithstoredattentionmap(s,c)using
theproposedAttentionBlendingBlock.Right:Specifically,wereplacethecross-
attentionmapscofun-editedwords(e.g.,roadandcountryside)withsource
mapscofthem.Inaddition,weblendtheself-attentionmapduringinversion
sandeditingswithanadaptivespatialmaskobtainedfromcross-attention
mapscofeditedwords(e.g..,silverandjeep),whichrepresentstheareasthatthe
userwantstoedit.
4.3Zero-shotlocalattributedediting(cat→tiger)usingstablediffusion.Incontrasttofusionwithattentionduringreconstruction(a)inpreviouswork[49,136,108],ourinversionattentionfusion(b)providesmoreaccuratestructureguidanceandeditingability,asvisualizedontherightside.
42
49
49
50
50
51
53
57
58
xi
4.4Studyofblendedself-attentioninzero-shotshapeediting(rabbit→tiger)usingstablediffusion.Forthandfifthcolumns:Ignoringself-attentioncannotpreservetheoriginalstructureandbackground,andnaivereplacementcausesartifacts.Thirdcolumn:Blendingtheself-attentionusingthecross-attentionmap(thesecondrow)obtainsbothnewshapefromthetargettextwithasimilarposeandbackgroundfromtheinputframe.
4.5Zero-shotobjectshapeeditingonpre-trainedvideodiffusionmodel[150]:Ourframeworkcandirectlyedittheshapeoftheobjectinvideosdrivenbytextpromptsusingatrainedvideodiffusionmodel[150]
4.6Zero-shotattributeandstyleeditingresultsusingStableDiffusion[117].Ourframeworksupportsabstractattributeandstyleeditinglike‘Swarovskicrystal’,‘Ukiyo-e’,and‘MakotoShinkai’.Bestviewedwithzoom-in.
4.7Qualitativecomparisonofourmethodswithotherbaselines.InputsareinFig.4.5andFig4.8.Ourresultshavethebesttemporalconsistency,imagefidelity,andeditingquality.Bestviewedwithzoom-in.
4.8Applicationoflatentsblending.Extendingourattentionblendingstrategytohigh-resolutionlatent,ourframeworkcanpreservetheaccuratelow-levelcolorandtextureofinput.
4.9Inversionattentioncomparedwithreconstructionattentionusingprompt‘desertedshore‘glaciershore’.Theattentionmapsobtainedfromtherecon-structionstagefailtodetecttheboat’sposition,andcannotprovidesuitablemotionguidanceforzero-shotvideoediting.
4.10Ablationstudyofblendedself-attention.Withoutself-attentionfusion,thegeneratedvideocannotpreservethedetailsofinputvideos(e.g.,fence,trees,andcaridentity).Ifwereplacefullself-attentionwithoutaspatialmask,thestructureoftheoriginaljeepmisleadsthegenerationofthePorschecar.
59
62
63
64
65
67
69
xii
LISTOFTABLES
1.1Thecomparisonofdifferentgenerativemodels.
2.1Thecomparisonofdifferentmethodsrelatedtoimagerescaling.(a)Super-resolutionfromdownsampledJPEGdoesnotoptimizerate-distortionperformanceandcanhardlymaintainhighfidelityduetoinformationlostindownsampling.(b)SOTAflow-basedimagerescalingmethodsalsoignorethefilesizeconstraintsand
arenotreal-timefor6Kreconstructionduetothelimitedspeedofinvertiblenetworks.(c)Ourframeworkoptimizesrate-distortionperformancewhilemaintaininghigh-fidelityandreal-time6Kimagerescaling.
2.2Quantitativeevaluationofupscalingefficiencyandreconstructionfidelity.Wekeepbpparound0.3onKodak[1]fordifferentmethods,andthedistortionismeasuredbythePSNRonthereconstructedHRimages.OurapproachoutperformsothermethodswithbetterHRreconstructionandasignificantlylowerruntime.WemeasuretherunningtimeandGMacsofallmodelsbyupscalinga960×540LRimagetoa3840×2160HRimage.ThemeasurementsaremadeonanNvidiaRTX
3090GPUwithPyTorch-1.11.0inhalf-precisionmodeforafaircomparison.
2.3Architecturesofourencoder.
2.4Architecturesofourefficientdecoder.
2.5Quantitativeevaluationofthe4×downsampledLRthumbnailsbydifferentmethods.Thetargetbitrateisaround0.3bpponKodak[1]forallmethods,andwetakeBicubicLRasthegroundtruth.Ourthumbnailpreservesvisualcontentsbetter.
2.6ComparisonofourHyperThumbnailframeworkagainstlearnedcompressionwithJPEGthumbnail.Inadditionalbaseline,weprovideaJPEGthumbnailbesides learnedcompression,andtakethesumofbitstreamsizeandJPEGsizetocal-culatethefinalbpp.Ourframeworkhasbetterrate-distortionperformancethan“Compression+JPEG”baseline.
2.7Ablationstudyofourencoder-decoderarchitecturesonthedownsampling/upsam-plingtimeandthePSNRofreconstructedHRimage/LRthumbnail.
2.8Quantitativeevaluationforoptimization-basedrescaling.
2.9HRreconstructionPSNRwithdifferentdecodercapacity.
3.1QuantitativeresultsontheMS-COCOdataset(withcs)usingourparameterizeddegradation(left)andReal-ESRGANdegradation(right).Wealsodenotethepromptchoiceattesttime.‘Sem’standsforsemanticprompt;‘Res’standsforrestorationprompt.Thefirstgroupofbaselinesaretestedwithoutprompt.Thesecondgrouparecombinedwithsemanticpromptinzero-shotway.
2
8
16
17
18
21
22
24
27
30
42
xiii
3.2Ourtrainingdegradationisrandomlysampledinthesetwopipelinewith50%each.(1)DegradedimagesysynthesizedbyReal-ESRGANarepairedwiththesamerestorationpromptcr=“Removealldegradation”(2)Inother50%iterations,imagesgeneratedbyourparameterizedpipelinearepairedwitheitherarestorationtypeprompt(e.g.,“Deblur”)orarestorationparameterprompt(e.g.,“Deblurwithsigma0.3;”).
3.3NumericalresultsonDIV2Ktestsetwithoutanyprompt.
3.4Ablationofarchitectureanddegradationstrengthincr
3.5Ablationofpromptsprovidedduringbothtrainingandtesting.Weuseanimage-to-imagemodelwithourmodulationfusionlayerasourbaseline.Providingsemanticpromptssignificantlyincreasestheimagequality(1.9lowerFID)andsemanticsimilarity(0.002CLIP-Image),butresultsinworsepixel-levelsimilarity.Incontrast,degradationtypeinformationembeddedinrestorationpromptsimprovesbothpixel-levelfidelityandimagequality.Utilizingdegradationparametersintherestorationinstructionsfurtherimprovesthesemetrics.
3.6Ablationofthearchitecture.Modulatingtheskipfeaturefskipimprovesthefi-delityoftherestoredimagewith3%extraparametersintheadaptor,whilefurthermodulatingthebackbonefeaturesfupdoesnotbringobviousadvantage.
4.1Quantitativeevaluationagainstbaselines.Inouruserstudy,theresultsofourmethodarepreferredoverthosefrombaselines.ForCLIP-Score,weachievethebesttemporalconsistencyandcomparableframewiseeditingaccuracy
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 新人教版九年级全一册册初中物理 第1节 电压 教案
- 2025年闭式塔项目发展计划
- 校企合作中的人才培养模式创新
- 汽车产业数字化转型中的学科建设需求
- 建筑工程审计工作总结(3篇)
- 信息与电气工程学院班级观摩团课评分细则
- 《乃哟乃》名师教案
- 用传统文化资源提升学生文化自信的智慧课堂实践
- 数字技术赋能下的小学语文课堂创新实践
- 科学实验体验馆行业深度调研及发展战略咨询报告
- 二手乘用车出口检验规范
- GB/Z 43281-2023即时检验(POCT)设备监督员和操作员指南
- 2023核电厂常规岛设计规范
- 自考中国古代文学史一历年试题与答案
- 02S404给排水图集标准
- 眼镜各部件英语知识点梳理汇总
- 3学会反思(第二课时) 说课稿-六年级下册道德与法治
- 脑膜瘤术后护理
- 化工检修电工考试题+参考答案
- 苏教版科学2023四年级下册全册教案教学设计及反思
- 幼儿园艾叶粑粑教案
评论
0/150
提交评论