




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1Seedream2.0:ANativeChinese-EnglishBilingualImageGenerationFoundationModelSeedVisionTeam,ByteDanceAbstractRapidadvancementofdiffusionmodelshascatalyzedremarkableprogressinthefieldofimagegeneration.However,prevalentmodelssuchasFlux,SD3.5andMidjourney,stillgrapplewithissueslikemodelbias,limitedtextrenderingcapabilities,andinsufficientunderstandingofChinesebilingualimagegenerationfoundationmodelthatexcelsamanagestextpromptinbothChineseandEnglish,supportingbilingualimagegenerationandtextrendering.Wedevelopapowerfuldatasystemthatfacilitatesknowledgeintegration,ansystemthatbalancestheaccuracyandrichnessforimagedescription.Particularly,Seedreamisintegratedwithaself-developedbilinguallargelanguagemodel(LLM)asatextencode,allowingittolearnnativeknowledgedirectlyfrommassivedata.Thisenableittogeneratehigh-fidelityimageswithaccurateculturalnuancesandaestheticexpressionsdescribedineitherChineseorEnglish.Beside,Glyph-AlignedByT5isappliedforflexiblecharacter-leveltextrendering,whileaScaledROPEgeneralizeswelltountrainedresolutions.Multi-phasepost-trainingoptimizations,includingSFTandRLHFiterations,furtherimprovetheoverallcapability.Throughextensiveexperimentation,wedemonstratethatSeedream2.0achiemultipleaspects,includingprompt-following,aesthetics,textrendering,andstructuralcorrectness.Furthermore,Seedream2.0hasbeenoptimizedthroughmultipleRLHFiterationstocloselyalignitsoutputwithhumanpreferences,asrevealedbyitsoutstandingELOscore.Inaddition,itcaneditingcapabilitythatbalancesinstruction-followingandimageconsistency.Correspondence:AuthorsarelistedinappendixA.OfficialPage:/tech/seedreamFigure1Seedream2.0demonstratesoutstandingperformanceacrossallevaluationaspectsinbothEnglishandChinese.2Figure2Seedream2.0Visualization.3Contents1Introduction 42DataPre-Processing 52.1DataComposition 62.3ActiveLearningEngine 62.4ImageCaptioning 2.5TextRenderingData 83ModelPre-Training 83.1DiffusionTransformer 93.2TextEncoder 93.3Character-levelTextEncoder 4ModelPost-Training 4.1ContinuingTraining(CT) 4.1.1Data 4.1.2TrainingStrategy 4.2SupervisedFine-Tuning(SFT) 4.2.1Data 4.2.2TrainingStrategy 4.3HumanFeedbackAlignment(RLHF) 4.4PromptEngineering(PE) 4.4.1Fine-tuneLLM 4.4.2PERLHF 5AligntoInstruction-BasedImageEditing 5.1Preliminaries 5.2EnhancedHumanIDPreservation 6ModelAcceleration 6.1CFGandStepDistillation 6.2Quantization 7ModelPerformance 7.1HumanEvaluation 7.1.2HumanEvaluationResults 7.2AutomaticEvaluation 7.2.1Text-ImageAlignment 7.2.2ImageQuality 7.3TextRendering 7.5Visualization 8Conclusion 24AContributionsandAcknowledgments 3341IntroductionWiththesignificantadvancementofdiffusionmodels,thefieldofimagegenerationhasexperiencedrapidexpansion.RecentpowerfulmodelssuchasFlux[13],SD3.5[7],Ideogram2.0,andMidjourney6.1haveinitiatedawaveofwidespreadcommercialapplications.However,despitetheremarkableprogressmadebytheexistingfoundationalmodels,theystillencounterseveralchallenges.whilesacrificingtheperformanceinotheraspects,suchasprompt-followingorstructuralcorrectness.•InadequateTextRenderingCapacity:Theabilitytoperformaccuratetextrenderinginlongcontentorinmultiplelanguages(especiallyinChinese)isratherlimited,whiletextrenderingisakeyabilitytosome•DeficiencyinUnderstandingChineseCharacteristics:Thereisalackofadeepunderstandingofthedistinctivecharacteristicsoflocalculture,suchasChineseculture,whichisofgreatimportancetolocalToaddresstheseimportantissues,weintroduceSeedream2.0,acutting-edgetext-to-imagemodel.ItcanproficientlyhandlebothChineseandEnglishprompts,supportsbilingualimagegenerationandtextrenderingtasks,withoutstandingperformanceinmultipleaspects.Specifically,wedesignadataarchitecturewiththeabilitytocontinuouslyintegratenewknowledge,anddevelopastrongcaptionsystemthatconsidersbothaccuracyandrichness.Importantly,wehaveintegratedaself-developedlargelanguagemodel(LLM)asatextencoderwithadecoder-onlyarchitecture.Throughmultipleroundsofcalibration,thetextencodercanobtainenhancedbilingualalignmentcapabilities,endowmentitwithnativesupportforlearningfromoriginaldatainbothChineseandEnglish.WealsoapplyaGlyphalignedByT5model,whichenablesourmodeltoflexiblyundertakecharacter-leveltextrendering.Moreover,aScaledROPEisproposedtogeneralizeourgenerationprocesstountrainedimageresolutions.Duringapost-trainingstage,wehavefurtherenhancedthemodel’scapabilitiesthroughmultiplephasesofSFTtrainingandRLHFiterations.Ourkeycontributionsarefourfold:•StrongModelCapability:Throughmulti-leveloptimizationconsistingofdataconstruction,modelpre-training,andpost-training,ourmodelstandsattheforefrontacrossmultipleaspects,includingprompt-following,aesthetic,text-rendering,andstructuralcorrectness.•ExcellentTextRenderingProficiency:Usingacustomcharacter-leveltextencodertailoredfortextrenderingtasks,ourmodelexhibitsexcellentcapabilitiesfortextgeneration,particularlyexcellingintheproductionoflongtextualcontentwithcomplicatedChinesecharacters.•ProfoundUnderstandingofChineseCharacteristics:Byintegratingwithaself-developedmulti-languageLLMtextencoder,ourmodelcanlearndirectlyfrommassivehigh-qualitydateinChinese.Thismakesitvocabulary.Furthermore,ourmodeldemonstratesexceptionalperformanceinChinesetextrendering,whichisnotwelldevelopedinthecommunity.•HighlyAlignwithHumanPreferences:FollowingmultipleiterationsofRLHFoptimizationsacrossvariouspost-trainingmodules,ourmodelconsistentlyalignsitsoutputswithhumabyagreatadvantageinELOscoring.(豆包)1andDreamina(即梦)2.Weardentlyencourageabroaderaudiencetodelveintotheextensivecapabilitiesandpotentialsofourmodel,withtheaspirationthatitcanemergeasaneffectivetoolforimprovingproductivityinthemultipleaspectsofworkanddailylife.1/chat/create-image2/ai-tool/image/generate52DataPre-ProcessingThissectiondetailsourdatapipelineforpre-training,encompassingvariouspre-processingstepssuchasdatacomposition,datacleaningandfiltering,activelearning,captioning,anddatafortextrendering.Theseprocessesensureafinalpre-trainingdatasetthatisofhighquality,largesc2.1DataCompositionOurpre-trainingdataismeticulouslycuratedfromfourmaincomponents,ensuringabalancedandcompre-Knowledge-richpairsDataKnowledge-richpairsDatasourceQualityClarityAestheticsHigh-QualityQualityClarityAestheticsDataclusterClustersImageClustersImagebasedDistributionMaintenanceDataEmbeddingGeneraldataEmbeddingTextbasedLow-qualityTextbasedTaxonomyEnginesKnowledgeInjectionDataNounsEnginesKnowledgeInjectionDataActivelearningActivelearningengineRetrieval/ClassificationTargetedRetrieval/ClassificationTargetedSupplementaryDataLargemovementsNonexistentSupplementaryDataFigure3Pre-trainingdatasystem.High-QualityData.ThiscomponentincludesdatawithexceptionallyhighimagequalityandrichknowledgeUncuratedDataDatafromtaxonomyEmbeddingRetrievalDeduplicationEmbeddingRetrievalAugmentedCuratedDataFigure4Overviewofourknowledgeinjectionprocess.DistributionMaintenanceData.Thiscomponentmaintainstheusefuldistributionoftheoriginaldatawhilereducinglow-qualitydatathrough:•DownsamplingbyDataSource:Reducingtheproportionofoverrepresentedsourceswhilepreservingtheirrelativemagnituderelationships.•Clustering-basedSampling:Samplingdatabasedonclustersatmultiplehierarchicallevels,fromclustersrepresentingbroadersemantics(suchasvisualdesigns)tothoserepresentingfinersee.g.,CD/bookcoversandpo6KnowledgeInjectionData.Thissegmentinvolvestheinjectionofknowledgeusingadevelopedtaxonomamultimodalretrievalengine,asshowninFigure4.ItincludesdatawithdistinctiveChinesecontextstoAdditionally,asmallbatchofdatawithdistinctiveChinesecontextswasmanuallycollected.Thisdatasetandfolkculture.OurmultimodalretrievalenginewasemployedtoaugmentandincorporatethisChineseknowledgeintoourgenerativemodel.TargetedSupplementaryData.Wesupplementthedatasetwithdatathatexhibitsuboptimalperformanceintext-to-imagetasks,suchasaction-orienteddataandcounterfactualdata(e.g.,"amanwithaballoonforaneck").Ouractivelearningenginecategorizesandintegratesthesechallengingdatapointsintothefinaltrainingset.2.2DataCleaningProcessdatafilteringmethodologies,asdepictedinFigure5.DeduplicationClusteringCaptioningOCRDeduplicationClusteringCaptioningStageIIIStageIIFigure5Overviewofourdatacleaningprocess.FirstStage:GeneralQualityAssessment.Welabeltheentiredatabaseusingthefollowingcriteria:•GeneralQualityScore:Evaluatingimageclarity,motionblur,andmeaninglesscontent.•GeneralStructureScore:Assessmentofelementssucha•OCRDetection:Identifyingandcatalogingtextwithinimages.Samplesthatdonotmeetqualitystandardsareeliminated.SecondStage:DetailedQualityAssessment.Thisstageinvolvesprofessionalaestheticscores,featureembeddingextraction,deduplication,andclustering.Clusteringisstructuredatmultiplehierarchicallevels,adjustmentofthedistribution.ThirdStage:CaptioningandRe-captioning.Westratifytheremainingdataandannotatecaptionsorrecaptions.Higher-leveldatagenerallyreceiverichernewcaptions,describedfromdifferentperspectives.DetailsonthecaptioningprocessareprovidedinSectionActiveLearningEngineWedevelopedanactivelearningsystemtoimproveourimageclassifiers,asillustratedinFigure6.Itisaniterativeprocedurethatprogressivelyrefinesourclassifiers,ensuringahigh-qualitydatasetfortraining.7StartbylabelingasmallsubsetofthedataActivelearningCurrentlabeleddatasetClassifierUnlabeledimageslabeledimagesHumanlabelersImagestolabelFigure6FlowdiagramofActiveLearningLifecycle.2.4ImageCaptioningbothgenericandspecial2.4.1GenericCaptionsWeformulateshortandlongcaptionsinChineseandEnglish,ensuringaccurateanddetaileddescriptions:•ShortCaptions:Accuratelydescribethemaincontentofanimage,capturingthecoreknowledgeandcontent.•LongCaptions:Moredescriptive,detailingasmanyaspinferencesandimaginaFigure7Captionexamplesinourtrainingdata.2.4.2SpecializedCaptionsInadditiontogenericcaptions,wealsohavespecializedcaptions•ArtisticCaptions:Describeaestheticelementssuchasstyle,color,composition,andlightinteraction.•TextualCaptions:Focusonthetextualinformationpresentintheimages.8•SurrealCaptions:Capturethesurrealandfantasticalaspectsofimages,offeringamoreimaginativeCropping OCRDetectionCropping OCRDetectionOCRDetectionPairRecaptionModelOCRRecaptionPairRecaptionModelOCRRecaptionFigure8TextRendering:DataPre-processingPipeline.2.5TextRenderingDataWeconstructalarge-scalevisualtextrenderingdatasetbyfilteringin-housedataandusingOCRtoolstoselectimageswithrichvisualtextcontent,asdepictedinFigure8.Themaindataprocessingstepsareasfollows:•Filterlow-qualitydatafromin-housesources.•EmployOCRtodetectandextracttextregions,followedbycroppingofwatermarks.•Removelow-qualitytextboxes,retainingclearandrelevanttextregions.•Processextractedtextusingare-captionmodeltogeneratehigh-qualitydescriptions.•Furtherrefinethedescriptionstoproducehigh-qualityimage-captionpairswhicharefinallyusedforvisualtext-renderingtasks.3ModelPre-TrainingFigure9OverviewofSeedream2.0TrainingandInferencePipeline.9Figure10OverviewofModelArchitecture.3.1DiffusionTransformerForaninputimageI,aself-developedVariationalAuto-Encoder(VAE)isusedtoencodetheinputimage,resultinginalatentspacerepresentationx∈RC×H×W.Thelatentvectorxisthenpatchifiedtoanumberof×.Thisprocessultimatelytransformstheinputimageintoareconcatenatedwithtexttokensencodedbyatextencoderandthenfedintotransformerblocks.ThedesignofDiTblocksmainlyadherestothedesignprinciplesofMMDiTinStableDiffusion3(SD3)[7].Eachtransformerblockincorporatesonlyasingleself-attentionlayer,whichconcurrentlyprocessesbothimageandtexttokens.Consideringthedisparitiesbetweentheimageandtextmodalities,distinctMLPsareemployedtohandlethemseparately.TheadaptivelayernormisutilizedtomodulateeachattentionandMLPlayer.WeresorttoQK-NormtoimprovetrainingstabilityandFullyShardedDataParallel(FSDP)[44]toconductdistributedmodeltraining.Inthispaper,weaddthelearnedpositionalembeddingontexttokens,andapplya2DRotaryPositionalEmbedding(RoPE)[29]onimagetokens.Unlikepreviousworks,wedevelopavariantof2DRoPE,namelymodeltobegeneralizedtountrainedaspectratiosandresolutionstoacertainextentduringinference.3.2TextEncoderToperformeffectivepromptencodingfortext-to-imagegenerationmodels,existingmethodologtypicallyresorttoemployingCLIPorT5asatextencoderfordiffusionmodels.CLIPtextencoder([24])iscapableofcapturingdiscriminativeinformationthatiswellalignedwithvisualrepresentationorembeddings,whiletheT5encoder([25])hasastrongabilitytounderstandcomplicatedandfine-grainedtextinformation.However,neitherCLIPorT5encoderhasstrongabilitytounderstandtextinChinese,whiledecoder-onlyLLMsoftenhaveexcellentmulti-languagecapabilities.Atextencoderplaysakeyroleindiffusionmodels,particularlyfortheperformanceoftext-alignmentinimagegeneration.Therefore,weaimtodevelopastrongtextencoderbytakingadvantageofthepowerofLLMsthatisstrongerthanthatofCLIPorT5.However,textembeddingsgeneratedbythedecoder-onlyLLMshavelargedifferencesinfeaturedistributioncomparedtothetextencoderofCLIPorT5,makingitdifficulttoalignwellwithimagerepresentationsindiffusionmodels.ThisresultsinsignificantinstabilitywhentrainingadiffusionmodelwithsuchanLLM-basedtextencoder.Wedevelopanewapproachtofine-tuneadecoder-onlyLLMbyusingtext-imagepairdata.Tofurtherenhancethecapabilitiesforgeneratingcertainchallengingscenarios,suchasthoseinvolvingChinesestylisticnuancesandspecializedprofessionalvocabulary,wecollectalargeamountofsuchtypesofdataincludedinourtrainingset.UsingthestrongcapabilitiesofLLM,andimplementingthemeticulouslycraftedtrainingstrategies,ourtextencoderhasdemonstratedasuperiorperformanceoverothermodelsacrossmultipleperspectives,includingstrongbilingualcapabilitiesthatenableexcellentperformanceinlong-textunderstandingandcomplicatedinstructionfollowing.Inparticular,excellentbilingualabilitymakesourmodelsabletolearnmeaningfulnativeknowledgedirectlyfrommassivedateinbothChineseandEnglish,whichisthekeyforourmodeltogenerateimageswithaccurateculturalnuancesandaestheticexpressionsdescribedinbothChineseandEnglish.3.3Character-levelTextEncoderConsideringthecomplexityofbilingualtextglyphs(especiallyChinesecharacters),weapplyaByT5[19,37]glyph-alignedmodeltoencodetheglyphlevelfeaturesorembeddingsandensuretheconsistencyofglyphfeaturesofrendprompt,whichareconcatenatedandthenareinputintoaDITblock.RenderingContent.ExperimentalresultshavedemonstratedthatwhenusingaByT5modelsolelytoencodethefeaturesofarenderedtext,particularlyinthecaseoflongtext,itcanleadtorepeatedcharactersanddisorderedlayoutgeneration.Thisisduetothemodel’sinsufficientunderstandingofholisticsemantics.Toaddressthisissue,fortheglyphfeaturesofrandaByT5model.ThenweemployanMLPlayertoprojecttheByT5embeddingsintoaspacethatalignswiththefeaturesoftheLLMtextencoder.Then,aftersplicingtheLLMandByT5features,wesendthecompletetextfeaturestotheDiTblocksfortraining.IncontrasttootherapproachesthattypicallyusebothLLMfeaturesandOCR-renderedimagefeaturesasconditions,ourapproachusesonlytextualfeaturesasconditions.Thisallowsourmodeltomaintainthesametrainingandreasoningprocessastheoriginaltext-to-imagegeneration,significantlyreducingthecomplexityofthetrainingandreasoningpipeline.RenderingFeatures.Thefont,color,size,positionandothercharacteristicsoftherenderedtextaredescribedusingare-captionmodelwhichisencodedthroughanLLMtextencoder.Traditionaltextrenderingapproaches[4,18,32]typicallyrelyonalayoutofpresettextboxesasaconditionalinputtoadiffusionmodel.Forexample,TextDiffuser-2[4]employsanadditionalLLMforlayoutplanningandencoding.Incontrast,ourapproachdirectlydescribestherenderingfeaturesofthetextthroughthere-captionmodel,allowingforanend-to-endtraining.Thisenablesourmodeltolearntherenderingfeaturesoftexteffectivelyanddirectlyfromtrainingdata,whichalsomakesiteoftherenderingtext,enablingthecreationofmoresophisticatedandhigh-qualitytextrenderingoutputs.4ModelPost-TrainingOurpost-trainingprocessconsistsofmultiplesequentialphases:1)ContinueTraining(CT)andSupervisedfine-tuning(SFT)stagesremarkablyenhancetheamodelsandfeedbacklearningalgorithms;3)PromptEngineering(PE)furtherimprovestheperformanceonaestheticsanddiversitybyleveragingafine-tunedLLM;4)Finally,arefinermodelisdevelopedtoscaleuptheresolutionofanoutputimagegeneratedfromourbasemodel,andatthesametimefixsomeminorstructuralerrors.Thevisualizationresultsduringdifferentpost-trainingstagesarepresentedinFigure11.4.1ContinuingTraining(CT)Pre-traineddiffusionmodelsoftenstruggletoproduceimagesthatmeetthedesiredaestheticcriteria,duetothedisparateaestheticstandardsinherentinthepre-trainingdatasets.Toconfrontthischallenge,weextendlargeponds,pavilionsandtowers,traditionaltechniques,inkpainting,delicatelines,prominentcolors,detailedtextures,classicalaesthetics,poeticatmosphere,highdefinition,highresolution,naturallight,softtones,distantviews,tranquility,elegance)—个男孩的背影,他看着窗外的花,摄影(Thgroundcomposition.Thepictureshowsagirlinkimonostandingunderacherrytree,thebreezeblowingthepetals,formingasofthalounderthebacklight,anddelicatelightandshadowshiningonthegirl'sface,presentinganoverallatmosphereoftranquilityandnature.)photography,lookingup.Ablackandwhitecatwalkinginthesnowatnight,withavintagehouseinthebackground.)Figure11Visualizationduringdifferentpost-trainingstages.thetrainingphasebytransitioningtoasmallerbutbetterqualitydataset.Thiscontinuingtraining(CT)phaseisdesignednotonlytomarkedlyenhanceaestheticsofthegeneratedimages,butisalsorequiredtomaintainfundamentalperformanceonprompt-followingandstructuralaccuracy.ThedataoftheCTstageconsistsoftwoparts.4.1.1Data•High-qualityPre-trainingData:Wefilteralargeamountofhigh-qualityimagesfromourpre-trainingisautomaticbyusingthesemodelswithoutanymanualeffort.•ManuallyCuratedData:Inadditiontothecollectedhigh-qualitydatafrompre-trainingdatasets,wemeticulouslyamassdatasetswithelevatedaestheticqualitiesfromdiversespecificdomainssuchasart,photography,anddesign.Theimageswithinthesedatasetsarerequiredtopossessacertainaestheticcharmandalignwiththeanticipatedimagegenerationoutcomes.Followingmultipleroundsofrefinement,arefineddatasetcomprisingmillionsofmanuallycherry-pickedimageswasfabricated.Toavsuchasmalldataset,wecontinuallytrainourmodelbyjointlyusingitwiththeselectedhigh-qualitypre-traineddata,withareasonablesamplingratio.4.1.2TrainingStrategyDirectlyperformingCTontheaforementioneddatasetscanconsiderablyimprovetheperformanceintermsofaesthetics,butthegeneratedimagesstillexhibitanotabledisparityfromrealimageshavingappealingaesthetics.Tofurtherimproveaestheticperformance,weintroduceVMix([34])whichenablesourmodeltolearnthefine-grainedaestheticcharacteristicsdirectlyduringthedenoisingprocess.Wetageachimageaccordingtovariousaestheticdimensions,namelycolor,lighting,texture,andcomposition,andthenthesetagsareusedassupplementaryconditionsduringourCTtrainingprocess.Experimentalresultsshowthatourmethodcanfurtherenhancetheaestheticappealofthegen4.2SupervisedFine-Tuning(SFT)4.2.1DataIntheSFTstage,wefurtherfine-tuneourmodeltowardgeneratinghigh-fidelityimageswithexcellentartisticbeauty,byusingasmallamountofcarefullycollectedimages.Withthesecollectedimages,wespecificallytrainedacaptionmodelcapableofpreciselydescribingbeautyandartistrythroughmulti-roundmanualrectifications.Furthermore,wealsoassignedstylelabelsandfine-grainedaestheticlabels(usedinthevmixapproach)totheseimages,whichensurethattheinformationofthemajorityofmainstreamgenresisincluded.4.2.2TrainingStrategyInaddtiontotheconstructedSFTdata,wealsoincludeacertainamountofmodel-generatedimages,whicharelabeledas"negativesamples",duringSFTtraining.Bycombiningwithrealimagesamples,themodelcanlearntodiscriminatebetweenrealandfakeimages,enablingittogeneratemorenaturalandrealisticimages.Thistherebyenhancesthequalityandauthenticityofthegeartisticstandardscansubstantiallyenhancetheartisticbeauty,butitinevitablydegradestheperformanceonimage-textalignment,whichisfundamentaltotext-to-imagegenerationtask.Toaddressthisissue,wedevelopedadataresamplingalgorithmthatallowsthemodeltoenhanceaestheticswhilestillmaintainingimage-textalignmentcapacity.4.3HumanFeedbackAlignment(RLHF)Inourwork,weintroduceapioneeringRLHFoptimizationproceduretailoredfordiffusionmodels([14,41,42]),incorporatingpreferencedata,rewardmodels(RMs),and12,theRLHFphaseplaysapivotalroleinenhancingtheoverallperformanceofourdiffusionmodelsinvariousaspects,includingimage-textalignment,aesthetic,structurecorrectness,textrendering,etc.Figure12Therewardcurvesshowthatthevaluesacrossdiverserewardmodelsallexhibitastableandconsistentupwardtrendthroughoutthealignmentprocess.Somevisualizationexamplesrevealthatthehumanfeedbackalignmentstageiscrucial.4.3.1PreferenceData•PromptSystem:WehavedevelopedaversatilePromptSystemtailoredforemploymentinboththeRMTrainingandFeedbackLearningphases.Ourcuratedcollectioncomprisesof1millionmulti-dimensionalpromptssourcedfromtrainingcaptionsanduserinput.Throughrigorouscurationprocessesthatfilteroutambiguousorvagueexpressions,weguaranteeapromptsystemthatisnotonlycomprehensivebutalsorichindiversityanddepthofcontent.•RMDataCollection:Wecollecthigh-qualitydataforpreferenceannotation,comprisingimagescraftedbyvarioustrainedmodelsanddatasources.Throughtheconstructionofacross-versionandcross-modelannotationpipeline,weenhancethedomainadaptabilityofRMs,andextenditsupperthresholdof•AnnotationRules:Intheannotationphase,weengageinmulti-dimensionalfusionannotation(suchasimageandtextmatching,textrendering,aesthetic,etc.).Theseintegratedannotationproceduresaredesignedtoelevatethemulti-dimensionalcapabilitiesofasinglerewardmodel,forestalldeficiencRLHFstage,andfostertheachievementofParetooptimalityacrossalldimensionswithinRLHF.4.3.2RewardModel•ModelArchitecture:WeuseaCLIPmodelthatsupportsbothChineseandEnglishasourRMs.ByleveragingthestrongalignmentcapabilitiesoftheCLIPmodel,weforgoadditionalHeadoutputrewardmethodslikeImageReward,optingtoutilizetheoutputofCLIPmodelastherewarditself.ArankinglossisprimarilyappliedasthetraininglossofourRMs.•Multi-aspectsRewardModels:ToenhancetheoveralandtrainedthreedistinctRMs:aimage-textalignmentRM,anaestheticRM,andatext-renderingRM.Inparticular,thetext-renderingRMisselectivelyengagedwhenaprompttagrelatestotextrendering,significantlyimprovingtheprecisionofcharacter-leveltextgeneration.4.3.3FeedbackLearning•LearningAlgorithm:WerefineourdiffusionmodelthroughadirectoptimizationofoutputscorescomputedfrommultipleRMs,akintoREFL([36])paradigm.DelvingintovariousfeedbacklearningalgorithmssuchasDPO([33])andDDPO([1]),ourinvestigationrevealedthatourmetapproachtowardmulti-rewardoptimiz
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 二零二五年度小麦种植与农村电商服务体系建设合同书
- 2025年国家公务员考试公共基础知识法律基础知识试题库及答案(共400题)
- 高中主题班会 高三上学期《无奋斗不青春》主题班会课件
- 赴中关村一小培训心得体会
- 中级银行管理-2025中级银行从业资格考试《银行管理》点睛提分卷4
- 术后肺部感染控制与预防
- 防控知识精美
- 铁路劳动安全预防起重伤害
- 农作物防冻害保护方法
- 航空航天材料应用测试试题及答案解析
- 《电力安全工作规程DLT408-2023》知识培训
- 2024北京重点校初二(下)期中语文汇编:基础知识综合
- DB21-T 3943-2024 消防控制室管理
- 规划课题申报范例:高校毕业生高质量就业服务体系建设研究(附可修改技术路线图)
- GB/T 29498-2024木门窗通用技术要求
- 2025北京语言大学新编长聘人员招聘21人笔试备考试题及答案解析
- (三级)信息通信网络运行管理员资格认证复习题库(浓缩300题)
- 银屑病小讲课
- 北京市丰台区2024届高三下学期一模考试 政治 含答案
- HPV感染后微环境在子宫颈癌发生发展中的研究进展
- 教师资格考试高级中学数学面试试题与参考答案(2024年)
评论
0/150
提交评论