2025豆包大模型Seedream2.0技术报告原生中英双语图像生成模型

上传人：策*** IP属地：山西上传时间：2025-03-19 格式：DOCX 页数：63 大小：5.52MB 积分：15 举报 版权申诉

2025豆包大模型Seedream2.0技术报告原生中英双语图像生成模型_第2页

2025豆包大模型Seedream2.0技术报告原生中英双语图像生成模型_第3页

2025豆包大模型Seedream2.0技术报告原生中英双语图像生成模型_第4页

2025豆包大模型Seedream2.0技术报告原生中英双语图像生成模型_第5页

已阅读5页，还剩58页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1Seedream2.0:ANativeChinese-EnglishBilingualImageGenerationFoundationModelSeedVisionTeam,ByteDanceAbstractRapidadvancementofdiffusionmodelshascatalyzedremarkableprogressinthefieldofimagegeneration.However,prevalentmodelssuchasFlux,SD3.5andMidjourney,stillgrapplewithissueslikemodelbias,limitedtextrenderingcapabilities,andinsufficientunderstandingofChinesebilingualimagegenerationfoundationmodelthatexcelsamanagestextpromptinbothChineseandEnglish,supportingbilingualimagegenerationandtextrendering.Wedevelopapowerfuldatasystemthatfacilitatesknowledgeintegration,ansystemthatbalancestheaccuracyandrichnessforimagedescription.Particularly,Seedreamisintegratedwithaself-developedbilinguallargelanguagemodel(LLM)asatextencode,allowingittolearnnativeknowledgedirectlyfrommassivedata.Thisenableittogeneratehigh-fidelityimageswithaccurateculturalnuancesandaestheticexpressionsdescribedineitherChineseorEnglish.Beside,Glyph-AlignedByT5isappliedforflexiblecharacter-leveltextrendering,whileaScaledROPEgeneralizeswelltountrainedresolutions.Multi-phasepost-trainingoptimizations,includingSFTandRLHFiterations,furtherimprovetheoverallcapability.Throughextensiveexperimentation,wedemonstratethatSeedream2.0achiemultipleaspects,includingprompt-following,aesthetics,textrendering,andstructuralcorrectness.Furthermore,Seedream2.0hasbeenoptimizedthroughmultipleRLHFiterationstocloselyalignitsoutputwithhumanpreferences,asrevealedbyitsoutstandingELOscore.Inaddition,itcaneditingcapabilitythatbalancesinstruction-followingandimageconsistency.Correspondence:AuthorsarelistedinappendixA.OfficialPage:/tech/seedreamFigure1Seedream2.0demonstratesoutstandingperformanceacrossallevaluationaspectsinbothEnglishandChinese.2Figure2Seedream2.0Visualization.3Contents1Introduction 42DataPre-Processing 52.1DataComposition 62.3ActiveLearningEngine 62.4ImageCaptioning 2.5TextRenderingData 83ModelPre-Training 83.1DiffusionTransformer 93.2TextEncoder 93.3Character-levelTextEncoder 4ModelPost-Training 4.1ContinuingTraining(CT) 4.1.1Data 4.1.2TrainingStrategy 4.2SupervisedFine-Tuning(SFT) 4.2.1Data 4.2.2TrainingStrategy 4.3HumanFeedbackAlignment(RLHF) 4.4PromptEngineering(PE) 4.4.1Fine-tuneLLM 4.4.2PERLHF 5AligntoInstruction-BasedImageEditing 5.1Preliminaries 5.2EnhancedHumanIDPreservation 6ModelAcceleration 6.1CFGandStepDistillation 6.2Quantization 7ModelPerformance 7.1HumanEvaluation 7.1.2HumanEvaluationResults 7.2AutomaticEvaluation 7.2.1Text-ImageAlignment 7.2.2ImageQuality 7.3TextRendering 7.5Visualization 8Conclusion 24AContributionsandAcknowledgments 3341IntroductionWiththesignificantadvancementofdiffusionmodels,thefieldofimagegenerationhasexperiencedrapidexpansion.RecentpowerfulmodelssuchasFlux[13],SD3.5[7],Ideogram2.0,andMidjourney6.1haveinitiatedawaveofwidespreadcommercialapplications.However,despitetheremarkableprogressmadebytheexistingfoundationalmodels,theystillencounterseveralchallenges.whilesacrificingtheperformanceinotheraspects,suchasprompt-followingorstructuralcorrectness.•InadequateTextRenderingCapacity:Theabilitytoperformaccuratetextrenderinginlongcontentorinmultiplelanguages(especiallyinChinese)isratherlimited,whiletextrenderingisakeyabilitytosome•DeficiencyinUnderstandingChineseCharacteristics:Thereisalackofadeepunderstandingofthedistinctivecharacteristicsoflocalculture,suchasChineseculture,whichisofgreatimportancetolocalToaddresstheseimportantissues,weintroduceSeedream2.0,acutting-edgetext-to-imagemodel.ItcanproficientlyhandlebothChineseandEnglishprompts,supportsbilingualimagegenerationandtextrenderingtasks,withoutstandingperformanceinmultipleaspects.Specifically,wedesignadataarchitecturewiththeabilitytocontinuouslyintegratenewknowledge,anddevelopastrongcaptionsystemthatconsidersbothaccuracyandrichness.Importantly,wehaveintegratedaself-developedlargelanguagemodel(LLM)asatextencoderwithadecoder-onlyarchitecture.Throughmultipleroundsofcalibration,thetextencodercanobtainenhancedbilingualalignmentcapabilities,endowmentitwithnativesupportforlearningfromoriginaldatainbothChineseandEnglish.WealsoapplyaGlyphalignedByT5model,whichenablesourmodeltoflexiblyundertakecharacter-leveltextrendering.Moreover,aScaledROPEisproposedtogeneralizeourgenerationprocesstountrainedimageresolutions.Duringapost-trainingstage,wehavefurtherenhancedthemodel’scapabilitiesthroughmultiplephasesofSFTtrainingandRLHFiterations.Ourkeycontributionsarefourfold:•StrongModelCapability:Throughmulti-leveloptimizationconsistingofdataconstruction,modelpre-training,andpost-training,ourmodelstandsattheforefrontacrossmultipleaspects,includingprompt-following,aesthetic,text-rendering,andstructuralcorrectness.•ExcellentTextRenderingProficiency:Usingacustomcharacter-leveltextencodertailoredfortextrenderingtasks,ourmodelexhibitsexcellentcapabilitiesfortextgeneration,particularlyexcellingintheproductionoflongtextualcontentwithcomplicatedChinesecharacters.•ProfoundUnderstandingofChineseCharacteristics:Byintegratingwithaself-developedmulti-languageLLMtextencoder,ourmodelcanlearndirectlyfrommassivehigh-qualitydateinChinese.Thismakesitvocabulary.Furthermore,ourmodeldemonstratesexceptionalperformanceinChinesetextrendering,whichisnotwelldevelopedinthecommunity.•HighlyAlignwithHumanPreferences:FollowingmultipleiterationsofRLHFoptimizationsacrossvariouspost-trainingmodules,ourmodelconsistentlyalignsitsoutputswithhumabyagreatadvantageinELOscoring.(豆包)1andDreamina(即梦)2.Weardentlyencourageabroaderaudiencetodelveintotheextensivecapabilitiesandpotentialsofourmodel,withtheaspirationthatitcanemergeasaneffectivetoolforimprovingproductivityinthemultipleaspectsofworkanddailylife.1/chat/create-image2/ai-tool/image/generate52DataPre-ProcessingThissectiondetailsourdatapipelineforpre-training,encompassingvariouspre-processingstepssuchasdatacomposition,datacleaningandfiltering,activelearning,captioning,anddatafortextrendering.Theseprocessesensureafinalpre-trainingdatasetthatisofhighquality,largesc2.1DataCompositionOurpre-trainingdataismeticulouslycuratedfromfourmaincomponents,ensuringabalancedandcompre-Knowledge-richpairsDataKnowledge-richpairsDatasourceQualityClarityAestheticsHigh-QualityQualityClarityAestheticsDataclusterClustersImageClustersImagebasedDistributionMaintenanceDataEmbeddingGeneraldataEmbeddingTextbasedLow-qualityTextbasedTaxonomyEnginesKnowledgeInjectionDataNounsEnginesKnowledgeInjectionDataActivelearningActivelearningengineRetrieval/ClassificationTargetedRetrieval/ClassificationTargetedSupplementaryDataLargemovementsNonexistentSupplementaryDataFigure3Pre-trainingdatasystem.High-QualityData.ThiscomponentincludesdatawithexceptionallyhighimagequalityandrichknowledgeUncuratedDataDatafromtaxonomyEmbeddingRetrievalDeduplicationEmbeddingRetrievalAugmentedCuratedDataFigure4Overviewofourknowledgeinjectionprocess.DistributionMaintenanceData.Thiscomponentmaintainstheusefuldistributionoftheoriginaldatawhilereducinglow-qualitydatathrough:•DownsamplingbyDataSource:Reducingtheproportionofoverrepresentedsourceswhilepreservingtheirrelativemagnituderelationships.•Clustering-basedSampling:Samplingdatabasedonclustersatmultiplehierarchicallevels,fromclustersrepresentingbroadersemantics(suchasvisualdesigns)tothoserepresentingfinersee.g.,CD/bookcoversandpo6KnowledgeInjectionData.Thissegmentinvolvestheinjectionofknowledgeusingadevelopedtaxonomamultimodalretrievalengine,asshowninFigure4.ItincludesdatawithdistinctiveChinesecontextstoAdditionally,asmallbatchofdatawithdistinctiveChinesecontextswasmanuallycollected.Thisdatasetandfolkculture.OurmultimodalretrievalenginewasemployedtoaugmentandincorporatethisChineseknowledgeintoourgenerativemodel.TargetedSupplementaryData.Wesupplementthedatasetwithdatathatexhibitsuboptimalperformanceintext-to-imagetasks,suchasaction-orienteddataandcounterfactualdata(e.g.,"amanwithaballoonforaneck").Ouractivelearningenginecategorizesandintegratesthesechallengingdatapointsintothefinaltrainingset.2.2DataCleaningProcessdatafilteringmethodologies,asdepictedinFigure5.DeduplicationClusteringCaptioningOCRDeduplicationClusteringCaptioningStageIIIStageIIFigure5Overviewofourdatacleaningprocess.FirstStage:GeneralQualityAssessment.Welabeltheentiredatabaseusingthefollowingcriteria:•GeneralQualityScore:Evaluatingimageclarity,motionblur,andmeaninglesscontent.•GeneralStructureScore:Assessmentofelementssucha•OCRDetection:Identifyingandcatalogingtextwithinimages.Samplesthatdonotmeetqualitystandardsareeliminated.SecondStage:DetailedQualityAssessment.Thisstageinvolvesprofessionalaestheticscores,featureembeddingextraction,deduplication,andclustering.Clusteringisstructuredatmultiplehierarchicallevels,adjustmentofthedistribution.ThirdStage:CaptioningandRe-captioning.Westratifytheremainingdataandannotatecaptionsorrecaptions.Higher-leveldatagenerallyreceiverichernewcaptions,describedfromdifferentperspectives.DetailsonthecaptioningprocessareprovidedinSectionActiveLearningEngineWedevelopedanactivelearningsystemtoimproveourimageclassifiers,asillustratedinFigure6.Itisaniterativeprocedurethatprogressivelyrefinesourclassifiers,ensuringahigh-qualitydatasetfortraining.7StartbylabelingasmallsubsetofthedataActivelearningCurrentlabeleddatasetClassifierUnlabeledimageslabeledimagesHumanlabelersImagestolabelFigure6FlowdiagramofActiveLearningLifecycle.2.4ImageCaptioningbothgenericandspecial2.4.1GenericCaptionsWeformulateshortandlongcaptionsinChineseandEnglish,ensuringaccurateanddetaileddescriptions:•ShortCaptions:Accuratelydescribethemaincontentofanimage,capturingthecoreknowledgeandcontent.•LongCaptions:Moredescriptive,detailingasmanyaspinferencesandimaginaFigure7Captionexamplesinourtrainingdata.2.4.2SpecializedCaptionsInadditiontogenericcaptions,wealsohavespecializedcaptions•ArtisticCaptions:Describeaestheticelementssuchasstyle,color,composition,andlightinteraction.•TextualCaptions:Focusonthetextualinformationpresentintheimages.8•SurrealCaptions:Capturethesurrealandfantasticalaspectsofimages,offeringamoreimaginativeCropping OCRDetectionCropping OCRDetectionOCRDetectionPairRecaptionModelOCRRecaptionPairRecaptionModelOCRRecaptionFigure8TextRendering:DataPre-processingPipeline.2.5TextRenderingDataWeconstructalarge-scalevisualtextrenderingdatasetbyfilteringin-housedataandusingOCRtoolstoselectimageswithrichvisualtextcontent,asdepictedinFigure8.Themaindataprocessingstepsareasfollows:•Filterlow-qualitydatafromin-housesources.•EmployOCRtodetectandextracttextregions,followedbycroppingofwatermarks.•Removelow-qualitytextboxes,retainingclearandrelevanttextregions.•Processextractedtextusingare-captionmodeltogeneratehigh-qualitydescriptions.•Furtherrefinethedescriptionstoproducehigh-qualityimage-captionpairswhicharefinallyusedforvisualtext-renderingtasks.3ModelPre-TrainingFigure9OverviewofSeedream2.0TrainingandInferencePipeline.9Figure10OverviewofModelArchitecture.3.1DiffusionTransformerForaninputimageI,aself-developedVariationalAuto-Encoder(VAE)isusedtoencodetheinputimage,resultinginalatentspacerepresentationx∈RC×H×W.Thelatentvectorxisthenpatchifiedtoanumberof×.Thisprocessultimatelytransformstheinputimageintoareconcatenatedwithtexttokensencodedbyatextencoderandthenfedintotransformerblocks.ThedesignofDiTblocksmainlyadherestothedesignprinciplesofMMDiTinStableDiffusion3(SD3)[7].Eachtransformerblockincorporatesonlyasingleself-attentionlayer,whichconcurrentlyprocessesbothimageandtexttokens.Consideringthedisparitiesbetweentheimageandtextmodalities,distinctMLPsareemployedtohandlethemseparately.TheadaptivelayernormisutilizedtomodulateeachattentionandMLPlayer.WeresorttoQK-NormtoimprovetrainingstabilityandFullyShardedDataParallel(FSDP)[44]toconductdistributedmodeltraining.Inthispaper,weaddthelearnedpositionalembeddingontexttokens,andapplya2DRotaryPositionalEmbedding(RoPE)[29]onimagetokens.Unlikepreviousworks,wedevelopavariantof2DRoPE,namelymodeltobegeneralizedtountrainedaspectratiosandresolutionstoacertainextentduringinference.3.2TextEncoderToperformeffectivepromptencodingfortext-to-imagegenerationmodels,existingmethodologtypicallyresorttoemployingCLIPorT5asatextencoderfordiffusionmodels.CLIPtextencoder([24])iscapableofcapturingdiscriminativeinformationthatiswellalignedwithvisualrepresentationorembeddings,whiletheT5encoder([25])hasastrongabilitytounderstandcomplicatedandfine-grainedtextinformation.However,neitherCLIPorT5encoderhasstrongabilitytounderstandtextinChinese,whiledecoder-onlyLLMsoftenhaveexcellentmulti-languagecapabilities.Atextencoderplaysakeyroleindiffusionmodels,particularlyfortheperformanceoftext-alignmentinimagegeneration.Therefore,weaimtodevelopastrongtextencoderbytakingadvantageofthepowerofLLMsthatisstrongerthanthatofCLIPorT5.However,textembeddingsgeneratedbythedecoder-onlyLLMshavelargedifferencesinfeaturedistributioncomparedtothetextencoderofCLIPorT5,makingitdifficulttoalignwellwithimagerepresentationsindiffusionmodels.ThisresultsinsignificantinstabilitywhentrainingadiffusionmodelwithsuchanLLM-basedtextencoder.Wedevelopanewapproachtofine-tuneadecoder-onlyLLMbyusingtext-imagepairdata.Tofurtherenhancethecapabilitiesforgeneratingcertainchallengingscenarios,suchasthoseinvolvingChinesestylisticnuancesandspecializedprofessionalvocabulary,wecollectalargeamountofsuchtypesofdataincludedinourtrainingset.UsingthestrongcapabilitiesofLLM,andimplementingthemeticulouslycraftedtrainingstrategies,ourtextencoderhasdemonstratedasuperiorperformanceoverothermodelsacrossmultipleperspectives,includingstrongbilingualcapabilitiesthatenableexcellentperformanceinlong-textunderstandingandcomplicatedinstructionfollowing.Inparticular,excellentbilingualabilitymakesourmodelsabletolearnmeaningfulnativeknowledgedirectlyfrommassivedateinbothChineseandEnglish,whichisthekeyforourmodeltogenerateimageswithaccurateculturalnuancesandaestheticexpressionsdescribedinbothChineseandEnglish.3.3Character-levelTextEncoderConsideringthecomplexityofbilingualtextglyphs(especiallyChinesecharacters),weapplyaByT5[19,37]glyph-alignedmodeltoencodetheglyphlevelfeaturesorembeddingsandensuretheconsistencyofglyphfeaturesofrendprompt,whichareconcatenatedandthenareinputintoaDITblock.RenderingContent.ExperimentalresultshavedemonstratedthatwhenusingaByT5modelsolelytoencodethefeaturesofarenderedtext,particularlyinthecaseoflongtext,itcanleadtorepeatedcharactersanddisorderedlayoutgeneration.Thisisduetothemodel’sinsufficientunderstandingofholisticsemantics.Toaddressthisissue,fortheglyphfeaturesofrandaByT5model.ThenweemployanMLPlayertoprojecttheByT5embeddingsintoaspacethatalignswiththefeaturesoftheLLMtextencoder.Then,aftersplicingtheLLMandByT5features,wesendthecompletetextfeaturestotheDiTblocksfortraining.IncontrasttootherapproachesthattypicallyusebothLLMfeaturesandOCR-renderedimagefeaturesasconditions,ourapproachusesonlytextualfeaturesasconditions.Thisallowsourmodeltomaintainthesametrainingandreasoningprocessastheoriginaltext-to-imagegeneration,significantlyreducingthecomplexityofthetrainingandreasoningpipeline.RenderingFeatures.Thefont,color,size,positionandothercharacteristicsoftherenderedtextaredescribedusingare-captionmodelwhichisencodedthroughanLLMtextencoder.Traditionaltextrenderingapproaches[4,18,32]typicallyrelyonalayoutofpresettextboxesasaconditionalinputtoadiffusionmodel.Forexample,TextDiffuser-2[4]employsanadditionalLLMforlayoutplanningandencoding.Incontrast,ourapproachdirectlydescribestherenderingfeaturesofthetextthroughthere-captionmodel,allowingforanend-to-endtraining.Thisenablesourmodeltolearntherenderingfeaturesoftexteffectivelyanddirectlyfromtrainingdata,whichalsomakesiteoftherenderingtext,enablingthecreationofmoresophisticatedandhigh-qualitytextrenderingoutputs.4ModelPost-TrainingOurpost-trainingprocessconsistsofmultiplesequentialphases:1)ContinueTraining(CT)andSupervisedfine-tuning(SFT)stagesremarkablyenhancetheamodelsandfeedbacklearningalgorithms;3)PromptEngineering(PE)furtherimprovestheperformanceonaestheticsanddiversitybyleveragingafine-tunedLLM;4)Finally,arefinermodelisdevelopedtoscaleuptheresolutionofanoutputimagegeneratedfromourbasemodel,andatthesametimefixsomeminorstructuralerrors.Thevisualizationresultsduringdifferentpost-trainingstagesarepresentedinFigure11.4.1ContinuingTraining(CT)Pre-traineddiffusionmodelsoftenstruggletoproduceimagesthatmeetthedesiredaestheticcriteria,duetothedisparateaestheticstandardsinherentinthepre-trainingdatasets.Toconfrontthischallenge,weextendlargeponds,pavilionsandtowers,traditionaltechniques,inkpainting,delicatelines,prominentcolors,detailedtextures,classicalaesthetics,poeticatmosphere,highdefinition,highresolution,naturallight,softtones,distantviews,tranquility,elegance)—个男孩的背影，他看着窗外的花，摄影(Thgroundcomposition.Thepictureshowsagirlinkimonostandingunderacherrytree,thebreezeblowingthepetals,formingasofthalounderthebacklight,anddelicatelightandshadowshiningonthegirl'sface,presentinganoverallatmosphereoftranquilityandnature.)photography,lookingup.Ablackandwhitecatwalkinginthesnowatnight,withavintagehouseinthebackground.)Figure11Visualizationduringdifferentpost-trainingstages.thetrainingphasebytransitioningtoasmallerbutbetterqualitydataset.Thiscontinuingtraining(CT)phaseisdesignednotonlytomarkedlyenhanceaestheticsofthegeneratedimages,butisalsorequiredtomaintainfundamentalperformanceonprompt-followingandstructuralaccuracy.ThedataoftheCTstageconsistsoftwoparts.4.1.1Data•High-qualityPre-trainingData:Wefilteralargeamountofhigh-qualityimagesfromourpre-trainingisautomaticbyusingthesemodelswithoutanymanualeffort.•ManuallyCuratedData:Inadditiontothecollectedhigh-qualitydatafrompre-trainingdatasets,wemeticulouslyamassdatasetswithelevatedaestheticqualitiesfromdiversespecificdomainssuchasart,photography,anddesign.Theimageswithinthesedatasetsarerequiredtopossessacertainaestheticcharmandalignwiththeanticipatedimagegenerationoutcomes.Followingmultipleroundsofrefinement,arefineddatasetcomprisingmillionsofmanuallycherry-pickedimageswasfabricated.Toavsuchasmalldataset,wecontinuallytrainourmodelbyjointlyusingitwiththeselectedhigh-qualitypre-traineddata,withareasonablesamplingratio.4.1.2TrainingStrategyDirectlyperformingCTontheaforementioneddatasetscanconsiderablyimprovetheperformanceintermsofaesthetics,butthegeneratedimagesstillexhibitanotabledisparityfromrealimageshavingappealingaesthetics.Tofurtherimproveaestheticperformance,weintroduceVMix([34])whichenablesourmodeltolearnthefine-grainedaestheticcharacteristicsdirectlyduringthedenoisingprocess.Wetageachimageaccordingtovariousaestheticdimensions,namelycolor,lighting,texture,andcomposition,andthenthesetagsareusedassupplementaryconditionsduringourCTtrainingprocess.Experimentalresultsshowthatourmethodcanfurtherenhancetheaestheticappealofthegen4.2SupervisedFine-Tuning(SFT)4.2.1DataIntheSFTstage,wefurtherfine-tuneourmodeltowardgeneratinghigh-fidelityimageswithexcellentartisticbeauty,byusingasmallamountofcarefullycollectedimages.Withthesecollectedimages,wespecificallytrainedacaptionmodelcapableofpreciselydescribingbeautyandartistrythroughmulti-roundmanualrectifications.Furthermore,wealsoassignedstylelabelsandfine-grainedaestheticlabels(usedinthevmixapproach)totheseimages,whichensurethattheinformationofthemajorityofmainstreamgenresisincluded.4.2.2TrainingStrategyInaddtiontotheconstructedSFTdata,wealsoincludeacertainamountofmodel-generatedimages,whicharelabeledas"negativesamples",duringSFTtraining.Bycombiningwithrealimagesamples,themodelcanlearntodiscriminatebetweenrealandfakeimages,enablingittogeneratemorenaturalandrealisticimages.Thistherebyenhancesthequalityandauthenticityofthegeartisticstandardscansubstantiallyenhancetheartisticbeauty,butitinevitablydegradestheperformanceonimage-textalignment,whichisfundamentaltotext-to-imagegenerationtask.Toaddressthisissue,wedevelopedadataresamplingalgorithmthatallowsthemodeltoenhanceaestheticswhilestillmaintainingimage-textalignmentcapacity.4.3HumanFeedbackAlignment(RLHF)Inourwork,weintroduceapioneeringRLHFoptimizationproceduretailoredfordiffusionmodels([14,41,42]),incorporatingpreferencedata,rewardmodels(RMs),and12,theRLHFphaseplaysapivotalroleinenhancingtheoverallperformanceofourdiffusionmodelsinvariousaspects,includingimage-textalignment,aesthetic,structurecorrectness,textrendering,etc.Figure12Therewardcurvesshowthatthevaluesacrossdiverserewardmodelsallexhibitastableandconsistentupwardtrendthroughoutthealignmentprocess.Somevisualizationexamplesrevealthatthehumanfeedbackalignmentstageiscrucial.4.3.1PreferenceData•PromptSystem:WehavedevelopedaversatilePromptSystemtailoredforemploymentinboththeRMTrainingandFeedbackLearningphases.Ourcuratedcollectioncomprisesof1millionmulti-dimensionalpromptssourcedfromtrainingcaptionsanduserinput.Throughrigorouscurationprocessesthatfilteroutambiguousorvagueexpressions,weguaranteeapromptsystemthatisnotonlycomprehensivebutalsorichindiversityanddepthofcontent.•RMDataCollection:Wecollecthigh-qualitydataforpreferenceannotation,comprisingimagescraftedbyvarioustrainedmodelsanddatasources.Throughtheconstructionofacross-versionandcross-modelannotationpipeline,weenhancethedomainadaptabilityofRMs,andextenditsupperthresholdof•AnnotationRules:Intheannotationphase,weengageinmulti-dimensionalfusionannotation(suchasimageandtextmatching,textrendering,aesthetic,etc.).Theseintegratedannotationproceduresaredesignedtoelevatethemulti-dimensionalcapabilitiesofasinglerewardmodel,forestalldeficiencRLHFstage,andfostertheachievementofParetooptimalityacrossalldimensionswithinRLHF.4.3.2RewardModel•ModelArchitecture:WeuseaCLIPmodelthatsupportsbothChineseandEnglishasourRMs.ByleveragingthestrongalignmentcapabilitiesoftheCLIPmodel,weforgoadditionalHeadoutputrewardmethodslikeImageReward,optingtoutilizetheoutputofCLIPmodelastherewarditself.ArankinglossisprimarilyappliedasthetraininglossofourRMs.•Multi-aspectsRewardModels:ToenhancetheoveralandtrainedthreedistinctRMs:aimage-textalignmentRM,anaestheticRM,andatext-renderingRM.Inparticular,thetext-renderingRMisselectivelyengagedwhenaprompttagrelatestotextrendering,significantlyimprovingtheprecisionofcharacter-leveltextgeneration.4.3.3FeedbackLearning•LearningAlgorithm:WerefineourdiffusionmodelthroughadirectoptimizationofoutputscorescomputedfrommultipleRMs,akintoREFL([36])paradigm.DelvingintovariousfeedbacklearningalgorithmssuchasDPO([33])andDDPO([1]),ourinvestigationrevealedthatourmetapproachtowardmulti-rewardoptimiz

人人文库> 全部分类> 应用文书 > 研究报告

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

2025豆包大模型Seedream2.0技术报告原生中英双语图像生成模型

文档简介

温馨提示

最新文档

评论

2025豆包大模型Seedream2.0技术报告原生中英双语图像生成模型

文档简介

温馨提示

最新文档

评论

相关文档