




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
PRIVACYENHANCINGTECHNOLOGY(PET):
PROPOSEDGUIDEONSYNTHETICDATA
GENERATION
Published15July2024VersionNumber1.0
JOINTLYDEVELOPEDWITHSUPPORTEDBY
2
TABLEOFCONTENTS
I.IntroductiontoPrivacyEnhancingTechnology(PET) 3
II.SyntheticData 4
WhatisSyntheticData? 5
UnderWhatCircumstancesisSyntheticDataUseful? 6
CaseStudies 8
III.Recommendations 10
AnnexA:HandbookonKeyConsiderationsandBestPracticesin
SyntheticDataGeneration 11
AnnexB:DataDictionaryFormat 24
AnnexC:ExamplesofMethodsofSyntheticDataGeneration 27
AnnexD:Re-identificationRisks 33
AnnexE:ExamplesofApproachesinEvaluationofRe-identification
Risks 35
ACKNOWLEDGEMENTS 41
3
I.IntroductiontoPrivacyEnhancingTechnology(PET)
PrivacyEnhancingTechnologies(PETs)areasuiteoftoolsandtechniquesthatallowtheprocessing,analysis,andextractionofinsightsfromdatawithoutrevealingtheunderlyingpersonalorcommerciallysensitivedata.ByincorporatingPETs,companiescanmaintainacompetitiveedgeinthemarketthroughleveragingtheirexistingdataassetsforinnovationwhilecomplyingwithdataprotectionregulations,reducingtheriskofdatabreachesanddemonstratingacommitmenttodataprotection.PETsarenotjustadefensivemeasure;theyareaproactivesteptowardsfosteringacultureofdataprotectionandsecuringacompany'sreputationinthedigitalage.
PETscangenerallybeclassifiedintothreekeycategorie
s1:
dataobfuscation,encrypteddataprocessing,andfederatedanalytics.PETscanalsobecombinedtoaddressvaryingneedsoforganisations.ThefollowingTable1mapsoutthecurrenttypesofPETsinthemarketandtheirkeyapplications.
Table1.TypesofPETsandtheirapplications
Categoriesof
PETs
PETs
Examplesofapplications(non-exhaustive)
Data
obfuscation
Anonymisation/pseudonymisationtechniques
•Securestorage
•Datasharingandretention
•Softwaretesting
Syntheticdatageneration
•Privacy-preservingAImachinelearning
•Datasharingandanalysis
•Softwaretesting
Differentialprivacy
•Expandingresearchopportunities
•Datasharing
Zeroknowledgeproofs
•Verifyinginformationwithout
requiringdisclosure(e.g.,ageverification)
Encrypteddataprocessing
Homomorphicencryption
•Securedatastoredincloud
1AdaptedfromOECD,“EmergingPrivacyEnhancingTechnologies:CurrentRegulatoryandPolicyApproaches,”OECDDigitalEconomyPapers(OECD,2023).
4
•Computingonprivate
datathatisnotdisclosed
Multi-partycomputation
(includingprivatesetintersection)
•Computingonprivate
datathatisnotdisclosed
Trustedexecutionenvironments
•Computingusing
modelsthatneedtoremainprivate
•Computingonprivate
datathatisnotdisclosed
Federatedanalytics
Federatedlearning
•Privacy-preservingAImachinelearning
Distributedanalysis
II.SyntheticData
Thisguidefocusesontheuseofsyntheticdata
2
togeneratestructureddata.Whilesyntheticdataisgenerallyfictitiousdatathatmaynotbeconsideredpersonaldataonitsown,itisnotinherentlyrisk-freeduetopossiblere-identificationrisk
s3.
Assuch,thisguideproposesgoodpracticesthatorganisationsmayadopttogeneratesyntheticdatatominimisesuchrisksforasetofcommonusecasearchetypes.Theguidealsoincludesasetofgoodpracticesandriskassessments/considerationsforgeneratingsyntheticdataaswellasgovernancecontrols,contractualprocess,andtechnicalmeasurestomitigateresidualrisks.
ThetargetaudienceforthisguideareCIOs,CTOs,CDOs,datascientists,dataprotectionpractitioners,andtechnicaldecision-makerswhomaydirectlyorindirectlybeinvolvedinthegenerationanduseofsyntheticdata.
Syntheticdataisatechnologythatisbeingactivelyresearchedanddevelopedatthetimeofpublication.Hence,thisguideisnotintendedtoprovideacomprehensiveorin-depthreviewofthetechnologyoritsassessmentmethods.Theguideisintendedtobealivingdocument,andwillbeupdatedtoensureitsrecommendationsremainrelevant.
2Therearetwotypesofsyntheticdata:fullysyntheticdataandpartiallysyntheticdata.Thisguidediscussestheuseoffullysyntheticdata.
3Inthisguide,wegenerallyrefertoprivacyrisksasre-identificationrisks.
5
WhatisSyntheticData?
Syntheticdataiscommonlyreferredtoasartificialdatathathasbeengeneratedusingapurpose-builtmathematicalmodel(includingartificialintelligence(AI)/machinelearning(ML)models)oralgorithm.Itcanbederivedbytrainingamodel(oralgorithm)onasourcedatasettomimicthecharacteristicsandstructureofthesourcedata.Goodqualitysyntheticdatacanretainthestatisticalpropertiesandpatternsofthesourcedatatoahighextent.Asaresult,performinganalysisonsyntheticdatacanproduceresultssimilartothoseyieldedwithsourcedata.
Characteristicsofsyntheticdata
Figure1showsanexampleofhowsyntheticdatamaylooklikeascomparedwiththesourcedata.Generatedsyntheticdatawillgenerallyhavedifferentdatapointsfromthesourcedata,asseenfromthetabulardata.However,thesyntheticdatawillhavestatisticalpropertiesthatareclosetothatofthesourcedata,i.e.,capturingthedistributionandstructureofthesourcedataasseenfromthetrendlinesinFigure1.
Figure1:Sourcedataversussyntheticdata
.4
Assuch,syntheticdatamaynotalwaysbeinherentlyrisk-freeasinformationaboutanindividualinthesourcedataset,orconfidentialdata,canstillbeleakedduetotheresemblanceofthesyntheticdatatothesourcedata.Therewillalsobetrade-off
s5
betweendatautilityanddataprotectionrisksinsyntheticdatageneration.However,suchriskscanbeminimisedbytakingdataprotectionintoconsiderationduringthesyntheticdatagenerationprocess.
4DiagramtakenwithmodificationfromKhaledElEmam,LucyMosquera,andRichardHoptroff,PracticalSyntheticDataGeneration(O’ReillyMedia,Inc,2020).
5Trade-offbetweendatautilityanddataprotectionrisksisfurtherdiscussedinAnnexA:Step1andStep
3inthisguide.
6
UnderWhatCircumstancesisSyntheticDataUseful?
SyntheticdatacanbeusedinavarietyofusecasesrangingfromgeneratingtrainingdatasetsforAImodelstodataanalysisandcollaboration.Theuseofsyntheticdatanotonlycanaccelerateresearch,innovation,collaboration,anddecision-makingbutalsomitigateconcernsaboutcybersecurityincidentsanddatabreaches,enablingbettercompliancewithdataprotection/privacyregulations.Table2discussesafewcommonusecasearchetypes,theirkeybenefits,andgoodpracticesthatorganisationscanfocusonwhengeneratingsyntheticdata.
Table2.Usecasearchetypesforsyntheticdata.
TypesofUseCases
KeyBenefits
GoodPracticesto
GenerateSyntheticData
Usecasearchetype1:GeneratingtrainingdatasetforAImodels
Augmenting
dataforAI/MLmodels
•Syntheticdataaddressesthechallengeoftheuserhavingtoobtainlargevolumesof
labelleddataneededfor
trainingandtestingAI/MLmodelsduetocosts,legalregulations,andproprietaryrights.
•Augmentingtrainingdatasetswithsyntheticallygeneratedlabelleddatacanbemore
cost-effective,especially
whenthesourcedatasetsaresparse.
•Addnoise*toorreducegranularityofthe
syntheticdatapoints.
•Suchfictitiousnewdatapointswillgenerallynotbeconsideredpersonaldata.
*Ifthestatistical
properties/characteristicsofthesyntheticdatais
representativeofthe
populationinquestionandnotsignificantlyskewed
towardsaspecific
individual/groupof
individualsusedassourcetrainingdata,addingof
noisemightnotbe
necessaryasre-
identificationrisksaregenerallylow.
Increasing
datadiversityforAI/ML
models
•Syntheticdatacanbeusedtosimulaterareeventsor
augmentunder-representedgroupsintrainingAImodels.
•DiversedatasetscanbeusefulinimprovingperformanceofAI/MLmodels
Usecasearchetype2:Dataanalysisandcollaboration
Datasharingandanalysis
•Underlyingtrendsorpatterns,andbiasesofthedataare
usefulfordataanalytics
regardlessofwhetherthedatasourceisrealorsynthetic.
•Balancethetrade-offs
betweendatautilityanddataprotectionby
incorporatingdataprotectionmeasures
7
•Syntheticdatacanenabledatasharingforanalysisespeciallyinindustriesandsectors,e.g.,healthcare,wherethesourcedatacanbesensitive.
throughoutthesyntheticdatagenerationprocess,forexample:
Datapreparation
•Removeoutliersfromsourcedata
•Pseudonymisesourcedata
•Employdata
minimisationand
generalisegranulardata
Syntheticdatageneration
•Addnoisebeforeoraftersyntheticdatageneration
Postsyntheticdatageneration
•Incorporatetechnical,
contractual,and
governancemeasurestomitigateanyresidualre-identificationrisks
Previewing
datafor
collaboration
•Syntheticdatacanbeusedindataexploration,analysis,andcollaborationtoprovide
stakeholderswitha
representativepreviewofthesourcedatawithoutexposingsensitiveinformation.
•Thisenablesstakeholdersto
exploreandunderstandthe
structure,relationships,and
potentialinsightswithinthe
datatogainassuranceofthedataqualitybeforefinalisinganyagreementor
collaboration.
Usecasearchetype3:Softwaretesting
System
development/software
testing
•Organisationscanuse
syntheticdatainsteadof
productiondatatofacilitatesoftwaredevelopment.
•Useofsyntheticdatacanhelporganisationsavoiddata
breachesintheeventofthedevelopmentenvironmentbeingcompromised.
•Focusongenerating
syntheticdatathat
followssemanticse.g.,format,min/maxvaluesandcategories,of
sourcedatainsteadofthestatistical
characteristicsandproperties.
RefertoAnnexAforproposedconsiderationsandgoodpracticestogeneratesyntheticdata.
8
CaseStudies
(A)TrainingAImodelforfrauddetectioninthefinancialsecto
r6
Problem:Sincethenumberoffraudulenttransactionsinthesourcedataissmallcomparedtonormal,non-fraudulenttransactions,thesourcedatadidnottrainmodelsverywellforfrauddetection.
Solution:J.P.Morgansuccessfullyusedsyntheticdataforfrauddetectionmodeltraining.AImodelswereprovidedwithsamplesofnormalandfraudulenttransactionstounderstandthetell-talesignsofsuspicioustransactions.
Benefit:Syntheticdataprovedtobemoreeffectiveintermsoftrainingmodelstodetectanomalousbehaviour.Thisisbecausethesyntheticdatausedwasdesignedtocontainahigherpercentageoffraudulenttransactions.
(B)TrainingAImodelforresearchintoAIbia
s7
Problem:Multi-labelclassificationandregressionmodelsarefrequentlyutilisedatMastercardforvariousapplications,includingfraudprevention,anti-moneylaunderingandmarketingusecasesforportfoliooptimisation.Thesemodels,whilepowerful,requirecarefulattentiontoproxiesofdemographicattributeswithintheirtrainingdata,whichcouldlearnunintendedbiases.Ensuringtheaccuracyandfairnessofthesemodelsiscomplexduetotheirmulti-labelsetting,theconfidentialityofthedemographicattributes,andthechallengesinaccessingthetrainingdatasetformodeldevelopment.
Solution:MastercardpartneredwithresearcherstodevelopnewAIbiastestingmethodsadaptedtomulti-labelsettings.Toprotecttheprivacyofthedatasharedexternally,syntheticdatawascreatedtosupportmodeltrainingandmethodologicalresearchintofairmulti-labelmodels.
Benefit:Syntheticdatawasmeasuredtobesufficientlyprivatetobesharedwithexternalresearcherswhilecapturingrealrelationshipswithinthesourcedata.Syntheticdataenablednewinsightsthatwouldnothavebeenpossiblewithouttheprivacyprotectingcharacteristicsinherenttosyntheticdata.
6J.P.Morgan,“SyntheticDataforRealInsights,”TechnologyBlog,n.d.,
/
technology/technology-blog/synthetic-data-for-real-insights
7ContributedbyMastercard
9
(C)Safeguardingpatientdatafordataanalysis
8
Problem:Priortoutilisingsyntheticdata,Johnson&Johnson(J&J)allowedexternalresearchersorconsortiatoaccesshealthcaredataforresearchproposalsvalidatedbyJ&J.Tosafeguardpatientprivacy,thedatawastransformedintoanonymisedhealthcaredata.However,feedbackreceivedindicatedthattheoverallusefulnessoftheanonymiseddata,whichreliedontraditionalanonymisationtechniques,wasnotalwayssatisfactoryanddidnotalwaysmeettherequirementsoftheresearchersorconsortia.
Solution:J&Jhasintroducedhigh-qualityAIgeneratedsyntheticdataasanadditionaloptiontoprocesstheirhealthcaredata.
Benefit:Researchersandclientshaveexperiencedsignificantlyimprovedanalysis.Whenemployedproperly,thisformofsyntheticdatacaneffectivelyrepresentthetargetpopulationandoffervariousanalyticalandscientificbenefits.
(D)Facilitatingdata
collaboration9
Problem:Apharmaceuticalcompanywantedtopurchaseheart-relatedhealthdatafromaresearchinstitutetotestoutanewhypothesis.Thehealthdata,whichwascollectedbytheresearchinstitutefromconsentingsubjects,washostedunderahighlyregulatedenvironmentasrequiredofthehealthcaresector.However,thispresentssignificantchallengesformanydataengagementactivities.
Solution:A*STARwasengagedbythepharmaceuticalcompanytobuildapipelinetocreatesyntheticcopiesoftheactualdata,whichcanthenbebroughtoutsideofthisregulatedenvironment.
Benefit:Thisallowedthepharmaceuticalcompanytopreviewthedataandbeassuredofthedataqualitypriortothehigh-valuepurchaseandaccesstotheactualdata.
8ContributedbyJohnson&Johnson(J&J)
9ContributedbyA*STAR
10
III.Recommendations
SyntheticdatahasthepotentialtodrivethegrowthofAI/MLbyenablingAImodeltrainingwhileprotectingtheunderlyingpersonaldata.ItalsoaddressesdatasetrelatedchallengesforAImodeltraining,suchasinsufficientandbiaseddata,throughenablingtheaugmentationandincreaseddiversityoftrainingdatasets.
Inaddition,syntheticdatacanbeusedtofacilitateandsupportorganisations’dataanalytics,collaborationandsoftwaredevelopmentneeds.Anaddedbenefitofusingsyntheticdatainplaceofproductiondatatofacilitatesoftwaredevelopmentisthatdatabreachescanbeavoidedintheeventthedevelopmentenvironmentiscompromised.
PDPCrecommendsasetofgoodpracticesandriskassessments/considerationsforgeneratingsyntheticdataandtoreduceanyresidualrisksfromre-identificationthroughgovernancecontrols,contractualprocess,andtechnicalmeasures(refertoAnnexA).
11
AnnexA:HandbookonKeyConsiderationsandBestPracticesinSyntheticDataGeneration
Inthishandbook,wedescribethekeyconsiderationsandbestpracticesfororganisationstoreducere-identificationrisksofsynthetictabulardatathroughafive-stepapproach.
Foranyothercomplexsyntheticdatasetsthatareunstructured,organisationsareadvisedtoconsiderhiringsyntheticdataexperts,datascientistsorindependentriskassessorstoassessandmitigatetherisksofthegeneratedsyntheticdata.
Overviewoffive-stepapproachtogeneratesyntheticdata
Step1:Knowyourdata
Beforeembarkingonanysyntheticdataproject,itisnecessarytohaveaclearunderstandingofthepurposeandusecasesofthesyntheticdataandthesourcedatathatthesyntheticdataistomimic.Thiswillhelptodeterminewhetheruseofsyntheticdatamightberelevantandidentifythepossiblerisksofusingthesyntheticdata.Someoftheconsiderationsmayinclude:
•Wheregeneraltrends/insightsofsourcedataaresensitive,organisationshouldtakenotethattheuseofsyntheticdatawillnotofferanyprotectiontothetrends/insightssincetheywillbereplicatedinthesyntheticdata.
•Wherethesyntheticdataisintendedtobereleasedpublicly,organisationsmayhavetoprioritisedataprotectionoverdatautilityinsuchcircumstances.
12
•Whererelevant,organisationsshouldalsoputinplacepropercontractualobligationsonrecipientsofsyntheticdatawherenecessarytopreventre-identificationattacksonthedata.
Withthisknowledge,themanagementanddataowner,withthehelpofrelevantstakeholderssuchasthedataanalyticsteam,shouldestablishobjectivespriortosyntheticdatagenerationtodetermineanacceptableriskthreshol
d10
ofthegeneratedsyntheticdataandtheexpectedutilityofthedata.Thiswillhelpprovideorganisationswiththeappropriatebenchmarkstoassessanytrade-offsbetweendataprotectionrisksanddatautility.
Thesebenchmarksmaybeadjustedappropriatelytomeetthebusinessobjectives,takingintoconsiderationanytrade-offsbetweendatautilityanddataprotectionrisksafterthesyntheticdatagenerationprocess,aswellassafeguardsandcontrolstomitigateorloweranyresidualrisksposedbythegeneratedsyntheticdata.Theacceptancecriteriashouldbeincorporatedintotheorganisation'sriskassessments(e.g.,enterpriseriskmanagement
framework11
ifapplicable)oraDataProtectionImpactAssessment(“DPIA”
)12.
Step2:Prepareyourdata
Whenpreparingthesource
data13
forgeneratingsyntheticdata,itisimportanttoconsiderthefollowing:
•Whatarethekeyinsightsthatneededtobepreservedinthesyntheticdata?
•Whicharethenecessarydataattributesforthesyntheticdatatomeetthebusinessobjectives?
10There-identificationriskthresholdrepresentsthelevelofre-identificationriskthatisacceptableforagivensyntheticdataset.Thereiscurrentlynouniversallyacceptednumericalvalueforriskthreshold.ForfurtherdetailsrefertoStep4(Assessre-identificationrisks).
11OrganisationsmayrefertoISO27001formoreinformationondevelopinganenterpriseriskmanagementframework.
12AnexampleofthisisPDPC’sGuidetoDataProtectionImpactAssessments.ADPIAisapplicableinthecasewherepersonaldataisinvolved.TheDPIAmaynotberelevantinsituationswherethesyntheticdatagenerationdoesnotinvolvepersonaldataprocessing.
13Thisstepassumesthatthesourcedatahasbeenproperlycleaned(suchasfixingorremovingincorrect,corrupted,incorrectlyformatted,duplicate,orincompletedata)andisofacceptablequalityforthegenerationofsyntheticdata.
13
Understandingkeyinsightstobepreserved
Toensurethatthesyntheticdatacanmeetthebusinessobjectives,organisationsneedtounderstandandidentifythetrends,keystatisticalproperties,andattribute-relationshipsinthesourcedatathatneedtobepreservedforanalysise.g.,identifyrelationshipsbetweendemographiccharacteristicsofpopulationandtheirhealthconditions.
Organisationsshouldconsider,atthispoint,whetheroutliertrendsandinsightsarenecessarytobepreservedforthebusinessobjectives.Keyconsiderationscouldincludethefollowing:
•Ifoutliersarenotnecessarytomeetthebusinessobjectivesandtheriskofre-identificationishigh,organisationsshouldconsiderremovingtheoutliers.Thiscanbedonepriortosyntheticdatagenerationoratsubsequentstagesofthesyntheticdatageneration.
•Iftheobjectiveistomimicthecharacteristicsofthesourcedataascloselyaspossible,includingoutliers,thentheorganisationmayhavetopreservetheoutliertrend/insighttomeetthebusinessobjectives.Insuchinstance,theorganisationshouldnotethatthere-identificationrisksofindividualsintheoutlierdatamaybehighandhenceputinplaceriskmitigationmeasures.
•Ifthebusinessobjectiveistobalancethenumberofdatapointsindifferentdatacategories,thenthesyntheticdatagenerationprocessitselfcanhelpmitigatetheissueofoutlierssimplybygeneratingmoreoutliers.Forexample,inadataset,thenumberofoutlierdatapointscomprisingmaleindividualsmaybebalancedwithoutlierdatapointscomprisingfemaleindividuals.
Selectingdataattributes
Basedonthekeyinsightsneeded,organisationsshouldapplydataminimisationtoextractonlytherelevantdataattributesfromthesourcedata.Thereafter,removeorpseudonymisealldirectidentifier
s14
fromtheextracteddata.
Wheregranularinformationisnotnecessary,organisationsmaygeneraliseorfurtheraddnoisetothedataatthispointoratalatersteptoreducetheriskofre-identification.Forexample,organisationscangeneraliseexactheightandweight
14RefertoPDPC’sGuidetoBasicAnonymisationonhowtoidentifydirectidentifiersinadataset.
14
informationintoheightandweightbandstoreducethepossibilityofheightandweightcombinationsbeingusedtoidentifyanyoutliers.
Organisationsshouldalsostandardiseanddocumentthedetailsoneachdataattribute(suchasdatadefinitions,standards,metricsetc.)inadatadictionary.Thisenablestheorganisationtosubsequentlyvalidatetheintegrityofthegeneratedsyntheticdatatodetectanomaliesandfixanydatainconsistencies.RefertothefollowingchecklistinTable3forkeyconsiderations.
Table3:Checklistfordatapreparation
DataPreparationChecklist
Understandkeyinsights
i.
Identifytrendsandentityrelationshipstobepreservedforsyntheticdatageneration.
ii.
Removeoutliersifsuchtrends/insightsarenotnecessary.Thiscanbeperformedpostgeneration.
Selectdataattributes
iii.
Applydataminimisationtoselectonlydataattributesthatarenecessarytomeetbusinessneeds.
iv.
Removeorpseudonymisedirectidentifiers(e.g.,name,nationalidentificationnumbers).
v.
Generalisegranulardataoraddnoise(e.g.,usingdifferentialprivac
y15)
tothe
data/modelifsuchdetailedinformationisnotnecessary.Thiscanalsobeperformedpostgeneration.
vi.
Standardiseanddocumentformat,constraints,andcategoriesofsourcedataindatadictionary(refertoAnnexBforareferencetemplate):
Format
•Standardisestringstolowerorpropercase
•Datatypes,columnnames,structures,relationships
•FrequencyofdatarecordConstraints
•Constraintsofvaluesforeachdatatype,e.g.,min-maxvalues,non-negativevalues,non-nullvalues
Category
•Typesofdatacategories
•Expectedorvalidvaluesfordataattributeswithineachdatacategory.Exampleofadatacategoryis“country”.
15Theuseofdifferentialprivacytoaddnoisetosyntheticdataiswidelydiscussedasamechanismtoreducere-identificationrisks.However,thereiscurrentlynouniversalstandardonhowtoimplementdifferentialprivacy.Moreover,thenoiseaddedmayalsoreducetheutilityofthesyntheticdata,makingitlessaccurateorusefulforcertaintypesofanalysis.
15
Step3:Generatesyntheticdata
Therearemanydifferentmethods
16
togeneratesyntheticdata,forexample,sequentialtree-basedsynthesisers,copulas,anddeepgenerativemodels(DGMs).Organisationsneedtoconsiderwhichmethodsaremostappropriate,basedontheirusecases,dataobjectives,andtypesofdata.PleaserefertoAnnexCformoreinformationonthesesyntheticdatagenerationmethods.Thereafter,organisationsmayconsidersplittingthesourcedataintotwoseparatesetse.g.,80%astrainingdataset,and20%ascontrol
dataset17
forassessingre-identificationrisksofthesyntheticdata.
Aftergeneratingsyntheticdata,itisagoodpracticefororganisationstoperformthefollowingchecksonthequalityofthegeneratedsyntheticdata:
•Dataintegrity
•Datafidelity
•Datautility
Dataintegrity
Dataintegrityensurestheaccuracy,completeness,consistency,andvalidityofthesyntheticdataascomparedwiththesourcedata.Organisationscanvalidatetheintegrityofthegeneratedsyntheticdataagainstthedictionaryofthesourcedata.
Datafidelity
Datafidelityexaminesifsyntheticdatacloselyfollowsthecharacteristicsandstatisticalattributesofthesourcedata.Thereareafewmetricsformeasuringdatafidelityandtheyaretypicallydonebystatisticallycomparingthegeneratedsyntheticdatadirectlywiththesourcedata.Organisationsshouldusetheperformancemetric(s)fordatafidelit
y18
(seeTable4)thatbestmeettheirdataobjectives.
16ThisguidemaynotbecomprehensiveincoveringallothersyntheticdatagenerationmethodssuchasBayesianmodelandvariationalautoencoders(VAE).
17RefertoApproach2inAnnexEformoredetailsontheassessmentandevaluationframeworkforquantifyingre-identificationrisk.
18ThereareothergenericmetricsdescribedhereinadditiontothoselistedinTable4.SeeKhaledElEmametal.,
“UtilityMetricsforEvaluatingSyntheticHealthDataGenerationMethods:ValidationStudy,”
JMIRMedicalInformatics10,no.4(2022)
Table4:Performancemetricsfordatafidelity
Performancemetricsgenerallyusedforassessingdatafidelity
Histogram-basedsimilarity
Measuresthesimilaritybetweensourceandsyntheticdata’sdistributionsthroughahistogramcomparisonofeachfeature.Thisensuresthesyntheticdatapreservesimportantstatisticalpropertiessuchascentraltendency(mean,median),dispersion(variance,range),anddistributionshape(skewness,kurtosis).
Correlationalsimilarity
Measuresthepreservationofrelationshipsbetweenfeaturesinthesourceandsyntheticdatasets.Forexample,ifhighereducationtypicallyleadstohigherincomeinthesourcedata,thispatternshouldalsobeevidentinsyntheticdata.
Datautility
Datautilityreferstohowwellsyntheticdatacanreplaceoraddtosourcedataforthespecificdataobjectiveoftheorganisation.
Therearedifferentapproachestoevaluatetheutilityofsyntheticdata.Thetruetestofutilityishowitperformsinreal-worldtasks.OnecommonapproachtocheckthisisbytrainingidenticalAI/MLmodelsonsyntheticandtrainingdata.T
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 中国热水器排气筒行业市场调查研究及发展战略规划报告
- 2025年海底观测市场分析报告
- 中国特种车改装行业市场全景监测及投资战略咨询报告
- 2025年唇膜市场调查报告
- 中国海鲜餐饮行业市场全景评估及发展战略规划报告
- 2024-2030年中国大理石板材行业市场全景监测及投资前景展望报告
- 中国高反应性聚异丁烯(HR-PIB)行业市场调查报告
- 中国入户门行业市场全景评估及投资方向研究报告
- 中国食用碱块行业市场发展前景及发展趋势与投资战略研究报告(2024-2030)
- 小学第29个爱国卫生月活动总结
- 电梯安装技术交底完整版
- 氧化铝溶出机组热试方案
- 小学阅读理解提分公开课课件
- esd防静电手册20.20标准
- 教育政策与法规课件
- 养老护理员职业道德27张课件
- 少儿美术课件-《长颈鹿不会跳舞》
- 人教版五年级数学下册单元及期中期末测试卷含答案(共16套)
- GB∕T 17989.1-2020 控制图 第1部分:通用指南
- EN485.32003铝及铝合金薄板、带材和厚板第三部分(译文)
- 商混企业整合方案
评论
0/150
提交评论