新加坡《合成数据生成指南》

上传人：策*** IP属地：山西上传时间：2024-12-14 格式：DOCX 页数：76 大小：321.39KB 积分：15 举报 版权申诉

已阅读5页，还剩71页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

PRIVACYENHANCINGTECHNOLOGY(PET):

PROPOSEDGUIDEONSYNTHETICDATA

GENERATION

Published15July2024VersionNumber1.0

JOINTLYDEVELOPEDWITHSUPPORTEDBY

TABLEOFCONTENTS

I.IntroductiontoPrivacyEnhancingTechnology(PET) 3

II.SyntheticData 4

WhatisSyntheticData? 5

UnderWhatCircumstancesisSyntheticDataUseful? 6

CaseStudies 8

III.Recommendations 10

AnnexA:HandbookonKeyConsiderationsandBestPracticesin

SyntheticDataGeneration 11

AnnexB:DataDictionaryFormat 24

AnnexC:ExamplesofMethodsofSyntheticDataGeneration 27

AnnexD:Re-identificationRisks 33

AnnexE:ExamplesofApproachesinEvaluationofRe-identification

Risks 35

ACKNOWLEDGEMENTS 41

I.IntroductiontoPrivacyEnhancingTechnology(PET)

PrivacyEnhancingTechnologies(PETs)areasuiteoftoolsandtechniquesthatallowtheprocessing,analysis,andextractionofinsightsfromdatawithoutrevealingtheunderlyingpersonalorcommerciallysensitivedata.ByincorporatingPETs,companiescanmaintainacompetitiveedgeinthemarketthroughleveragingtheirexistingdataassetsforinnovationwhilecomplyingwithdataprotectionregulations,reducingtheriskofdatabreachesanddemonstratingacommitmenttodataprotection.PETsarenotjustadefensivemeasure;theyareaproactivesteptowardsfosteringacultureofdataprotectionandsecuringacompany'sreputationinthedigitalage.

PETscangenerallybeclassifiedintothreekeycategorie

s1:

dataobfuscation,encrypteddataprocessing,andfederatedanalytics.PETscanalsobecombinedtoaddressvaryingneedsoforganisations.ThefollowingTable1mapsoutthecurrenttypesofPETsinthemarketandtheirkeyapplications.

Table1.TypesofPETsandtheirapplications

Categoriesof

PETs

Examplesofapplications(non-exhaustive)

Data

obfuscation

Anonymisation/pseudonymisationtechniques

•Securestorage

•Datasharingandretention

•Softwaretesting

Syntheticdatageneration

•Privacy-preservingAImachinelearning

•Datasharingandanalysis

•Softwaretesting

Differentialprivacy

•Expandingresearchopportunities

•Datasharing

Zeroknowledgeproofs

•Verifyinginformationwithout

requiringdisclosure(e.g.,ageverification)

Encrypteddataprocessing

Homomorphicencryption

•Securedatastoredincloud

1AdaptedfromOECD,“EmergingPrivacyEnhancingTechnologies:CurrentRegulatoryandPolicyApproaches,”OECDDigitalEconomyPapers(OECD,2023).

•Computingonprivate

datathatisnotdisclosed

Multi-partycomputation

(includingprivatesetintersection)

•Computingonprivate

datathatisnotdisclosed

Trustedexecutionenvironments

•Computingusing

modelsthatneedtoremainprivate

•Computingonprivate

datathatisnotdisclosed

Federatedanalytics

Federatedlearning

•Privacy-preservingAImachinelearning

Distributedanalysis

II.SyntheticData

Thisguidefocusesontheuseofsyntheticdata

togeneratestructureddata.Whilesyntheticdataisgenerallyfictitiousdatathatmaynotbeconsideredpersonaldataonitsown,itisnotinherentlyrisk-freeduetopossiblere-identificationrisk

s3.

Assuch,thisguideproposesgoodpracticesthatorganisationsmayadopttogeneratesyntheticdatatominimisesuchrisksforasetofcommonusecasearchetypes.Theguidealsoincludesasetofgoodpracticesandriskassessments/considerationsforgeneratingsyntheticdataaswellasgovernancecontrols,contractualprocess,andtechnicalmeasurestomitigateresidualrisks.

ThetargetaudienceforthisguideareCIOs,CTOs,CDOs,datascientists,dataprotectionpractitioners,andtechnicaldecision-makerswhomaydirectlyorindirectlybeinvolvedinthegenerationanduseofsyntheticdata.

Syntheticdataisatechnologythatisbeingactivelyresearchedanddevelopedatthetimeofpublication.Hence,thisguideisnotintendedtoprovideacomprehensiveorin-depthreviewofthetechnologyoritsassessmentmethods.Theguideisintendedtobealivingdocument,andwillbeupdatedtoensureitsrecommendationsremainrelevant.

2Therearetwotypesofsyntheticdata:fullysyntheticdataandpartiallysyntheticdata.Thisguidediscussestheuseoffullysyntheticdata.

3Inthisguide,wegenerallyrefertoprivacyrisksasre-identificationrisks.

WhatisSyntheticData?

Syntheticdataiscommonlyreferredtoasartificialdatathathasbeengeneratedusingapurpose-builtmathematicalmodel(includingartificialintelligence(AI)/machinelearning(ML)models)oralgorithm.Itcanbederivedbytrainingamodel(oralgorithm)onasourcedatasettomimicthecharacteristicsandstructureofthesourcedata.Goodqualitysyntheticdatacanretainthestatisticalpropertiesandpatternsofthesourcedatatoahighextent.Asaresult,performinganalysisonsyntheticdatacanproduceresultssimilartothoseyieldedwithsourcedata.

Characteristicsofsyntheticdata

Figure1showsanexampleofhowsyntheticdatamaylooklikeascomparedwiththesourcedata.Generatedsyntheticdatawillgenerallyhavedifferentdatapointsfromthesourcedata,asseenfromthetabulardata.However,thesyntheticdatawillhavestatisticalpropertiesthatareclosetothatofthesourcedata,i.e.,capturingthedistributionandstructureofthesourcedataasseenfromthetrendlinesinFigure1.

Figure1:Sourcedataversussyntheticdata

Assuch,syntheticdatamaynotalwaysbeinherentlyrisk-freeasinformationaboutanindividualinthesourcedataset,orconfidentialdata,canstillbeleakedduetotheresemblanceofthesyntheticdatatothesourcedata.Therewillalsobetrade-off

betweendatautilityanddataprotectionrisksinsyntheticdatageneration.However,suchriskscanbeminimisedbytakingdataprotectionintoconsiderationduringthesyntheticdatagenerationprocess.

4DiagramtakenwithmodificationfromKhaledElEmam,LucyMosquera,andRichardHoptroff,PracticalSyntheticDataGeneration(O’ReillyMedia,Inc,2020).

5Trade-offbetweendatautilityanddataprotectionrisksisfurtherdiscussedinAnnexA:Step1andStep

3inthisguide.

UnderWhatCircumstancesisSyntheticDataUseful?

SyntheticdatacanbeusedinavarietyofusecasesrangingfromgeneratingtrainingdatasetsforAImodelstodataanalysisandcollaboration.Theuseofsyntheticdatanotonlycanaccelerateresearch,innovation,collaboration,anddecision-makingbutalsomitigateconcernsaboutcybersecurityincidentsanddatabreaches,enablingbettercompliancewithdataprotection/privacyregulations.Table2discussesafewcommonusecasearchetypes,theirkeybenefits,andgoodpracticesthatorganisationscanfocusonwhengeneratingsyntheticdata.

Table2.Usecasearchetypesforsyntheticdata.

TypesofUseCases

KeyBenefits

GoodPracticesto

GenerateSyntheticData

Usecasearchetype1:GeneratingtrainingdatasetforAImodels

Augmenting

dataforAI/MLmodels

•Syntheticdataaddressesthechallengeoftheuserhavingtoobtainlargevolumesof

labelleddataneededfor

trainingandtestingAI/MLmodelsduetocosts,legalregulations,andproprietaryrights.

•Augmentingtrainingdatasetswithsyntheticallygeneratedlabelleddatacanbemore

cost-effective,especially

whenthesourcedatasetsaresparse.

•Addnoise*toorreducegranularityofthe

syntheticdatapoints.

•Suchfictitiousnewdatapointswillgenerallynotbeconsideredpersonaldata.

*Ifthestatistical

properties/characteristicsofthesyntheticdatais

representativeofthe

populationinquestionandnotsignificantlyskewed

towardsaspecific

individual/groupof

individualsusedassourcetrainingdata,addingof

noisemightnotbe

necessaryasre-

identificationrisksaregenerallylow.

Increasing

datadiversityforAI/ML

models

•Syntheticdatacanbeusedtosimulaterareeventsor

augmentunder-representedgroupsintrainingAImodels.

•DiversedatasetscanbeusefulinimprovingperformanceofAI/MLmodels

Usecasearchetype2:Dataanalysisandcollaboration

Datasharingandanalysis

•Underlyingtrendsorpatterns,andbiasesofthedataare

usefulfordataanalytics

regardlessofwhetherthedatasourceisrealorsynthetic.

•Balancethetrade-offs

betweendatautilityanddataprotectionby

incorporatingdataprotectionmeasures

•Syntheticdatacanenabledatasharingforanalysisespeciallyinindustriesandsectors,e.g.,healthcare,wherethesourcedatacanbesensitive.

throughoutthesyntheticdatagenerationprocess,forexample:

Datapreparation

•Removeoutliersfromsourcedata

•Pseudonymisesourcedata

•Employdata

minimisationand

generalisegranulardata

Syntheticdatageneration

•Addnoisebeforeoraftersyntheticdatageneration

Postsyntheticdatageneration

•Incorporatetechnical,

contractual,and

governancemeasurestomitigateanyresidualre-identificationrisks

Previewing

datafor

collaboration

•Syntheticdatacanbeusedindataexploration,analysis,andcollaborationtoprovide

stakeholderswitha

representativepreviewofthesourcedatawithoutexposingsensitiveinformation.

•Thisenablesstakeholdersto

exploreandunderstandthe

structure,relationships,and

potentialinsightswithinthe

datatogainassuranceofthedataqualitybeforefinalisinganyagreementor

collaboration.

Usecasearchetype3:Softwaretesting

System

development/software

testing

•Organisationscanuse

syntheticdatainsteadof

productiondatatofacilitatesoftwaredevelopment.

•Useofsyntheticdatacanhelporganisationsavoiddata

breachesintheeventofthedevelopmentenvironmentbeingcompromised.

•Focusongenerating

syntheticdatathat

followssemanticse.g.,format,min/maxvaluesandcategories,of

sourcedatainsteadofthestatistical

characteristicsandproperties.

RefertoAnnexAforproposedconsiderationsandgoodpracticestogeneratesyntheticdata.

CaseStudies

(A)TrainingAImodelforfrauddetectioninthefinancialsecto

Problem:Sincethenumberoffraudulenttransactionsinthesourcedataissmallcomparedtonormal,non-fraudulenttransactions,thesourcedatadidnottrainmodelsverywellforfrauddetection.

Solution:J.P.Morgansuccessfullyusedsyntheticdataforfrauddetectionmodeltraining.AImodelswereprovidedwithsamplesofnormalandfraudulenttransactionstounderstandthetell-talesignsofsuspicioustransactions.

Benefit:Syntheticdataprovedtobemoreeffectiveintermsoftrainingmodelstodetectanomalousbehaviour.Thisisbecausethesyntheticdatausedwasdesignedtocontainahigherpercentageoffraudulenttransactions.

(B)TrainingAImodelforresearchintoAIbia

Problem:Multi-labelclassificationandregressionmodelsarefrequentlyutilisedatMastercardforvariousapplications,includingfraudprevention,anti-moneylaunderingandmarketingusecasesforportfoliooptimisation.Thesemodels,whilepowerful,requirecarefulattentiontoproxiesofdemographicattributeswithintheirtrainingdata,whichcouldlearnunintendedbiases.Ensuringtheaccuracyandfairnessofthesemodelsiscomplexduetotheirmulti-labelsetting,theconfidentialityofthedemographicattributes,andthechallengesinaccessingthetrainingdatasetformodeldevelopment.

Solution:MastercardpartneredwithresearcherstodevelopnewAIbiastestingmethodsadaptedtomulti-labelsettings.Toprotecttheprivacyofthedatasharedexternally,syntheticdatawascreatedtosupportmodeltrainingandmethodologicalresearchintofairmulti-labelmodels.

Benefit:Syntheticdatawasmeasuredtobesufficientlyprivatetobesharedwithexternalresearcherswhilecapturingrealrelationshipswithinthesourcedata.Syntheticdataenablednewinsightsthatwouldnothavebeenpossiblewithouttheprivacyprotectingcharacteristicsinherenttosyntheticdata.

6J.P.Morgan,“SyntheticDataforRealInsights,”TechnologyBlog,n.d.,

technology/technology-blog/synthetic-data-for-real-insights

7ContributedbyMastercard

(C)Safeguardingpatientdatafordataanalysis

Problem:Priortoutilisingsyntheticdata,Johnson&Johnson(J&J)allowedexternalresearchersorconsortiatoaccesshealthcaredataforresearchproposalsvalidatedbyJ&J.Tosafeguardpatientprivacy,thedatawastransformedintoanonymisedhealthcaredata.However,feedbackreceivedindicatedthattheoverallusefulnessoftheanonymiseddata,whichreliedontraditionalanonymisationtechniques,wasnotalwayssatisfactoryanddidnotalwaysmeettherequirementsoftheresearchersorconsortia.

Solution:J&Jhasintroducedhigh-qualityAIgeneratedsyntheticdataasanadditionaloptiontoprocesstheirhealthcaredata.

Benefit:Researchersandclientshaveexperiencedsignificantlyimprovedanalysis.Whenemployedproperly,thisformofsyntheticdatacaneffectivelyrepresentthetargetpopulationandoffervariousanalyticalandscientificbenefits.

(D)Facilitatingdata

collaboration9

Problem:Apharmaceuticalcompanywantedtopurchaseheart-relatedhealthdatafromaresearchinstitutetotestoutanewhypothesis.Thehealthdata,whichwascollectedbytheresearchinstitutefromconsentingsubjects,washostedunderahighlyregulatedenvironmentasrequiredofthehealthcaresector.However,thispresentssignificantchallengesformanydataengagementactivities.

Solution:A*STARwasengagedbythepharmaceuticalcompanytobuildapipelinetocreatesyntheticcopiesoftheactualdata,whichcanthenbebroughtoutsideofthisregulatedenvironment.

Benefit:Thisallowedthepharmaceuticalcompanytopreviewthedataandbeassuredofthedataqualitypriortothehigh-valuepurchaseandaccesstotheactualdata.

8ContributedbyJohnson&Johnson(J&J)

9ContributedbyA*STAR

III.Recommendations

SyntheticdatahasthepotentialtodrivethegrowthofAI/MLbyenablingAImodeltrainingwhileprotectingtheunderlyingpersonaldata.ItalsoaddressesdatasetrelatedchallengesforAImodeltraining,suchasinsufficientandbiaseddata,throughenablingtheaugmentationandincreaseddiversityoftrainingdatasets.

Inaddition,syntheticdatacanbeusedtofacilitateandsupportorganisations’dataanalytics,collaborationandsoftwaredevelopmentneeds.Anaddedbenefitofusingsyntheticdatainplaceofproductiondatatofacilitatesoftwaredevelopmentisthatdatabreachescanbeavoidedintheeventthedevelopmentenvironmentiscompromised.

PDPCrecommendsasetofgoodpracticesandriskassessments/considerationsforgeneratingsyntheticdataandtoreduceanyresidualrisksfromre-identificationthroughgovernancecontrols,contractualprocess,andtechnicalmeasures(refertoAnnexA).

AnnexA:HandbookonKeyConsiderationsandBestPracticesinSyntheticDataGeneration

Inthishandbook,wedescribethekeyconsiderationsandbestpracticesfororganisationstoreducere-identificationrisksofsynthetictabulardatathroughafive-stepapproach.

Foranyothercomplexsyntheticdatasetsthatareunstructured,organisationsareadvisedtoconsiderhiringsyntheticdataexperts,datascientistsorindependentriskassessorstoassessandmitigatetherisksofthegeneratedsyntheticdata.

Overviewoffive-stepapproachtogeneratesyntheticdata

Step1:Knowyourdata

Beforeembarkingonanysyntheticdataproject,itisnecessarytohaveaclearunderstandingofthepurposeandusecasesofthesyntheticdataandthesourcedatathatthesyntheticdataistomimic.Thiswillhelptodeterminewhetheruseofsyntheticdatamightberelevantandidentifythepossiblerisksofusingthesyntheticdata.Someoftheconsiderationsmayinclude:

•Wheregeneraltrends/insightsofsourcedataaresensitive,organisationshouldtakenotethattheuseofsyntheticdatawillnotofferanyprotectiontothetrends/insightssincetheywillbereplicatedinthesyntheticdata.

•Wherethesyntheticdataisintendedtobereleasedpublicly,organisationsmayhavetoprioritisedataprotectionoverdatautilityinsuchcircumstances.

•Whererelevant,organisationsshouldalsoputinplacepropercontractualobligationsonrecipientsofsyntheticdatawherenecessarytopreventre-identificationattacksonthedata.

Withthisknowledge,themanagementanddataowner,withthehelpofrelevantstakeholderssuchasthedataanalyticsteam,shouldestablishobjectivespriortosyntheticdatagenerationtodetermineanacceptableriskthreshol

d10

ofthegeneratedsyntheticdataandtheexpectedutilityofthedata.Thiswillhelpprovideorganisationswiththeappropriatebenchmarkstoassessanytrade-offsbetweendataprotectionrisksanddatautility.

Thesebenchmarksmaybeadjustedappropriatelytomeetthebusinessobjectives,takingintoconsiderationanytrade-offsbetweendatautilityanddataprotectionrisksafterthesyntheticdatagenerationprocess,aswellassafeguardsandcontrolstomitigateorloweranyresidualrisksposedbythegeneratedsyntheticdata.Theacceptancecriteriashouldbeincorporatedintotheorganisation'sriskassessments(e.g.,enterpriseriskmanagement

framework11

ifapplicable)oraDataProtectionImpactAssessment(“DPIA”

)12.

Step2:Prepareyourdata

Whenpreparingthesource

data13

forgeneratingsyntheticdata,itisimportanttoconsiderthefollowing:

•Whatarethekeyinsightsthatneededtobepreservedinthesyntheticdata?

•Whicharethenecessarydataattributesforthesyntheticdatatomeetthebusinessobjectives?

10There-identificationriskthresholdrepresentsthelevelofre-identificationriskthatisacceptableforagivensyntheticdataset.Thereiscurrentlynouniversallyacceptednumericalvalueforriskthreshold.ForfurtherdetailsrefertoStep4(Assessre-identificationrisks).

11OrganisationsmayrefertoISO27001formoreinformationondevelopinganenterpriseriskmanagementframework.

12AnexampleofthisisPDPC’sGuidetoDataProtectionImpactAssessments.ADPIAisapplicableinthecasewherepersonaldataisinvolved.TheDPIAmaynotberelevantinsituationswherethesyntheticdatagenerationdoesnotinvolvepersonaldataprocessing.

13Thisstepassumesthatthesourcedatahasbeenproperlycleaned(suchasfixingorremovingincorrect,corrupted,incorrectlyformatted,duplicate,orincompletedata)andisofacceptablequalityforthegenerationofsyntheticdata.

Understandingkeyinsightstobepreserved

Toensurethatthesyntheticdatacanmeetthebusinessobjectives,organisationsneedtounderstandandidentifythetrends,keystatisticalproperties,andattribute-relationshipsinthesourcedatathatneedtobepreservedforanalysise.g.,identifyrelationshipsbetweendemographiccharacteristicsofpopulationandtheirhealthconditions.

Organisationsshouldconsider,atthispoint,whetheroutliertrendsandinsightsarenecessarytobepreservedforthebusinessobjectives.Keyconsiderationscouldincludethefollowing:

•Ifoutliersarenotnecessarytomeetthebusinessobjectivesandtheriskofre-identificationishigh,organisationsshouldconsiderremovingtheoutliers.Thiscanbedonepriortosyntheticdatagenerationoratsubsequentstagesofthesyntheticdatageneration.

•Iftheobjectiveistomimicthecharacteristicsofthesourcedataascloselyaspossible,includingoutliers,thentheorganisationmayhavetopreservetheoutliertrend/insighttomeetthebusinessobjectives.Insuchinstance,theorganisationshouldnotethatthere-identificationrisksofindividualsintheoutlierdatamaybehighandhenceputinplaceriskmitigationmeasures.

•Ifthebusinessobjectiveistobalancethenumberofdatapointsindifferentdatacategories,thenthesyntheticdatagenerationprocessitselfcanhelpmitigatetheissueofoutlierssimplybygeneratingmoreoutliers.Forexample,inadataset,thenumberofoutlierdatapointscomprisingmaleindividualsmaybebalancedwithoutlierdatapointscomprisingfemaleindividuals.

Selectingdataattributes

Basedonthekeyinsightsneeded,organisationsshouldapplydataminimisationtoextractonlytherelevantdataattributesfromthesourcedata.Thereafter,removeorpseudonymisealldirectidentifier

s14

fromtheextracteddata.

Wheregranularinformationisnotnecessary,organisationsmaygeneraliseorfurtheraddnoisetothedataatthispointoratalatersteptoreducetheriskofre-identification.Forexample,organisationscangeneraliseexactheightandweight

14RefertoPDPC’sGuidetoBasicAnonymisationonhowtoidentifydirectidentifiersinadataset.

informationintoheightandweightbandstoreducethepossibilityofheightandweightcombinationsbeingusedtoidentifyanyoutliers.

Organisationsshouldalsostandardiseanddocumentthedetailsoneachdataattribute(suchasdatadefinitions,standards,metricsetc.)inadatadictionary.Thisenablestheorganisationtosubsequentlyvalidatetheintegrityofthegeneratedsyntheticdatatodetectanomaliesandfixanydatainconsistencies.RefertothefollowingchecklistinTable3forkeyconsiderations.

Table3:Checklistfordatapreparation

DataPreparationChecklist

Understandkeyinsights

Identifytrendsandentityrelationshipstobepreservedforsyntheticdatageneration.

ii.

Removeoutliersifsuchtrends/insightsarenotnecessary.Thiscanbeperformedpostgeneration.

Selectdataattributes

iii.

Applydataminimisationtoselectonlydataattributesthatarenecessarytomeetbusinessneeds.

iv.

Removeorpseudonymisedirectidentifiers(e.g.,name,nationalidentificationnumbers).

Generalisegranulardataoraddnoise(e.g.,usingdifferentialprivac

y15)

tothe

data/modelifsuchdetailedinformationisnotnecessary.Thiscanalsobeperformedpostgeneration.

vi.

Standardiseanddocumentformat,constraints,andcategoriesofsourcedataindatadictionary(refertoAnnexBforareferencetemplate):

Format

•Standardisestringstolowerorpropercase

•Datatypes,columnnames,structures,relationships

•FrequencyofdatarecordConstraints

•Constraintsofvaluesforeachdatatype,e.g.,min-maxvalues,non-negativevalues,non-nullvalues

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

新加坡《合成数据生成指南》

文档简介

温馨提示

最新文档

评论

新加坡《合成数据生成指南》

文档简介

温馨提示

最新文档

评论

相关文档