Chapter-3-Data-Preprocessing-数据挖掘双语教学课件_第1页
Chapter-3-Data-Preprocessing-数据挖掘双语教学课件_第2页
Chapter-3-Data-Preprocessing-数据挖掘双语教学课件_第3页
Chapter-3-Data-Preprocessing-数据挖掘双语教学课件_第4页
Chapter-3-Data-Preprocessing-数据挖掘双语教学课件_第5页
已阅读5页,还剩101页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

DataMining:

ConceptsandTechniques

—SlidesforTextbook—

—Chapter3—©JiaweiHanandMichelineKamberIntelligentDatabaseSystemsResearchLabSchoolofComputingScienceSimonFraserUniversity,Canadahttp://www.cs.sfu.ca12/9/20221DataMining:ConceptsandTechniquesDataMining:

ConceptsandTecChapter3:DataPreprocessingWhypreprocessthedata?DatacleaningDataintegrationandtransformationDatareductionDiscretizationandconcepthierarchygenerationSummary12/9/20222DataMining:ConceptsandTechniquesChapter3:DataPreprocessingWWhyDataPreprocessing?Dataintherealworldisdirtyincomplete:lackingattributevalues,lackingcertainattributesofinterest,orcontainingonlyaggregatedatanoisy:containingerrorsoroutliersinconsistent:containingdiscrepanciesincodesornamesNoqualitydata,noqualityminingresults!QualitydecisionsmustbebasedonqualitydataDatawarehouseneedsconsistentintegrationofqualitydata12/9/20223DataMining:ConceptsandTechniquesWhyDataPreprocessing?DatainMulti-DimensionalMeasureofDataQualityAwell-acceptedmultidimensionalview:AccuracyCompletenessConsistencyTimelinessBelievabilityValueaddedInterpretabilityAccessibilityBroadcategories:intrinsic,contextual,representational,andaccessibility.12/9/20224DataMining:ConceptsandTechniquesMulti-DimensionalMeasureofDMajorTasksinDataPreprocessingDatacleaningFillinmissingvalues,smoothnoisydata,identifyorremoveoutliers,andresolveinconsistenciesDataintegrationIntegrationofmultipledatabases,datacubes,orfilesDatatransformationNormalizationandaggregationDatareductionObtainsreducedrepresentationinvolumebutproducesthesameorsimilaranalyticalresultsDatadiscretizationPartofdatareductionbutwithparticularimportance,especiallyfornumericaldata12/9/20225DataMining:ConceptsandTechniquesMajorTasksinDataPreprocessFormsofdatapreprocessing

12/9/20226DataMining:ConceptsandTechniquesFormsofdatapreprocessing12Chapter3:DataPreprocessingWhypreprocessthedata?Datacleaning

DataintegrationandtransformationDatareductionDiscretizationandconcepthierarchygenerationSummary12/9/20227DataMining:ConceptsandTechniquesChapter3:DataPreprocessingWDataCleaningDatacleaningtasksFillinmissingvaluesIdentifyoutliersandsmoothoutnoisydataCorrectinconsistentdata12/9/20228DataMining:ConceptsandTechniquesDataCleaningDatacleaningtasMissingDataDataisnotalwaysavailableE.g.,manytupleshavenorecordedvalueforseveralattributes,suchascustomerincomeinsalesdataMissingdatamaybeduetoequipmentmalfunctioninconsistentwithotherrecordeddataandthusdeleteddatanotenteredduetomisunderstandingcertaindatamaynotbeconsideredimportantatthetimeofentrynotregisterhistoryorchangesofthedataMissingdatamayneedtobeinferred.12/9/20229DataMining:ConceptsandTechniquesMissingDataDataisnotalwaysHowtoHandleMissingData?Ignorethetuple:usuallydonewhenclasslabelismissing(assumingthetasksinclassification—noteffectivewhenthepercentageofmissingvaluesperattributevariesconsiderably.Fillinthemissingvaluemanually:tedious+infeasible?Useaglobalconstanttofillinthemissingvalue:e.g.,“unknown”,anewclass?!UsetheattributemeantofillinthemissingvalueUsetheattributemeanforallsamplesbelongingtothesameclasstofillinthemissingvalue:smarterUsethemostprobablevaluetofillinthemissingvalue:inference-basedsuchasBayesianformulaordecisiontree12/9/202210DataMining:ConceptsandTechniquesHowtoHandleMissingData?IgnNoisyDataNoise:randomerrororvarianceinameasuredvariableIncorrectattributevaluesmayduetofaultydatacollectioninstrumentsdataentryproblemsdatatransmissionproblemstechnologylimitationinconsistencyinnamingconventionOtherdataproblemswhichrequiresdatacleaningduplicaterecordsincompletedatainconsistentdata12/9/202211DataMining:ConceptsandTechniquesNoisyDataNoise:randomerrorHowtoHandleNoisyData?Binningmethod:firstsortdataandpartitioninto(equi-depth)binsthenonecansmoothbybinmeans,smoothbybinmedian,smoothbybinboundaries,etc.ClusteringdetectandremoveoutliersCombinedcomputerandhumaninspectiondetectsuspiciousvaluesandcheckbyhumanRegressionsmoothbyfittingthedataintoregressionfunctions12/9/202212DataMining:ConceptsandTechniquesHowtoHandleNoisyData?BinniSimpleDiscretizationMethods:BinningEqual-width(distance)partitioning:ItdividestherangeintoNintervalsofequalsize:uniformgridifAandBarethelowestandhighestvaluesoftheattribute,thewidthofintervalswillbe:W=(B-A)/N.ThemoststraightforwardButoutliersmaydominatepresentationSkeweddataisnothandledwell.Equal-depth(frequency)partitioning:ItdividestherangeintoNintervals,eachcontainingapproximatelysamenumberofsamplesGooddatascalingManagingcategoricalattributescanbetricky.12/9/202213DataMining:ConceptsandTechniquesSimpleDiscretizationMethods:BinningMethodsforDataSmoothing*Sorteddataforprice(indollars):4,8,9,15,21,21,24,25,26,28,29,34*Partitioninto(equi-depth)bins:-Bin1:4,8,9,15-Bin2:21,21,24,25-Bin3:26,28,29,34*Smoothingbybinmeans:-Bin1:9,9,9,9-Bin2:23,23,23,23-Bin3:29,29,29,29*Smoothingbybinboundaries:-Bin1:4,4,4,15-Bin2:21,21,25,25-Bin3:26,26,26,3412/9/202214DataMining:ConceptsandTechniquesBinningMethodsforDataSmootClusterAnalysis12/9/202215DataMining:ConceptsandTechniquesClusterAnalysis12/9/202215DatRegressionxyy=x+1X1Y1Y1’12/9/202216DataMining:ConceptsandTechniquesRegressionxyy=x+1X1Y1Y1’12Chapter3:DataPreprocessingWhypreprocessthedata?DatacleaningDataintegrationandtransformationDatareductionDiscretizationandconcepthierarchygenerationSummary12/9/202217DataMining:ConceptsandTechniquesChapter3:DataPreprocessingWDataIntegrationDataintegration:combinesdatafrommultiplesourcesintoacoherentstoreSchemaintegrationintegratemetadatafromdifferentsourcesEntityidentificationproblem:identifyrealworldentitiesfrommultipledatasources,e.g.,A.cust-idB.cust-#Detectingandresolvingdatavalueconflictsforthesamerealworldentity,attributevaluesfromdifferentsourcesaredifferentpossiblereasons:differentrepresentations,differentscales,e.g.,metricvs.Britishunits12/9/202218DataMining:ConceptsandTechniquesDataIntegrationDataintegratiHandlingRedundantDatainDataIntegrationRedundantdataoccuroftenwhenintegrationofmultipledatabasesThesameattributemayhavedifferentnamesindifferentdatabasesOneattributemaybea“derived”attributeinanothertable,e.g.,annualrevenueRedundantdatamaybeabletobedetectedbycorrelationalanalysisCarefulintegrationofthedatafrommultiplesourcesmayhelpreduce/avoidredundanciesandinconsistenciesandimproveminingspeedandquality12/9/202219DataMining:ConceptsandTechniquesHandlingRedundantDatainDatDataTransformationSmoothing:removenoisefromdataAggregation:summarization,datacubeconstructionGeneralization:concepthierarchyclimbingNormalization:scaledtofallwithinasmall,specifiedrangemin-maxnormalizationz-scorenormalizationnormalizationbydecimalscalingAttribute/featureconstructionNewattributesconstructedfromthegivenones12/9/202220DataMining:ConceptsandTechniquesDataTransformationSmoothing:DataTransformation:Normalizationmin-maxnormalizationz-scorenormalizationnormalizationbydecimalscalingWherejisthesmallestintegersuchthatMax(||)<112/9/202221DataMining:ConceptsandTechniquesDataTransformation:NormalizaChapter3:DataPreprocessingWhypreprocessthedata?DatacleaningDataintegrationandtransformationDatareductionDiscretizationandconcepthierarchygenerationSummary12/9/202222DataMining:ConceptsandTechniquesChapter3:DataPreprocessingWDataReductionStrategiesWarehousemaystoreterabytesofdata:Complexdataanalysis/miningmaytakeaverylongtimetorunonthecompletedatasetDatareductionObtainsareducedrepresentationofthedatasetthatismuchsmallerinvolumebutyetproducesthesame(oralmostthesame)analyticalresultsDatareductionstrategiesDatacubeaggregationDimensionalityreductionNumerosityreductionDiscretizationandconcepthierarchygeneration12/9/202223DataMining:ConceptsandTechniquesDataReductionStrategiesWarehDataCubeAggregationThelowestlevelofadatacubetheaggregateddataforanindividualentityofintereste.g.,acustomerinaphonecallingdatawarehouse.MultiplelevelsofaggregationindatacubesFurtherreducethesizeofdatatodealwithReferenceappropriatelevelsUsethesmallestrepresentationwhichisenoughtosolvethetaskQueriesregardingaggregatedinformationshouldbeansweredusingdatacube,whenpossible12/9/202224DataMining:ConceptsandTechniquesDataCubeAggregationThelowesDimensionalityReductionFeatureselection(i.e.,attributesubsetselection):Selectaminimumsetoffeaturessuchthattheprobabilitydistributionofdifferentclassesgiventhevaluesforthosefeaturesisascloseaspossibletotheoriginaldistributiongiventhevaluesofallfeaturesreduce#ofpatternsinthepatterns,easiertounderstandHeuristicmethods(duetoexponential#ofchoices):step-wiseforwardselectionstep-wisebackwardeliminationcombiningforwardselectionandbackwardeliminationdecision-treeinduction12/9/202225DataMining:ConceptsandTechniquesDimensionalityReductionFeaturExampleofDecisionTreeInductionInitialattributeset:{A1,A2,A3,A4,A5,A6}A4?A1?A6?Class1Class2Class1Class2>Reducedattributeset:{A1,A4,A6}12/9/202226DataMining:ConceptsandTechniquesExampleofDecisionTreeInducHeuristicFeatureSelectionMethodsThereare2d

possiblesub-featuresofdfeaturesSeveralheuristicfeatureselectionmethods:Bestsinglefeaturesunderthefeatureindependenceassumption:choosebysignificancetests.Beststep-wisefeatureselection:Thebestsingle-featureispickedfirstThennextbestfeatureconditiontothefirst,...Step-wisefeatureelimination:RepeatedlyeliminatetheworstfeatureBestcombinedfeatureselectionandelimination:Optimalbranchandbound:Usefeatureeliminationandbacktracking12/9/202227DataMining:ConceptsandTechniquesHeuristicFeatureSelectionMeDataCompressionStringcompressionThereareextensivetheoriesandwell-tunedalgorithmsTypicallylosslessButonlylimitedmanipulationispossiblewithoutexpansionAudio/videocompressionTypicallylossycompression,withprogressiverefinementSometimessmallfragmentsofsignalcanbereconstructedwithoutreconstructingthewholeTimesequenceisnotaudioTypicallyshortandvaryslowlywithtime12/9/202228DataMining:ConceptsandTechniquesDataCompressionStringcompresDataCompressionOriginalDataCompressedDatalosslessOriginalDataApproximatedlossy12/9/202229DataMining:ConceptsandTechniquesDataCompressionOriginalDataCWaveletTransformsDiscretewavelettransform(DWT):linearsignalprocessingCompressedapproximation:storeonlyasmallfractionofthestrongestofthewaveletcoefficientsSimilartodiscreteFouriertransform(DFT),butbetterlossycompression,localizedinspaceMethod:Length,L,mustbeanintegerpowerof2(paddingwith0s,whennecessary)Eachtransformhas2functions:smoothing,differenceAppliestopairsofdata,resultingintwosetofdataoflengthL/2Appliestwofunctionsrecursively,untilreachesthedesiredlength

Haar2Daubechie412/9/202230DataMining:ConceptsandTechniquesWaveletTransformsDiscretewaGivenNdatavectorsfromk-dimensions,findc<=korthogonalvectorsthatcanbebestusedtorepresentdataTheoriginaldatasetisreducedtooneconsistingofNdatavectorsoncprincipalcomponents(reduceddimensions)EachdatavectorisalinearcombinationofthecprincipalcomponentvectorsWorksfornumericdataonlyUsedwhenthenumberofdimensionsislargePrincipalComponentAnalysis12/9/202231DataMining:ConceptsandTechniquesGivenNdatavectorsfromk-diX1X2Y1Y2PrincipalComponentAnalysis12/9/202232DataMining:ConceptsandTechniquesX1X2Y1Y2PrincipalComponentAnNumerosityReductionParametricmethodsAssumethedatafitssomemodel,estimatemodelparameters,storeonlytheparameters,anddiscardthedata(exceptpossibleoutliers)Log-linearmodels:obtainvalueatapointinm-DspaceastheproductonappropriatemarginalsubspacesNon-parametricmethods

DonotassumemodelsMajorfamilies:histograms,clustering,sampling12/9/202233DataMining:ConceptsandTechniquesNumerosityReductionParametricRegressionandLog-LinearModelsLinearregression:DataaremodeledtofitastraightlineOftenusestheleast-squaremethodtofitthelineMultipleregression:allowsaresponsevariableYtobemodeledasalinearfunctionofmultidimensionalfeaturevectorLog-linearmodel:approximatesdiscretemultidimensionalprobabilitydistributions12/9/202234DataMining:ConceptsandTechniquesRegressionandLog-LinearModeLinearregression:Y=+XTwoparameters,andspecifythelineandaretobeestimatedbyusingthedataathand.usingtheleastsquarescriteriontotheknownvaluesofY1,Y2,…,X1,X2,….Multipleregression:Y=b0+b1X1+b2X2.Manynonlinearfunctionscanbetransformedintotheabove.Log-linearmodels:Themulti-waytableofjointprobabilitiesisapproximatedbyaproductoflower-ordertables.Probability:p(a,b,c,d)=abacadbcdRegressAnalysisandLog-LinearModelsLinearregression:Y=+XHistogramsApopulardatareductiontechniqueDividedataintobucketsandstoreaverage(sum)foreachbucketCanbeconstructedoptimallyinonedimensionusingdynamicprogrammingRelatedtoquantizationproblems.12/9/202236DataMining:ConceptsandTechniquesHistogramsApopulardatareducClusteringPartitiondatasetintoclusters,andonecanstoreclusterrepresentationonlyCanbeveryeffectiveifdataisclusteredbutnotifdatais“smeared”Canhavehierarchicalclusteringandbestoredinmulti-dimensionalindextreestructuresTherearemanychoicesofclusteringdefinitionsandclusteringalgorithms,furtherdetailedinChapter812/9/202237DataMining:ConceptsandTechniquesClusteringPartitiondatasetiSamplingAllowaminingalgorithmtorunincomplexitythatispotentiallysub-lineartothesizeofthedataChoosearepresentativesubsetofthedataSimplerandomsamplingmayhaveverypoorperformanceinthepresenceofskewDevelopadaptivesamplingmethodsStratifiedsampling:Approximatethepercentageofeachclass(orsubpopulationofinterest)intheoveralldatabaseUsedinconjunctionwithskeweddataSamplingmaynotreducedatabaseI/Os(pageatatime).12/9/202238DataMining:ConceptsandTechniquesSamplingAllowaminingalgoritSamplingSRSWOR(simplerandomsamplewithoutreplacement)SRSWRRawData12/9/202239DataMining:ConceptsandTechniquesSamplingSRSWORSRSWRRawData12/SamplingRawDataCluster/StratifiedSample12/9/202240DataMining:ConceptsandTechniquesSamplingRawDataCluster/StratHierarchicalReductionUsemulti-resolutionstructurewithdifferentdegreesofreductionHierarchicalclusteringisoftenperformedbuttendstodefinepartitionsofdatasetsratherthan“clusters”ParametricmethodsareusuallynotamenabletohierarchicalrepresentationHierarchicalaggregationAnindextreehierarchicallydividesadatasetintopartitionsbyvaluerangeofsomeattributesEachpartitioncanbeconsideredasabucketThusanindextreewithaggregatesstoredateachnodeisahierarchicalhistogram12/9/202241DataMining:ConceptsandTechniquesHierarchicalReductionUsemultChapter3:DataPreprocessingWhypreprocessthedata?DatacleaningDataintegrationandtransformationDatareductionDiscretizationandconcepthierarchygenerationSummary12/9/202242DataMining:ConceptsandTechniquesChapter3:DataPreprocessingWDiscretizationThreetypesofattributes:Nominal—valuesfromanunorderedsetOrdinal—valuesfromanorderedsetContinuous—realnumbersDiscretization:dividetherangeofacontinuousattributeintointervalsSomeclassificationalgorithmsonlyacceptcategoricalattributes.ReducedatasizebydiscretizationPrepareforfurtheranalysis12/9/202243DataMining:ConceptsandTechniquesDiscretizationThreetypesofaDiscretizationandConcepthierachyDiscretization

reducethenumberofvaluesforagivencontinuousattributebydividingtherangeoftheattributeintointervals.Intervallabelscanthenbeusedtoreplaceactualdatavalues.Concepthierarchies

reducethedatabycollectingandreplacinglowlevelconcepts(suchasnumericvaluesfortheattributeage)byhigherlevelconcepts(suchasyoung,middle-aged,orsenior).12/9/202244DataMining:ConceptsandTechniquesDiscretizationandConcepthieDiscretizationandconcepthierarchygenerationfornumericdataBinning(seesectionsbefore)Histogramanalysis(seesectionsbefore)Clusteringanalysis(seesectionsbefore)Entropy-baseddiscretizationSegmentationbynaturalpartitioning12/9/202245DataMining:ConceptsandTechniquesDiscretizationandconcepthieEntropy-BasedDiscretizationGivenasetofsamplesS,ifSispartitionedintotwointervalsS1andS2usingboundaryT,theentropyafterpartitioningisTheboundarythatminimizestheentropyfunctionoverallpossibleboundariesisselectedasabinarydiscretization.Theprocessisrecursivelyappliedtopartitionsobtaineduntilsomestoppingcriterionismet,e.g.,Experimentsshowthatitmayreducedatasizeandimproveclassificationaccuracy12/9/202246DataMining:ConceptsandTechniquesEntropy-BasedDiscretizationGiSegmentationbynaturalpartitioning3-4-5rulecanbeusedtosegmentnumericdataintorelativelyuniform,“natural”intervals.*Ifanintervalcovers3,6,7or9distinctvaluesatthemostsignificantdigit,partitiontherangeinto3equi-widthintervals*Ifitcovers2,4,or8distinctvaluesatthemostsignificantdigit,partitiontherangeinto4intervals*Ifitcovers1,5,or10distinctvaluesatthemostsignificantdigit,partitiontherangeinto5intervals12/9/202247DataMining:ConceptsandTechniquesSegmentationbynaturalpartitExampleof3-4-5rule(-$4000-$5,000)(-$400-0)(-$400--$300)(-$300--$200)(-$200--$100)(-$100-0)(0-$1,000)(0-$200)($200-$400)($400-$600)($600-$800)($800-$1,000)($2,000-$5,000)($2,000-$3,000)($3,000-$4,000)($4,000-$5,000)($1,000-$2,000)($1,000-$1,200)($1,200-$1,400)($1,400-$1,600)($1,600-$1,800)($1,800-$2,000)msd=1,000 Low=-$1,000 High=$2,000Step2:Step4:Step1:-$351 -$159 profit $1,838 $4,700 MinLow(i.e,5%-tile) High(i.e,95%-0tile)Maxcount(-$1,000-$2,000)(-$1,000-0)(0-$1,000)Step3:($1,000-$2,000)12/9/202248DataMining:ConceptsandTechniquesExampleof3-4-5rule(-$4000-ConcepthierarchygenerationforcategoricaldataSpecificationofapartialorderingofattributesexplicitlyattheschemalevelbyusersorexpertsSpecificationofaportionofahierarchybyexplicitdatagroupingSpecificationofasetofattributes,butnotoftheirpartialorderingSpecificationofonlyapartialsetofattributes12/9/202249DataMining:ConceptsandTechniquesConcepthierarchygenerationfSpecificationofasetofattributesConcepthierarchycanbeautomaticallygeneratedbasedonthenumberofdistinctvaluesperattributeinthegivenattributeset.Theattributewiththemostdistinctvaluesisplacedatthelowestlevelofthehierarchy.countryprovince_or_statecitystreet15distinctvalues65distinctvalues3567distinctvalues674,339distinctvalues12/9/202250DataMining:ConceptsandTechniquesSpecificationofasetofattrChapter3:DataPreprocessingWhypreprocessthedata?DatacleaningDataintegrationandtransformationDatareductionDiscretizationandconcepthierarchygenerationSummary12/9/202251DataMining:ConceptsandTechniquesChapter3:DataPreprocessingWSummaryDatapreparationisabigissueforbothwarehousingandminingDatapreparationincludesDatacleaninganddataintegrationDatareductionandfeatureselectionDiscretizationAlotamethodshavebeendevelopedbutstillanactiveareaofresearch12/9/202252DataMining:ConceptsandTechniquesSummaryDatapreparationisaReferencesD.P.BallouandG.K.Tayi.Enhancingdataqualityindatawarehouseenvironments.CommunicationsofACM,42:73-78,1999.Jagadishetal.,SpecialIssueonDataReductionTechniques.BulletinoftheTechnicalCommitteeonDataEngineering,20(4),December1997.D.Pyle.DataPreparationforDataMining.MorganKaufmann,1999.T.Redman.DataQuality:ManagementandTechnology.BantamBooks,NewYork,1992.Y.WandandR.Wang.Anchoringdataqualitydimensionsontologicalfoundations.CommunicationsofACM,39:86-95,1996.R.Wang,V.Storey,andC.Firth.Aframeworkforanalysisofdataqualityresearch.IEEETrans.KnowledgeandDataEngineering,7:623-640,1995.12/9/202253DataMining:ConceptsandTechniquesReferencesD.P.BallouandG.http://www.cs.sfu.ca/~hanThankyou!!!12/9/202254DataMining:ConceptsandTechniqueshttp://www.cs.sfu.ca/~hanThankDataMining:

ConceptsandTechniques

—SlidesforTextbook—

—Chapter3—©JiaweiHanandMichelineKamberIntelligentDatabaseSystemsResearchLabSchoolofComputingScienceSimonFraserUniversity,Canadahttp://www.cs.sfu.ca12/9/202255DataMining:ConceptsandTechniquesDataMining:

ConceptsandTecChapter3:DataPreprocessingWhypreprocessthedata?DatacleaningDataintegrationandtransformationDatareductionDiscretizationandconcepthierarchygenerationSummary12/9/202256DataMining:ConceptsandTechniquesChapter3:DataPreprocessingWWhyDataPreprocessing?Dataintherealworldisdirtyincomplete:lackingattributevalues,lackingcertainattributesofinterest,orcontainingonlyaggregatedatanoisy:containingerrorsoroutliersinconsistent:containingdiscrepanciesincodesornamesNoqualitydata,noqualityminingresults!QualitydecisionsmustbebasedonqualitydataDatawarehouseneedsconsistentintegrationofqualitydata12/9/202257DataMining:ConceptsandTechniquesWhyDataPreprocessing?DatainMulti-DimensionalMeasureofDataQualityAwell-acceptedmultidimensionalview:AccuracyCompletenessConsistencyTimelinessBelievabilityValueaddedInterpretabilityAccessibilityBroadcategories:intrinsic,contextual,representational,andaccessibility.12/9/202258DataMining:ConceptsandTechniquesMulti-DimensionalMeasureofDMajorTasksinDataPreprocessingDatacleaningFillinmissingvalues,smoothnoisydata,identifyorremoveoutliers,andresolveinconsistenciesDataintegrationIntegrationofmultipledatabases,datacubes,orfilesDatatransformationNormalizationandaggregationDatareductionObtainsreducedrepresentationinvolumebutproducesthesameorsimilaranalyticalresultsDatadiscretizationPartofdatareductionbutwithparticularimportance,especiallyfornumericaldata12/9/202259DataMining:ConceptsandTechniquesMajorTasksinDataPreprocessFormsofdatapreprocessing

12/9/202260DataMining:ConceptsandTechniquesFormsofdatapreprocessing12Chapter3:DataPreprocessingWhypreprocessthedata?Datacleaning

DataintegrationandtransformationDatareductionDiscretizationandconcepthierarchygenerationSummary12/9/202261DataMining:ConceptsandTechniquesChapter3:DataPreprocessingWDataCleaningDatacleaningtasksFillinmissingvaluesIdentifyoutliersandsmoothoutnoisydataCorrectinconsistentdata12/9/202262DataMining:ConceptsandTechniquesDataCleaningDatacleaningtasMissingDataDataisnotalwaysavailableE.g.,manytupleshavenorecordedvalueforseveralattributes,suchascustomerincomeinsalesdataMissingdatamaybeduetoequipmentmalfunctioninconsistentwithotherrecordeddataandthusdeleteddatanotenteredduetomisunderstandingcertaindatamaynotbeconsideredimportantatthetimeofentrynotregisterhistoryorchangesofthedataMissingdatamayneedtobeinferred.12/9/202263DataMining:ConceptsandTechniquesMissingDataDataisnotalwaysHowtoHandleMissingData?Ignorethetuple:usuallydonewhenclasslabelismissing(assumingthetasksinclassification—noteffectivewhenthepercentageofmissingvaluesperattributevariesconsiderably.Fillinthemissingvaluemanually:tedious+infeasible?Useaglobalconstanttofillinthemissingvalue:e.g.,“unknown”,anewclass?!UsetheattributemeantofillinthemissingvalueUsetheattributemeanforallsamplesbelongingtothesameclasstofillinthemissingvalue:smarterUsethemostprobablevaluetofillinthemissingvalue:inference-basedsuchasBayesianformulaordecisiontree12/9/202264DataMining:ConceptsandTechniquesHowtoHandleMissingData?IgnNoisyDataNoise:randomerrororvarianceinameasuredvariableIncorrectattributevaluesmayduetofaultydatacollectioninstrumentsdataentryproblemsdatatransmissionproblemstechnologylimitationinconsistencyinnamingconventionOtherdataproblemswhichrequiresdatacleaningduplicaterecordsincompletedatainconsistentdata12/9/202265DataMining:ConceptsandTechniquesNoisyDataNoise:randomerrorHowtoHandleNoisyData?Binningmethod:firstsortdataandpartitioninto(equi-depth)binsthenonecansmoothbybinmeans,smoothbybinmedian,smoothbybinboundaries,etc.ClusteringdetectandremoveoutliersCombinedcomputerandhumaninspectiondetectsuspiciousvaluesandcheckbyhumanRegressionsmoothbyfittingthedataintoregressionfunctions12/9/202266DataMining:ConceptsandTechniquesHowtoHand

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论