数据预处理精选课件_第1页
数据预处理精选课件_第2页
数据预处理精选课件_第3页
数据预处理精选课件_第4页
数据预处理精选课件_第5页
已阅读5页,还剩47页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Data PreprocessingSchool of Software, Nanjing UniversityKnowledge Discovery in DatabasesChapter 3: Data PreprocessingWhy data preprocessing?Data cleaning Data integration and transformationData reductionDiscretization and concept hierarchy generationSummaryWhy Data Preprocessing?Data in the real wor

2、ld is dirtyincomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate datanoisy: containing errors or outliersinconsistent: containing discrepancies in codes or namesNo quality data, no quality mining results!Quality decisions must be based on quality

3、dataData warehouse needs consistent integration of quality dataMajor Tasks in Data PreprocessingData cleaningFill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistenciesData integrationIntegration of multiple databases, data cubes, or filesData transformationNo

4、rmalization and aggregationData reductionObtains reduced representation in volume but produces the same or similar analytical resultsData discretizationPart of data reduction but with particular importance, especially for numerical dataForms of data preprocessing Chapter 3: Data PreprocessingWhy dat

5、a preprocessing?Data cleaning Data integration and transformationData reductionDiscretization and concept hierarchy generationSummaryData CleaningData cleaning tasksFill in missing valuesIdentify outliers and smooth out noisy data Correct inconsistent dataMissing DataData is not always availableE.g.

6、, many tuples have no recorded value for several attributes, such as “customer income” in sales dataMissing data may be due to equipment malfunctioninconsistent with other recorded data and thus deleteddata not entered due to misunderstandingcertain data may not be considered important at the time o

7、f entrynot register history or changes of the dataMissing data may need to be inferred.How to Handle Missing Data?Ignore the tuple: usually done when class label is missing (assuming the tasks in classification) not effective when the percentage of missing values per attribute varies considerably.Fi

8、ll in the missing value manually: tedious + infeasible?Use a global constant to fill in the missing value: e.g., “unknown”, a new class?! Use the attribute mean to fill in the missing valueUse the attribute mean for all samples belonging to the same class to fill in the missing value: smarterUse the

9、 most probable value to fill in the missing value: inference-based such as Bayesian formula or decision treeNoisy DataNoise: random error or variance in a measured variableIncorrect attribute values may due tofaulty data collection instrumentsdata entry problemsdata transmission problemstechnology l

10、imitation(e.g. Input cache capacity )inconsistency in naming convention Other data problems which requires data cleaningduplicate recordsincomplete datainconsistent dataHow to Handle Noisy Data?Binning method:first sort data and partition into (equi-depth) binsthen one can smooth by bin means, smoot

11、h by bin median, smooth by bin boundaries, etc.Clusteringdetect and remove outliersCombined computer and human inspectiondetect suspicious values and check by humanRegressionsmooth by fitting the data into regression functionsSimple Discretization Methods: BinningEqual-width (distance) partitioning:

12、It divides the range into N intervals of equal size: uniform gridif A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N.The most straightforwardBut outliers may dominate presentationSkewed data is not handled well.Equal-depth (frequency) partitioni

13、ng:It divides the range into N intervals, each containing approximately same number of samplesGood data scalingManaging categorical attributes can be tricky.Binning Methods for Data Smoothing* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34* Partition into (equi-depth

14、) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34* Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29* Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34Cluster AnalysisRegressionxyy =

15、x + 1X1Y1Y1Chapter 3: Data PreprocessingWhy data preprocessing?Data cleaning Data integration and transformationData reductionDiscretization and concept hierarchy generationSummaryData IntegrationData integration: combines data from multiple sources into a coherent storeSchema integrationintegrate m

16、etadata from different sourcesEntity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-#Detecting and resolving data value conflictsfor the same real world entity, attribute values from different sources are differentpossible reasons: different r

17、epresentations, different scales, e.g., metric vs. British unitsHandling Redundant Data in Data IntegrationRedundant data occur often when integration of multiple databasesThe same attribute may have different names in different databasesOne attribute may be a “derived” attribute in another table, e

18、.g., annual revenueRedundant data may be able to be detected by correlational analysisCareful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and qualityData TransformationSmoothing: remove noise from dataAggregation: summ

19、arization, data cube constructionGeneralization: concept hierarchy climbingNormalization: scaled to fall within a small, specified rangemin-max normalizationz-score normalizationnormalization by decimal scalingAttribute/feature constructionNew attributes constructed from the given onesData Transform

20、ation: Normalizationmin-max normalizationz-score normalizationnormalization by decimal scalingWhere j is the smallest integer such that Max(| |)Reduced attribute set: A1, A4, A6Data CompressionString compressionThere are extensive theories and well-tuned algorithmsTypically losslessBut only limited

21、manipulation is possible without expansionAudio/video compressionTypically lossy compression, with progressive refinementSometimes small fragments of signal can be reconstructed without reconstructing the wholeData CompressionOriginal DataCompressed DatalosslessOriginal DataApproximated lossyNumeros

22、ity ReductionParametric methodsAssume the data fits some model, estimate model parameters, store only the parameters, and discard the data (except possible outliers)RegressionLog-linear models: obtain value at a point in m-D space as the product on appropriate marginal subspaces Non-parametric metho

23、ds Do not assume modelsMajor families: histograms, clustering, sampling Regression and Log-Linear ModelsLinear regression: Data are modeled to fit a straight lineOften uses the least-square method to fit the lineMultiple regression: allows a response variable Y to be modeled as a linear function of

24、multidimensional feature vectorLog-linear model: approximates discrete multidimensional probability distributionsLinear regression: Y = + XTwo parameters , and specify the line and are to be estimated by using the data at hand.using the least squares criterion to the known values of Y1, Y2, , X1, X2

25、, .Multiple regression: Y = b0 + b1 X1 + b2 X2.Many nonlinear functions can be transformed into the above.Log-linear models:The multi-way table of joint probabilities is approximated by a product of lower-order tables.Probability: p(a, b, c, d) = ab acad bcdRegress Analysis and Log-Linear ModelsHist

26、ogramsA popular data reduction techniqueDivide data into buckets and store average (sum) for each bucketCan be constructed optimally in one dimension using dynamic programmingRelated to quantization problems.ClusteringPartition data set into clusters, and one can store cluster representation onlyCan

27、 be very effective if data is clustered but not if data is “smeared”Can have hierarchical clustering and be stored in multi-dimensional index tree structuresThere are many choices of clustering definitions and clustering algorithms, further detailed in Chapter 8SamplingAllow a mining algorithm to ru

28、n in complexity that is potentially sub-linear to the size of the dataChoose a representative subset of the dataSimple random sampling may have very poor performance in the presence of skewDevelop adaptive sampling methodsStratified sampling: Approximate the percentage of each class (or subpopulatio

29、n of interest) in the overall database Used in conjunction with skewed dataSamplingSRSWOR(simple random sample without replacement)SRSWRRaw DataSamplingRaw Data Cluster/Stratified SampleChapter 3: Data PreprocessingWhy data preprocessing?Data cleaning Data integration and transformationData reductio

30、nDiscretization and concept hierarchy generationSummaryDiscretizationThree types of attributes:Nominal values from an unordered setOrdinal values from an ordered setContinuous real numbersDiscretization: divide the range of a continuous attribute into intervalsSome classification algorithms only acc

31、ept categorical attributes.Reduce data size by discretizationPrepare for further analysisDiscretization and Concept hierachyDiscretization reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace a

32、ctual data values.Concept hierarchies reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior).Discretization and concept hierarchy generation for numeric dataBinning (see sections b

33、efore)Histogram analysis (see sections before)Clustering analysis (see sections before)Entropy-based discretization( will be introduced later)Segmentation by natural partitioningEntropy-Based DiscretizationGiven a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, t

34、he entropy after partitioning isThe boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization.The process is recursively applied to partitions obtained until some stopping criterion is met, e.g.,Experiments show that it may reduce data size and

35、improve classification accuracySee the chapter ”concept description and discrimination mining”Segmentation by natural partitioning3-4-5 rule can be used to segment numeric data into relatively uniform, “natural” intervals.* If an interval covers 3, 6, 7 or 9 distinct values at the most significant d

36、igit, partition the range into 3 equi-width intervals* If it covers 2, 4, or 8 distinct values at the most significant digit, partition the range into 4 intervals* If it covers 1, 5, or 10 distinct values at the most significant digit, partition the range into 5 intervalsExample of 3-4-5 rule(-$400

37、-$5,000)(-$400 0(-$400 - -$300(-$300 - -$200(-$200 - -$100(-$100 - 0(0 - $1,000(0 - $200($200 - $400($400 - $600($600 - $800($800 - $1,000($2,000 - $5, 000)($2,000 - $3,000($3,000 - $4,000($4,000 - $5,000)($1,000 - $2, 000($1,000 - $1,200($1,200 - $1,400($1,400 - $1,600($1,600 - $1,800($1,800 - $2,0

38、00 msd=1,000Low=-$1,000High=$2,000Step 2:Step 4:Step 1: -$351-$159profit $1,838 $4,700 Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Maxcount(-$1,000 - $2,000)(-$1,000 0(0 -$ 1,000Step 3:($1,000 - $2,000Concept hierarchy generation for categorical dataSpecification of a partial ordering of attributes

39、 explicitly at the schema level by users or expertsSpecification of a portion of a hierarchy by explicit data groupingSpecification of a set of attributes, but not of their partial orderingSpecification of only a partial set of attributesSpecification of a set of attributesConcept hierarchy can be automatically generated based on the number of distinct values per attribute in the given attribute setThe attribute with the most distinct values is placed at the lowest level of the hierarchycountryprovince_or_ statecitystreet1

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论