数据挖掘技术Lecture4资料_第1页
数据挖掘技术Lecture4资料_第2页
数据挖掘技术Lecture4资料_第3页
数据挖掘技术Lecture4资料_第4页
数据挖掘技术Lecture4资料_第5页
已阅读5页,还剩221页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Data Mining Techniques1学会阅读文献读得进去,钻得出来。尽量读原文Know everything for something and Know something for everything有人的地方就有江湖,学术界也是另外一个江湖,不了解江湖的形势怎么能混的下去呢?学会基本技能学会综述文献,学会撰写论文,学会撰写基金申请书,学会做学术报告。科学研究的基本套路要熟悉和掌握培养自己的开放精神与学术交流的意识(y sh) 团队合作,欣赏和宽容保持一个好奇心广泛的兴趣爱好和好奇心是培养自己具有丰富想象力的前提,也是使自己进行创新的原动力在科学的道路上是没有平坦的大道可走的,只

2、有那些不畏艰险,在崎岖小路上攀登的人,才有希望达到光辉的顶点。共二百二十六页Data Mining Techniques2Classification vs. Prediction共二百二十六页Data Mining Techniques3OutlineWhat are classification and prediction?Decision tree inductionBayesian classification Classification by artificial neural networkKNNSome other methodsPrediction共二百二十六页4The g

3、oal of data classification is to organize and categorize data in distinct classes.A model is first created based on the data distribution.The model is then used to classify new data.Given the model, a class can be predicted for new data. What is Classification(Color, weight) sweet? 共二百二十六页Data Minin

4、g Techniques5Prediction: numeric prediction, where the models constructed predict continuous-valued functions, i.e., predicts unknown or missing values Suppose that marketing manager would like to predict how much a given customer will spend during a sale at electronic products.Classification vs. Pr

5、edictionSimilarDifferent in the values are being predicted: categorical(分类(fn li)型) vs. continuous-valueDifferent in the methods used to build their respective models.Typical applicationsCredit approval, Target marketing, Fraud detectionMedical diagnosisWhat is Prediction共二百二十六页Data Mining Technique

6、s6Classification: A Two-Step Process Model construction: find a relation between a set of variables (features) to target variables (labels) from finite examples. Given Y = y1, y2, , yn and X = x1, x2, , xm find y = f(x) , xiX, ! yjY: yj = f(xi)Each tuple/sample is assumed to belong to a predefined c

7、lass, as determined by the class label attributeThe set of tuples used for model construction: training setThe model is represented as classification rules, decision trees, or mathematical formulae共二百二十六页Data Mining Techniques7Classification: A Two-Step Process Model usage: for classifying future or

8、 unknown objectsEstimate accuracy of the modelThe known label of test sample is compared with the classified result from the modelAccuracy rate is the percentage of test set samples that are correctly classified by the modelTest set is independent of training set, otherwise over-fitting will occur共二

9、百二十六页Data Mining Techniques8TrainingDataClassificationAlgorithmsIF rank = professorOR years 6THEN tenured = yes Classifier(Model)Step 1: Model Construction共二百二十六页Data Mining Techniques9Step 2: Using the Model in Prediction ClassifierTestingDataUnseen Data(Jeff, Professor, 5)Tenured?Error: 1 /4 = 0.2

10、5Accept?IF rank = professorOR years 6THEN tenured = yes 共二百二十六页Data Mining Techniques10Issues: Data PreparationIn order to help improve the accuracy, efficiency, and scalability Data cleaningPreprocess data in order to remove or reduce noise and treat missing valuesRelevance analysis (feature select

11、ion)Remove the irrelevant or redundant attributesData transformationGeneralize and/or normalize datai.e. numeric values for the attribute income may be generalized to discrete ranges such as low, medium, and high. Nominal-valued attribute street can be generalized to higher-level conceptes, like cit

12、yGeneralization compresses the original training data, fewer input/output operations may be involved during learning共二百二十六页Data Mining Techniques11Measurements of Classification Methods()Predictive accuracy: this refers to the ability of the model to correctly predict the class label of new or previ

13、ously unseen dataSpeed: this refers to the computation costs involved in generating and using the modelRobustness: this is the ability of the model to make correct predictions given noisy data or data with missing values共二百二十六页Data Mining Techniques12Measurements of Classification Methods()Scalabili

14、ty: this refers to the ability to construct the model efficiently given large amount of dataInterpretability: this refers to the level of understanding and insight that is provided by the model Simplicity:decision tree sizerule compactness共二百二十六页Data Mining Techniques13Classification MethodsDecision

15、 Tree InductionBayesian ClassificationNeural Networks K-Nearest Neighbour Associative ClassifiersSupport Vector Machines Case-Based ReasoningGenetic AlgorithmsRough Set TheoryFuzzy SetsEtc.共二百二十六页Data Mining Techniques14OutlineWhat are classification and prediction?Decision tree inductionBayesian cl

16、assification Classification by artificial neural networkKNNAssociative ClassifiersSupport vector machinesSome other methodsPrediction共二百二十六页Classification Tree 1Classification trees are methodologies to classify data into discrete ones using tree-structured algorithms.Breiman invented the terminolog

17、y in the early 1980sThe technique has found immense uses medical, market research statistics, marketing and customer relationships, 共二百二十六页Decision TreeThe main purpose of decision tree is to expose the structural information contained in the data A flow-chart-like tree structure Each leaf node- ass

18、igns a classification Each internal node represents a decision or split Each branch corresponds to attribute valueRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KAttributeTraining DataModel: Decision TreeValue RangeClass Label共二百二十六页Another Example of Decision TreeData Mining Techniques17

19、MarStRefundTaxIncYESNONONOYesNoMarried Single, Divorced 80KThere could be more than one tree that fits the same data!共二百二十六页Data Mining Techniques18Apply Model to Test DataRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KTest DataStart from the root of tree.共二百二十六页Data Mining Techniques19R

20、efundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KTest DataApply Model to Test Data共二百二十六页Data Mining Techniques20Apply Model to Test DataRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KTest Data共二百二十六页Data Mining Techniques21Apply Model to Test DataRefundMarStTaxIncYESNONONOYesNo

21、Married Single, Divorced 80KTest Data共二百二十六页Data Mining Techniques22Apply Model to Test DataRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KTest Data共二百二十六页Data Mining Techniques23Apply Model to Test DataRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KTest DataAssign Cheat to “

22、No”共二百二十六页Data Mining Techniques24What Is A Decision TreeA Decision Tree is a predictive model that is used to make predictions through a classification process. The predictive model is represented as an upside down Tree- root at the top (or on the left-hand side) and leaves at the bottom (or on the

23、 right-hand side). A Decision Tree is linear model that use axis parallel splits to recursively partition the labeled points, so that each region is as “pure” as possible in terms of the labels.Decision Trees represent rules. By following the Tree, you can decipher the rules and understand why a rec

24、ord is classified in a certain way. These rules can then be used to predict unknown records共二百二十六页Data Mining Techniques25Why are Decision Trees so PopularThe construction of decision tree does not require any domain knowledge or parameter setting, and therefore is appropriate for exploratory knowle

25、dge discovery;Decision tree can handle high dimensional data;Intuitive and generally easy to understand by human;The learning and classifiers steps of decision tree are simple and fast.共二百二十六页Data Mining Techniques26Classification by Decision Tree InductionDecision tree generation consists of two ph

26、asesTree constructionAt start, all the training examples are at the rootPartition examples recursively based on selected attributesTree pruningIdentify and remove branches that reflect noise or outliersUse of decision tree: Classifying an unknown sampleTest the attribute values of the sample against

27、 the decision tree共二百二十六页Data Mining Techniques27Decision Tree algorithmsMany Algorithms Hunts Algorithm (one of the earliest) ID3 (Information Gain), C4.5(Gain Ratio, handling missing value ) CART (Gini Index)Different in the approach to select which attribute to test at each node in the tree. 共二百二

28、十六页Data Mining Techniques28Basic algorithm (a greedy algorithm)Construct a tree in a top-down recursive divide-and-conquer mannerTakes a given subset of the dataEvaluates all possible splitsChose the best split to partition the data into two subsetStops when certain stopping conditions are met共二百二十六

29、页Data Mining Techniques29Issues About Decision Tree algorithmsDetermine how to split the recordsHow to specify the attribute test condition?How to determine the best split?Determine when to stop splitting共二百二十六页Data Mining Techniques30How to specify the attribute test condition?Depends on attribute

30、typesCategoricalContinuousDepends on number of ways to split2-way splitMulti-way split共二百二十六页Data Mining Techniques31Splitting Based on Categorical AttributeMulti-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need to find optimal partitioning.OR

31、IncomelowmediumhighIncomelow, mediumhighIncomelow, highmediumIncomelow, highmediumOR共二百二十六页Data Mining Techniques32Splitting Based on Continuous AttributesDifferent ways of handlingDiscretization to form an ordinal categorical attribute Static discretize once at the beginning Dynamic ranges can be f

32、ound by equal interval bucketing, equal frequency bucketing(percentiles), or clustering.Binary Decision: (A 80K?YesNoTaxableIncome?(i) Binary split(ii) Multi-way split 80K共二百二十六页Data Mining Techniques34Issues About Decision Tree algorithmsDetermine how to split the recordsHow to specify the attribut

33、e test condition?How to determine the best split?Determine when to stop splitting共二百二十六页Finding the SpiltsThe central focus of the decision tree growing algorithm is selecting which attribute to test at each node in the treeWhich attribute can make the best split?The measure used to evaluate a poten

34、tial split is purityLow purity means that the set contains a representative distribution of classesHigh purity means that members of single class predominateWe want a measureTell us the purity of a nodeTell us how much additional purity we achieved of a splitting data on a non-pure node to daughter

35、nodes based on attribute values of a variableHigh PurityOriginal Datalow Purity共二百二十六页Information Theory and Information EntropyInformation theory: a branch of the mathematical theory of probability and mathematical statisticsDeal with the concepts of information and information entropy, communicati

36、on systems, data transmission and rate distortion theory, cryptograph, signal-to-noise ratios, data compression and related topicsClaude Shannon (1916-2001): the father of information theoryEntropy: originated in thermodynamics but late found its way to information theoryThermodynamic entropy: a mea

37、sure of the amount of energy in physical system which cannot be used to do work. It is also a measure of the disorder present in a systemInformation entropy: a measure of the disorder present in a system共二百二十六页Data Mining Techniques37Assume there are two classes, P and N Let the set of examples D co

38、ntain p elements of class P and n elements of class NThe amount of information, needed to decide if an arbitrary example in D belongs to P or N is defined asEntropy (Binary Case)共二百二十六页Data Mining Techniques38Suppose the class label attribute has m distinct values, Ci (for i = 1, , m)Let pi be the p

39、robability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D|Expected information (entropy) needed to classify a tuple in D:Entropy (General Case)A log function to the base 2 is used since the information is encoded in bits共二百二十六页Data Mining Techniques39Example of Entropy P(C

40、1) = 0/6 = 0 P(C2) = 6/6 = 1Entropy = 0 log 0 1 log 1 = 0 0 = 0 P(C1) = 1/6 P(C2) = 5/6Entropy = (1/6) log2 (1/6) (5/6) log2 (1/6) = 0.65P(C1) = 2/6 P(C2) = 4/6Entropy = (2/6) log2 (2/6) (4/6) log2 (4/6) = 0.92Maximum (log nc) when records are equally distributed among all classes implying least inf

41、ormationMinimum (0.0) when all records belong to one class, implying most information共二百二十六页Data Mining Techniques40Let attribute A have v distinct values. Information needed (after using A to split D into v partitions) to classify D- How much more information would we still need in order to arrive

42、at an exact classification? The smaller the value, the greater the purity of the subset partitions.Gain(A) tells us how much information be gained by branching on attribute A. The attribute A with the highest information gain is chosen as the splitting attribute at node N. Information Gain共二百二十六页Dat

43、a Mining Techniques41Select the attribute with the highest information gainThis attribute minimizes the information needed to classify the samples in the resulting partitions and reflects the least randomness or impurity in the these partitions. Information Gain共二百二十六页Data Mining Techniques42Example

44、-Information GainClass P: buys_computer = “yes”Class N: buys_computer = “no” means “age =30” has 5 out of 14 samples,with 2 yess and 3 nos. Hence Similarly,共二百二十六页Data Mining Techniques43ID3 has been incorporated in a number of commercial rule-induction packages. Some specific applications include m

45、edical diagnosis, credit risk assessment of loan applications, equipment malfunctions by their cause, classification of soybean diseases, and web search classification.Information Gain共二百二十六页Data Mining Techniques44Gain Ratio (C4.5)Information gain measure is biased towards attributes with a large n

46、umber of valuesFor instance, . Consider an attribute that acts as a unique identifier, such as customer_ID. A split this attribute would result in a large number of partitions, each one containing just one tuple. Because each partition is pure, the information required to classify data set D based o

47、n this partition would be 0 . Clearly, such a partitioning is useless for classification.共二百二十六页Data Mining Techniques45Gain Ratio (C4.5)C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain)GainRatio(A) = Gain(A)/SplitInfo(A)Ex.gain_ratio(income) = 0.0

48、29/0.926 = 0.031The attribute with the maximum gain ratio is selected as the splitting attribute共二百二十六页Gini IndexGini index: after Italian statistician and economist, Corrado GiniThe sum of the squares of the proportions of the classesGini index of node tMeasure the inpurity of a nodeThe basic idea

49、is to choose a split at each node so that the data in each subset (child node) is “purer” than the data in the parent node.共二百二十六页Example of Gini IndexMaximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting informationMinimum (0.0) when all records belon

50、g to one class, implying most interesting informationData Mining Techniques47P(C1) = 0/6 = 0 P(C2) = 6/6 = 1Gini = 1 P(C1)2 P(C2)2 = 1 0 1 = 0 P(C1) = 1/6 P(C2) = 5/6Gini = 1 (1/6)2 (5/6)2 = 0.278P(C1) = 2/6 P(C2) = 4/6Gini = 1 (2/6)2 (4/6)2 = 0.444共二百二十六页Data Mining Techniques48Gini index (CART, IB

51、M IntelligentMiner)If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is defined asReduction in Impurity:The attribute provides the smallest ginisplit(D) (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting poi

52、nts for each attribute)共二百二十六页Data Mining Techniques49Induction of a decision tree using Gini indexEx. D has 9 tuples in buys_computer = “yes” and 5 in “no”Computer the gini index for each attributeStart with the attribute income and consider each of the possible splitting subsets. income partitions

53、 D into 10 in D1: high, medium and 4 in D2Similarly, the Gini index values for splits on the remaining subsets are: 0.490 (low, high and medium) and 0.475 (low,medium and high). ginimedium,high is 0.450 and thus the best since it is the lowest共二百二十六页Data Mining Techniques50Comparing Attribute Select

54、ion MeasuresThe three measures, in general, return good results butInformation gain: All attributes are assumed to be categorical Can be modified for continuous-valued attributesbiased towards multivalued attributesGain ratio: tends to prefer unbalanced splits in which one partition is much smaller

55、than the othersGini index: All attributes are assumed continuous-valuedAssume there exist several possible split values for each attributeMay need other tools, such as clustering, to get the possible split valuesCan be modified for categorical attributesbiased to multivalued attributeshas difficulty

56、 when # of classes is largetends to favor tests that result in equal-sized partitions and purity in both partitions共二百二十六页Data Mining Techniques51Issues About Decision Tree algorithmsDetermine how to split the recordsHow to specify the attribute test condition?How to determine the best split?Determi

57、ne when to stop splitting共二百二十六页Data Mining Techniques52Stopping Criteria for Tree InductionStop expanding a node when all the records belong to the same classStop expanding a node when all the records have similar attribute valuesEarly termination (to be discussed later)共二百二十六页Data Mining Technique

58、s53Extracting Classification Rules from TreesRepresent the knowledge in the form of IF-THEN rulesOne rule is created for each path from the root to a leafEach attribute-value pair along a path forms a conjunctionThe leaf node holds the class predictionRules are easier for humans to understandExample

59、IF age= “=30”AND student= “no”THEN buys_computer= “no”IF age= “40”AND credit_rating= “excellent”THEN buys_computer = “yes”IF age= “ 7.5, therefore instead of T4 with a leaf node 共二百二十六页Strengths and Weakness of Decision Tree MethodsStrengths:Generate understandable rules Perform classification witho

60、ut requiring much computationHandle both continuous and categorical variablesProvide a clear indication of which fields are most important for prediction or classification. Weakness:Less appropriate for estimation tasks where the goal is to predict the value of a continuous attribute. Prone to error

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论