文稿机器学习alternative classification_第1页
文稿机器学习alternative classification_第2页
文稿机器学习alternative classification_第3页
文稿机器学习alternative classification_第4页
文稿机器学习alternative classification_第5页
已阅读5页,还剩94页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Ch5 Data MiningClassification: Alternative TechniquesRule-Based ClassifierClassify records by using a collection of “ifthen”rules(Condition) ® yRule:whereCondition is a conjunctions of attributesy is the class labelLHS: rule antecedent or conditionRHS: rule consequentExamples of classification

2、rules: (Blood Type=Warm) Ù (Lay Eggs=Yes) ® Birds (Taxable Income < 50K) Ù (Refund=Yes) ® Evade=NoRule-based Classifier (Example)R1: (Give Birth = no) Ù (Can Fly = yes) ® BirdsR2: (Give Birth = no) Ù (Live in Water = yes) ® FishesR3: (Give Birth = yes) 

3、7; (Blood Type = warm) ® Mammals R4: (Give Birth = no) Ù (Can Fly = no) ® ReptilesR5: (Live in Water = sometimes) ® AmphibiansNameBlood TypeGive BirthCan FlyLive in WaterClasshuman python salmon whale frog komodo bat pigeon catleopard shark turtle penguin porcupineeelsander gila

4、monster platypus owldolphin eaglewarm cold cold warm cold cold warm warm warm cold cold warm warm cold cold cold warm warm warm warmyes no no yes no no yes no yes yes no no yes no no no no no yes nono no no no no no yes yes no no no no no no no no no yes no yesno no yes yessometimes nono no no yesso

5、metimes sometimes noyes sometimes nono no yes nomammals reptiles fishes mammals amphibians reptiles mammals birds mammals fishes reptiles birds mammals fishes amphibians reptiles mammals birds mammals birdsApplication of Rule-Based ClassifierA rule r covers an instance x if the attributes of the ins

6、tance satisfy the condition of the ruleR1: (Give Birth = no) Ù (Can Fly = yes) ® BirdsR2: (Give Birth = no) Ù (Live in Water = yes) ® FishesR3: (Give Birth = yes) Ù (Blood Type = warm) ® MammalsR4: (Give Birth = no) Ù (Can Fly = no) ® ReptilesR5: (Live in Wate

7、r = sometimes) ® AmphibiansThe rule R1 covers a hawk => BirdThe rule R3 covers the grizzly bear => MammalNameBlood TypeGive BirthCan FlyLive in WaterClasshawkwarmnoyesno?grizzly bearwarmyesnono?Rule Coverage and AccuracyCoverage of a rule: Fraction of records that satisfy the antecedent o

8、f a ruleAccuracy of a rule: Fraction of records that satisfy both the antecedent and consequent of a rule0(Status=Single) ® NoCoverage = 40%, Accuracy = 50%TidRefundMarital StatusTaxable IncomeClass12345678910Yes No No Yes No No Yes No NoNoSingle Married Single Married Divorced Married Divorced

9、 Single MarriedSingle125K100K70K120K95K60K220K85K75K90KNo No No No Yes No No Yes NoYesHow does Rule-based Classifier Work?R1: (Give Birth = no) Ù (Can Fly = yes) ® BirdsR2: (Give Birth = no) Ù (Live in Water = yes) ® FishesR3: (Give Birth = yes) Ù (Blood Type = warm) ®

10、Mammals R4: (Give Birth = no) Ù (Can Fly = no) ® ReptilesR5: (Live in Water = sometimes) ® AmphibiansA lemur triggers rule R3, so it is classified as a mammal A turtle triggers both R4 and R5A dogfish shark triggers none of the rulesNameBlood TypeGive BirthCan FlyLive in WaterClasslem

11、urwarmyesnono?turtlecoldnonosometimes?dogfish sharkcoldyesnoyes?Characteristics of Rule-Based ClassifierMutually exclusive rules Classifier contains mutually exclusive rules if the rules are independent of each other Every record is covered by at most one ruleExhaustive(无遗漏的、穷尽的) rules Classifier ha

12、s exhaustive coverage if it accounts for every possible combination of attribute values Each record is covered by at least one ruleFrom Decision Trees To RulesYesNoNOSingle,DivorcedMarriedNO< 80K> 80KNOYESRules are mutually exclusive and exhaustiveRule set contains as much information as the t

13、reeTaxable IncomeMarital StatusRefundClassification Rules(Refund=Yes) => No(Refund=No, Marital Status=Single,Divorced,Taxable Income<80K) => No(Refund=No, Marital Status=Single,Divorced,Taxable Income>80K) => Yes(Refund=No, Marital Status=Married) => NoRules Can Be SimplifiedYesNoN

14、OSingle,DivorcedMarriedNO< 80K> 80KNOYES0(Refund=No) Ù (Status=Married) ® No(Status=Married) ® NoInitial Rule:Simplified Rule:Taxable IncomeMarital StatusRefundTidRefundMarital StatusTaxable IncomeCheat12345678910Yes No No Yes No No Yes No NoNoSingle Married Single Married Divor

15、ced Married Divorced Single MarriedSingle125K100K70K120K95K60K220K85K75K90KNo No No No Yes No No Yes NoYesEffect of Rule SimplificationRules are no longer mutually exclusiveA record may trigger more than one ruleSolution?Ordered rule set(有序规则)Unordered rule set(无序规则)采用投票策略,把每条被触发的后件看作是对相应类的一次投票,然后计票

16、确定测试的类标号;在某些情况下,投票可以用规则的准确率;利:不易受由于选择不当的规则而产生错误的影响;不必维护规则的顺序,建立模型的开销也相对较小;弊:测试对测试的属性要与规则集中的每一条规则的前件作比较,所以进行分类是一件很繁重的任务。Rules are no longer exhaustive A record may not trigger any rules Solution?à yUse a default classOrdered Rule SetRules are ranked in descending order according to their priorit

17、y(按优先级降序排列)优先级的定义有多种方法:基于准确率、覆盖率、总描述长度或规则的产生顺序有序的规则集也成为决策表( decision list)When a test record is presented to the classifier It is assigned to the class label of the highest ranked rule(排序最高的规则)it has triggered If none of the rules fired, it is assigned to the default classNameBlood TypeGive BirthCan

18、 FlyLive in WaterClassturtlecoldnonosometimes?R1: (Give Birth = no) Ù (Can Fly = yes) ® BirdsR2: (Give Birth = no) Ù (Live in Water = yes) ® FishesR3: (Give Birth = yes) Ù (Blood Type = warm) ® MammalsR4: (Give Birth = no) Ù (Can Fly = no) ® Reptiles R5: (Live

19、 in Water = sometimes) ® AmphibiansRule Ordering SchemesRule-based orderingIndividual rules are ranked based on their quality:依据规则质量的某种度量对规则排序,该方案确保每一个测试都是由覆盖它的“最好的”规则来分类该方案的潜在缺点是排在越低的规则越难解释,因为每条规则都假设所有排在它前面的规则不成立Class-based orderingRules that belong to the same class appear together质量较差的规则可能因为

20、某个类的优先级较高排在前面而被触发,从而导致高质量规则被忽略。Class-based Ordering(Refund=Yes) => No(Refund=No, Marital Status=Single,Divorced,Taxable Income<80K) => No(Refund=No, Marital Status=Married) => No(Refund=No, Marital Status=Single,Divorced, Taxable Income>80K) => YesRule-based Ordering(Refund=Yes) =&

21、gt; No(Refund=No, Marital Status=Single,Divorced,Taxable Income<80K) => No(Refund=No, Marital Status=Single,Divorced,Taxable Income>80K) => Yes(Refund=No, Marital Status=Married) => NoBuilding Classification RulesDirect Method:Extract rules directly from datae.g.: RIPPER, CN2, Holtes

22、1RIndirect Method:Extract rules from other classification mdecision trees, neural networks, etc).e.g: C4.5ruless (e.g.Direct Method: Sequential CoveringStart from an empty ruleGrow a rule using the Learn-One-Rule function Remove training records covered by the ruleRepeat Step (2) and (3) until stopp

23、ing criterion is met1.2.3.4.1. 令E是训练,A是属性-值对的集合(Aj, vj)2. 令Y0是类的有序集y1, y1, , yk3. 令R=是初始规则列表4. for 每个类yÎY0 -yk dowhile 终止条件不满足dor ß Learn-one-rule(E, A, y)从E中删除被r覆盖的训练追加r到规则列表尾部:R ß R v rend while5.6.7.8.9.10. end for11. 把默认规则 à yk到规则列表R尾部Example of Sequential Covering(i) Origina

24、l Data(ii) Step 1Example of Sequential CoveringR1R1R2(iii) Step 2(iv) Step 3Aspects of Sequential CoveringRule GrowingInstance EliminationRule EvaluationStopping CriterionRule PruningRule GrowingTwo common strategiesYes: 3No: 4 .Income> 80KRefund= NoStatus = SingleStatus = DivorcedStatus = Marrie

25、dRefund=No, Status = Single (Class = Yes)Yes: 3Yes: 2No: 1Yes: 1No: 0Yes: 0No: 3Yes: 3No: 1No:4(b) Specific-to-general可以随机选择一个正例作为规则增长的初始(a) General-to-specific先建立一个初始规则r: à y (质量很差, 覆盖所有样本)加入新的合取项提高规则的质量。探查所有可能的候选,并贪心地选择下一个合取项。继续该过程,直至满足终止条件为止求精步,通过删除规则的一个合取项,使其覆盖的正例来泛化规则。继续该过程,直至满足终止条件为止Refun

26、d=No, Status=Single, Income=90K (Class=Yes)Refund=No, Status=Single, Income=85K (Class=Yes)Rule Growing (Examples)CN2 Algorithm:Start from an empty conjunct: Add conjuncts that minimizes the entropy measure:A, A,B, Determine the rule consequent by taking majority class of instances covered by the ru

27、leRIPPER Algorithm:Start from an empty rule: => classAdd conjuncts thatizes FOILs information gain measure:R0: => class(initial rule)R1: A => class (rule after adding conjunct)Gain(R0, R1) = t log (p1/(p1+n1) log (p0/(p0 + n0) wheret: number of positive instances covered by both R0 and R1p0

28、: number of positive instances covered by R0 n0: number of negative instances covered by R0 p1: number of positive instances covered by R1 n1: number of negative instances covered by R1Instance EliminationWhy do we need toeliminate instances? Otherwise, the next rule is identical to previous ruleWhy

29、 do we remove positive instances? Ensure that the next rule isdifferentWhy do we remove negative instances? Prevent underestimating accuracy of rule Compare rules R2 and R3 in the diagramR3R2R1+ +class = + +-class = -Rule Evaluation Statistical test-For example, compute the following likelihood rati

30、o statistic:R = 2å fi log( fi / ei )i=1kis the observed frequency of class i-Where k is the number of classes,fexamples that are covered by the rule, and eirule that makes random predictions.iis the expected frequency of a-Some explanation: compare the given rule with rule that makes randompred

31、ictions. A large R value suggests that the number of correct predictions make by the rule is significantly large.Rule Evaluation Statistical test-For example.-Consider a training set that contains 60 positive examples and 100 negative examples.-Rule1: covers 50 positive examples and 5 negative examp

32、les-Rule2: covers 2 positive examples and no negative examples-Rule1 covers 55 examples, the expected frequency for thepositive class is e1 = 55*60/160 = 20.625, while the expectedfrequency for the negative class is e2 = 55*100/160 = 34.375R(r1 ) = 2 ´50 ´log2 (50 / 20.625) + 5´log2 (

33、5 / 34.375) = 99.9Rule EvaluationMetrics: Accuracy= ncn= nc+ 1 Laplacen + kn : Number of instances covered by rulenc : Number of positive instancescovered by rulek : Number of classesp : Prior probability of positive instancesnc + kp M-estimate =n + kStopping Criterion and Rule PruningStopping crite

34、rion Compute the gain If gain is not significant, discard the new ruleRule Pruning Similar to post-pruning of decision trees Reduced Error Pruning:Remove one of the conjuncts in the ruleCompare error rate on validation set before and after pruningIf error improves, prune the conjunctSummary of Direc

35、t MethodGrow a single ruleRemove Instances from rulePrune the rule (if necessary)Add rule to Current Rule SetRepeatDirect Method: RIPPERFor 2-class problem, choose one of the classes as positiveclass, and the other as negative class Learn rules for positive class Negative class will be default class

36、 For multi-class problemOrder the classes according to increasing class prevalence(fraction of instances that belong to a particular class) Learn the rule set for smallest class first, treat the rest as negative classRepeat with next smallest class as positive classDirect Method: RIPPERGrowing a rul

37、e:Start from empty ruleAdd conjuncts as long as they improve FOILs information gain Stop when rule no longer covers negative examplesPrune the rule immediately using incremental reduced errorpruningMeasure for pruning:v = (p-n)/(p+n)p: number of positive examples covered by the rule inthe validation

38、 setn: number of negative examples covered by the rule inthe validation setPruning method: delete any final sequence of conditions thatizes vDirect Method: RIPPERBuilding a Rule Set: Use sequential covering algorithm Finds the best rule that covers the current set of positive examples Eliminate both

39、 positive and negative examples covered by the rule Each time a rule is added to the rule set, compute the new description length stop adding new rules when the new description length is d bits longer than the smallest description length obtained so farDirect Method: RIPPEROptimize the rule set: For

40、 each rule r in the rule set RConsider 2 alternative rules: Replacement rule (r*): grow new rule from scratch Revised rule(r): add conjuncts to extend the rule rCompare the rule set for r against the rule set for r* and rChoose rule set that minimizes MDL principle Repeat rule generation and rule op

41、timization for the remaining positive examplesIndirect MethodsPNoYesQRNoYesNoYes-+QNoYes-+Rule Setr1: (P=No,Q=No) => - r2: (P=No,Q=Yes) => + r3: (P=Yes,R=No) => +r4: (P=Yes,R=Yes,Q=No) => - r5: (P=Yes,R=Yes,Q=Yes) => +Indirect Method: C4.5rulesExtract rules from an unpruned decision t

42、reeFor each rule, r: A ® y, consider an alternative rule r: A ® y where A is obtained by removing one of the conjuncts in A Compare the pessimistic error rate for r against all rs Prune if one of the rs has lower pessimistic error rate Repeat until we can no longer improve generalization e

43、rrorIndirect Method: C4.5rulesInstead of ordering the rules, order subsets of rules (class ordering) E Recallthe regression problem and classificationr problem: C Regularization can prevent over-fitting,why?SVM is the most popular classifier, why?Tlength is given the highest priority(Occam'sRazo

44、r)Occams Razor: It is a principle urging one to select among competing hypotheses that which makes the fewest assumptionsExampleNameGive BirthLay EggsCan FlyLive in WaterHave LegsClasshumanyesnononoyesmammalspythonnoyesnononoreptilessalmonnoyesnoyesnofisheswhaleyesnonoyesnomammalsfrognoyesnosometime

45、syesamphibianskomodonoyesnonoyesreptilesbatyesnoyesnoyesmammalspigeonnoyesyesnoyesbirdscatyesnononoyesmammalsleopard sharkyesnonoyesnofishesturtlenoyesnosometimesyesreptilespenguinnoyesnosometimesyesbirdsporcupineyesnononoyesmammalseelnoyesnoyesnofishessandernoyesnosometimesyesamphibiansgila monster

46、noyesnonoyesreptilesplatypusnoyesnonoyesmammalsowlnoyesyesnoyesbirdsdolphinyesnonoyesnomammalseaglenoyesyesnoyesbirdsC4.5 versus C4.5rules versus RIPPERC4.5rules:(Give Birth=No, Can Fly=Yes) ® Birds(Give Birth=No, Live in Water=Yes) ® Fishes (Give Birth=Yes) ® Mammals(Give Birth=No, C

47、an Fly=No, Live in Water=No) ® Reptiles( ) ® AmphibiansGive Birth?YesNoLive In Water?MammalsRIPPER:(Live in Water=Yes) ® Fishes (Have Legs=No) ® ReptilesYesNoSometimes(Give Birth=No, Can Fly=No, Live In Water=No)® Reptiles(Can Fly=Yes,Give Birth=No) ® Birds() ® Mam

48、malsCan Fly?FishesAmphibiansNoYesBirdsReptilesC4.5 versus C4.5rules versus RIPPERC4.5 and C4.5rules:RIPPER:PREDICTED CLASSAmphibiansFishesReptilesBirdsMammalsACTUALAmphibians00002CLASSFishes03000Reptiles00301Birds00121Mammals02104PREDICTED CLASSAmphibiansFishesReptilesBirdsMammalsACTUALAmphibians200

49、00CLASSFishes02001Reptiles10300Birds10030Mammals00106Advantages of Rule-Based ClassifiersAs highly expressive as decision treesEasy to interpret Easy to generateCan classify new instances rapidlyPerformance comparable to decision treesInstance-Based ClassifiersSet of Stored Cases Store the training

50、records Use training records to predict the class label of unseen casesUnseen CaseAtr1.AtrNAtr1.AtrNClassABBCACBInstance Based ClassifiersExamples: Rote-learner Memorizes entire training data and performs classification only if attributes of record match one of the training examples exactly Nearest

51、neighbor Uses k “closest” points (nearest neighbors) for performing classificationNearest Neighbor ClassifiersBasic idea: If it walks like a duck, quacks like a duck, then its probably a duckComputeTestRecordDistanceTraining RecordsChoose k of the “nearest” recordsNearest-Neighbor ClassifiersUnknown

52、 recordl Requires three thingsThe set of stored recordsDistance Metric to compute distance between recordsThe value of k, the number of nearest neighbors to retrievel To classify an unknown record:Compute distance to other training recordsIdentify k nearest neighborsUse class labels of nearest neighbors to determine the class labe

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论