FAFU机器学习06-1 Decision Tree课件_第1页
FAFU机器学习06-1 Decision Tree课件_第2页
FAFU机器学习06-1 Decision Tree课件_第3页
FAFU机器学习06-1 Decision Tree课件_第4页
FAFU机器学习06-1 Decision Tree课件_第5页
已阅读5页,还剩44页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

FoundationsofMachineLearning

DecisionTreeTop10algorithmsindataminingC4.5K-MeansSVMAprioriEM(MaximumLikelihood)PageRankAdaBoostKNNNaïveBayesCARTMainClassificationMethodsLogisticRegressionLinearDiscriminantAnalysisDecisionTreeInductionNearestNeighborBayesClassificationMethodsClassificationbyBackpropagationSupportVectorMachinesEnsembleMethods…

IllustratingClassificationTask

ExampleofaDecisionTree

AnotherExampleofDecisionTree

DecisionTreeClassificationTask

ApplyModeltoTestData

DecisionTreeClassificationTask10AlgorithmforDecisionTreeInductionManyAlgorithms:Hunt’sAlgorithm(oneoftheearliest)ID3(IterativeDichotomiser)C4.5CART(ClassificationandRegressionTree)SLIQ(SupervisedLearningInQuest)SPRINT(ScalablePaRallelizableINductionofdecisionTrees)……11AlgorithmforDecisionTreeInductionBasicalgorithm(agreedyalgorithm)Treeisconstructedinatop-downrecursivedivide-and-conquermannerAtstart,allthetrainingexamplesareattherootAttributesarecategorical(ifcontinuous-valued,theyarediscretizedinadvance)ExamplesarepartitionedrecursivelybasedonselectedattributesTestattributesareselectedonthebasisofaheuristicorstatisticalmeasure(e.g.,informationgain)ConditionsforstoppingpartitioningAllsamplesforagivennodebelongtothesameclassTherearenoremainingattributesforfurtherpartitioning–majorityvotingisemployedforclassifyingtheleafTherearenosamplesleft…12AlgorithmforDecisionTreeInduction13AlgorithmforDecisionTreeInductionGreedystrategySplittherecordsbasedonanattributetestthatoptimizescertaincriterion(根据最优划分属性进行划分)IssuesDeterminehowtoselectthebestattribute?Howtosplittorecords?Howtodeterminethebestsplit?Determinewhentostopsplitting14AlgorithmforDecisionTreeInductionHowtoSplittherecords?DependsonattributetypesNominalOrdinalContinuousDependsonnumberofwaystosplit2-waysplitMulti-waysplit15AlgorithmforDecisionTreeInductionSplittingBasedonNominalAttributes?Multi-waysplit:UseasmanypartitionsasdistinctvaluesBinarysplit:DividesvaluesintotwosubsetsNeedtofindoptimalpartitioning16AlgorithmforDecisionTreeInductionSplittingBasedonOrdinalAttributes?Multi-waysplit:UseasmanypartitionsasdistinctvaluesBinarysplit:DividesvaluesintotwosubsetsNeedtofindoptimalpartitioning17AlgorithmforDecisionTreeInductionSplittingBasedonContinuousAttributesDiscretizationtoformanordinalcategoricalattributeBinaryDecision:(A<v)or(A>=v)considerallpossiblesplitsandfindsthebestcutcanbemorecomputationintensive18AlgorithmforDecisionTreeInductionGreedystrategySplittherecordsbasedonanattributetestthatoptimizescertaincriterionIssuesDeterminehowtoselectthebestattributeHowtosplittherecords?

Howtodeterminethebestsplit?Determinewhentostopsplitting19AlgorithmforDecisionTreeInductionHowtodeterminetheBestSplitBeforeSplitting:10recordsofclassC0,10recordsofclassC120AlgorithmforDecisionTreeInductionHowtodeterminetheBestSplitGreedyapproach:Nodeswithhomogeneousclassdistributionarepreferred

Needameasureofnodeimpurity(purity):21HowtodeterminetheBestSplit

Needameasure

22MeasuresofNodeImpurity

EntropyGiniIndex…BriefReviewofEntropy

23m=224AttributeSelectionMeasure:InformationGain(ID3/C4.5)SelecttheattributewiththehighestinformationgainLetpibetheprobabilitythatanarbitrarytupleinDbelongstoclassCi,estimatedby|Ci,D|/|D|Expectedinformation(entropy)neededtoclassifyatupleinD:Informationneeded(afterusingAtosplitDintovpartitions)toclassifyD:InformationgainedbybranchingonattributeAInformationgainisthedifferencebetweentheentropyoftheparentnodeandtheweightedaverageofthechildrennodes'entropies.25AttributeSelection:InformationGainClassP:buys_computer=“yes”ClassN:buys_computer=“no”

means“age<=30”has5outof14samples,with2yes’esand3no’s.HenceSimilarly,26ComputingInformation-GainforContinuous-ValuedAttributesLetattributeAbeacontinuous-valuedattributeMustdeterminethebestsplitpointforASortthevalueAinincreasingorderTypically,themidpointbetweeneachpairofadjacentvaluesisconsideredasapossiblesplitpoint(ai+ai+1)/2isthemidpointbetweenthevaluesofaiandai+1ThepointwiththeminimumexpectedinformationrequirementforAisselectedasthesplit-pointforASplit:D1isthesetoftuplesinDsatisfyingA≤split-point,andD2isthesetoftuplesinDsatisfyingA>split-pointGainRatioforAttributeSelection(C4.5)InformationgainmeasureisbiasedtowardsattributeswithalargenumberofvaluesC4.5(asuccessorofID3)usesgainratiotoovercometheproblem(normalizationtoinformationgain)GainRatio(A)=Gain(A)/SplitInfo(A)Ex.gain_ratio(income)=0.029/1.557=0.019TheattributewiththemaximumgainratioisselectedasthesplittingattributeGiniIndex(CART,IBMIntelligentMiner)IfadatasetDcontainsexamplesfromnclasses,giniindex,gini(D)isdefinedas wherepjistherelativefrequencyofclassjinDIfadatasetDissplitonAintotwosubsetsD1andD2,theginiindexgini(D)isdefinedasReductioninImpurity:Theattributeprovidesthesmallestginisplit(D)(orthelargestreductioninimpurity)ischosentosplitthenode(needtoenumerateallthepossiblesplittingpointsforeachattributeforCART)ComputationofGiniIndexEx.Dhas9tuplesinbuys_computer=“yes”and5in“no”SupposetheattributeincomepartitionsDinto10inD1:{low,medium}and4inD2Gini{low,high}is0.458;Gini{medium,high}is0.450.Thus,splitonthe{low,medium}(and{high})sinceithasthelowestGiniindex30ComparingAttributeSelectionMeasuresThethreemeasures,ingeneral,returngoodresultsbutInformationgain:biasedtowardsmultivaluedattributesGainratio:tendstopreferunbalancedsplitsinwhichonepartitionismuchsmallerthantheothersGiniindex:biasedtomultivaluedattributeshasdifficultywhen#ofclassesislargetendstofavorteststhatresultinequal-sizedpartitionsandpurityinbothpartitions31OverfittingandTreePruningOverfitting:AninducedtreemayoverfitthetrainingdataToomanybranches,somemayreflectanomaliesduetonoiseoroutliersPooraccuracyforunseensamplesTwoapproachestoavoidoverfittingPrepruning:Halttreeconstructionearly

̵donotsplitanodeifthiswouldresultinthegoodnessmeasurefallingbelowathresholdDifficulttochooseanappropriatethresholdPostpruning:Removebranchesfroma“fullygrown”tree—getasequenceofprogressivelyprunedtreesUseasetofdatadifferentfromthetrainingdatatodecidewhichisthe“bestprunedtree”EnhancementstoBasicDecisionTreeInductionAllowforcontinuous-valuedattributesDynamicallydefinenewdiscrete-valuedattributesthatpartitionthecontinuousattributevalueintoadiscretesetofintervalsHandlemissingattributevaluesAssignthemostcommonvalueoftheattributeAssignprobabilitytoeachofthepossiblevaluesAttributeconstructionCreatenewattributesbasedonexistingonesthataresparselyrepresentedThisreducesfragmentation,repetition,andreplicationClassificationinLargeDatabasesClassification—aclassicalproblemextensivelystudiedbystatisticiansandmachinelearningresearchersScalability:ClassifyingdatasetswithmillionsofexamplesandhundredsofattributeswithreasonablespeedWhyisdecisiontreeinductionpopular?relativelyfasterlearningspeed(thanotherclassificationmethods)convertibletosimpleandeasytounderstandclassificationrulescanuseSQLqueriesforaccessingdatabasescomparableclassificationaccuracywithothermethodsDecisionTreeclassifierinsklearnclasssklearn.tree.DecisionTreeClassifier(criterion=’gini’,splitter=’best’,max_depth=None,min_samples_split=2,min_samples_leaf=1,min_weight_fraction_leaf=0.0,max_features=None,random_state=None,max_leaf_nodes=None,min_impurity_decrease=0.0,min_impurity_split=None,class_weight=None,presort=False)/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html/pinard/p/6056319.htmlExampleTheIrisdataset(Iris数据集,鸢尾属植物)TheIrisdatasetisaclassicdatasetfromthe1930s;itisoneofthefirstmodernexamplesofstatisticalclassification.ThesettingisthatofIrisflowers,ofwhichtherearemultiplespeciesthatcanbeidentifiedbytheirmorphology.Today,thespecieswouldbedefinedbytheirgenomicsignatures,butinthe1930s,DNAhadnotevenbeenidentifiedasthecarrierofgeneticinformation.iris以鸢尾花的特征作为数据来源,数据集包含150个数据集,分为3类,每类50个数据,每个数据包含4个属性,是在数据挖掘、数据分类中非常常用的测试集、训练集

三类分别为:setosa,versicolor,virginicasetosa,versicolor,virginicaThefollowingfourattributes(四种属性)ofeachplantweremeasured:Sepallength(萼片长度)Sepalwidth(萼片宽度)Petallength(花瓣长度)Petalwidth(花瓣宽度)Thefirststepisvisualization(可视化)frommatplotlibimportpyplotaspltfromsklearn.datasetsimportload_irisimportnumpyasnp#Weloadthedatawithload_irisfromsklearndata=load_iris()features=data['data']feature_names=data['feature_names']target=data['target']fort,marker,cinzip(range(3),">ox","rgb"):#Weploteachclassonitsowntogetdifferentcoloredmarkersplt.scatter(features[target==t,0],features[target==t,1],marker=marker,c=c)sklearn.tree.DecisionTreeClassifier>>>from

sklearn

importtree>>>X=[[0,0],[1,1]]>>>Y=[0,1]>>>clf=tree.DecisionTreeClassifier()>>>clf=clf.fit(X,Y)>>>clf.predict([[2.,2.]])array([1])>>>clf.predict_proba([[2.,2.]])array([[0.,1.]])sklearn.tree.DecisionTreeClassifier>>>from

sklearn.datasets

importload_iris>>>from

sklearn.cross_validation

importcross_val_score>>>from

sklearn.tree

importDecisionTreeClassifier>>>clf=DecisionTreeClassifier(random_state=0)>>>iris=load_iris()>>>cross_val_score(clf,iris.data,iris.target,cv=10)......array([1.,0.93...,0.86...,0.93...,0.93...,0.93...,0.93...,1.,0.93...,1.])ExampleInternetAdvertisementsDataSet/ml/datasets/Internet+AdvertisementsDataSetCharacteristics:

MultivariateNumberofInstances:3279Area:ComputerAttributeCharacteristics:Categorical,Integer,RealNumberofAttributes:1558DateDonated1998-07-01AssociatedTasks:ClassificationMissingValues?YesNumberofWebHits:307075DecisionTreeregressor回归树(regressiontree),顾名思义,就是用树模型做回归问题,每一片叶子都输出一个预测值。预测值一般是该片叶子所含训练集元素输出的均值,即cm=ave(yi|xi∈leafm)。CART在分类问题和回归问题中的相同和差异:相同:在分类问题和回归问题中,CART都是一棵二叉树,除叶子节点外的所有节点都有且仅有两个子节点;所有落在同一片叶子中的输入都有同样的输出。差异:在分类问题中,CART使用基尼指数(Giniindex)作为选择特征(feature)和划分(split)的依据;在回归问题中,CART使用mse(meansquareerror)或者mae(meanabsoluteerror)作为选择feature和split的criteria。在分类问题中,CART的每一片叶子都代表的是一个class;在回归问题中,CART的每一片叶子表示的是一个预测值,取值是连续的。DecisionTreeregressor给定一个数据集D={(x1,y1),(x2,y2),...,(xi,yi),...,(xn,yn)},其中xi

是一个m维的向量,即xi

含有m个features。回归问题的目标就是构造一个函数f(x)能够拟合数据集D中的元素,使得mse最小,即:假设一棵构建好的CART回归树有M片叶子,这意味着CART将输入空间x划分成了M个单元R1,R2,...,RM,同时意味着CART至多会有M个不同的预测值。CART最小化mse公式如下:DecisionTreeregressor在每一次的划分中,选择切分变量(splittingvariable)和切分点(splittingpoint)时(也就是选择feature和将该featurespace一分为二的split

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论