FAFU机器学习 06-1ecisionree中文_第1页
FAFU机器学习 06-1ecisionree中文_第2页
FAFU机器学习 06-1ecisionree中文_第3页
FAFU机器学习 06-1ecisionree中文_第4页
FAFU机器学习 06-1ecisionree中文_第5页
已阅读5页,还剩44页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

机器学习决策树基础数据挖掘十大算法C4.5.K均值支持向量机先验的EM(最大似然)PageRank阿达博斯特KNN奈韦巴耶斯推车主要分类方法逻辑回归线性判别分析决策树归纳最近的邻居贝叶斯分类方法反向传播分类支持向量机集合方法…说明分类任务决策树示例决策树的另一个例子决策树分类任务对测试数据应用模型决策树分类任务10决策树归纳算法许多算法:亨特算法(最早的算法之一)ID3(迭代二分法)C4.5.分类回归树任务中的监督学习决策树的可伸缩性,可并行化归纳……11决策树归纳算法基本算法(一种贪婪算法)树是以自顶向下递归分治的方式构造的在开始时,所有的训练例子都是在根部属性是范畴的(如果是连续值,则它们被预先离散化)示例是基于所选属性递归分区的基于启发式或统计度量(例如,信息增益)选择测试属性停止分区的条件给定节点的所有样本都属于同一个类没有剩余的属性用于进一步的分区--采用多数投票来对叶进行分类没有样本了…12决策树归纳算法13决策树归纳算法贪婪的策略基于优化某些标准的属性测试拆分记录()问题确定如何选择最佳属性?如何拆分到记录?如何确定最佳拆分?确定何时停止分裂14决策树归纳算法如何拆分记录?取决于属性类型名义上序数连续的取决于拆分方式的数量2路分流多路分流15决策树归纳算法基于名义属性的拆分?多路拆分:使用与不同值一样多的分区二进制拆分:将值分成两个子集需要找到最优分区16决策树归纳算法基于序数属性的拆分?多路拆分:使用与不同值一样多的分区二进制拆分:将值分成两个子集需要找到最优分区17决策树归纳算法基于连续属性的拆分离散化以形成有序的范畴属性二进制判决:(A<v)或(A>=v)considerallpossiblesplitsandfindsthebestcutcanbemorecomputationintensive18AlgorithmforDecisionTreeInductionGreedystrategySplittherecordsbasedonanattributetestthatoptimizescertaincriterionIssuesDeterminehowtoselectthebestattributeHowtosplittherecords?

Howtodeterminethebestsplit?Determinewhentostopsplitting19AlgorithmforDecisionTreeInductionHowtodeterminetheBestSplitBeforeSplitting:10recordsofclassC0,10recordsofclassC120AlgorithmforDecisionTreeInductionHowtodeterminetheBestSplitGreedyapproach:Nodeswithhomogeneousclassdistributionarepreferred

Needameasureofnodeimpurity(purity):21HowtodeterminetheBestSplit

Needameasure

22MeasuresofNodeImpurity

EntropyGiniIndex…BriefReviewofEntropy

23m=224AttributeSelectionMeasure:InformationGain(ID3/C4.5)SelecttheattributewiththehighestinformationgainLetpibetheprobabilitythatanarbitrarytupleinDbelongstoclassCi,estimatedby|Ci,D|/|D|Expectedinformation(entropy)neededtoclassifyatupleinD:Informationneeded(afterusingAtosplitDintovpartitions)toclassifyD:InformationgainedbybranchingonattributeAInformationgainisthedifferencebetweentheentropyoftheparentnodeandtheweightedaverageofthechildrennodes'entropies.25AttributeSelection:InformationGainClassP:buys_computer=“yes”ClassN:buys_computer=“no”

means“age<=30”has5outof14samples,with2yes’esand3no’s.HenceSimilarly,26ComputingInformation-GainforContinuous-ValuedAttributesLetattributeAbeacontinuous-valuedattributeMustdeterminethebestsplitpointforASortthevalueAinincreasingorderTypically,themidpointbetweeneachpairofadjacentvaluesisconsideredasapossiblesplitpoint(ai+ai+1)/2isthemidpointbetweenthevaluesofaiandai+1ThepointwiththeminimumexpectedinformationrequirementforAisselectedasthesplit-pointforASplit:D1isthesetoftuplesinDsatisfyingA≤split-point,andD2isthesetoftuplesinDsatisfyingA>split-pointGainRatioforAttributeSelection(C4.5)InformationgainmeasureisbiasedtowardsattributeswithalargenumberofvaluesC4.5(asuccessorofID3)usesgainratiotoovercometheproblem(normalizationtoinformationgain)GainRatio(A)=Gain(A)/SplitInfo(A)Ex.gain_ratio(income)=0.029/1.557=0.019TheattributewiththemaximumgainratioisselectedasthesplittingattributeGiniIndex(CART,IBMIntelligentMiner)IfadatasetDcontainsexamplesfromnclasses,giniindex,gini(D)isdefinedas wherepjistherelativefrequencyofclassjinDIfadatasetDissplitonAintotwosubsetsD1andD2,theginiindexgini(D)isdefinedasReductioninImpurity:Theattributeprovidesthesmallestginisplit(D)(orthelargestreductioninimpurity)ischosentosplitthenode(needtoenumerateallthepossiblesplittingpointsforeachattributeforCART)ComputationofGiniIndexEx.Dhas9tuplesinbuys_computer=“yes”and5in“no”SupposetheattributeincomepartitionsDinto10inD1:{low,medium}and4inD2Gini{low,high}is0.458;Gini{medium,high}is0.450.Thus,splitonthe{low,medium}(and{high})sinceithasthelowestGiniindex30ComparingAttributeSelectionMeasuresThethreemeasures,ingeneral,returngoodresultsbutInformationgain:biasedtowardsmultivaluedattributesGainratio:tendstopreferunbalancedsplitsinwhichonepartitionismuchsmallerthantheothersGiniindex:biasedtomultivaluedattributeshasdifficultywhen#ofclassesislargetendstofavorteststhatresultinequal-sizedpartitionsandpurityinbothpartitions31OverfittingandTreePruningOverfitting:AninducedtreemayoverfitthetrainingdataToomanybranches,somemayreflectanomaliesduetonoiseoroutliersPooraccuracyforunseensamplesTwoapproachestoavoidoverfittingPrepruning:Halttreeconstructionearly

̵donotsplitanodeifthiswouldresultinthegoodnessmeasurefallingbelowathresholdDifficulttochooseanappropriatethresholdPostpruning:Removebranchesfroma“fullygrown”tree—getasequenceofprogressivelyprunedtreesUseasetofdatadifferentfromthetrainingdatatodecidewhichisthe“bestprunedtree”EnhancementstoBasicDecisionTreeInductionAllowforcontinuous-valuedattributesDynamicallydefinenewdiscrete-valuedattributesthatpartitionthecontinuousattributevalueintoadiscretesetofintervalsHandlemissingattributevaluesAssignthemostcommonvalueoftheattributeAssignprobabilitytoeachofthepossiblevaluesAttributeconstructionCreatenewattributesbasedonexistingonesthataresparselyrepresentedThisreducesfragmentation,repetition,andreplicationClassificationinLargeDatabasesClassification—aclassicalproblemextensivelystudiedbystatisticiansandmachinelearningresearchersScalability:ClassifyingdatasetswithmillionsofexamplesandhundredsofattributeswithreasonablespeedWhyisdecisiontreeinductionpopular?relativelyfasterlearningspeed(thanotherclassificationmethods)convertibletosimpleandeasytounderstandclassificationrulescanuseSQLqueriesforaccessingdatabasescomparableclassificationaccuracywithothermethodsDecisionTreeclassifierinsklearnclasssklearn.tree.DecisionTreeClassifier(criterion=’gini’,splitter=’best’,max_depth=None,min_samples_split=2,min_samples_leaf=1,min_weight_fraction_leaf=0.0,max_features=None,random_state=None,max_leaf_nodes=None,min_impurity_decrease=0.0,min_impurity_split=None,class_weight=None,presort=False)/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html/pinard/p/6056319.htmlExampleTheIrisdataset(Iris数据集,鸢尾属植物)TheIrisdatasetisaclassicdatasetfromthe1930s;itisoneofthefirstmodernexamplesofstatisticalclassification.ThesettingisthatofIrisflowers,ofwhichtherearemultiplespeciesthatcanbeidentifiedbytheirmorphology.Today,thespecieswouldbedefinedbytheirgenomicsignatures,butinthe1930s,DNAhadnotevenbeenidentifiedasthecarrierofgeneticinformation.iris以鸢尾花的特征作为数据来源,数据集包含150个数据集,分为3类,每类50个数据,每个数据包含4个属性,是在数据挖掘、数据分类中非常常用的测试集、训练集

三类分别为:setosa,versicolor,virginicasetosa,versicolor,virginicaThefollowingfourattributes(四种属性)ofeachplantweremeasured:Sepallength(萼片长度)Sepalwidth(萼片宽度)Petallength(花瓣长度)Petalwidth(花瓣宽度)Thefirststepisvisualization(可视化)frommatplotlibimportpyplotaspltfromsklearn.datasetsimportload_irisimportnumpyasnp#Weloadthedatawithload_irisfromsklearndata=load_iris()features=data['data']feature_names=data['feature_names']target=data['target']fort,marker,cinzip(range(3),">ox","rgb"):#Weploteachclassonitsowntogetdifferentcoloredmarkersplt.scatter(features[target==t,0],features[target==t,1],marker=marker,c=c)sklearn.tree.DecisionTreeClassifier>>>from

sklearn

importtree>>>X=[[0,0],[1,1]]>>>Y=[0,1]>>>clf=tree.DecisionTreeClassifier()>>>clf=clf.fit(X,Y)>>>clf.predict([[2.,2.]])array([1])>>>clf.predict_proba([[2.,2.]])array([[0.,1.]])sklearn.tree.DecisionTreeClassifier>>>from

sklearn.datasets

importload_iris>>>from

sklearn.cross_validation

importcross_val_score>>>from

sklearn.tree

importDecisionTreeClassifier>>>clf=DecisionTreeClassifier(random_state=0)>>>iris=load_iris()>>>cross_val_score(clf,iris.data,iris.target,cv=10)......array([1.,0.93...,0.86...,0.93...,0.93...,0.93...,0.93...,1.,0.93...,1.])ExampleInternetAdvertisementsDataSet/ml/datasets/Internet+AdvertisementsDataSetCharacteristics:

MultivariateNumberofInstances:3279Area:ComputerAttributeCharacteristics:Categorical,Integer,RealNumberofAttributes:1558DateDonated1998-07-01AssociatedTasks:ClassificationMissingValues?YesNumberofWebHits:307075DecisionTreeregressor回归树(regressiontree),顾名思义,就是用树模型做回归问题,每一片叶子都输出一个预测值。预测值一般是该片叶子所含训练集元素输出的均值,即cm=ave(yi|xi∈leafm)。CART在分类问题和回归问题中的相同和差异:相同:在分类问题和回归问题中,CART都是一棵二叉树,除叶子节点外的所有节点都有且仅有两个子节点;所有落在同一片叶子中的输入都有同样的输出。差异:在分类问题中,CART使用基尼指数(Giniindex)作为选择特征(feature)和划分(split)的依据;在回归问题中,CART使用mse(meansquareerror)或者mae(meanabsoluteerror)作为选择feature和split的criteria。在分类问题中,CART的每一片叶子都代表的是一个class;在回归问题中,CART的每一片叶子表示的是一个预测值,取值是连续的。DecisionTreeregressor给定一个数据集D={(x1,y1),(x2,y2),...,(xi,yi),...,(xn,yn)},其中xi

是一个m维的向量,即xi

含有m个features。回归问题的目标就是构造一个函数f(x)能够拟合数据集D中的元素,使得mse最小,即:假设一棵构建好的CART回归树有M片叶子,这意味着CART将输入空间x划分成了M个单元R1,R2,...,RM,同时意味着CART至多会有M个不同的预测值。CART最小化mse公式如下:DecisionTreeregressor在每一次的划分中,选择切分变量(splittingvariable)和切分点(splittingpoint)时(也就是选择feature和将该featurespace一分为二的split),使得模型在训练集上的mse最小,也就是每片叶子的mse之和最小。采用启发式的方法,遍历所有的切分变量和切分点,然后选出叶子节点mse之和最小的那种情况作为划分。选择第j个featurex(j)和它取的值s,作为切分变量和切分点,则切分变量和切分点将父节点的输入空间一分为二:CART选择切分变量j和切分点s的公式如下:D

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论