版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
FoundationsofMachineLearning
DecisionTreeTop10algorithmsindataminingC4.5K-MeansSVMAprioriEM(MaximumLikelihood)PageRankAdaBoostKNNNaïveBayesCARTMainClassificationMethodsLogisticRegressionLinearDiscriminantAnalysisDecisionTreeInductionNearestNeighborBayesClassificationMethodsClassificationbyBackpropagationSupportVectorMachinesEnsembleMethods…
IllustratingClassificationTask
ExampleofaDecisionTree
AnotherExampleofDecisionTree
DecisionTreeClassificationTask
ApplyModeltoTestData
DecisionTreeClassificationTask10AlgorithmforDecisionTreeInductionManyAlgorithms:Hunt’sAlgorithm(oneoftheearliest)ID3(IterativeDichotomiser)C4.5CART(ClassificationandRegressionTree)SLIQ(SupervisedLearningInQuest)SPRINT(ScalablePaRallelizableINductionofdecisionTrees)……11AlgorithmforDecisionTreeInductionBasicalgorithm(agreedyalgorithm)Treeisconstructedinatop-downrecursivedivide-and-conquermannerAtstart,allthetrainingexamplesareattherootAttributesarecategorical(ifcontinuous-valued,theyarediscretizedinadvance)ExamplesarepartitionedrecursivelybasedonselectedattributesTestattributesareselectedonthebasisofaheuristicorstatisticalmeasure(e.g.,informationgain)ConditionsforstoppingpartitioningAllsamplesforagivennodebelongtothesameclassTherearenoremainingattributesforfurtherpartitioning–majorityvotingisemployedforclassifyingtheleafTherearenosamplesleft…12AlgorithmforDecisionTreeInduction13AlgorithmforDecisionTreeInductionGreedystrategySplittherecordsbasedonanattributetestthatoptimizescertaincriterion(根据最优划分属性进行划分)IssuesDeterminehowtoselectthebestattribute?Howtosplittorecords?Howtodeterminethebestsplit?Determinewhentostopsplitting14AlgorithmforDecisionTreeInductionHowtoSplittherecords?DependsonattributetypesNominalOrdinalContinuousDependsonnumberofwaystosplit2-waysplitMulti-waysplit15AlgorithmforDecisionTreeInductionSplittingBasedonNominalAttributes?Multi-waysplit:UseasmanypartitionsasdistinctvaluesBinarysplit:DividesvaluesintotwosubsetsNeedtofindoptimalpartitioning16AlgorithmforDecisionTreeInductionSplittingBasedonOrdinalAttributes?Multi-waysplit:UseasmanypartitionsasdistinctvaluesBinarysplit:DividesvaluesintotwosubsetsNeedtofindoptimalpartitioning17AlgorithmforDecisionTreeInductionSplittingBasedonContinuousAttributesDiscretizationtoformanordinalcategoricalattributeBinaryDecision:(A<v)or(A>=v)considerallpossiblesplitsandfindsthebestcutcanbemorecomputationintensive18AlgorithmforDecisionTreeInductionGreedystrategySplittherecordsbasedonanattributetestthatoptimizescertaincriterionIssuesDeterminehowtoselectthebestattributeHowtosplittherecords?
Howtodeterminethebestsplit?Determinewhentostopsplitting19AlgorithmforDecisionTreeInductionHowtodeterminetheBestSplitBeforeSplitting:10recordsofclassC0,10recordsofclassC120AlgorithmforDecisionTreeInductionHowtodeterminetheBestSplitGreedyapproach:Nodeswithhomogeneousclassdistributionarepreferred
Needameasureofnodeimpurity(purity):21HowtodeterminetheBestSplit
Needameasure
22MeasuresofNodeImpurity
EntropyGiniIndex…BriefReviewofEntropy
23m=224AttributeSelectionMeasure:InformationGain(ID3/C4.5)SelecttheattributewiththehighestinformationgainLetpibetheprobabilitythatanarbitrarytupleinDbelongstoclassCi,estimatedby|Ci,D|/|D|Expectedinformation(entropy)neededtoclassifyatupleinD:Informationneeded(afterusingAtosplitDintovpartitions)toclassifyD:InformationgainedbybranchingonattributeAInformationgainisthedifferencebetweentheentropyoftheparentnodeandtheweightedaverageofthechildrennodes'entropies.25AttributeSelection:InformationGainClassP:buys_computer=“yes”ClassN:buys_computer=“no”
means“age<=30”has5outof14samples,with2yes’esand3no’s.HenceSimilarly,26ComputingInformation-GainforContinuous-ValuedAttributesLetattributeAbeacontinuous-valuedattributeMustdeterminethebestsplitpointforASortthevalueAinincreasingorderTypically,themidpointbetweeneachpairofadjacentvaluesisconsideredasapossiblesplitpoint(ai+ai+1)/2isthemidpointbetweenthevaluesofaiandai+1ThepointwiththeminimumexpectedinformationrequirementforAisselectedasthesplit-pointforASplit:D1isthesetoftuplesinDsatisfyingA≤split-point,andD2isthesetoftuplesinDsatisfyingA>split-pointGainRatioforAttributeSelection(C4.5)InformationgainmeasureisbiasedtowardsattributeswithalargenumberofvaluesC4.5(asuccessorofID3)usesgainratiotoovercometheproblem(normalizationtoinformationgain)GainRatio(A)=Gain(A)/SplitInfo(A)Ex.gain_ratio(income)=0.029/1.557=0.019TheattributewiththemaximumgainratioisselectedasthesplittingattributeGiniIndex(CART,IBMIntelligentMiner)IfadatasetDcontainsexamplesfromnclasses,giniindex,gini(D)isdefinedas wherepjistherelativefrequencyofclassjinDIfadatasetDissplitonAintotwosubsetsD1andD2,theginiindexgini(D)isdefinedasReductioninImpurity:Theattributeprovidesthesmallestginisplit(D)(orthelargestreductioninimpurity)ischosentosplitthenode(needtoenumerateallthepossiblesplittingpointsforeachattributeforCART)ComputationofGiniIndexEx.Dhas9tuplesinbuys_computer=“yes”and5in“no”SupposetheattributeincomepartitionsDinto10inD1:{low,medium}and4inD2Gini{low,high}is0.458;Gini{medium,high}is0.450.Thus,splitonthe{low,medium}(and{high})sinceithasthelowestGiniindex30ComparingAttributeSelectionMeasuresThethreemeasures,ingeneral,returngoodresultsbutInformationgain:biasedtowardsmultivaluedattributesGainratio:tendstopreferunbalancedsplitsinwhichonepartitionismuchsmallerthantheothersGiniindex:biasedtomultivaluedattributeshasdifficultywhen#ofclassesislargetendstofavorteststhatresultinequal-sizedpartitionsandpurityinbothpartitions31OverfittingandTreePruningOverfitting:AninducedtreemayoverfitthetrainingdataToomanybranches,somemayreflectanomaliesduetonoiseoroutliersPooraccuracyforunseensamplesTwoapproachestoavoidoverfittingPrepruning:Halttreeconstructionearly
̵donotsplitanodeifthiswouldresultinthegoodnessmeasurefallingbelowathresholdDifficulttochooseanappropriatethresholdPostpruning:Removebranchesfroma“fullygrown”tree—getasequenceofprogressivelyprunedtreesUseasetofdatadifferentfromthetrainingdatatodecidewhichisthe“bestprunedtree”EnhancementstoBasicDecisionTreeInductionAllowforcontinuous-valuedattributesDynamicallydefinenewdiscrete-valuedattributesthatpartitionthecontinuousattributevalueintoadiscretesetofintervalsHandlemissingattributevaluesAssignthemostcommonvalueoftheattributeAssignprobabilitytoeachofthepossiblevaluesAttributeconstructionCreatenewattributesbasedonexistingonesthataresparselyrepresentedThisreducesfragmentation,repetition,andreplicationClassificationinLargeDatabasesClassification—aclassicalproblemextensivelystudiedbystatisticiansandmachinelearningresearchersScalability:ClassifyingdatasetswithmillionsofexamplesandhundredsofattributeswithreasonablespeedWhyisdecisiontreeinductionpopular?relativelyfasterlearningspeed(thanotherclassificationmethods)convertibletosimpleandeasytounderstandclassificationrulescanuseSQLqueriesforaccessingdatabasescomparableclassificationaccuracywithothermethodsDecisionTreeclassifierinsklearnclasssklearn.tree.DecisionTreeClassifier(criterion=’gini’,splitter=’best’,max_depth=None,min_samples_split=2,min_samples_leaf=1,min_weight_fraction_leaf=0.0,max_features=None,random_state=None,max_leaf_nodes=None,min_impurity_decrease=0.0,min_impurity_split=None,class_weight=None,presort=False)/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html/pinard/p/6056319.htmlExampleTheIrisdataset(Iris数据集,鸢尾属植物)TheIrisdatasetisaclassicdatasetfromthe1930s;itisoneofthefirstmodernexamplesofstatisticalclassification.ThesettingisthatofIrisflowers,ofwhichtherearemultiplespeciesthatcanbeidentifiedbytheirmorphology.Today,thespecieswouldbedefinedbytheirgenomicsignatures,butinthe1930s,DNAhadnotevenbeenidentifiedasthecarrierofgeneticinformation.iris以鸢尾花的特征作为数据来源,数据集包含150个数据集,分为3类,每类50个数据,每个数据包含4个属性,是在数据挖掘、数据分类中非常常用的测试集、训练集
三类分别为:setosa,versicolor,virginicasetosa,versicolor,virginicaThefollowingfourattributes(四种属性)ofeachplantweremeasured:Sepallength(萼片长度)Sepalwidth(萼片宽度)Petallength(花瓣长度)Petalwidth(花瓣宽度)Thefirststepisvisualization(可视化)frommatplotlibimportpyplotaspltfromsklearn.datasetsimportload_irisimportnumpyasnp#Weloadthedatawithload_irisfromsklearndata=load_iris()features=data['data']feature_names=data['feature_names']target=data['target']fort,marker,cinzip(range(3),">ox","rgb"):#Weploteachclassonitsowntogetdifferentcoloredmarkersplt.scatter(features[target==t,0],features[target==t,1],marker=marker,c=c)sklearn.tree.DecisionTreeClassifier>>>from
sklearn
importtree>>>X=[[0,0],[1,1]]>>>Y=[0,1]>>>clf=tree.DecisionTreeClassifier()>>>clf=clf.fit(X,Y)>>>clf.predict([[2.,2.]])array([1])>>>clf.predict_proba([[2.,2.]])array([[0.,1.]])sklearn.tree.DecisionTreeClassifier>>>from
sklearn.datasets
importload_iris>>>from
sklearn.cross_validation
importcross_val_score>>>from
sklearn.tree
importDecisionTreeClassifier>>>clf=DecisionTreeClassifier(random_state=0)>>>iris=load_iris()>>>cross_val_score(clf,iris.data,iris.target,cv=10)......array([1.,0.93...,0.86...,0.93...,0.93...,0.93...,0.93...,1.,0.93...,1.])ExampleInternetAdvertisementsDataSet/ml/datasets/Internet+AdvertisementsDataSetCharacteristics:
MultivariateNumberofInstances:3279Area:ComputerAttributeCharacteristics:Categorical,Integer,RealNumberofAttributes:1558DateDonated1998-07-01AssociatedTasks:ClassificationMissingValues?YesNumberofWebHits:307075DecisionTreeregressor回归树(regressiontree),顾名思义,就是用树模型做回归问题,每一片叶子都输出一个预测值。预测值一般是该片叶子所含训练集元素输出的均值,即cm=ave(yi|xi∈leafm)。CART在分类问题和回归问题中的相同和差异:相同:在分类问题和回归问题中,CART都是一棵二叉树,除叶子节点外的所有节点都有且仅有两个子节点;所有落在同一片叶子中的输入都有同样的输出。差异:在分类问题中,CART使用基尼指数(Giniindex)作为选择特征(feature)和划分(split)的依据;在回归问题中,CART使用mse(meansquareerror)或者mae(meanabsoluteerror)作为选择feature和split的criteria。在分类问题中,CART的每一片叶子都代表的是一个class;在回归问题中,CART的每一片叶子表示的是一个预测值,取值是连续的。DecisionTreeregressor给定一个数据集D={(x1,y1),(x2,y2),...,(xi,yi),...,(xn,yn)},其中xi
是一个m维的向量,即xi
含有m个features。回归问题的目标就是构造一个函数f(x)能够拟合数据集D中的元素,使得mse最小,即:假设一棵构建好的CART回归树有M片叶子,这意味着CART将输入空间x划分成了M个单元R1,R2,...,RM,同时意味着CART至多会有M个不同的预测值。CART最小化mse公式如下:DecisionTreeregressor在每一次的划分中,选择切分变量(splittingvariable)和切分点(splittingpoint)时(也就是选择feature和将该featurespace一分为二的split
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 村上房子赠予协议书(2篇)
- 台青基地建设协议书
- 二零二四年特许经营合同:连锁餐饮品牌的授权与管理
- 合伙经营物业合同(2篇)
- 二零二四年度股权转让合同标的:公司股权交易
- 二零二四年度核电站监控设备安装合同
- 二零二四年度智能电网控制系统设计与施工合同2篇
- 橡胶制品采购协议
- 打印和复印服务协议
- 工厂防护栏网购买合同
- 人教版(2024新版)八年级上册物理期末必刷多项选择题50题(含答案解析)
- 课件科比教学课件
- 2024年医学科研诚信与医学研究伦理考试题库
- 2024固态电池行业产业现状产业链相关公司及市场预测分析报告
- 山西煤矸石综合开发利用项目可行性研究报告
- 新教科版五年级上册综合实践活动全册教案
- 手术分级目录(2023年修订)
- 雾化吸入疗法合理用药专家共识(2024版)解读
- 2024年秋新湘教版七年级地理上册全册课件(新教材)
- 公共卫生与预防医学继续教育平台-2024年“大学习”活动线上培训栏目(练习+考试)-题库及答案
- 2024年秋新人教版七年级上册数学教学课件 5.1.1 第2课时 方程的解
评论
0/150
提交评论