第九讲 分类与预测_第1页
第九讲 分类与预测_第2页
第九讲 分类与预测_第3页
第九讲 分类与预测_第4页
第九讲 分类与预测_第5页
已阅读5页,还剩89页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

第八讲

分类与预测1厦门大学软件学院ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassification(自学)ClassificationbybackpropagationSupportVectorMachines(SVM)(自学)Associativeclassification(自学)Lazylearners(orlearningfromyourneighbors)(自学)Otherclassificationmethods(自学)PredictionAccuracyanderrormeasures(自学)Ensemblemethods(自学)Modelselection(自学)Summary2厦门大学软件学院分类(Classification):

预测分类标号(离散值或名词性词)建立一个模型,基于训练集的数据和分类属性的值(类标识)来分类,并在新数据中使用该模型。预测(Prediction):连续值函数上的建模,例如,预测未知或缺失的值典型应用信用度目标市场医疗诊断分类vs.预测3厦门大学软件学院分类的两个步骤模型创建:描述预定的数据类集或概念集。假定每个元组或样本属于一个预定义的类,由一个称为类标号属性(classlabelattribute)的属性确定。用于建模的元组集称为训练集(trainingset)模型表示为:分类规则、判定树或属性公式模型应用:用于分类未来或未知对象评估模型的准确率测试样本的已知标号与根据模型获得的分类结果作比较。准确率定义为正确被模型分类的测试样本的百分比测试集独立于训练集,否则,学习模型倾向于过分适合(over-fitting)数据如果准确率可被接受,使用模型用于分类类标号未知的数据。4厦门大学软件学院ClassificationProcess(1):ModelConstructionTrainingDataClassificationAlgorithmsIFrank=‘professor’ORyears>6THENtenured=‘yes’Classifier(Model)5厦门大学软件学院ClassificationProcess(2):UsetheModelinPredictionClassifierTestingDataUnseenData(Jeff,Professor,4)Tenured?6厦门大学软件学院决策树算法基本算法(贪心算法),树的构建方式:自顶向下递归的各个击破方式树以代表训练样本的单个节点开始如果样本都在同一类中,则节点为叶子节点,并用该类标记否则,选择能够最好地将样本分类的属性(称为测试属性,必须是离散值的)对测试属性的每个已知值,创建一个分支,并据此划分样本递归形成每个划分上的样本决策树7厦门大学软件学院递归划分步骤仅当下列条件之一成立时立即停止给定节点的所有样本属于同一个类没有剩余属性可作进一步划分样本——majorityvotingisemployedforclassifyingtheleaf没有剩余的样本age?student?creditrating?<=30>40noyesyesyes31..40nofairexcellentyesno8厦门大学软件学院AttributeSelectionMeasure:InformationGain(ID3/C4.5)选择具有最高信息增益(informationgain

)的属性。pi

是D中任意元组属于类Ci的概率,用估计|Ci,D|/|D|Expectedinformation(熵:entropy)对D中元组分类所需的期望信息:Informationneeded(afterusingAtosplitDintovpartitions)toclassifyD:InformationgainedbybranchingonattributeA9厦门大学软件学院AttributeSelection:InformationGainClassP:buys_computer=“yes”ClassN:buys_computer=“no”

means“age<=30”has5outof14samples,with2yes’esand3no’s.HenceSimilarly,10厦门大学软件学院ComputingInformation-GainforContinuous-ValueAttributesLetattributeAbeacontinuous-valuedattributeMustdeterminethebestsplitpointforASortthevalueAinincreasingorderTypically,themidpointbetweeneachpairofadjacentvaluesisconsideredasapossiblesplitpoint(ai+ai+1)/2isthemidpointbetweenthevaluesofaiandai+1ThepointwiththeminimumexpectedinformationrequirementforAisselectedasthesplit-pointforASplit:D1isthesetoftuplesinDsatisfyingA≤split-point,andD2isthesetoftuplesinDsatisfyingA>split-point11厦门大学软件学院GainRatioforAttributeSelection(C4.5)信息增益度量偏向具有许多输出的测试C4.5(asuccessorofID3)usesgainratiotoovercometheproblem(normalizationtoinformationgain)GainRatio(A)=Gain(A)/SplitInfo(A)Ex.gain_ratio(income)=0.029/0.926=0.031Theattributewiththemaximumgainratioisselectedasthesplittingattribute12厦门大学软件学院Gini指标(CART,IBMIntelligentMiner)IfadatasetDcontainsexamplesfromnclasses,giniindex,gini(D)isdefinedas

wherepjistherelativefrequencyofclassjinDIfadatasetDissplitonAintotwosubsetsD1andD2,theginiindexgini(D)isdefinedas不纯度降低:Theattributeprovidesthesmallestginisplit(D)(orthelargestreductioninimpurity)ischosentosplitthenode(needtoenumerateallthepossiblesplittingpointsforeachattribute)13厦门大学软件学院Giniindex(CART,IBMIntelligentMiner)Ex.Dhas9tuplesinbuys_computer=“yes”and5in“no”SupposetheattributeincomepartitionsDinto10inD1:{low,medium}and4inD2butgini{medium,high}is0.30andthusthebestsinceitisthelowestAllattributesareassumedcontinuous-valuedMayneedothertools,e.g.,clustering,togetthepossiblesplitvaluesCanbemodifiedforcategoricalattributes14厦门大学软件学院ComparingAttributeSelectionMeasuresThethreemeasures,ingeneral,returngoodresultsbutInformationgain:biasedtowardsmultivaluedattributesGainratio:tendstopreferunbalancedsplitsinwhichonepartitionismuchsmallerthantheothersGiniindex:biasedtomultivaluedattributeshasdifficultywhen#ofclassesislargetendstofavorteststhatresultinequal-sizedpartitionsandpurityinbothpartitions15厦门大学软件学院OverfittingandTreePruningOverfitting:过份拟合AninducedtreemayoverfitthetrainingdataToomanybranches,somemayreflectanomaliesduetonoiseoroutliersPooraccuracyforunseensamplesTwoapproachestoavoidoverfittingPrepruning:提前停止树的构造,如果划分一个节点的元组导致低于预定义阈值的分裂,则结束进一步的划分DifficulttochooseanappropriatethresholdPostpruning:由“完全生长”的树减去子树(用树叶替代对应的子树)Useasetofdatadifferentfromthetrainingdatatodecidewhichisthe“bestprunedtree”16厦门大学软件学院决策树归纳基本算法的加强允许连续值属性通过将连续属性值划分到离散值空间,以动态定义新的离散值属性处理缺失属性值缺失值赋值以属性的最常见值缺失值赋值以属性的最或然值属性构建在给定的属性上创建新的属性,改变给定属性的首先表示,是数据变换的一种形式解决决策树归纳可能面临的碎片、复制和重复问题。17厦门大学软件学院大型数据库中的分类分类问题是统计学和机器学习研究者研究的经典难题的一个扩展。可伸缩性:在具有数以百万计的样本和成千上百个属性的数据集上,以可被接受的速度作分类为什么在数据挖掘中使用决策树归纳?学习速度相当快(与其他分类方法比较)分类规则简单易懂对数据库的访问可使用SQL查询可比较分类的准确度18厦门大学软件学院可伸缩的决策树归纳方法SLIQ(EDBT’96—Mehtaetal.)buildsanindexforeachattributeandonlyclasslistandthecurrentattributelistresideinmemorySPRINT(VLDB’96—J.Shaferetal.)constructsanattributelistdatastructurePUBLIC(VLDB’98—Rastogi&Shim)integratestreesplittingandtreepruning:stopgrowingthetreeearlierRainForest(VLDB’98—Gehrke,Ramakrishnan&Ganti)separatesthescalabilityaspectsfromthecriteriathatdeterminethequalityofthetreebuildsanAVC-list(attribute,value,classlabel)19厦门大学软件学院ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassificationClassificationbybackpropagationSupportVectorMachines(SVM)AssociativeclassificationLazylearners(orlearningfromyourneighbors)OtherclassificationmethodsPredictionAccuracyanderrormeasuresEnsemblemethodsModelselectionSummary20厦门大学软件学院贝叶斯分类贝叶斯分类是统计学分类方法贝叶斯分类基于贝叶斯定理朴素贝叶斯分类假定一个属性值对给定类的影响独立于其他属性的值(类条件独立)。贝叶斯信念网络能表示属性子集间的依赖,也可用于分类。21厦门大学软件学院贝叶斯定理:BasicsX

是类标识未知的数据样本(或数据元组)如:35岁收入$4000的顾客H

是数据样本X属于某特定类C的某种假定。如:假设顾客将购买计算机P(H/X):条件X下H的后验概率如:知道顾客年龄与收入时,顾客X将购买计算机的概率P(H):H的先验概率,即在我们观察任何样本前的初始概率,它反应了背景知识。如:任意给定的顾客将购买计算机的概率。P(X):被观察的样本数据的概率如:顾客中年龄35岁收入$4000的概率P(X|H):条件H下,X的后验概率如:已知顾客购买计算机,该顾客为35岁收入$4000的概率贝叶斯定理:22厦门大学软件学院TowardsNaïveBayesianClassifierLetDbeatrainingsetoftuplesandtheirassociatedclasslabels,andeachtupleisrepresentedbyann-DattributevectorX=(x1,x2,…,xn)SupposetherearemclassesC1,C2,…,Cm.Classificationistoderivethemaximumposteriori,i.e.,themaximalP(Ci|X)ThiscanbederivedfromBayes’theoremSinceP(X)isconstantforallclasses,onlyneedstobemaximized23厦门大学软件学院DerivationofNaïveBayesClassifierAsimplifiedassumption:attributesareconditionallyindependent(i.e.,nodependencerelationbetweenattributes):Thisgreatlyreducesthecomputationcost:OnlycountstheclassdistributionIfAkiscategorical,P(xk|Ci)isthe#oftuplesinCihavingvaluexkforAkdividedby|Ci,D|(#oftuplesofCiinD)IfAkiscontinous-valued,P(xk|Ci)isusuallycomputedbasedonGaussiandistributionwithameanμandstandarddeviationσandP(xk|Ci)is24厦门大学软件学院NaïveBayesianClassifier:TrainingDatasetClass:C1:buys_computer=‘yes’C2:buys_computer=‘no’DatasampleX=(age<=30,Income=medium,Student=yesCredit_rating=Fair)25厦门大学软件学院NaïveBayesianClassifier:AnExampleP(Ci):P(buys_computer=“yes”)=9/14=0.643P(buys_computer=“no”)=5/14=0.357ComputeP(X|Ci)foreachclassP(age=“<=30”|buys_computer=“yes”)=2/9=0.222P(age=“<=30”|buys_computer=“no”)=3/5=0.6P(income=“medium”|buys_computer=“yes”)=4/9=0.444P(income=“medium”|buys_computer=“no”)=2/5=0.4P(student=“yes”|buys_computer=“yes)=6/9=0.667P(student=“yes”|buys_computer=“no”)=1/5=0.2P(credit_rating=“fair”|buys_computer=“yes”)=6/9=0.667P(credit_rating=“fair”|buys_computer=“no”)=2/5=0.4X=(age<=30,income=medium,student=yes,credit_rating=fair)

P(X|Ci):P(X|buys_computer=“yes”)=0.222x0.444x0.667x0.667=0.044P(X|buys_computer=“no”)=0.6x0.4x0.2x0.4=0.019P(X|Ci)*P(Ci):P(X|buys_computer=“yes”)*P(buys_computer=“yes”)=0.028

P(X|buys_computer=“no”)*P(buys_computer=“no”)=0.007Therefore,Xbelongstoclass(“buys_computer=yes”)

26厦门大学软件学院Avoidingthe0-ProbabilityProblemNaïveBayesianpredictionrequireseachconditionalprob.benon-zero.Otherwise,thepredictedprob.willbezero

Ex.Supposeadatasetwith1000tuples,income=low(0),income=medium(990),andincome=high(10),UseLaplaciancorrection(orLaplacianestimator)Adding1toeachcaseProb(income=low)=1/1003Prob(income=medium)=991/1003Prob(income=high)=11/1003The“corrected”prob.estimatesareclosetotheir“uncorrected”counterparts27厦门大学软件学院NaïveBayesianClassifier:CommentsAdvantagesEasytoimplementGoodresultsobtainedinmostofthecasesDisadvantagesAssumption:classconditionalindependence,thereforelossofaccuracyPractically,dependenciesexistamongvariablesHowtodealwiththesedependencies?BayesianBeliefNetworks28厦门大学软件学院贝叶斯网络贝叶斯信念网络允许在变量的子集间定义类条件的独立性因果关系的图形建模表示了变量间的依赖关系说明联合条件概率分布节点:随机变量(离散或连续)弧:概率依赖X,YaretheparentsofZ,andYistheparentofPNodependencybetweenZandP有向无环图29厦门大学软件学院AnExampleFamilyHistoryLungCancerPositiveXRaySmokerEmphysemaDyspneaLC~LC(FH,S)(FH,~S)(~FH,S)(~FH,~S)0.10.9BayesianBeliefNetworks对于每个变量,信念网络有一个条件概率表CPT,P(Y|Parent(Y))30厦门大学软件学院训练贝叶斯信念网络许多情况都有可能如果网络结构已知并且变量是可见的:只需计算CPT(条件概率表)网络结构已知,但有些变量隐藏在样本中:使用梯度下降方法训练信念网络,类似于神经网络的学习网络结构未知,所有变量都可见:从模型空间查找,重新构建图的拓扑结构未知网络结构,所有变量都隐藏:还未有好的算法31厦门大学软件学院ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassification(自学)ClassificationbybackpropagationSupportVectorMachines(SVM)AssociativeclassificationLazylearners(orlearningfromyourneighbors)OtherclassificationmethodsPredictionAccuracyanderrormeasuresEnsemblemethodsModelselectionSummary32厦门大学软件学院UsingIF-THENRulesforClassification以IF-THEN规则的形式来表示知识R:IFage=youthANDstudent=yesTHENbuys_computer=yes规则前件或前提vs.规则结论规则的覆盖率与准确率ncovers=#oftuplescoveredbyR(规则前件被满足)ncorrect=#oftuplescorrectlyclassifiedbyRcoverage(R)=ncovers/|D|/*D:trainingdataset*/accuracy(R)=ncorrect/ncovers如果元组被多条规则所“触发”,则需要解决冲突规模序(Sizeordering):assignthehighestprioritytothetriggeringrulesthathasthe“toughest”requirement(i.e.,withthemostattributetest)基于类的序(Class-basedordering):类按重要性递减排序规则序(Rule-basedordering(decisionlist)):rulesareorganizedintoonelongprioritylist,accordingtosomemeasureofrulequalityorbyexperts33厦门大学软件学院age?student?creditrating?<=30>40noyesyesyes31..40nofairexcellentyesno从决策树中提取规则以IF-THEN规则的形式来表示知识对从根到树叶的每一条路径创建一条规则。沿着给定路径上的每个属性——值对,形成规则前件(“IF”部分)叶子节点包含类预测,形成规则后件(“THEN”部分)IF-THEN规则易于理解。ExampleIFage=“<=30”ANDstudent=“no”THENbuys_computer=“no”IFage=“<=30”ANDstudent=“yes”THENbuys_computer=“yes”IFage=“31…40” THENbuys_computer=“yes”IFage=“>40”ANDcredit_rating=“excellent”THENbuys_computer=“yes”IFage=“<=30”ANDcredit_rating=“fair”THENbuys_computer=“no”34厦门大学软件学院RuleExtractionfromtheTrainingData顺序覆盖算法:直接从训练数据提取IF-THEN规则,不必产生决策树典型的顺序覆盖算法:FOIL,AQ,CN2,RIPPER规则被顺序学习,理想地,在为C类学习规则时,希望规则覆盖C类的所有(或许多)训练元组,并且没有(或很少)覆盖其他类的原则Steps:一次学习一条规则每当学习一个规则,就删除该规则覆盖的元组对剩下的元组重复该过程,直到遇到终止条件(无训练样本或规则质量低于设定阈值)35厦门大学软件学院HowtoLearn-One-Rule?Starwiththemostgeneralrulepossible:condition=emptyAddingnewattributesbyadoptingagreedydepth-firststrategyPickstheonethatmostimprovestherulequalityRule-Qualitymeasures:considerbothcoverageandaccuracyFoil-gain(inFOIL&RIPPER):assessesinfo_gainbyextendingconditionItfavorsrulesthathavehighaccuracyandcovermanypositivetuplesRulepruningbasedonanindependentsetoftesttuplesPos/negare#ofpositive/negativetuplescoveredbyR.IfFOIL_PruneishigherfortheprunedversionofR,pruneR36厦门大学软件学院37厦门大学软件学院ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassificationClassificationbybackpropagationSupportVectorMachines(SVM)AssociativeclassificationLazylearners(orlearningfromyourneighbors)OtherclassificationmethodsPredictionAccuracyanderrormeasuresEnsemblemethodsModelselectionSummary38厦门大学软件学院分类:

预测分类的类标识典型应用{credithistory,salary}->creditapproval(Yes/No){Temp,Humidity}-->Rain(Yes/No)分类的数学映射数学方式:39厦门大学软件学院线性分类二进制分类问题红线上方的数据属于类’x’红线下方的数据属于类‘o’例子–SVM,Perceptron,ProbabilisticClassifiersxxxxxxxxxxooooooooooooo40厦门大学软件学院DiscriminativeClassifiers优点预测准确率一般更高(与贝叶斯方法比较)对在样本包含错误信息的数据集上工作时,具有较强的鲁棒性快速评估被学习的目标函数(贝叶斯网络通常较慢)限制漫长的训练时间网络构建和训练过程中,参数设置多,很难确定可理解性差41厦门大学软件学院ClassificationbyBackpropagation后向传播(Backpropagation):一种神经网络(neuralnetwork)学习算法最早由心理学家和神经学家开创,旨在寻求开发和测试神经的计算模拟神经网络:一组连接的输入输出单元,每个连接都与一个权重(weight)关联在学习过程中,通过调整权重,能预测输入元组的正确类标号又称“连接者学习”(connectionistlearning)42厦门大学软件学院NeuralNetworkasaClassifierWeaknessLongtrainingtimeRequireanumberofparameterstypicallybestdeterminedempirically,e.g.,thenetworktopologyor``structure."Poorinterpretability:Difficulttointerpretthesymbolicmeaningbehindthelearnedweightsandof``hiddenunits"inthenetworkStrengthHightolerancetonoisydataAbilitytoclassifyuntrainedpatternsWell-suitedforcontinuous-valuedinputsandoutputsSuccessfulonawidearrayofreal-worlddataAlgorithmsareinherentlyparallelTechniqueshaverecentlybeendevelopedfortheextractionofrulesfromtrainedneuralnetworks43厦门大学软件学院神经元(感知器)Then-dimensionalinputvectorxismappedintovariableybymeansofthescalarproductandanonlinearfunctionmappingmk-fweightedsumInputvectorxoutputyActivationfunctionweightvectorwåw0w1wnx0x1xn44厦门大学软件学院AMulti-LayerFeed-ForwardNeuralNetworkOutputlayerInputlayerHiddenlayerOutputvectorInputvector:Xwij45厦门大学软件学院HowAMulti-LayerNeuralNetworkWorks?网络的输入对应于每个训练元组测量的属性输入同时提供给称作输入层(inputlayer)的单元层。输入层同时加权提供给隐藏层(hiddenlayer)隐藏层的数量是任意的,实践中通常只有一层最后一个隐藏层的加权输出作为输出层的单元的输入,输出层发布给定元组的网络预测。前馈网络:权重都不回送到输入单元或前一层的输出单元的网络从统计学观点来看,神经网络进行的时非线性回归。给定足够多的隐藏单元和足够的训练样本,多层前馈网络可以逼近任何函数。46厦门大学软件学院DefiningaNetworkTopologyFirstdecidethenetworktopology:#ofunitsintheinputlayer,#ofhiddenlayers(if>1),#ofunitsineachhiddenlayer,and#ofunitsintheoutputlayerNormalizingtheinputvaluesforeachattributemeasuredinthetrainingtuplesto[0.0—1.0]Oneinputunitperdomainvalue,eachinitializedto0Output,ifforclassificationandmorethantwoclasses,oneoutputunitperclassisusedOnceanetworkhasbeentrainedanditsaccuracyisunacceptable,repeatthetrainingprocesswithadifferentnetworktopologyoradifferentsetofinitialweights47厦门大学软件学院Backpropagation迭代地处理训练元组数据集,将每个元组的网络预测与实际已知的目标值做比较对于每个训练样本,修改权重使网络预测和实际目标间的均方误差最小修改“后向”进行:由输出层经由每个隐藏藏,到第一个隐藏层StepsInitializeweights(tosmallrandom#s)andbiasesinthenetworkPropagatetheinputsforward(byapplyingactivationfunction)Backpropagatetheerror(byupdatingweightsandbiases)Terminatingcondition(whenerrorisverysmall,etc.)48厦门大学软件学院49厦门大学软件学院ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassificationClassificationbybackpropagationSupportVectorMachines(SVM)(自学)AssociativeclassificationLazylearners(orlearningfromyourneighbors)OtherclassificationmethodsPredictionAccuracyanderrormeasuresEnsemblemethodsModelselectionSummary50厦门大学软件学院SVM—SupportVectorMachines线性和非线性数据的新分类方法使用非线性映射,将原训练数据映射到较高的维在新的维上,它搜索线性最佳分离超平面(决策边界)

使用一个适当的足够高维的非线性映射,两类数据总可以被超平面所分开。SVM使用支持向量

(“essential”trainingtuples)和边缘(definedbythesupportvectors)发现超平面51厦门大学软件学院SVM—HistoryandApplicationsVapnikandcolleagues(1992)—groundworkfromVapnik&Chervonenkis’statisticallearningtheoryin1960sFeatures:trainingcanbeslowbutaccuracyishighowingtotheirabilitytomodelcomplexnonlineardecisionboundaries(marginmaximization)UsedbothforclassificationandpredictionApplications:handwrittendigitrecognition,objectrecognition,speakeridentification,benchmarkingtime-seriespredictiontests52厦门大学软件学院SVM—GeneralPhilosophySupportVectorsSmallMarginLargeMargin53厦门大学软件学院SVM—MarginsandSupportVectors54厦门大学软件学院SVM—WhenDataIsLinearlySeparablemLetdataDbe(X1,y1),…,(X|D|,y|D|),whereXiisthesetoftrainingtuplesassociatedwiththeclasslabelsyiThereareinfinitelines(hyperplanes)separatingthetwoclassesbutwewanttofindthebestone(theonethatminimizesclassificationerroronunseendata)SVMsearchesforthehyperplanewiththelargestmargin,i.e.,maximummarginalhyperplane(MMH)55厦门大学软件学院SVM—LinearlySeparableAseparatinghyperplanecanbewrittenasW●X+b=0whereW={w1,w2,…,wn}isaweightvectorandbascalar(bias)For2-Ditcanbewrittenasw0+w1x1+w2x2=0Thehyperplanedefiningthesidesofthemargin:H1:w0+w1x1+w2x2≥1foryi=+1,andH2:w0+w1x1+w2x2≤–1foryi=–1AnytrainingtuplesthatfallonhyperplanesH1orH2(i.e.,the

sidesdefiningthemargin)aresupportvectors56厦门大学软件学院SVM—LinearlyInseparableTransformtheoriginalinputdataintoahigherdimensionalspaceSearchforalinearseparatinghyperplaneinthenewspace57厦门大学软件学院SVM—KernelfunctionsInsteadofcomputingthedotproductonthetransformeddatatuples,itismathematicallyequivalenttoinsteadapplyingakernelfunctionK(Xi,Xj)totheoriginaldata,i.e.,K(Xi,Xj)=Φ(Xi)Φ(Xj)TypicalKernelFunctionsSVMcanalsobeusedforclassifyingmultiple(>2)classesandforregressionanalysis(withadditionaluserparameters)58厦门大学软件学院ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassificationClassificationbybackpropagationSupportVectorMachines(SVM)Associativeclassification(自学)

Lazylearners(orlearningfromyourneighbors)OtherclassificationmethodsPredictionAccuracyanderrormeasuresEnsemblemethodsModelselectionSummary59厦门大学软件学院AssociativeClassificationAssociativeclassification关联规则的产生和分析旨在用于分类搜索频繁模式与类标号之间的强关联规则。Classification:BasedonevaluatingasetofrulesintheformofP1^p2…^pl

“Aclass=C”(conf,sup)Whyeffective?关联规则考察了多属性间的高置信度关联,可以克服决策树归纳一次只能考虑一个属性的局限性。关联分类比传统的分类方法更准确60厦门大学软件学院TypicalAssociativeClassificationMethodsCBA(ClassificationByAssociation:Liu,Hsu&Ma,KDD’98)MineassociationpossiblerulesintheformofCond-set(asetofattribute-valuepairs)classlabelBuildclassifier:OrganizerulesaccordingtodecreasingprecedencebasedonconfidenceandthensupportCMAR(ClassificationbasedonMultipleAssociationRules:Li,Han,Pei,ICDM’01)Classification:StatisticalanalysisonmultiplerulesCPAR(ClassificationbasedonPredictiveAssociationRules:Yin&Han,SDM’03)Generationofpredictiverules(FOIL-likeanalysis)Highefficiency,accuracysimilartoCMARRCBT(Miningtop-kcoveringrulegroupsforgeneexpressiondata,Congetal.SIGMOD’05)Explorehigh-dimensionalclassification,usingtop-krulegroupsAchievehighclassificationaccuracyandhighrun-timeefficiency61厦门大学软件学院ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassificationClassificationbybackpropagationSupportVectorMachines(SVM)AssociativeclassificationLazylearners(orlearningfromyourneighbors)(自学)OtherclassificationmethodsPredictionAccuracyanderrormeasuresEnsemblemethodsModelselectionSummary62厦门大学软件学院Lazyvs.EagerLearning懒惰与急切学习法Eagerlearning(theabovediscussedmethods):Givenasetoftrainingset,constructsaclassificationmodelbeforereceivingnew(e.g.,test)datatoclassifyLazylearning(e.g.,instance-basedlearning):Simplystorestrainingdata(oronlyminorprocessing)andwaitsuntilitisgivenatesttupleLazy:lesstimeintrainingbutmoretimeinpredicting63厦门大学软件学院LazyLearner:Instance-BasedMethodsInstance-basedlearning:Storetrainingexamplesanddelaytheprocessing(“lazyevaluation”)untilanewinstancemustbeclassifiedTypicalapproachesk-nearestneighborapproachInstancesrepresentedaspointsinaEuclideanspace.Case-basedreasoningUsessymbolicrepresentationsandknowledge-basedinference64厦门大学软件学院Thek-NearestNeighborAlgorithmAllinstancescorrespondtopointsinthen-DspaceThenearestneighboraredefinedintermsofEuclideandistance,dist(X1,X2)Targetfunctioncouldbediscrete-orreal-valuedFordiscrete-valued,k-NNreturnsthemostcommonvalueamongthektrainingexamplesnearestto

xq

._+_xq+__+__+65厦门大学软件学院Case-BasedReasoning(CBR)CBR:UsesadatabaseofproblemsolutionstosolvenewproblemsStoresymbolicdescription(tuplesorcases)—notpointsinaEuclideanspaceApplications:Customer-service(product-relateddiagnosis),legalrulingMethodologyInstancesrepresentedbyrichsymbolicdescriptions(e.g.,functiongraphs)Searchforsimilarcases,multipleretrievedcasesmaybecombinedTightcouplingbetweencaseretrieval,knowledge-basedreasoning,andproblemsolvingChallengesFindagoodsimilaritymetricIndexingbasedonsyntacticsimilaritymeasure,andwhenfailure,backtracking,andadaptingtoadditionalcases66厦门大学软件学院ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassificationClassificationbybackpropagationSupportVectorMachines(SVM)AssociativeclassificationLazylearners(orlearningfromyourneighbors)Otherclassificationmethods(自学)PredictionAccuracyanderrormeasuresEnsemblemethodsModelselectionSummary67厦门大学软件学院GeneticAlgorithms(GA)GeneticAlgorithm:basedonananalogytobiologicalevolutionAninitialpopulationiscreatedconsistingofrandomlygeneratedrulesEachruleisrepresentedbyastringofbitsE.g.,ifA1and¬A2thenC2canbeencodedas100Ifanattributehask>2values,kbitscanbeusedBasedonthenotionofsurvivalofthefittest,anewpopulationisformedtoconsistofthefittestrulesandtheiroffspringsThefitnessofaruleisrepresentedbyitsclassificationaccuracyonasetoftrainingexamplesOffspringsaregeneratedbycrossoverandmutationTheprocesscontinuesuntilapopulationPevolveswheneachruleinPsatisfiesaprespecifiedthresholdSlowbuteasilyparallelizable68厦门大学软件学院RoughSetApproachRoughsetsareusedtoapproximatelyor“roughly”defineequivalentclassesAroughsetforagivenclassCisapproximatedbytwosets:alowerapproximation(certaintobeinC)andanupperapproximation(cannotbedescribedasnotbelongingtoC)Findingtheminimalsubsets(reducts)ofattributesforfeaturereductionisNP-hardbutadiscernibilitymatrix(whichstoresthedifferencesbetweenattributevaluesforeachpairofdatatuples)isusedtoreducethecomputationintensity69厦门大学软件学院FuzzySetApproachesFuzzylogicusestruthvaluesbetween0.0and1.0torepresentthedegreeofmembership(suchasusingfuzzymembershipgraph)Attributevaluesareconvertedtofuzzyvaluese.g.,incomeismappedintothediscretecategories{low,medium,high}withfuzzyvaluescalculatedForagivennewsample,morethanonefuzzyvaluemayapplyEachapplicablerulecontributesavoteformembershipinthecategoriesTypically,thetruthvaluesforeachpredictedcategoryaresummed,andthesesumsarecombined70厦门大学软件学院ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassificationClassificationbybackpropagationSupportVectorMachines(SVM)AssociativeclassificationLazylearners(orlearningfromyourneighbors)OtherclassificationmethodsPredictionAccuracyanderrormeasuresEnsemblemethodsModelselectionSummary71厦门大学软件学院WhatIsPrediction?(Numerical)predictionissimilartoclassificationconstructamodelusemodeltopredictcontinuousororderedvalueforagiveninputPredictionisdifferentfromclassificationClassificationreferstopredictcategoricalclasslabelPredictionmodelscontinuous-valuedfunctionsMajormethodforprediction:regression回归分析可以用来对一个或多个独立或预测变量和一个(连续值的)依赖或相应变量之间的联系建模。RegressionanalysisLinearandmultipleregressionNon-linearregressionOtherregressionmethods:generalizedlinearmodel,Poissonregression,log-linearmodels,regressiontrees72厦门大学软件学院LinearRegression

Linearregression:involves

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论