《数据仓库与数据挖掘》第章2

上传人：V*** IP属地：贵州上传时间：2023-01-11 格式：PPTX 页数：87 大小：807.91KB 积分：40 举报 版权申诉

已阅读5页，还剩82页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

第7章:分类和预测Whatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianClassificationClassificationbyNeuralNetworksClassificationbySupportVectorMachines(SVM)ClassificationbasedonconceptsfromassociationruleminingOtherClassificationMethodsPredictionClassificationaccuracySummary2023/1/111Classification:

predictscategoricalclasslabels(discreteornominal)classifiesdata(constructsamodel)basedonthetrainingsetandthevalues(classlabels)inaclassifyingattributeandusesitinclassifyingnewdataPrediction:modelscontinuous-valuedfunctions,i.e.,predictsunknownormissingvaluesTypicalApplicationscreditapprovaltargetmarketingmedicaldiagnosistreatmenteffectivenessanalysisClassificationvs.Prediction2023/1/112Classification—ATwo-StepProcess

Modelconstruction:describingasetofpredeterminedclassesEachtuple/sampleisassumedtobelongtoapredefinedclass,asdeterminedbytheclasslabelattributeThesetoftuplesusedformodelconstructionistrainingsetThemodelisrepresentedasclassificationrules,decisiontrees,ormathematicalformulaeModelusage:forclassifyingfutureorunknownobjectsEstimateaccuracyofthemodelTheknownlabeloftestsampleiscomparedwiththeclassifiedresultfromthemodelAccuracyrateisthepercentageoftestsetsamplesthatarecorrectlyclassifiedbythemodelTestsetisindependentoftrainingset,otherwiseover-fittingwilloccurIftheaccuracyisacceptable,usethemodeltoclassifydatatupleswhoseclasslabelsarenotknown2023/1/113ClassificationProcess(1):ModelConstructionTrainingDataClassificationAlgorithmsIFrank=‘professor’ORyears>6THENtenured=‘yes’Classifier(Model)2023/1/114ClassificationProcess(2):UsetheModelinPredictionClassifierTestingDataUnseenData(Jeff,Professor,4)Tenured?2023/1/115Supervisedvs.UnsupervisedLearningSupervisedlearning(classification)Supervision:Thetrainingdata(observations,measurements,etc.)areaccompaniedbylabelsindicatingtheclassoftheobservationsNewdataisclassifiedbasedonthetrainingsetUnsupervisedlearning

(clustering)TheclasslabelsoftrainingdataisunknownGivenasetofmeasurements,observations,etc.withtheaimofestablishingtheexistenceofclassesorclustersinthedata2023/1/116第7章:分类和预测Whatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianClassificationClassificationbyNeuralNetworksClassificationbySupportVectorMachines(SVM)ClassificationbasedonconceptsfromassociationruleminingOtherClassificationMethodsPredictionClassificationaccuracySummary2023/1/117IssuesRegardingClassificationandPrediction(1):DataPreparationDatacleaningPreprocessdatainordertoreducenoiseandhandlemissingvaluesRelevanceanalysis(featureselection)RemovetheirrelevantorredundantattributesDatatransformationGeneralizeand/ornormalizedata2023/1/118Issuesregardingclassificationandprediction(2):EvaluatingClassificationMethodsPredictiveaccuracySpeedandscalabilitytimetoconstructthemodeltimetousethemodelRobustnesshandlingnoiseandmissingvaluesScalabilityefficiencyindisk-residentdatabasesInterpretability:understandingandinsightprovidedbythemodelGoodnessofrulesdecisiontreesizecompactnessofclassificationrules2023/1/119第7章:分类和预测Whatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianClassificationClassificationbyNeuralNetworksClassificationbySupportVectorMachines(SVM)ClassificationbasedonconceptsfromassociationruleminingOtherClassificationMethodsPredictionClassificationaccuracySummary2023/1/1110TrainingDatasetThisfollowsanexamplefromQuinlan’sID32022/12/3111Output:ADecisionTreefor““buys_computer””age?overcaststudent?creditrating?noyesfairexcellent<=30>40nonoyesyesyes30..402022/12/3112AlgorithmforDecisionTreeInductionBasicalgorithm(agreedyalgorithm)Treeisconstructedinatop-downrecursivedivide-and-conquermannerAtstart,allthetrainingexamplesareattherootAttributesarecategorical(ifcontinuous-valued,theyarediscretizedinadvance)ExamplesarepartitionedrecursivelybasedonselectedattributesTestattributesareselectedonthebasisofaheuristicorstatisticalmeasure(e.g.,informationgain)ConditionsforstoppingpartitioningAllsamplesforagivennodebelongtothesameclassTherearenoremainingattributesforfurtherpartitioning––majorityvotingisemployedforclassifyingtheleafTherearenosamplesleft2022/12/3113AttributeSelectionMeasure:InformationGain(ID3/C4.5)SelecttheattributewiththehighestinformationgainScontainssituplesofclassCifori={1,…,m}informationmeasuresinforequiredtoclassifyanyarbitrarytupleentropyofattributeAwithvalues{a1,a2,…,av}informationgainedbybranchingonattributeA2022/12/3114AttributeSelectionbyInformationGainComputationClassP:buys_computer=““yes”ClassN:buys_computer=““no””I(p,n)=I(9,5)=0.940Computetheentropyforage:means““age<=30”has5outof14samples,with2yes’’esand3no’s.HenceSimilarly,2022/12/3115OtherAttributeSelectionMeasuresGiniindex(CART,IBMIntelligentMiner)Allattributesareassumedcontinuous-valuedAssumethereexistseveralpossiblesplitvaluesforeachattributeMayneedothertools,suchasclustering,togetthepossiblesplitvaluesCanbemodifiedforcategoricalattributes2022/12/3116GiniIndex(IBMIntelligentMiner)IfadatasetTcontainsexamplesfromnclasses,giniindex,gini(T)isdefinedaswherepjistherelativefrequencyofclassjinT.IfadatasetTissplitintotwosubsetsT1andT2withsizesN1andN2respectively,theginiindexofthesplitdatacontainsexamplesfromnclasses,theginiindexgini(T)isdefinedasTheattributeprovidesthesmallestginisplit(T)ischosentosplitthenode(needtoenumerateallpossiblesplittingpointsforeachattribute).2022/12/3117ExtractingClassificationRulesfromTreesRepresenttheknowledgeintheformofIF-THENrulesOneruleiscreatedforeachpathfromtheroottoaleafEachattribute-valuepairalongapathformsaconjunctionTheleafnodeholdstheclasspredictionRulesareeasierforhumanstounderstandExampleIFage=“<=30”ANDstudent=“no”THENbuys_computer=“no”IFage=“<=30”ANDstudent=“yes”THENbuys_computer=“yes”IFage=“31…40””THENbuys_computer=“yes”IFage=“>40”ANDcredit_rating=“excellent”THENbuys_computer=“yes”IFage=“<=30”ANDcredit_rating=“fair”THENbuys_computer=“no”2022/12/3118AvoidOverfittinginClassificationOverfitting:AninducedtreemayoverfitthetrainingdataToomanybranches,somemayreflectanomaliesduetonoiseoroutliersPooraccuracyforunseensamplesTwoapproachestoavoidoverfittingPrepruning:Halttreeconstructionearly——donotsplitanodeifthiswouldresultinthegoodnessmeasurefallingbelowathresholdDifficulttochooseanappropriatethresholdPostpruning:Removebranchesfroma““fullygrown””tree—getasequenceofprogressivelyprunedtreesUseasetofdatadifferentfromthetrainingdatatodecidewhichisthe“bestprunedtree”2022/12/3119ApproachestoDeterminetheFinalTreeSizeSeparatetraining(2/3)andtesting(1/3)setsUsecrossvalidation,e.g.,10-foldcrossvalidationUseallthedatafortrainingbutapplyastatisticaltest(e.g.,chi-square)toestimatewhetherexpandingorpruninganodemayimprovetheentiredistributionUseminimumdescriptionlength(MDL)principlehaltinggrowthofthetreewhentheencodingisminimized2022/12/3120EnhancementstobasicdecisiontreeinductionAllowforcontinuous-valuedattributesDynamicallydefinenewdiscrete-valuedattributesthatpartitionthecontinuousattributevalueintoadiscretesetofintervalsHandlemissingattributevaluesAssignthemostcommonvalueoftheattributeAssignprobabilitytoeachofthepossiblevaluesAttributeconstructionCreatenewattributesbasedonexistingonesthataresparselyrepresentedThisreducesfragmentation,repetition,andreplication2022/12/3121ClassificationinLargeDatabasesClassification—aclassicalproblemextensivelystudiedbystatisticiansandmachinelearningresearchersScalability:ClassifyingdatasetswithmillionsofexamplesandhundredsofattributeswithreasonablespeedWhydecisiontreeinductionindatamining?relativelyfasterlearningspeed(thanotherclassificationmethods)convertibletosimpleandeasytounderstandclassificationrulescanuseSQLqueriesforaccessingdatabasescomparableclassificationaccuracywithothermethods2022/12/3122ScalableDecisionTreeInductionMethodsinDataMiningStudiesSLIQ(EDBT’96——Mehtaetal.)buildsanindexforeachattributeandonlyclasslistandthecurrentattributelistresideinmemorySPRINT(VLDB’96——J.Shaferetal.)constructsanattributelistdatastructurePUBLIC(VLDB’98——Rastogi&Shim)integratestreesplittingandtreepruning:stopgrowingthetreeearlierRainForest(VLDB’98——Gehrke,Ramakrishnan&Ganti)separatesthescalabilityaspectsfromthecriteriathatdeterminethequalityofthetreebuildsanAVC-list(attribute,value,classlabel)2022/12/3123DataCube-BasedDecision-TreeInductionIntegrationofgeneralizationwithdecision-treeinduction(Kamberetal’97).ClassificationatprimitiveconceptlevelsE.g.,precisetemperature,humidity,outlook,etc.Low-levelconcepts,scatteredclasses,bushyclassification-treesSemanticinterpretationproblems.Cube-basedmulti-levelclassificationRelevanceanalysisatmulti-levels.Information-gainanalysiswithdimension+level.2022/12/3124PresentationofClassificationResults2022/12/3125VisualizationofaDecisionTreeinSGI/MineSet3.02022/12/3126InteractiveVisualMiningbyPerception-BasedClassification(PBC)2022/12/3127第7章:分类类和预测测Whatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianClassificationClassificationbyNeuralNetworksClassificationbySupportVectorMachines(SVM)ClassificationbasedonconceptsfromassociationruleminingOtherClassificationMethodsPredictionClassificationaccuracySummary2022/12/3128BayesianClassification:Why?Probabilisticlearning:Calculateexplicitprobabilitiesforhypothesis,amongthemostpracticalapproachestocertaintypesoflearningproblemsIncremental:Eachtrainingexamplecanincrementallyincrease/decreasetheprobabilitythatahypothesisiscorrect.Priorknowledgecanbecombinedwithobserveddata.Probabilisticprediction:Predictmultiplehypotheses,weightedbytheirprobabilitiesStandard:EvenwhenBayesianmethodsarecomputationallyintractable,theycanprovideastandardofoptimaldecisionmakingagainstwhichothermethodscanbemeasured2022/12/3129BayesianTheorem:BasicsLetXbeadatasamplewhoseclasslabelisunknownLetHbeahypothesisthatXbelongstoclassCForclassificationproblems,determineP(H/X):theprobabilitythatthehypothesisholdsgiventheobserveddatasampleXP(H):priorprobabilityofhypothesisH(i.e.theinitialprobabilitybeforeweobserveanydata,reflectsthebackgroundknowledge)P(X):probabilitythatsampledataisobservedP(X|H):probabilityofobservingthesampleX,giventhatthehypothesisholds2022/12/3130BayesianTheoremGiventrainingdataX,posterioriprobabilityofahypothesisH,P(H|X)followstheBayestheoremInformally,thiscanbewrittenasposterior=likelihoodxprior/evidenceMAP(maximumposteriori)hypothesisPracticaldifficulty:requireinitialknowledgeofmanyprobabilities,significantcomputationalcost2022/12/3131NaïveBayesClassifierAsimplifiedassumption:attributesareconditionallyindependent:Theproductofoccurrenceofsay2elementsx1andx2,giventhecurrentclassisC,istheproductoftheprobabilitiesofeachelementtakenseparately,giventhesameclassP([y1,y2],C)=P(y1,C)*P(y2,C)NodependencerelationbetweenattributesGreatlyreducesthecomputationcost,onlycounttheclassdistribution.OncetheprobabilityP(X|Ci)isknown,assignXtotheclasswithmaximumP(X|Ci)*P(Ci)2022/12/3132TrainingdatasetClass:C1:buys_computer=‘yes’C2:buys_computer=‘no’’DatasampleX=(age<=30,Income=medium,Student=yesCredit_rating=Fair)2022/12/3133NaïïveBayesianClassifier:ExampleComputeP(X/Ci)foreachclassP(age=““<30””|buys_computer=““yes””)=2/9=0.222P(age=““<30””|buys_computer=““no”)=3/5=0.6P(income=“medium””|buys_computer=““yes””)=4/9=0.444P(income=“medium””|buys_computer=““no”)=2/5=0.4P(student=““yes””|buys_computer=““yes)=6/9=0.667P(student=““yes””|buys_computer=““no”)=1/5=0.2P(credit_rating=““fair”|buys_computer=“yes”)=6/9=0.667P(credit_rating=““fair”|buys_computer=“no””)=2/5=0.4X=(age<=30,income=medium,student=yes,credit_rating=fair)P(X|Ci):P(X|buys_computer=““yes””)=0.222x0.444x0.667x0.0.667=0.044P(X|buys_computer=““no”)=0.6x0.4x0.2x0.4=0.019P(X|Ci)*P(Ci):P(X|buys_computer=““yes””)*P(buys_computer=“yes”)=0.028P(X|buys_computer=““yes””)*P(buys_computer=“yes”)=0.007Xbelongstoclass““buys_computer=yes””2022/12/3134NaïïveBayesianClassifier:CommentsAdvantages:EasytoimplementGoodresultsobtainedinmostofthecasesDisadvantagesAssumption:classconditionalindependence,thereforelossofaccuracyPractically,dependenciesexistamongvariablesE.g.,hospitals:patients:Profile:age,familyhistoryetcSymptoms:fever,coughetc.,Disease:lungcancer,diabetesetcDependenciesamongthesecannotbemodeledbyNaïïveBayesianClassifierHowtodealwiththesedependencies?BayesianBeliefNetworks2022/12/3135BayesianNetworksBayesianbeliefnetworkallowsasubsetofthevariablesconditionallyindependentAgraphicalmodelofcausalrelationshipsRepresentsdependencyamongthevariablesGivesaspecificationofjointprobabilitydistributionXYZPNodes:randomvariablesLinks:dependencyX,YaretheparentsofZ,andYistheparentofPNodependencybetweenZandPHasnoloopsorcycles2022/12/3136BayesianBeliefNetwork:AnExampleFamilyHistoryLungCancerPositiveXRaySmokerEmphysemaDyspneaLC~LC(FH,S)(FH,~S)(~FH,S)(~FH,~S)0.10.9BayesianBeliefNetworksTheconditionalprobabilitytableforthevariableLungCancer:Showstheconditionalprobabilityforeachpossiblecombinationofitsparents2022/12/3137LearningBayesianNetworksSeveralcasesGivenboththenetworkstructureandallvariablesobservable:learnonlytheCPTsNetworkstructureknown,somehiddenvariables:methodofgradientdescent,analogoustoneuralnetworklearningNetworkstructureunknown,allvariablesobservable:searchthroughthemodelspacetoreconstructgraphtopologyUnknownstructure,allhiddenvariables:nogoodalgorithmsknownforthispurposeD.Heckerman,Bayesiannetworksfordatamining2022/12/3138第7章:分分类和预测Whatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianClassificationClassificationbyNeuralNetworksClassificationbySupportVectorMachines(SVM)ClassificationbasedonconceptsfromassociationruleminingOtherClassificationMethodsPredictionClassificationaccuracySummary2022/12/3139Classification:predictscategoricalclasslabelsTypicalApplications{credithistory,salary}->creditapproval(Yes/No){Temp,Humidity}-->Rain(Yes/No)ClassificationMathematically2022/12/3140LinearClassificationBinaryClassificationproblemThedataabovetheredlinebelongstoclass‘‘x’Thedatabelowredlinebelongstoclass‘‘o’Examples––SVM,Perceptron,ProbabilisticClassifiersxxxxxxxxxxooooooooooooo2022/12/3141DiscriminativeClassifiersAdvantagespredictionaccuracyisgenerallyhigh(ascomparedtoBayesianmethods––ingeneral)robust,workswhentrainingexamplescontainerrorsfastevaluationofthelearnedtargetfunction(Bayesiannetworksarenormallyslow)Criticismlongtrainingtimedifficulttounderstandthelearnedfunction(weights)(Bayesiannetworkscanbeusedeasilyforpatterndiscovery)noteasytoincorporatedomainknowledge(easyintheformofpriorsonthedataordistributions)2022/12/3142NeuralNetworksAnalogytoBiologicalSystems(Indeedagreatexampleofagoodlearningsystem)MassiveParallelismallowingforcomputationalefficiencyThefirstlearningalgorithmcamein1959(Rosenblatt)whosuggestedthatifatargetoutputvalueisprovidedforasingleneuronwithfixedinputs,onecanincrementallychangeweightstolearntoproducetheseoutputsusingtheperceptronlearningrule2022/12/3143ANeuronThen-dimensionalinputvectorxismappedintovariableybymeansofthescalarproductandanonlinearfunctionmappingmk-fweightedsumInputvectorxoutputyActivationfunctionweightvectorwåw0w1wnx0x1xn2022/12/3144ANeuronmk-fweightedsumInputvectorxoutputyActivationfunctionweightvectorwåw0w1wnx0x1xn2022/12/3145Multi-LayerPerceptronOutputnodesInputnodesHiddennodesOutputvectorInputvector:xiwijNetworkTrainingTheultimateobjectiveoftrainingobtainasetofweightsthatmakesalmostallthetuplesinthetrainingdataclassifiedcorrectlyStepsInitializeweightswithrandomvaluesFeedtheinputtuplesintothenetworkonebyoneForeachunitComputethenetinputtotheunitasalinearcombinationofalltheinputstotheunitComputetheoutputvalueusingtheactivationfunctionComputetheerrorUpdatetheweightsandthebiasNetworkPruningandRuleExtractionNetworkpruningFullyconnectednetworkwillbehardtoarticulateNinputnodes,hhiddennodesandmoutputnodesleadtoh(m+N)weightsPruning:RemovesomeofthelinkswithoutaffectingclassificationaccuracyofthenetworkExtractingrulesfromatrainednetworkDiscretizeactivationvalues;replaceindividualactivationvaluebytheclusteraveragemaintainingthenetworkaccuracyEnumeratetheoutputfromthediscretizedactivationvaluestofindrulesbetweenactivationvalueandoutputFindtherelationshipbetweentheinputandactivationvalueCombinetheabovetwotohaverulesrelatingtheoutputtoinputChapter7.ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianClassificationClassificationbyNeuralNetworksClassificationbySupportVectorMachines(SVM)ClassificationbasedonconceptsfromassociationruleminingOtherClassificationMethodsPredictionClassificationaccuracySummary2022/12/3149SVM–SupportVectorMachinesSupportVectorsSmallMarginLargeMarginSVM–Cont.LinearSupportVectorMachineGivenasetofpointswithlabelTheSVMfindsahyperplanedefinedbythepair(w,b)(wherewisthenormaltotheplaneandbisthedistancefromtheorigin)s.t.x–featurevector,b-bias,y-classlabel,||w||-margin2022/12/3151SVM–Cont.Whatifthedataisnotlinearlyseparable?ProjectthedatatohighdimensionalspacewhereitislinearlyseparableandthenwecanuselinearSVM––(UsingKernels)-10+1++-(1,0)(0,0)(0,1)++-2022/12/3152Non-LinearSVMClassificationusingSVM(w,b)InnonlinearcasewecanseethisasKernel–Canbethoughtofasdoingdotproductinsomehighdimensionalspace2022/12/3153ExampleofNon-linearSVM2022/12/3154Results2022/12/3155SVMvs.NeuralNetworkSVMRelativelynewconceptNiceGeneralizationpropertiesHardtolearn–learnedinbatchmodeusingquadraticprogrammingtechniquesUsingkernelscanlearnverycomplexfunctionsNeuralNetworkQuietOldGeneralizeswellbutdoesn’thavestrongmathematicalfoundationCaneasilybelearnedinincrementalfashionTolearncomplexfunctions–usemultilayerperceptron(notthattrivial)2022/12/3156SVMRelatedLinks2022/12/3157第7章:分分类和预测Whatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianClassificationClassificationbyNeuralNetworksClassificationbySupportVectorMachines(SVM)ClassificationbasedonconceptsfromassociationruleminingOtherClassificationMethodsPredictionClassificationaccuracySummary2022/12/3158Association-BasedClassificationSeveralmethodsforassociation-basedclassificationARCS:Quantitativeassociationminingandclusteringofassociationrules(Lentetal’’97)ItbeatsC4.5in(mainly)scalabilityandalsoaccuracyAssociativeclassification:(Liuetal’98)Itmineshighsupportandhighconfidencerulesintheformof“cond_set=>y”,whereyisaclasslabelCAEP(Classificationbyaggregatingemergingpatterns)(Dongetal’99)Emergingpatterns(EPs):theitemsetswhosesupportincreasessignificantlyfromoneclasstoanotherMineEpsbasedonminimumsupportandgrowthrate2022/12/3159第7章章:分分类类和和预预测测Whatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianClassificationClassificationbyNeuralNetworksClassificationbySupportVectorMachines(SVM)ClassificationbasedonconceptsfromassociationruleminingOtherClassificationMethodsPredictionClassificationaccuracySummary2022/12/3160OtherClassificationMethodsk-nearestneighborclassifiercase-basedreasoningGeneticalgorithmRoughsetapproachFuzzysetapproaches2022/12/3161Instance-BasedMethodsInstance-basedlearning:Storetrainingexamplesanddelaytheprocessing(“lazyevaluation””)untilanewinstancemustbeclassifiedTypicalapproachesk-nearestneighborapproachInstancesrepresentedaspointsinaEuclideanspace.LocallyweightedregressionConstructslocalapproximationCase-basedreasoningUsessymbolicrepresentationsandknowledge-basedinference2022/12/3162Thek-NearestNeighborAlgorithmAllinstancescorrespondtopointsinthen-Dspace.ThenearestneighboraredefinedintermsofEuclideandistance.Thetargetfunctioncouldbediscrete-orreal-valued.Fordiscrete-valued,thek-NNreturnsthemostcommonvalueamongthektrainingexamplesnearesttoxq.Vonoroidiagram:thedecisionsurfaceinducedby1-NNforatypicalsetoftrainingexamples.._+_xq+__+__+.....2022/12/3163Discussiononthek-NNAlgorithmThek-NNalgorithmforcontinuous-valuedtargetfunctionsCalculatethemeanvaluesoftheknearestneighborsDistance-weightednearestneighboralgorithmWeightthecontributionofeachofthekneighborsaccordingtotheirdistancetothequerypointxqgivinggreaterweighttocloserneighborsSimilarly,forreal-valuedtargetfunctionsRobusttonoisydatabyaveragingk-nearestneighborsCurseofdimensionality:distancebetweenneighborscouldbedominatedbyirrelevantattributes.Toovercomeit,axesstretchoreliminationoftheleastrelevantattributes.2022/12/3164Case-BasedReasoningAlsouses:lazyevaluation+analyzesimilarinstancesDifference:Instancesarenot““pointsinaEuclideanspace””Example:WaterfaucetprobleminCADET(Sycaraetal’92)MethodologyInstancesrepresentedbyrichsymbolicdescriptions(e.g.,functiongraphs)MultipleretrievedcasesmaybecombinedTightcouplingbetweencaseretrieval,knowledge-basedreasoning,andproblemsolvingResearchissuesIndexingbasedonsyntacticsimilaritymeasure,andwhenfailure,backtracking,andadaptingtoadditionalcases2022/12/3165RemarksonLazyvs.EagerLearningInstance-basedlearning:lazyevaluationDecision-treeandBayesianclassification:eagerevaluationKeydifferencesLazymethodmayconsiderqueryinstancexqwhendecidinghowtogeneralizebeyondthetrainingdataDEagermethodcannotsincetheyhavealreadychosenglobalapproximationwhenseeingthequeryEfficiency:Lazy-lesstimetrainingbutmoretimepredictingAccuracyLazymethodeffectivelyusesaricherhypothesisspacesinceitusesmanylocallinearfunctionstoformitsimplicitglobalapproximationtothetargetfunctionEager:mustcommittoasinglehypothesisthatcoverstheentireinstancespace2022/12/3166GeneticAlgorithmsGA:basedonananalogytobiologicalevolutionEachruleisrepresentedbyastringofbitsAninitialpopulationiscreatedconsistingofrandomlygeneratedrulese.g.,IFA1andNotA2thenC2canbeencodedas100Basedonthenotionofsurvivalofthefittest,anewpopulationisformedtoconsistsofthefittestrulesandtheiroffspringsThefitnessofaruleisrepresentedbyitsclassificationaccuracyonasetoftrainingexamplesOffspringsaregeneratedbycrossoverandmutation2022/12/3167RoughSetApproachRoughsetsareusedtoapproximatelyor““roughly””defineequivalentclassesAroughsetforagivenclassCisapproximatedbytwosets:alowerapproximation(certaintobeinC)andanupperapproximation(cannotbedescribedasnotbelongingtoC)Findingtheminimalsubsets(reducts)ofattributes(forfeaturereduction)isNP-hardbutadiscernibilitymatrixisusedtoreducethecomputationintensity2022/12/3168FuzzySetApproachesFuzzylogicusestruthvaluesbetween0.0and1.0torepresentthedegreeofmembership(suchasusingfuzzymembershipgraph)Attributevaluesareconvertedtofuzzyvaluese.g.,incomeismappedintothediscretecategories{low,medium,high}withfuzzyvaluescalculatedForagivennewsample,morethanonefuzzyvaluemayapplyEachapplicablerulecontributesavoteformembershipinthecategoriesTypically,thetruthvaluesforeachpredictedcategoryaresummed2022/12/3169第7章:分分类和预测Whatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianClassificationClassificationbyNeuralNetworksClassificationbySupportVectorMachines(SVM)ClassificationbasedonconceptsfromassociationruleminingOtherClassificationMethodsPredictionClassificationaccuracySummary2022/12/3170WhatIsPrediction?PredictionissimilartoclassificationFirst,constructamodelSecond,usemodeltopredictunknownvalueMajormethodforpredictionisregressionLinearandmultipleregressionNon-linearregressionPredictionisdifferentfromclassificationClassificationreferstopredictcategoricalclasslabelPredictionmodelscontinuous-valuedfunctions2022/12/3171Predictivemodeling:Predictdatavaluesorconstructgeneralizedlinearmodelsbasedonthedatabasedata.OnecanonlypredictvaluerangesorcategorydistributionsMethodoutline:MinimalgeneralizationAttributerelevanceanalysisGeneralizedlinearmodelconstructionPredictionDeterminethemajorfactorswhichinfluencethepredictionDatarelevanceanalysis:uncertaintymeasurement,entropyanalysis,expertjudgement,etc.Multi-levelprediction:drill-downandroll-upanalysisPredictiveModelinginDatabases2022/12/3172Linearregression:Y=+XTwoparameters,andspecifythelineandaretobeestimatedbyusingthedataathand.usingtheleastsquarescriteriontotheknownvaluesofY1,Y2,…,X1,X2,….Multipleregression:Y=b0+b1X1+b2X2.Manynonlinearfunctionscanbetransformedintotheabove.Log-linearmodels:Themulti-waytableofjointprobabilitiesisapproximatedbyaproductoflower-ordertables.Probability:p(a,b,c,d)=abacadbcdRegressAnalysisandLog-LinearModelsinPrediction2022/12/3173LocallyWeightedRegressionConstructanexplicitapproximationtofoveralocalregionsurroundingqueryinstancexq.Locallyweightedlinearregression:Thetarge

人人文库> 全部分类> 专业文献 > 生活休闲

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

《数据仓库与数据挖掘》第章2

文档简介

温馨提示

最新文档

评论

《数据仓库与数据挖掘》第章2

文档简介

温馨提示

最新文档

评论

相关文档