版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Chapter20:DataAnalysis........Chapter20:DataAnalysis...Chapter20:DataAnalysisDecisionSupportSystemsDataWarehousingDataMiningClassificationAssociationRulesClustering........Chapter20:DataAnalysisDeciDecisionSupportSystemsDecision-supportsystemsareusedtomakebusinessdecisions,oftenbasedondatacollectedbyon-linetransaction-processingsystems.Examplesofbusinessdecisions:Whatitemstostock?Whatinsurancepremiumtochange?Towhomtosendadvertisements?Examplesofdatausedformakingdecisions Retailsalestransactiondetails Customerprofiles(income,age,gender,etc.)........DecisionSupportSystemsDecisiDecision-SupportSystems:OverviewDataanalysistasksaresimplifiedbyspecializedtoolsandSQLextensionsExampletasksForeachproductcategoryandeachregion,whatwerethetotalsalesinthelastquarterandhowdotheycomparewiththesamequarterlastyearAsabove,foreachproductcategoryandeachcustomercategoryStatisticalanalysispackages(e.g.,:S++)canbeinterfacedwithdatabasesStatisticalanalysisisalargefield,butnotcoveredhereDataminingseekstodiscoverknowledgeautomaticallyintheformofstatisticalrulesandpatternsfromlargedatabases.Adatawarehousearchivesinformationgatheredfrommultiplesources,andstoresitunderaunifiedschema,atasinglesite.Importantforlargebusinessesthatgeneratedatafrommultipledivisions,possiblyatmultiplesitesDatamayalsobepurchasedexternally........Decision-SupportSystems:OverDataWarehousingDatasourcesoftenstoreonlycurrentdata,nothistoricaldataCorporatedecisionmakingrequiresaunifiedviewofallorganizationaldata,includinghistoricaldataAdatawarehouseisarepository(archive)ofinformationgatheredfrommultiplesources,storedunderaunifiedschema,atasinglesiteGreatlysimplifiesquerying,permitsstudyofhistoricaltrendsShiftsdecisionsupportqueryloadawayfromtransactionprocessingsystems........DataWarehousingDatasourcesoDataWarehousing........DataWarehousing...DesignIssuesWhenandhowtogatherdataSourcedrivenarchitecture:datasourcestransmitnewinformationtowarehouse,eithercontinuouslyorperiodically(e.g.,atnight)Destinationdrivenarchitecture:warehouseperiodicallyrequestsnewinformationfromdatasourcesKeepingwarehouseexactlysynchronizedwithdatasources(e.g.,usingtwo-phasecommit)istooexpensiveUsuallyOKtohaveslightlyout-of-datedataatwarehouseData/updatesareperiodicallydownloadedformonlinetransactionprocessing(OLTP)systems.WhatschematouseSchemaintegration........DesignIssuesWhenandhowtogMoreWarehouseDesignIssuesDatacleansingE.g.,correctmistakesinaddresses(misspellings,zipcodeerrors)MergeaddresslistsfromdifferentsourcesandpurgeduplicatesHowtopropagateupdatesWarehouseschemamaybea(materialized)viewofschemafromdatasourcesWhatdatatosummarizeRawdatamaybetoolargetostoreon-lineAggregatevalues(totals/subtotals)oftensufficeQueriesonrawdatacanoftenbetransformedbyqueryoptimizertouseaggregatevalues........MoreWarehouseDesignIssuesDaWarehouseSchemasDimensionvaluesareusuallyencodedusingsmallintegersandmappedtofullvaluesviadimensiontablesResultantschemaiscalledastarschemaMorecomplicatedschemastructuresSnowflakeschema:multiplelevelsofdimensiontablesConstellation:multiplefacttables........WarehouseSchemasDimensionvalDataWarehouseSchema........DataWarehouseSchema...DataMiningDataminingistheprocessofsemi-automaticallyanalyzinglargedatabasestofindusefulpatterns
PredictionbasedonpasthistoryPredictifacreditcardapplicantposesagoodcreditrisk,basedonsomeattributes(income,jobtype,age,..)andpasthistoryPredictifapatternofphonecallingcardusageislikelytobefraudulentSomeexamplesofpredictionmechanisms:ClassificationGivenanewitemwhoseclassisunknown,predicttowhichclassitbelongsRegressionformulaeGivenasetofmappingsforanunknownfunction,predictthefunctionresultforanewparametervalue........DataMiningDataminingistheDataMining(Cont.)DescriptivePatternsAssociationsFindbooksthatareoftenboughtby“similar”customers.Ifanewsuchcustomerbuysonesuchbook,suggesttheotherstoo.AssociationsmaybeusedasafirststepindetectingcausationE.g.,associationbetweenexposuretochemicalXandcancer,ClustersE.g.,typhoidcaseswereclusteredinanareasurroundingacontaminatedwellDetectionofclustersremainsimportantindetectingepidemics........DataMining(Cont.)DescriptiveClassificationRulesClassificationruleshelpassignnewobjectstoclasses.E.g.,givenanewautomobileinsuranceapplicant,shouldheorshebeclassifiedaslowrisk,mediumriskorhighrisk?Classificationrulesforaboveexamplecoulduseavarietyofdata,suchaseducationallevel,salary,age,etc.
personP,P.degree=mastersandP.income>75,000
P.credit=excellent
personP,P.degree=bachelorsand
(P.income
25,000andP.income
75,000)
P.credit=goodRulesarenotnecessarilyexact:theremaybesomemisclassificationsClassificationrulescanbeshowncompactlyasadecisiontree.........ClassificationRulesClassificaDecisionTree........DecisionTree...ConstructionofDecisionTreesTrainingset:adatasampleinwhichtheclassificationisalreadyknown.
Greedytopdowngenerationofdecisiontrees.Eachinternalnodeofthetreepartitionsthedataintogroupsbasedonapartitioningattribute,andapartitioningcondition
forthenodeLeafnode:all(ormost)oftheitemsatthenodebelongtothesameclass,orallattributeshavebeenconsidered,andnofurtherpartitioningispossible.........ConstructionofDecisionTreesBestSplitsPickbestattributesandconditionsonwhichtopartitionThepurityofasetSoftraininginstancescanbemeasuredquantitativelyinseveralways.Notation:numberofclasses=k,numberofinstances=|S|,
fractionofinstancesinclassi=pi.TheGinimeasureofpurityisdefinedas[ Gini(S)=1-
Whenallinstancesareinasingleclass,theGinivalueis0Itreachesitsmaximum(of1–1/k)ifeachclassthesamenumberofinstances.
ki-1p2i........BestSplitsPickbestattributeBestSplits(Cont.)Anothermeasureofpurityistheentropy
measure,whichisdefinedas entropy(S)=–
WhenasetSissplitintomultiplesetsSi,I=1,2,…,r,wecanmeasurethepurityoftheresultantsetofsetsas:
purity(S1,S2,…..,Sr)=
TheinformationgainduetoparticularsplitofSintoSi,i=1,2,….,r
Information-gain(S,{S1,S2,….,Sr)=purity(S)–purity(S1,S2,…Sr)
ri=1|Si||S|purity(Si)ki-1pilog2pi........BestSplits(Cont.)AnothermeaBestSplits(Cont.)Measureof“cost”ofasplit:
Information-content(S,{S1,S2,…..,Sr}))=–
Information-gainratio=Information-gain(S,{S1,S2,……,Sr}) Information-content(S,{S1,S2,…..,Sr})Thebestsplitistheonethatgivesthemaximuminformationgainratiolog2ri-1|Si||S||Si||S|
........BestSplits(Cont.)MeasureofFindingBestSplitsCategoricalattributes(withnomeaningfulorder):Multi-waysplit,onechildforeachvalueBinarysplit:tryallpossiblebreakupofvaluesintotwosets,andpickthebestContinuous-valuedattributes(canbesortedinameaningfulorder)Binarysplit:Sortvalues,tryeachasasplitpointE.g.,ifvaluesare1,10,15,25,splitat1,10,15PickthevaluethatgivesbestsplitMulti-waysplit:Aseriesofbinarysplitsonthesameattributehasroughlyequivalenteffect........FindingBestSplitsCategoricalDecision-TreeConstructionAlgorithm
ProcedureGrowTree(S)
Partition(S);
ProcedurePartition(S)
if(purity(S)>
por|S|<s)then
return;
foreachattributeA
evaluatesplitsonattributeA;
Usebestsplitfound(acrossallattributes)topartition
SintoS1,S2,….,Sr,
fori=1,2,…..,r
Partition(Si);........Decision-TreeConstructionAlgOtherTypesofClassifiersNeuralnetclassifiersarestudiedinartificialintelligenceandarenotcoveredhereBayesianclassifiersuseBayestheorem,whichsays
p(cj|d)=p(d|cj)p(cj)
p(d)
where
p(cj|d)=probabilityofinstancedbeinginclasscj,
p(d|cj)=probabilityofgeneratinginstancedgivenclasscj,
p(cj
)
=probabilityofoccurrenceofclasscj,and
p(d)=probabilityofinstancedoccuring
........OtherTypesofClassifiersNeurNaïveBayesianClassifiersBayesianclassifiersrequirecomputationofp(d|cj)precomputationofp(cj)
p(d)canbeignoredsinceitisthesameforallclassesTosimplifythetask,naïveBayesianclassifiersassumeattributeshaveindependentdistributions,andtherebyestimate
p(d|cj)=p(d1|cj)*p(d2|cj)*….*(p(dn|cj)Eachofthep(di|cj)canbeestimatedfromahistogramondivaluesforeachclasscjthehistogramiscomputedfromthetraininginstancesHistogramsonmultipleattributesaremoreexpensivetocomputeandstore........NaïveBayesianClassifiersBayeRegressionRegressiondealswiththepredictionofavalue,ratherthanaclass.Givenvaluesforasetofvariables,X1,X2,…,Xn,wewishtopredictthevalueofavariableY.Onewayistoinfercoefficientsa0,a1,a1,…,ansuchthat
Y=a0+a1*X1+a2*X2+…+an*Xn
Findingsuchalinearpolynomialiscalledlinearregression.Ingeneral,theprocessoffindingacurvethatfitsthedataisalsocalledcurvefitting.Thefitmayonlybeapproximatebecauseofnoiseinthedata,orbecausetherelationshipisnotexactlyapolynomialRegressionaimstofindcoefficientsthatgivethebestpossiblefit.........RegressionRegressiondealswitAssociationRulesRetailshopsareofteninterestedinassociationsbetweendifferentitemsthatpeoplebuy.SomeonewhobuysbreadisquitelikelyalsotobuymilkApersonwhoboughtthebookDatabaseSystemConceptsisquitelikelyalsotobuythebookOperatingSystemConcepts.Associationsinformationcanbeusedinseveralways.E.g.,whenacustomerbuysaparticularbook,anonlineshopmaysuggestassociatedbooks.Associationrules:
bread
milkDB-Concepts,OS-ConceptsNetworksLefthandside:antecedent,righthandside:consequentAnassociationrulemusthaveanassociatedpopulation;thepopulationconsistsofasetofinstancesE.g.,eachtransaction(sale)atashopisaninstance,andthesetofalltransactionsisthepopulation........AssociationRulesRetailshopsAssociationRules(Cont.)Ruleshaveanassociatedsupport,aswellasanassociatedconfidence.Support
isameasureofwhatfractionofthepopulationsatisfiesboththeantecedentandtheconsequentoftherule.E.g.,supposeonly0.001percentofallpurchasesincludemilkandscrewdrivers.Thesupportfortheruleismilk
screwdriversislow.Confidence
isameasureofhowoftentheconsequentistruewhentheantecedentistrue.E.g.,therulebread
milkhasaconfidenceof80percentif80percentofthepurchasesthatincludebreadalsoincludemilk.........AssociationRules(Cont.)RulesFindingAssociationRulesWearegenerallyonlyinterestedinassociationruleswithreasonablyhighsupport(e.g.,supportof2%orgreater)NaïvealgorithmConsiderallpossiblesetsofrelevantitems.Foreachsetfinditssupport(i.e.,counthowmanytransactionspurchaseallitemsintheset).Largeitemsets:setswithsufficientlyhighsupportUselargeitemsetstogenerateassociationrules.FromitemsetAgeneratetheruleA-{b}bforeachb
A.Supportofrule=support(A).Confidenceofrule=support(A)/support(A-{b})........FindingAssociationRulesWearFindingSupportDeterminesupportofitemsetsviaasinglepassonsetoftransactionsLargeitemsets:setswithahighcountattheendofthepassIfmemorynotenoughtoholdallcountsforallitemsetsusemultiplepasses,consideringonlysomeitemsetsineachpass.Optimization:Onceanitemsetiseliminatedbecauseitscount(support)istoosmallnoneofitssupersetsneedstobeconsidered.Theaprioritechniquetofindlargeitemsets:Pass1:countsupportofallsetswithjust1item.EliminatethoseitemswithlowsupportPassi:candidates:everysetofiitemssuchthatallitsi-1itemsubsetsarelargeCountsupportofallcandidatesStopiftherearenocandidates........FindingSupportDeterminesuppoOtherTypesofAssociationsBasicassociationruleshaveseverallimitationsDeviationsfromtheexpectedprobabilityaremoreinterestingE.g.,ifmanypeoplepurchasebread,andmanypeoplepurchasecereal,quiteafewwouldbeexpectedtopurchasebothWeareinterestedinpositiveaswellasnegativecorrelationsbetweensetsofitemsPositivecorrelation:co-occurrenceishigherthanpredictedNegativecorrelation:co-occurrenceislowerthanpredictedSequenceassociations/correlationsE.g.,wheneverbondsgoup,stockpricesgodownin2daysDeviationsfromtemporalpatternsE.g.,deviationfromasteadygrowthE.g.,salesofwinterweargodowninsummerNotsurprising,partofaknownpattern.Lookfordeviationfromvaluepredictedusingpastpatterns........OtherTypesofAssociationsBasClusteringClustering:Intuitively,findingclustersofpointsinthegivendatasuchthatsimilarpointslieinthesameclusterCanbeformalizedusingdistancemetricsinseveralwaysGrouppointsintoksets(foragivenk)suchthattheaveragedistanceofpointsfromthecentroidoftheirassignedgroupisminimizedCentroid:pointdefinedbytakingaverageofcoordinatesineachdimension.Anothermetric:minimizeaveragedistancebetweeneverypairofpointsinaclusterHasbeenstudiedextensivelyinstatistics,butonsmalldatasetsDataminingsystemsaimatclusteringtechniquesthatcanhandleverylargedatasetsE.g.,theBirchclusteringalgorithm(moreshortly)........ClusteringClustering:IntuitivHierarchicalClusteringExamplefrombiologicalclassification(thewordclassificationheredoesnotmeanapredictionmechanism)chordata
mammaliareptilia
leopardshumanssnakescrocodilesOtherexamples:Internetdirectorysystems(e.g.,Yahoo,moreonthislater)AgglomerativeclusteringalgorithmsBuildsmallclusters,thenclustersmallclustersintobiggerclusters,andsoonDivisiveclusteringalgorithmsStartwithallitemsinasinglecluster,repeatedlyrefine(break)clustersintosmallerones........HierarchicalClusteringExampleClusteringAlgorithmsClusteringalgorithmshavebeendesignedtohandleverylargedatasetsE.g.,theBirchalgorithmMainidea:useanin-memoryR-treetostorepointsthatarebeingclusteredInsertpointsoneatatimeintotheR-tree,merginganewpointwithanexistingclusterifislessthansome
distanceawayIftherearemoreleafnodesthanfitinmemory,mergeexistingclustersthatareclosetoeachotherAttheendoffirstpasswegetalargenumberofclustersattheleavesoftheR-treeMergeclusterstoreducethenumberofclusters........Clu
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 《爷爷和小树》课件
- 玻璃制造液氮运输协议
- 五下8课课件教学课件
- 装修工程弱电系统合同样本
- 文化创意产业园区装修额外
- 轻奢风格超市翻新
- 畜牧业基地道路运输合同
- 生态园区房产居间合同范例
- 眼镜大厦装修合同样本
- 电子产品跨国运输安全协议
- 《电工基础知识》PPT课件课件
- 律师事务所财务管理规定
- 申请成立专科医师分会申请书(共6页)
- 2022年广东近三年高考生物试题分布(双向细目表)
- JGJ_T231-2021建筑施工承插型盘扣式钢管脚手架安全技术标准(高清-最新版)
- 教坛新秀申请书完美版本
- 珞珈一号01星数据与应用服务
- 高考语文双向细目表
- 钢便桥及平台防洪度汛专项方案
- 知识产权法总论 序章
- 专题三:消毒副产物及其控制技术
评论
0/150
提交评论