版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Knowledgediscovery&datamining
Tools,methods,andexperiencesFoscaGiannottiand
DinoPedreschiPisaKDDLabCNUCE-CNR&Univ.Pisahttp://www-kdd.di.unipi.it/Atutorial@EDBT2000Konstanz,27-28.3.2000EDBT2000tutorial-Intro2ContributorsandacknowledgementsThepeople@PisaKDDLab:FrancescoBONCHI,GiuseppeMANCO,MircoNANNI,ChiaraRENSO,SalvatoreRUGGIERI,FrancoTURINIandmanystudentsThemanyKDDtutorialistsandteacherswhichmadetheirslidesavailableontheweb(allofthemlistedinbibliography);-)Inparticular:JiaweiHAN,SimonFraserUniversity,whoseforthcomingbookDatamining:conceptsandtechniqueshasinfluencedthewholetutorialRajeevRASTOGIandKyuseokSHIM,LucentBellLabsDanielA.KEIM,UniversityofHalleDanielSilver,CogNovaTechnologiesTheEDBT2000boardwhoacceptedourtutorialproposalKonstanz,27-28.3.2000EDBT2000tutorial-Intro3TutorialgoalsIntroduceyoutomajoraspectsoftheKnowledgeDiscoveryProcess,andtheoryandapplicationsofDataMiningtechnologyProvideasystematizationtothemanymanyconceptsaroundthisarea,accordingthefollowinglinestheprocessthemethodsappliedtoparadigmaticcasesthesupportenvironmenttheresearchchallengesImportantissuesthatwillbenotcoveredinthistutorial:methods:timeseries,exceptiondetection,neuralnetssystems:parallelimplementationsKonstanz,27-28.3.2000EDBT2000tutorial-Intro4TutorialOutlineIntroductionandbasicconceptsMotivations,applications,theKDDprocess,thetechniquesDeeperintoDMtechnologyDecisionTreesandFraudDetectionAssociationRulesandMarketBasketAnalysisClusteringandCustomerSegmentationTrendsintechnologyKnowledgeDiscoverySupportEnvironmentTools,LanguagesandSystemsResearchchallengesKonstanz,27-28.3.2000EDBT2000tutorial-Intro5Introduction-moduleoutlineMotivationsApplicationAreasKDDDecisionalContextKDDProcessArchitectureofaKDDsystemTheKDDstepsinshortKonstanz,27-28.3.2000EDBT2000tutorial-Intro6EvolutionofDatabaseTechnology:
fromdatamanagementtodataanalysis1960s:Datacollection,databasecreation,IMSandnetworkDBMS.1970s:Relationaldatamodel,relationalDBMSimplementation.1980s:RDBMS,advanceddatamodels(extended-relational,OO,deductive,etc.)andapplication-orientedDBMS(spatial,scientific,engineering,etc.).1990s:Datamininganddatawarehousing,multimediadatabases,andWebtechnology.Konstanz,27-28.3.2000EDBT2000tutorial-Intro7Motivations
“NecessityistheMotherofInvention”Dataexplosionproblem:
Automateddatacollectiontools,maturedatabasetechnologyandinternetleadtotremendousamountsofdatastoredindatabases,datawarehousesandotherinformationrepositories.
Wearedrowningininformation,butstarvingforknowledge!
(JohnNaisbett)Datawarehousinganddatamining:On-lineanalyticalprocessingExtractionofinterestingknowledge(rules,regularities,patterns,constraints)fromdatainlargedatabases.Konstanz,27-28.3.2000EDBT2000tutorial-Intro8Alsoreferredtoas:
Datadredging,Dataharvesting,DataarcheologyAmultidisciplinaryfield:DatabaseStatisticsArtificialintelligenceMachinelearning,ExpertsystemsandKnowledgeAcquisitionVisualizationmethodsArapidlyemergingfieldArapidlyemergingfieldKonstanz,27-28.3.2000EDBT2000tutorial-Intro9MotivationsforDM
AbundanceofbusinessandindustrydataCompetitivefocus-KnowledgeManagementInexpensive,powerfulcomputingenginesStrongtheoretical/mathematicalfoundationsmachinelearning&logicstatisticsdatabasemanagementsystemsKonstanz,27-28.3.2000EDBT2000tutorial-Intro10WhatisDMusefulfor?MarketingDatabaseMarketingDataWarehousingKDD&DataMining
Increaseknowledgetobasedecisionupon.E.g.,impactonmarketingKonstanz,27-28.3.2000EDBT2000tutorial-Intro11TheValueChain
Data
Customerdata
Storedata
DemographicalData
Geographicaldata
Information
XlivesinZSisYyearsoldXandSmovedWhasmoneyinZ
Knowledge
AquantityYofproductAisusedinregionZ
CustomersofclassYusex%ofCduringperiodD
Decision
PromoteproductAinregionZ.
MailadstofamiliesofprofilePCross-sellserviceBtoclientsCKonstanz,27-28.3.2000EDBT2000tutorial-Intro12ApplicationAreasandOpportunitiesMarketing:segmentation,customertargeting,...Finance:investmentsupport,portfoliomanagementBanking&Insurance:creditandpolicyapprovalSecurity:frauddetectionScienceandmedicine:hypothesisdiscovery,
prediction,classification,diagnosisManufacturing:processmodeling,qualitycontrol, resourceallocationEngineering:simulationandanalysis,pattern recognition,signalprocessingInternet:smartsearchengines,webmarketingKonstanz,27-28.3.2000EDBT2000tutorial-Intro13ClassesofapplicationsMarketanalysistargetmarketing,customerrelationmanagement,marketbasketanalysis,crossselling,marketsegmentation.RiskanalysisForecasting,customerretention,improvedunderwriting,qualitycontrol,competitiveanalysis.FrauddetectionText(newsgroup,email,documents)andWebanalysis.Konstanz,27-28.3.2000EDBT2000tutorial-IntroMarketAnalysisWherearethedatasourcesforanalysis?Creditcardtransactions,loyaltycards,discountcoupons,customercomplaintcalls,plus(public)lifestylestudies.TargetmarketingFindclustersof“model”customerswhosharethesamecharacteristics:interest,incomelevel,spendinghabits,etc.DeterminecustomerpurchasingpatternsovertimeConversionofsingletoajointbankaccount:marriage,etc.Cross-marketanalysisAssociations/co-relationsbetweenproductsalesPredictionbasedontheassociationinformation.Konstanz,27-28.3.2000EDBT2000tutorial-IntroCustomerprofilingdataminingcantellyouwhattypesofcustomersbuywhatproducts(clusteringorclassification).IdentifyingcustomerrequirementsidentifyingthebestproductsfordifferentcustomersusepredictiontofindwhatfactorswillattractnewcustomersProvidessummaryinformationvariousmultidimensionalsummaryreports;statisticalsummaryinformation(datacentraltendencyandvariation)MarketAnalysisandManagementMarketAnalysis(2)Konstanz,27-28.3.2000EDBT2000tutorial-IntroRiskAnalysisFinanceplanningandassetevaluation:cashflowanalysisandpredictioncontingentclaimanalysistoevaluateassetscross-sectionalandtimeseriesanalysis(financial-ratio,trendanalysis,etc.)Resourceplanning:summarizeandcomparetheresourcesandspendingCompetition:monitorcompetitorsandmarketdirections(CI:competitiveintelligence).groupcustomersintoclassesandclass-basedpricingproceduressetpricingstrategyinahighlycompetitivemarketKonstanz,27-28.3.2000EDBT2000tutorial-IntroFraudDetectionApplications:widelyusedinhealthcare,retail,creditcardservices,telecommunications(phonecardfraud),etc.Approach:usehistoricaldatatobuildmodelsoffraudulentbehaviorandusedataminingtohelpidentifysimilarinstances.Examples:autoinsurance:detectagroupofpeoplewhostageaccidentstocollectoninsurancemoneylaundering:detectsuspiciousmoneytransactions(USTreasury'sFinancialCrimesEnforcementNetwork)medicalinsurance:detectprofessionalpatientsandringofdoctorsandringofreferencesKonstanz,27-28.3.2000EDBT2000tutorial-IntroMoreexamples:Detectinginappropriatemedicaltreatment:AustralianHealthInsuranceCommissionidentifiesthatinmanycasesblanketscreeningtestswererequested(saveAustralian$1m/yr).Detectingtelephonefraud:Telephonecallmodel:destinationofthecall,duration,timeofdayorweek.Analyzepatternsthatdeviatefromanexpectednorm.BritishTelecomidentifieddiscretegroupsofcallerswithfrequentintra-groupcalls,especiallymobilephones,andbrokeamultimilliondollarfraud.Retail:Analystsestimatethat38%ofretailshrinkisduetodishonestemployees.FraudDetection(2)Konstanz,27-28.3.2000EDBT2000tutorial-IntroSportsIBMAdvancedScoutanalyzedNBAgamestatistics(shotsblocked,assists,andfouls)togaincompetitiveadvantageforNewYorkKnicksandMiamiHeat.AstronomyJPLandthePalomarObservatorydiscovered22quasarswiththehelpofdataminingInternetWebSurf-AidIBMSurf-AidappliesdataminingalgorithmstoWebaccesslogsformarket-relatedpagestodiscovercustomerpreferenceandbehaviorpages,analyzingeffectivenessofWebmarketing,improvingWebsiteorganization,etc.WatchforthePRIVACYpitfall!OtherapplicationsKonstanz,27-28.3.2000EDBT2000tutorial-Intro20Theselectionandprocessingofdatafor:theidentificationofnovel,accurate,andusefulpatterns,andthemodelingofreal-worldphenomena.Datamining
isamajorcomponentoftheKDDprocess-automateddiscoveryofpatternsandthedevelopmentofpredictiveandexplanatorymodels.WhatisKDD?Aprocess!Konstanz,27-28.3.2000EDBT2000tutorial-Intro21SelectionandPreprocessingDataMiningInterpretationandEvaluationDataConsolidationKnowledgep(x)=0.02WarehouseDataSourcesPatterns&ModelsPreparedDataConsolidatedDataTheKDDprocessKonstanz,27-28.3.2000EDBT2000tutorial-Intro22TheKDDProcessCoreProblems&ApproachesProblems:identificationofrelevantdatarepresentationofdatasearchforvalidpatternormodelApproaches:top-downdeductionbyexpertinteractivevisualizationofdata/models*bottom-upinduction
fromdata*DataMiningOLAPKonstanz,27-28.3.2000EDBT2000tutorial-IntroLearningtheapplicationdomain:relevantpriorknowledgeandgoalsofapplicationDataconsolidation:CreatingatargetdatasetSelectionandPreprocessing
Datacleaning:(maytake60%ofeffort!)Datareductionandprojection:findusefulfeatures,dimensionality/variablereduction,invariantrepresentation.Choosingfunctionsofdataminingsummarization,classification,regression,association,clustering.Choosingtheminingalgorithm(s)Datamining:searchforpatternsofinterestInterpretationandevaluation:analysisofresults.visualization,transformation,removingredundantpatterns,…UseofdiscoveredknowledgeThestepsoftheKDDprocessKonstanz,27-28.3.2000EDBT2000tutorial-Intro24IdentifyProblemorOpportunityMeasureeffectofActionActonKnowledgeKnowledgeResultsStrategyProblemThevirtuouscycleKonstanz,27-28.3.2000EDBT2000tutorial-Intro25Applications,operations,techniquesKonstanz,27-28.3.2000EDBT2000tutorial-Intro26RolesintheKDDprocessKonstanz,27-28.3.2000EDBT2000tutorial-Intro27IncreasingpotentialtosupportbusinessdecisionsEndUserBusinessAnalystDataAnalystDBA
MakingDecisionsDataPresentationVisualizationTechniquesDataMiningInformationDiscoveryDataExplorationOLAP,MDAStatisticalAnalysis,QueryingandReportingDataWarehouses/DataMartsDataSourcesPaper,Files,InformationProviders,DatabaseSystems,OLTPDataminingandbusinessintelligenceKonstanz,27-28.3.2000EDBT2000tutorial-Intro28GraphicalUserInterfaceDataConsolidationSelectionandPreprocessingDataMiningInterpretationandEvaluationWarehouseKnowledgeDataSourcesArchitectureofaKDDsystemKonstanz,27-28.3.2000EDBT2000tutorial-Intro29AbusinessintelligenceenvironmentKonstanz,27-28.3.2000EDBT2000tutorial-Intro30SelectionandPreprocessingDataMiningInterpretationandEvaluationDataConsolidationKnowledgep(x)=0.02WarehouseDataSourcesPatterns&ModelsPreparedDataConsolidatedDataTheKDDprocessKonstanz,27-28.3.2000EDBT2000tutorial-Intro31GarbageinGarbageout
Thequalityofresultsrelatesdirectlytoqualityofthedata50%-70%ofKDDprocesseffortisspentondataconsolidationandpreparationMajorjustificationforacorporatedatawarehouseDataconsolidationandpreparationKonstanz,27-28.3.2000EDBT2000tutorial-Intro32FromdatasourcestoconsolidateddatarepositoryRDBMSLegacyDBMSFlatFilesDataConsolidationandCleansingWarehouseObject/RelationDBMSMultidimensionalDBMSDeductiveDatabaseFlatfilesExternalDataconsolidationKonstanz,27-28.3.2000EDBT2000tutorial-Intro33DeterminepreliminarylistofattributesConsolidatedataintoworkingdatabaseInternalandExternalsourcesEliminateorestimatemissingvaluesRemoveoutliers(obviousexceptions)DeterminepriorprobabilitiesofcategoriesanddealwithvolumebiasDataconsolidationKonstanz,27-28.3.2000EDBT2000tutorial-Intro34SelectionandPreprocessingDataMiningInterpretationandEvaluationDataConsolidationKnowledgep(x)=0.02WarehouseTheKDDprocessKonstanz,27-28.3.2000EDBT2000tutorial-Intro35GenerateasetofexampleschoosesamplingmethodconsidersamplecomplexitydealwithvolumebiasissuesReduceattributedimensionalityremoveredundantand/orcorrelatingattributescombineattributes(sum,multiply,difference)ReduceattributevaluerangesgroupsymbolicdiscretevaluesquantizecontinuousnumericvaluesTransformdatade-correlateandnormalizevaluesmaptime-seriesdatatostaticrepresentationOLAPandvisualizationtoolsplaykeyroleDataselectionandpreprocessingKonstanz,27-28.3.2000EDBT2000tutorial-Intro36SelectionandPreprocessingDataMining
InterpretationandEvaluationDataConsolidationKnowledgep(x)=0.02WarehouseTheKDDprocessKonstanz,27-28.3.2000EDBT2000tutorial-Intro37DataminingtasksandmethodsAutomatedExploration/Discoverye.g..discoveringnewmarketsegmentsclusteringanalysisPrediction/Classificatione.g..forecastinggrosssalesgivencurrentfactorsregression,neuralnetworks,geneticalgorithms,
decisiontreesExplanation/Descriptione.g..characterizingcustomersbydemographics
andpurchasehistorydecisiontrees,associationrulesx1x2f(x)xifage>35andincome<$35kthen...Konstanz,27-28.3.2000EDBT2000tutorial-Intro38Clustering:partitioningasetofdataintoasetofclasses,calledclusters,whosememberssharesomeinterestingcommonproperties.Distance-basednumericalclusteringmetricgroupingofexamples(K-NN)graphicalvisualizationcanbeusedBayesianclusteringsearchforthenumberofclasseswhichresultinbestfitofaprobabilitydistributiontothedataAutoClass(NASA)oneofbestexamplesAutomatedexplorationanddiscoveryKonstanz,27-28.3.2000EDBT2000tutorial-Intro39LearningapredictivemodelClassificationofanewcase/sampleManymethods:ArtificialneuralnetworksInductivedecisiontreeandrulesystemsGeneticalgorithmsNearestneighborclusteringalgorithmsStatistical(parametric,andnon-parametric)PredictionandclassificationKonstanz,27-28.3.2000EDBT2000tutorial-Intro40Theobjectiveoflearningistoachievegoodgeneralizationtonewunseencases.GeneralizationcanbedefinedasamathematicalinterpolationorregressionoverasetoftrainingpointsModelscanbevalidatedwithapreviouslyunseentestsetorusingcross-validationmethodsf(x)xGeneralizationandregressionKonstanz,27-28.3.2000EDBT2000tutorial-Intro41ClassificationandpredictionClassifydatabasedonthevaluesofatargetattribute,e.g.,classifycountriesbasedonclimate,orclassifycarsbasedongasmileage.Useobtainedmodeltopredictsomeunknownormissingattributevaluesbasedonotherinformation.Konstanz,27-28.3.2000EDBT2000tutorial-Intro42Objective:
Developageneralmodelor hypothesisfromspecificexamplesFunctionapproximation(curvefitting)Classification(conceptlearning,patternrecognition)x1x2ABf(x)xSummarizing:inductivemodeling=learningKonstanz,27-28.3.2000EDBT2000tutorial-Intro43Learnageneralizedhypothesis(model)fromselecteddataDescription/InterpretationofmodelprovidesnewknowledgeMethods:InductivedecisiontreeandrulesystemsAssociationrulesystemsLinkAnalysis…ExplanationanddescriptionKonstanz,27-28.3.2000EDBT2000tutorial-Intro44GenerateamodelofnormalactivityDeviationfrommodelcausesalertMethods:ArtificialneuralnetworksInductivedecisiontreeandrulesystemsStatisticalmethodsVisualizationtoolsException/deviationdetectionKonstanz,27-28.3.2000EDBT2000tutorial-Intro45OutlierandexceptiondataanalysisTime-seriesanalysis(trendanddeviation):Trendanddeviationanalysis:regression,sequentialpattern,similarsequences,trendanddeviation,e.g.,stockanalysis.Similarity-basedpattern-directedanalysisFullvs.partialperiodicityanalysisOtherpattern-directedorstatisticalanalysisKonstanz,27-28.3.2000EDBT2000tutorial-Intro46SelectionandPreprocessingDataMiningInterpretationandEvaluationDataConsolidationandWarehousingKnowledgep(x)=0.02WarehouseTheKDDprocessKonstanz,27-28.3.2000EDBT2000tutorial-IntroAdataminingsystem/querymaygeneratethousandsofpatterns,notallofthemareinteresting.Interestingnessmeasures:easilyunderstoodbyhumansvalidonnewortestdatawithsomedegreeofcertainty.potentiallyusefulnovel,orvalidatessomehypothesisthatauserseekstoconfirmObjectivevs.subjectiveinterestingnessmeasuresObjective:basedonstatisticsandstructuresofpatterns,e.g.,support,confidence,etc.Subjective:basedonuser’sbeliefsinthedata,e.g.,unexpectedness,novelty,etc.Areallthediscoveredpatterninteresting?Konstanz,27-28.3.2000EDBT2000tutorial-IntroFindalltheinterestingpatterns:Completeness.Canadataminingsystemfindalltheinterestingpatterns?Searchforonlyinterestingpatterns:Optimization.Canadataminingsystemfindonlytheinterestingpatterns?ApproachesFirstgenerateallthepatternsandthenfilterouttheuninterestingones.Generateonlytheinterestingpatterns-miningqueryoptimization.Completenessvs.optimizationKonstanz,27-28.3.2000EDBT2000tutorial-Intro49EvaluationStatisticalvalidationandsignificancetestingQualitativereviewbyexpertsinthefieldPilotsurveystoevaluatemodelaccuracyInterpretationInductivetreeandrulemodelscanbereaddirectlyClusteringresultscanbegraphedandtabledCodecanbeautomaticallygeneratedbysomesystems(IDTs,Regressionmodels)InterpretationandevaluationKonstanz,27-28.3.2000EDBT2000tutorial-Intro50Visualizationtoolscanbeveryhelpfulsensitivityanalysis(I/Orelationship)histogramsofvaluedistributiontime-seriesplotsandanimationrequirestrainingandpracticeResponseVelocityTempInterpretationandevaluationKonstanz,27-28.3.2000EDBT2000tutorial-Intro1989IJCAIWorkshoponKDDKnowledgeDiscoveryinDatabases(G.Piatetsky-ShapiroandW.Frawley,eds.,1991)1991-1994WorkshopsonKDDAdvancesinKnowledgeDiscoveryandDataMining(U.Fayyad,G.Piatetsky-Shapiro,P.Smyth,andR.Uthurusam
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2024年湖南益阳职业技术学院第二次人才招聘22人高频500题难、易错点模拟试题附带答案详解
- 2024年湖南电气职业技术学院招考聘用高频500题难、易错点模拟试题附带答案详解
- 2024年湖南永州新田县事业单位招聘134人历年高频500题难、易错点模拟试题附带答案详解
- 2024年湖南永州宁远县急需紧缺高层次人才引进149人历年高频500题难、易错点模拟试题附带答案详解
- 2024年湖南株洲市中级人民法院招聘聘用制司法警察4人高频500题难、易错点模拟试题附带答案详解
- 2024年湖南怀化市辰溪县经济建设投资限公司招聘8人高频500题难、易错点模拟试题附带答案详解
- 2024年湖南张家界市政府发展研究中心引进急需紧缺专业人才2人高频500题难、易错点模拟试题附带答案详解
- 2024年湖南常德市澧县部分事业单位招聘历年高频500题难、易错点模拟试题附带答案详解
- 2024年湖南岳阳平江县乡镇事业站所招聘乡村振兴专干10人高频500题难、易错点模拟试题附带答案详解
- 2024年湖南岳阳县事业单位招聘工作人员56人高频500题难、易错点模拟试题附带答案详解
- 七年级语文上册第一单元综合测试卷
- 高频通气的护理
- 量子信息技术标准化图景(2022)-2023.11
- 护士临床思维能力培养
- 天津一中初三数学第一次月考
- 研究生毕业生登记表填写样表及规范
- DL∕T 547-2020 电力系统光纤通信运行管理规程
- 电磁加热系统设计
- 破产清算费用处理方案范本
- 小学音乐-《猜谜谣》教学设计学情分析教材分析课后反思
- 2023经皮冠状动脉腔内冲击波球囊导管成形术临床应用中国专家建议
评论
0/150
提交评论