版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Knowledge discovery & data mining Tools, methods, and experiencesFosca Giannotti and Dino PedreschiPisa KDD LabCNUCE-CNR & Univ. Pisawww-kdd.di.unipi.it/A tutorial EDBT2000.Contributors and acknowledgementsThe people Pisa KDD Lab: Francesco BONCHI, Giuseppe MANCO, Mirco NANNI, Chiara RENSO, Salvator
2、e RUGGIERI, Franco TURINI and many studentsThe many KDD tutorialists and teachers which made their slides available on the web (all of them listed in bibliography) ;-)In particular:Jiawei HAN, Simon Fraser University, whose forthcoming book Data mining: concepts and techniques has influenced the who
3、le tutorialRajeev RASTOGI and Kyuseok SHIM, Lucent Bell LabsDaniel A. KEIM, University of HalleDaniel Silver, CogNova Technologies The EDBT2000 board who accepted our tutorial proposal.Tutorial goalsIntroduce you to major aspects of the Knowledge Discovery Process, and theory and applications of Dat
4、a Mining technologyProvide a systematization to the many many concepts around this area, according the following linesthe processthe methods applied to paradigmatic casesthe support environmentthe research challengesImportant issues that will be not covered in this tutorial:methods: time series, exc
5、eption detection, neural netssystems: parallel implementations.Tutorial OutlineIntroduction and basic conceptsMotivations, applications, the KDD process, the techniques Deeper into DM technologyDecision Trees and Fraud Detection Association Rules and Market Basket AnalysisClustering and Customer Seg
6、mentationTrends in technologyKnowledge Discovery Support EnvironmentTools, Languages and SystemsResearch challenges.Introduction - module outlineMotivationsApplication AreasKDD Decisional ContextKDD ProcessArchitecture of a KDD systemThe KDD steps in short.Evolution of Database Technology:from data
7、management to data analysis1960s:Data collection, database creation, IMS and network DBMS.1970s: Relational data model, relational DBMS implementation.1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.).
8、1990s: Data mining and data warehousing, multimedia databases, and Web technology.Motivations “Necessity is the Mother of InventionData explosion problem: Automated data collection tools, mature database technology and internet lead to tremendous amounts of data stored in databases, data warehouses
9、and other information repositories. We are drowning in information, but starving for knowledge! (John Naisbett)Data warehousing and data mining :On-line analytical processingExtraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases.Also referred to
10、 as: Data dredging, Data harvesting, Data archeologyA multidisciplinary field:Database StatisticsArtificial intelligenceMachine learning, Expert systems and Knowledge AcquisitionVisualization methodsA rapidly emerging fieldA rapidly emerging field.Motivations for DM Abundance of business and industr
11、y dataCompetitive focus - Knowledge ManagementInexpensive, powerful computing enginesStrong theoretical/mathematical foundations machine learning & logicstatisticsdatabase management systems.What is DM useful for?MarketingDatabaseMarketingDataWarehousingKDD &Data Mining Increase knowledge to base de
12、cision upon.E.g., impact on marketing.The Value Chain Data Customer data Store data Demographical Data Geographical data Information X lives in Z S is Y years old X and S moved W has money in Z Knowledge A quantity Y of product A is used in region Z Customers of class Y use x% of C during period D D
13、ecision Promote product A in region Z. Mail ads to families of profile P Cross-sell service B to clients C.Application Areas and OpportunitiesMarketing: segmentation, customer targeting, .Finance: investment support, portfolio managementBanking & Insurance: credit and policy approvalSecurity: fraud
14、detectionScience and medicine: hypothesis discovery, prediction, classification, diagnosis Manufacturing: process modeling, quality control,resource allocationEngineering: simulation and analysis, pattern recognition, signal processingInternet: smart search engines, web marketing .Classes of applica
15、tionsMarket analysistarget marketing, customer relation management, market basket analysis, cross selling, market segmentation.Risk analysisForecasting, customer retention, improved underwriting, quality control, competitive analysis.Fraud detectionText (news group, , documents) and Web analysis.Mar
16、ket AnalysisWhere are the data sources for analysis?Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies.Target marketingFind clusters of “model customers who share the same characteristics: interest, income level, spending habits, etc.
17、Determine customer purchasing patterns over timeConversion of single to a joint bank account: marriage, etc.Cross-market analysisAssociations/co-relations between product salesPrediction based on the association information.Customer profilingdata mining can tell you what types of customers buy what
18、products (clustering or classification).Identifying customer requirementsidentifying the best products for different customersuse prediction to find what factors will attract new customersProvides summary informationvarious multidimensional summary reports;statistical summary information (data centr
19、al tendency and variation)Market Analysis and ManagementMarket Analysis (2).Risk AnalysisFinance planning and asset evaluation: cash flow analysis and predictioncontingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend analysis, etc.)Resource plann
20、ing:summarize and compare the resources and spendingCompetition:monitor competitors and market directions (CI: competitive intelligence).group customers into classes and class-based pricing proceduresset pricing strategy in a highly competitive market.Fraud DetectionApplications:widely used in healt
21、h care, retail, credit card services, telecommunications (phone card fraud), etc.Approach:use historical data to build models of fraudulent behavior and use data mining to help identify similar instances.Examples:auto insurance: detect a group of people who stage accidents to collect on insurancemon
22、ey laundering: detect suspicious money transactions (US Treasurys Financial Crimes Enforcement Network) medical insurance: detect professional patients and ring of doctors and ring of references.More examples:Detecting inappropriate medical treatment: Australian Health Insurance Commission identifie
23、s that in many cases blanket screening tests were requested (save Australian $1m/yr).Detecting telephone fraud: Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm.British Telecom identified discrete groups of callers with
24、 frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. Retail: Analysts estimate that 38% of retail shrink is due to dishonest employees.Fraud Detection (2).SportsIBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain compet
25、itive advantage for New York Knicks and Miami Heat.AstronomyJPL and the Palomar Observatory discovered 22 quasars with the help of data miningInternet Web Surf-AidIBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior page
26、s, analyzing effectiveness of Web marketing, improving Web site organization, etc.Watch for the PRIVACY pitfall!Other applications.The selection and processing of data for:the identification of novel, accurate, and useful patterns, and the modeling of real-world phenomena.Data mining is a major comp
27、onent of the KDD process - automated discovery of patterns and the development of predictive and explanatory models.What is KDD? A process!.Selection and PreprocessingData MiningInterpretation and EvaluationData ConsolidationKnowledgep(x)=0.02WarehouseData SourcesPatterns & ModelsPrepared Data Conso
28、lidatedDataThe KDD process.The KDD ProcessCore Problems & Approaches Problems:identification of relevant datarepresentation of datasearch for valid pattern or modelApproaches:top-down deduction by expertinteractive visualization of data/models* bottom-up induction from data *DataMiningOLAP.Learning
29、the application domain:relevant prior knowledge and goals of applicationData consolidation: Creating a target data setSelection and Preprocessing Data cleaning : (may take 60% of effort!)Data reduction and projection:find useful features, dimensionality/variable reduction, invariant representation.C
30、hoosing functions of data mining summarization, classification, regression, association, clustering.Choosing the mining algorithm(s)Data mining: search for patterns of interestInterpretation and evaluation: analysis of results.visualization, transformation, removing redundant patterns, Use of discov
31、ered knowledgeThe steps of the KDD process.IdentifyProblem or OpportunityMeasure effectof ActionAct onKnowledgeKnowledgeResultsStrategyProblemThe virtuous cycle.Applications, operations, techniques.Roles in the KDD process.Increasing potentialto supportbusiness decisionsEnd UserBusiness Analyst Data
32、AnalystDBA MakingDecisionsData PresentationVisualization TechniquesData MiningInformation DiscoveryData ExplorationOLAP, MDAStatistical Analysis, Querying and ReportingData Warehouses / Data MartsData SourcesPaper, Files, Information Providers, Database Systems, OLTPData mining and business intellig
33、ence.Graphical User InterfaceDataConsolidationSelectionandPreprocessingDataMiningInterpretationand EvaluationWarehouseKnowledgeData SourcesArchitecture of a KDD system.A business intelligence environment.Selection and PreprocessingData MiningInterpretation and EvaluationData ConsolidationKnowledgep(
34、x)=0.02WarehouseData SourcesPatterns & ModelsPrepared Data ConsolidatedDataThe KDD process.Garbage in Garbage out The quality of results relates directly to quality of the data50%-70% of KDD process effort is spent on data consolidation and preparationMajor justification for a corporate data warehou
35、seData consolidation and preparation.From data sources to consolidated data repositoryRDBMSLegacy DBMSFlat FilesDataConsolidationand CleansingWarehouseObject/Relation DBMS Multidimensional DBMS Deductive Database Flat files ExternalData consolidation.Determine preliminary list of attributes Consolid
36、ate data into working databaseInternal and External sourcesEliminate or estimate missing valuesRemove outliers (obvious exceptions)Determine prior probabilities of categories and deal with volume biasData consolidation.Selection and PreprocessingData Mining Interpretation and EvaluationData Consolid
37、ationKnowledgep(x)=0.02WarehouseThe KDD process.Generate a set of exampleschoose sampling methodconsider sample complexitydeal with volume bias issuesReduce attribute dimensionalityremove redundant and/or correlating attributescombine attributes (sum, multiply, difference)Reduce attribute value rang
38、esgroup symbolic discrete valuesquantize continuous numeric valuesTransform datade-correlate and normalize values map time-series data to static representationOLAP and visualization tools play key roleData selection and preprocessing.Selection and PreprocessingData Mining Interpretation and Evaluati
39、onData ConsolidationKnowledgep(x)=0.02WarehouseThe KDD process.Data mining tasks and methods Automated Exploration/Discoverye.g. discovering new market segmentsclustering analysisPrediction/Classificatione.g. forecasting gross sales given current factorsregression, neural networks, genetic algorithm
40、s, decision treesExplanation/Descriptione.g. characterizing customers by demographics and purchase historydecision trees, association rulesx1x2f(x)xif age 35 and income $35k then .Clustering: partitioning a set of data into a set of classes, called clusters, whose members share some interesting comm
41、on properties.Distance-based numerical clusteringmetric grouping of examples (K-NN)graphical visualization can be usedBayesian clusteringsearch for the number of classes which result in best fit of a probability distribution to the data AutoClass (NASA) one of best examplesAutomated exploration and
42、discovery.Learning a predictive modelClassification of a new case/sample Many methods:Artificial neural networksInductive decision tree and rule systemsGenetic algorithmsNearest neighbor clustering algorithmsStatistical (parametric, and non-parametric)Prediction and classification.The objective of l
43、earning is to achieve good generalization to new unseen cases.Generalization can be defined as a mathematical interpolation or regression over a set of training pointsModels can be validated with a previously unseen test set or using cross-validation methodsf(x)xGeneralization and regression.Classif
44、ication and predictionClassify data based on the values of a target attribute, e.g., classify countries based on climate, or classify cars based on gas mileage.Use obtained model to predict some unknown or missing attribute values based on other information.Objective: Develop a general model or hypo
45、thesis from specific examplesFunction approximation (curve fitting)Classification (concept learning, pattern recognition)x1x2ABf(x)xSummarizing: inductive modeling = learning.Learn a generalized hypothesis (model) from selected dataDescription/Interpretation of model provides new knowledge Methods:I
46、nductive decision tree and rule systemsAssociation rule systemsLink Analysis Explanation and description.Generate a model of normal activityDeviation from model causes alertMethods:Artificial neural networksInductive decision tree and rule systemsStatistical methodsVisualization toolsException/devia
47、tion detection.Outlier and exception data analysisTime-series analysis (trend and deviation): Trend and deviation analysis: regression, sequential pattern, similar sequences, trend and deviation, e.g., stock analysis.Similarity-based pattern-directed analysisFull vs. partial periodicity analysisOthe
48、r pattern-directed or statistical analysis.Selection and PreprocessingData Mining Interpretation and EvaluationData Consolidationand WarehousingKnowledgep(x)=0.02WarehouseThe KDD process.A data mining system/query may generate thousands of patterns, not all of them are interesting.Interestingness me
49、asures:easily understood by humansvalid on new or test data with some degree of certainty.potentially usefulnovel, or validates some hypothesis that a user seeks to confirm Objective vs. subjective interestingness measuresObjective: based on statistics and structures of patterns, e.g., support, conf
50、idence, etc.Subjective: based on users beliefs in the data, e.g., unexpectedness, novelty, etc.Are all the discovered pattern interesting?.Find all the interesting patterns: Completeness.Can a data mining system find all the interesting patterns?Search for only interesting patterns: Optimization.Can
51、 a data mining system find only the interesting patterns?ApproachesFirst generate all the patterns and then filter out the uninteresting ones.Generate only the interesting patterns - mining query optimization.Completeness vs. optimization.EvaluationStatistical validation and significance testingQual
52、itative review by experts in the fieldPilot surveys to evaluate model accuracyInterpretationInductive tree and rule models can be read directlyClustering results can be graphed and tabledCode can be automatically generated by some systems (IDTs, Regression models)Interpretation and evaluation.Visual
53、ization tools can be very helpfulsensitivity analysis (I/O relationship)histograms of value distributiontime-series plots and animationrequires training and practiceResponseVelocityTempInterpretation and evaluation.1989 IJCAI Workshop on KDDKnowledge Discovery in Databases (G. Piatetsky-Shapiro and
54、W. Frawley, eds., 1991)1991-1994 Workshops on KDDAdvances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, eds., 1996)1995-1998 AAAI Int. Conf. on KDD and DM (KDD95-98)Journal of Data Mining and Knowledge Discovery (1997)1998 ACM SIGKDD 1999 SIGKDD99 Conf.
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 同意签订合同的纪要
- 《夏商周秦汉大事》课件
- 2025年海南货运从业资格证恢复考试题
- 2025年滨州货运资格证考试真题
- 2025年山东货运上岗证模拟考试0题
- 2025年江西货运从业资证孝试模似题库
- 2025年达州道路运输从业资格证考试模拟试题
- 治安院务公开管理办法
- 智能家居大白施工合同
- 航空航天木地板施工合同
- 2023年《思想道德与法治》期末考试复习题库(带答案)
- 篮球交叉步持球突破教学设计-高二下学期体育与健康人教版
- 八年级上册生物天津生物期末试卷测试卷(含答案解析)
- 契诃夫《苦恼》课件
- 服从岗位调配申请书
- 以甘蔗为原料年产10万吨生物乙醇工厂设计
- 养老机构养老院服务安全风险分级管控清单
- 单位内发生治安案件、涉嫌刑事案件的报告制度
- 幼儿园一日活动保教工作标准细则
- 银行统计报送工作实施细则
- 中国颈椎病诊治与康复指南
评论
0/150
提交评论