![数据挖掘概论.课件_第1页](http://file4.renrendoc.com/view/9e2ae2afa39f25efd68b4cf7a920c05a/9e2ae2afa39f25efd68b4cf7a920c05a1.gif)
![数据挖掘概论.课件_第2页](http://file4.renrendoc.com/view/9e2ae2afa39f25efd68b4cf7a920c05a/9e2ae2afa39f25efd68b4cf7a920c05a2.gif)
![数据挖掘概论.课件_第3页](http://file4.renrendoc.com/view/9e2ae2afa39f25efd68b4cf7a920c05a/9e2ae2afa39f25efd68b4cf7a920c05a3.gif)
![数据挖掘概论.课件_第4页](http://file4.renrendoc.com/view/9e2ae2afa39f25efd68b4cf7a920c05a/9e2ae2afa39f25efd68b4cf7a920c05a4.gif)
![数据挖掘概论.课件_第5页](http://file4.renrendoc.com/view/9e2ae2afa39f25efd68b4cf7a920c05a/9e2ae2afa39f25efd68b4cf7a920c05a5.gif)
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、What is Data Mining ?数据挖掘概论南京航空航天大学信息科学与技术学院皮德常 教授、博导Lots of data is being collected and warehoused Web data, e-commercepurchases at department/grocery storesBank/Credit Card transactionsComputers have become cheaper and more powerfulCompetitive pressure is strong Provide better, customized services
2、 for an edge (e.g. in Customer Relationship Management)Why Mine Data? Commercial ViewpointWhy Mine Data? Scientific ViewpointData collected and stored at enormous speeds (GB/hour)remote sensors on a satellitetelescopes scanning the skiesmicroarrays generating gene expression datascientific simulatio
3、ns generating terabytes of dataTraditional techniques infeasible for raw dataData mining may help scientists in classifying and segmenting data, Mining Large Data Sets - Motivationdata rich but information poor!we are drowning in data, but starving for knowledge!哇!这么多的数据!怎样才能用呢?挖!“Necessity is the m
4、other of invention”Data miningAutomated analysis of massive data setsMining Large Data Sets - MotivationA famous story:跟尿布一起购买最多的商品是啤酒!diapersbeerThe success of GoogleSearch Engine: Analyzing data on the internet to find what meets your demand.Larry Page 1973.3.26 & Sergey Brin 1973.8.21 166亿美元 & 14
5、1亿美元的财产,共享一架波音767 What is Data Mining?Data mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns from huge volume of data. U. Fayyad, et al. s definition of KDD at KDD96What is (not) Data Mining? What is Data Mining? Certain names a
6、re more prevalent in certain US locations (OBrien, ORurke, OReilly in Boston area) What is not Data Mining? Look up phone number in phone directory Draws ideas from machine learning/AI, pattern recognition, statistics, and database systemsTraditional Techniquesmay be unsuitable due to Enormity of da
7、taHigh dimensionality of dataHeterogeneous, distributed nature of dataOrigins of Data MiningMachine Learning/Pattern RecognitionStatistics/AIData MiningDatabase systemsArchitecture: Typical Data Mining Systemdata cleaning, integration, and selectionDatabase or Data Warehouse ServerData Mining Engine
8、Pattern EvaluationGraphical User InterfaceKnowle-dgeBaseDBDWWWWOther InfoRepositoriesData Mining TasksPredictionUse some variables to predict unknown or future values of other variables.DescriptionFind human-interpretable patterns that describe the data.From Fayyad, et.al. Advances in Knowledge Disc
9、overy and Data Mining, 1996Data Mining Tasks.ClassificationClusteringAssociation Rule DiscoverySequential Pattern DiscoveryRegressionDeviation DetectionClassification ExamplecategoricalcategoricalcontinuousclassTestSetTraining SetModelLearn ClassifierClassification: ApplicationDirect MarketingGoal:
10、Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product.Approach:Use the data for a similar product introduced before. We know which customers decided to buy and which decided otherwise. This buy, dont buy decision forms the class attribute.Collect some related
11、information about the customers.Type of business, where they stay, how much they earn, etc.Use this information as input attributes to learn a classifier model.Clustering DefinitionGiven a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such th
12、atData points in one cluster are more similar to one another.Data points in separate clusters are less similar to one another.ClusteringEuclidean Distance Based Clustering in 3-D space.Intra-cluster distancesare minimizedInter-cluster distancesare maximizedClustering: ApplicationDocument Clustering:
13、Goal: To find groups of documents that are similar to each other based on the important terms appearing in them.Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster.Gain: Information Retrieval can
14、utilize the clusters to relate a new document or search term to clustered documents.Illustrating Document ClusteringClustering Points: 3204 Articles of Los Angeles Times.Similarity Measure: How many words are common in these documents (after some word filtering).Association Rule DiscoveryGiven a set
15、 of records each of which contain some number of items from a given collection;Produce dependency rules which will predict occurrence of an item based on occurrences of other items.Rules Discovered: Diaper, Milk - BeerAssociation Rule Discovery: Application 1Supermarket shelf management.Goal: To ide
16、ntify items that are bought together by sufficiently many customers.Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items.A classic rule If a customer buys diaper and milk, then he is very likely to buy beer.So, dont be surprised if you find six-pa
17、cks stacked next to diapers!RegressionPredict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency.Greatly studied in statistics, neural network fields.Examples:Predicting sales amounts of new product based on adveti
18、sing expenditure.Predicting wind velocities as a function of temperature, humidity, air pressure, etc.Time series prediction of stock market indices.Deviation/Anomaly DetectionDetect significant deviations from normal behaviorApplications:Credit Card Fraud DetectionNetwork Intrusion DetectionChallen
19、ges of Data MiningScalabilityDimensionalityComplex and Heterogeneous DataData QualityData Ownership and DistributionPrivacy PreservationStreaming DataMy hope数据挖掘研究已经开展了近15年。推进该技术的广泛应用:1. 企业界已经开始关注数据挖掘技术研究部门应该做什么?2. 自身技术的研究:易用性可用性3. 与应用领域的结合:金融业生物信息学信息检索。飞行器故障诊断与预测、可靠性、My research in recent years1. Mining Acceleration-like Association Rule2. Interior-oriented Intrusion Detection System Based on Multi-agents 3. Fuzzy Clustering Algorithm4. A Fast Trajectory Clustering Algorithm with SamplingMy research in recent years5. An improved C
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- Unit 3 My friends Part A Let's spell(教学设计)-2024-2025学年人教PEP版英语四年级上册
- 8网络新世界 教学设计-2024-2025学年道德与法治四年级上册统编版
- 苏州市吴中区木渎实验中学2011-2012学年八年级语文10月月
- 《花园》教学设计-2024-2025学年二年级上册数学北师大版
- 中班安全用电教育
- 《失智老年人照护》模块 3:认知功能促进-技能 7 注意力训练(SZ-7)
- 工程关键部位、关键工序的监理措施
- 磷化工工艺流程
- 26西门豹治邺 教学设计-2024-2025学年语文四年级上册统编版
- 污水干管改造项目目标
- 医院科室运营与管理课件
- 少年英雄(课件)小学生主题班会通用版
- 《会稽山绍兴酒营销策略现状、问题及对策》开题报告文献综述4000字
- 2021年中国高尿酸及痛风趋势白皮书
- 电气安全培训
- 15 分章专项练习-整本书阅读系列《经典常谈》名著阅读与练习
- 注塑品质管理要点
- 一课一练┃二年级下册:1古诗二首
- 校长(含副校长)绩效考核指标要点
- 初中衡水体英语(28篇)
- 九年级心理健康教学计划
评论
0/150
提交评论