数据挖掘概论.课件

上传人：荷*** IP属地：贵州上传时间：2022-07-27 格式：PPT 页数：30 大小：4.15MB 积分：25 举报 版权申诉

已阅读5页，还剩25页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、What is Data Mining ?数据挖掘概论南京航空航天大学信息科学与技术学院皮德常教授、博导Lots of data is being collected and warehoused Web data, e-commercepurchases at department/grocery storesBank/Credit Card transactionsComputers have become cheaper and more powerfulCompetitive pressure is strong Provide better, customized services

2、 for an edge (e.g. in Customer Relationship Management)Why Mine Data? Commercial ViewpointWhy Mine Data? Scientific ViewpointData collected and stored at enormous speeds (GB/hour)remote sensors on a satellitetelescopes scanning the skiesmicroarrays generating gene expression datascientific simulatio

3、ns generating terabytes of dataTraditional techniques infeasible for raw dataData mining may help scientists in classifying and segmenting data, Mining Large Data Sets - Motivationdata rich but information poor!we are drowning in data, but starving for knowledge!哇！这么多的数据！怎样才能用呢？挖！“Necessity is the m

4、other of invention”Data miningAutomated analysis of massive data setsMining Large Data Sets - MotivationA famous story:跟尿布一起购买最多的商品是啤酒！diapersbeerThe success of GoogleSearch Engine: Analyzing data on the internet to find what meets your demand.Larry Page 1973.3.26 & Sergey Brin 1973.8.21 166亿美元 & 14

5、1亿美元的财产，共享一架波音767 What is Data Mining?Data mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns from huge volume of data. U. Fayyad, et al. s definition of KDD at KDD96What is (not) Data Mining? What is Data Mining? Certain names a

6、re more prevalent in certain US locations (OBrien, ORurke, OReilly in Boston area) What is not Data Mining? Look up phone number in phone directory Draws ideas from machine learning/AI, pattern recognition, statistics, and database systemsTraditional Techniquesmay be unsuitable due to Enormity of da

7、taHigh dimensionality of dataHeterogeneous, distributed nature of dataOrigins of Data MiningMachine Learning/Pattern RecognitionStatistics/AIData MiningDatabase systemsArchitecture: Typical Data Mining Systemdata cleaning, integration, and selectionDatabase or Data Warehouse ServerData Mining Engine

8、Pattern EvaluationGraphical User InterfaceKnowle-dgeBaseDBDWWWWOther InfoRepositoriesData Mining TasksPredictionUse some variables to predict unknown or future values of other variables.DescriptionFind human-interpretable patterns that describe the data.From Fayyad, et.al. Advances in Knowledge Disc

9、overy and Data Mining, 1996Data Mining Tasks.ClassificationClusteringAssociation Rule DiscoverySequential Pattern DiscoveryRegressionDeviation DetectionClassification ExamplecategoricalcategoricalcontinuousclassTestSetTraining SetModelLearn ClassifierClassification: ApplicationDirect MarketingGoal:

10、Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product.Approach:Use the data for a similar product introduced before. We know which customers decided to buy and which decided otherwise. This buy, dont buy decision forms the class attribute.Collect some related

11、information about the customers.Type of business, where they stay, how much they earn, etc.Use this information as input attributes to learn a classifier model.Clustering DefinitionGiven a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such th

12、atData points in one cluster are more similar to one another.Data points in separate clusters are less similar to one another.ClusteringEuclidean Distance Based Clustering in 3-D space.Intra-cluster distancesare minimizedInter-cluster distancesare maximizedClustering: ApplicationDocument Clustering:

13、Goal: To find groups of documents that are similar to each other based on the important terms appearing in them.Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster.Gain: Information Retrieval can

14、utilize the clusters to relate a new document or search term to clustered documents.Illustrating Document ClusteringClustering Points: 3204 Articles of Los Angeles Times.Similarity Measure: How many words are common in these documents (after some word filtering).Association Rule DiscoveryGiven a set

15、 of records each of which contain some number of items from a given collection;Produce dependency rules which will predict occurrence of an item based on occurrences of other items.Rules Discovered: Diaper, Milk - BeerAssociation Rule Discovery: Application 1Supermarket shelf management.Goal: To ide

16、ntify items that are bought together by sufficiently many customers.Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items.A classic rule If a customer buys diaper and milk, then he is very likely to buy beer.So, dont be surprised if you find six-pa

17、cks stacked next to diapers!RegressionPredict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency.Greatly studied in statistics, neural network fields.Examples:Predicting sales amounts of new product based on adveti

18、sing expenditure.Predicting wind velocities as a function of temperature, humidity, air pressure, etc.Time series prediction of stock market indices.Deviation/Anomaly DetectionDetect significant deviations from normal behaviorApplications:Credit Card Fraud DetectionNetwork Intrusion DetectionChallen

19、ges of Data MiningScalabilityDimensionalityComplex and Heterogeneous DataData QualityData Ownership and DistributionPrivacy PreservationStreaming DataMy hope数据挖掘研究已经开展了近15年。推进该技术的广泛应用：1. 企业界已经开始关注数据挖掘技术研究部门应该做什么？2. 自身技术的研究：易用性可用性3. 与应用领域的结合：金融业生物信息学信息检索。飞行器故障诊断与预测、可靠性、My research in recent years1. Mining Acceleration-like Association Rule2. Interior-oriented Intrusion Detection System Based on Multi-agents 3. Fuzzy Clustering Algorithm4. A Fast Trajectory Clustering Algorithm with SamplingMy research in recent years5. An improved C

人人文库> 全部分类> 教育资料 > 课件下载

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

数据挖掘概论.课件

文档简介

温馨提示

最新文档

评论

数据挖掘概论.课件

文档简介

温馨提示

最新文档

评论

相关文档