数据挖掘:实用机器学习工具与技术-01数据挖掘概述_第1页
数据挖掘:实用机器学习工具与技术-01数据挖掘概述_第2页
数据挖掘:实用机器学习工具与技术-01数据挖掘概述_第3页
数据挖掘:实用机器学习工具与技术-01数据挖掘概述_第4页
数据挖掘:实用机器学习工具与技术-01数据挖掘概述_第5页
已阅读5页,还剩37页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 1 of Data Mining by I. H. Witten, E. Frank and M. A. Hall,2,Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1),Whats it all about?,Data vs information Data mining and machine learning Structural desc

2、riptions Rules: classification and association Decision trees Datasets Weather, contact lens, CPU performance, labor negotiation data, soybean classification Fielded applications Ranking web pages, loan applications, screening images, load forecasting, machine fault diagnosis, market basket analysis

3、 Generalization as search Data mining and ethics,3,Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1),Data vs. information,Society produces huge amounts of data Sources: business, science, medicine, economics, geography, environment, sports, Potentially valuable resource Raw da

4、ta is useless: need techniques to automatically extract information from it Data: recorded facts Information: patterns underlying the data,4,Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1),Information is crucial,Example 1: in vitro fertilization Given: embryos described by 6

5、0 features Problem: selection of embryos that will survive Data: historical records of embryos and outcome Example 2: cow culling Given: cows described by 700 features Problem: selection of cows that should be culled Data: historical records and farmers decisions,5,Data Mining: Practical Machine Lea

6、rning Tools and Techniques (Chapter 1),Data mining,Extracting implicit, previously unknown, potentially useful information from data Needed: programs that detect patterns and regularities in the data Strong patterns good predictions Problem 1: most patterns are not interesting Problem 2: patterns ma

7、y be inexact (or spurious) Problem 3: data may be garbled or missing,6,Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1),Machine learning techniques,Algorithms for acquiring structural descriptions from examples Structural descriptions represent patterns explicitly Can be used

8、 to predict outcome in new situation Can be used to understand and explain how prediction is derived(may be even more important) Methods originate from artificial intelligence, statistics, and research on databases,7,Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1),Structural

9、 descriptions,Example: if-then rules,8,Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1),Can machines really learn?,Definitions of “learning” from dictionary:,To get knowledge of by study,experience, or being taught To become aware by information orfrom observation To commit t

10、o memory To be informed of, ascertain; to receive instruction,9,Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1),The weather problem,Conditions for playing a certain game,Yes,False,Normal,Mild,Rainy,Yes,False,High,Hot,Overcast,No,True,High,Hot,Sunny,No,False,High,Hot,Sunny,Pl

11、ay,Windy,Humidity,Temperature,Outlook,10,Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1),Ross Quinlan,Machine learning researcher from 1970s University of Sydney, Australia 1986 “Induction of decision trees” ML Journal 1993 C4.5: Programs for machine learning. Morgan Kaufman

12、n 199? Started,11,Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1),Classification vs. association rules,Classification rule:predicts value of a given attribute (the classification of an example) Association rule:predicts value of arbitrary attribute (or combination),12,Data M

13、ining: Practical Machine Learning Tools and Techniques (Chapter 1),Weather data with mixed attributes,Some attributes have numeric values,13,Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1),The contact lenses data,14,Data Mining: Practical Machine Learning Tools and Technique

14、s (Chapter 1),A complete and correct rule set,15,Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1),A decision tree for this problem,16,Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1),Classifying iris flowers,17,Data Mining: Practical Machine Learning T

15、ools and Techniques (Chapter 1),Example: 209 different computer configurations Linear regression function,Predicting CPU performance,18,Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1),Data from labor negotiations,19,Data Mining: Practical Machine Learning Tools and Technique

16、s (Chapter 1),Decision trees for the labor data,20,Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1),Soybean classification,Diaporthe stem canker,19,Diagnosis,Normal,3,Condition,Root,Yes,2,Stem lodging,Abnormal,2,Condition,Stem,?,3,Leaf spot size,Abnormal,2,Condition,Leaf,?,5,

17、Fruit spots,Normal,4,Condition of fruit pods,Fruit,Absent,2,Mold growth,Normal,2,Condition,Seed,Above normal,3,Precipitation,July,7,Time of occurrence,Environment,Sample value,Number of values,Attribute,21,Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1),The role of domain kn

18、owledge,But in this domain, “leaf condition is normal” implies“leaf malformation is absent”!,22,Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1),Fielded applications,The result of learningor the learning method itselfis deployed in practical applications Processing loan appli

19、cations Screening images for oil slicks Electricity supply forecasting Diagnosis of machine faults Marketing and sales Separating crude oil and natural gas Reducing banding in rotogravure printing Finding appropriate technicians for telephone faults Scientific applications: biology, astronomy, chemi

20、stry Automatic selection of TV programs Monitoring intensive care patients,23,Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1),Processing loan applications (American Express),Given: questionnaire withfinancial and personal information Question: should money be lent? Simple st

21、atistical method covers 90% of cases Borderline cases referred to loan officers But: 50% of accepted borderline cases defaulted! Solution: reject all borderline cases? No! Borderline cases are most active customers,24,Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1),Enter mac

22、hine learning,1000 training examples of borderline cases 20 attributes: age years with current employer years at current address years with the bank other credit cards possessed, Learned rules: correct on 70% of cases human experts only 50% Rules could be used to explain decisions to customers,25,Da

23、ta Mining: Practical Machine Learning Tools and Techniques (Chapter 1),Screening images,Given: radar satellite images of coastal waters Problem: detect oil slicks in those images Oil slicks appear as dark regions with changing size and shape Not easy: lookalike dark regions can be caused by weather

24、conditions (e.g. high wind) Expensive process requiring highly trained personnel,26,Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1),Enter machine learning,Extract dark regions from normalized image Attributes: size of region shape, area intensity sharpness and jaggedness of

25、boundaries proximity of other regions info about background Constraints: Few training examplesoil slicks are rare! Unbalanced data: most dark regions arent slicks Regions from same image form a batch Requirement: adjustable false-alarm rate,27,Data Mining: Practical Machine Learning Tools and Techni

26、ques (Chapter 1),Load forecasting,Electricity supply companiesneed forecast of future demandfor power Forecasts of min/max load for each hour significant savings Given: manually constructed load model that assumes “normal” climatic conditions Problem: adjust for weather conditions Static model consi

27、st of: base load for the year load periodicity over the year effect of holidays,28,Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1),Enter machine learning,Prediction corrected using “most similar” days Attributes: temperature humidity wind speed cloud cover readings plus diff

28、erence between actual load and predicted load Average difference among three “most similar” days added to static model Linear regression coefficients form attribute weights in similarity function,29,Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1),Diagnosis of machine faults,

29、Diagnosis: classical domainof expert systems Given: Fourier analysis of vibrations measured at various points of a devices mounting Question: which fault is present? Preventative maintenance of electromechanical motors and generators Information very noisy So far: diagnosis by expert/hand-crafted ru

30、les,30,Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1),Enter machine learning,Available: 600 faults with experts diagnosis 300 unsatisfactory, rest used for training Attributes augmented by intermediate concepts that embodied causal domain knowledge Expert not satisfied with

31、 initial rules because they did not relate to his domain knowledge Further background knowledge resulted in more complex rules that were satisfactory Learned rules outperformed hand-crafted ones,31,Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1),Marketing and sales I,Compani

32、es precisely record massive amounts of marketing and sales data Applications: Customer loyalty:identifying customers that are likely to defect by detecting changes in their behavior(e.g. banks/phone companies) Special offers:identifying profitable customers(e.g. reliable owners of credit cards that

33、need extra money during the holiday season),32,Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1),Marketing and sales II,Market basket analysis Association techniques findgroups of items that tend tooccur together in atransaction(used to analyze checkout data) Historical analys

34、is of purchasing patterns Identifying prospective customers Focusing promotional mailouts(targeted campaigns are cheaper than mass-marketed ones),33,Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1),Machine learning and statistics,Historical difference (grossly oversimplified)

35、: Statistics: testing hypotheses Machine learning: finding the right hypothesis But: huge overlap Decision trees (C4.5 and CART) Nearest-neighbor methods Today: perspectives have converged Most ML algorithms employ statistical techniques,34,Data Mining: Practical Machine Learning Tools and Technique

36、s (Chapter 1),Statisticians,Sir Ronald Aylmer Fisher Born: 17 Feb 1890 London, EnglandDied: 29 July 1962 Adelaide, Australia Numerous distinguished contributions to developing the theory and application of statistics for making quantitative a vast field of biology,Leo Breiman Developed decision tree

37、s 1984 Classification and Regression Trees. Wadsworth.,35,Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1),Generalization as search,Inductive learning: find a concept description that fits the data Example: rule sets as description language Enormous, but finite, search space

38、Simple solution: enumerate the concept space eliminate descriptions that do not fit examples surviving descriptions contain target concept,36,Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1),Enumerating the concept space,Search space for weather problem 4 x 4 x 3 x 3 x 2 = 28

39、8 possible combinations With 14 rules 2.7x1034 possible rule sets Other practical problems: More than one description may survive No description may survive Language is unable to describe target concept or data contains noise Another view of generalization as search:hill-climbing in description spac

40、e according to pre-specified matching criterion Most practical algorithms use heuristic search that cannot guarantee to find the optimum solution,37,Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1),Bias,Important decisions in learning systems: Concept description language Ord

41、er in which the space is searched Way that overfitting to the particular training data is avoided These form the “bias” of the search: Language bias Search bias Overfitting-avoidance bias,38,Data Mining: Practical Machine Learning Tools and Techniques (Chapter 1),Language bias,Important question: is

42、 language universalor does it restrict what can be learned? Universal language can express arbitrary subsets of examples If language includes logical or (“disjunction”), it is universal Example: rule sets Domain knowledge can be used to exclude some concept descriptions a priori from the search,39,D

43、ata Mining: Practical Machine Learning Tools and Techniques (Chapter 1),Search bias,Search heuristic “Greedy” search: performing the best single step “Beam search”: keeping several alternatives Direction of search General-to-specific E.g. specializing a rule by adding conditions Specific-to-general E.g. g

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论