版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、November 1, 2021Data Mining: Concepts and Techniques1Data Mining: Concepts and Techniques Chapter 5 Jianlin ChengDepartment of Computer Science University of Missouri, ColumbiaAdapted from Slides of the Text Book2006 Jiawei Han and Micheline Kamber, All rights reservedNovember 1, 2021Data Mining: Co
2、ncepts and Techniques2Chapter 5: Mining Frequent Patterns, Association and CorrelationsnBasic concepts nEfficient and scalable frequent itemset mining methodsnMining various kinds of association rulesnFrom association mining to correlation analysisnConstraint-based association miningnSummaryNovember
3、 1, 2021Data Mining: Concepts and Techniques3What Is Frequent Pattern Analysis?nFrequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set nFirst proposed by Agrawal, Imielinski, and Swami AIS93 in the context of frequent itemsets and associa
4、tion rule miningnMotivation: Finding inherent regularities in datanWhat products were often purchased together? Beer and diapers?!nWhat are the subsequent purchases after buying a PC?nWhat kinds of DNA are sensitive to this new drug?nCan we automatically classify web documents?nApplicationsnBasket d
5、ata analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.November 1, 2021Data Mining: Concepts and Techniques4Why Is Freq. Pattern Mining Important?nDiscloses an intrinsic and important property of data setsnForms the foundatio
6、n for many essential data mining tasksnAssociation, correlation, and causality analysisnSequential, structural (e.g., sub-graph) patternsnPattern analysis in spatiotemporal, multimedia, time-series, and stream data nClassification: associative classificationnCluster analysis: frequent pattern-based
7、clusteringnBroad applicationsNovember 1, 2021Data Mining: Concepts and Techniques5Basic Concepts: Frequent Patterns and Association RulesnItemset X = x1, , xknFind all the rules X Y with minimum support and confidencensupport, s, probability that a transaction contains X Ynconfidence, c, conditional
8、 probability that a transaction having X also contains YLet supmin = 50%, confmin = 50%Freq. Pat.: A:3, B:3, D:4, E:3, AD:3Association rules:A D (60%, 100%)D A (60%, 75%)Customerbuys diaperCustomerbuys bothCustomerbuys beerTransaction-idItems bought10A, B, D20A, C, D30A, D, E40B, E, F50B, C, D, E, F
9、November 1, 2021Data Mining: Concepts and Techniques6Closed Patterns and Max-PatternsnA long pattern contains a combinatorial number of sub-patterns, e.g., a1, , a100 contains (1001) + (1002) + + (110000) = 2100 1 = 1.27*1030 sub-patterns!nSolution: Mine closed patterns and max-patterns insteadnAn i
10、temset X is closed if X is frequent and there exists no super-pattern Y ? X, with the same support as X (proposed by Pasquier, et al. ICDT99) nAn itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y ? X (proposed by Bayardo SIGMOD98)nClosed pattern is a lossless c
11、ompression of freq. patternsnReducing the # of patterns and rulesNovember 1, 2021Data Mining: Concepts and Techniques7Closed Patterns and Max-PatternsnExercise. DB = , nMin_sup = 1.nWhat is the set of closed itemset?n: 1n: 2nWhat is the set of max-pattern?n: 1nWhat is the set of all patterns?n!Novem
12、ber 1, 2021Data Mining: Concepts and Techniques8Chapter 5: Mining Frequent Patterns, Association and CorrelationsnBasic concepts nEfficient and scalable frequent itemset mining methodsnMining various kinds of association rulesnFrom association mining to correlation analysisnConstraint-based associat
13、ion miningnSummaryNovember 1, 2021Data Mining: Concepts and Techniques9Scalable Methods for Mining Frequent PatternsnThe downward closure property of frequent patternsnAny subset of a frequent itemset must be frequentnIf beer, diaper, nuts is frequent, so is beer, diaperni.e., every transaction havi
14、ng beer, diaper, nuts also contains beer, diaper nScalable mining methods: Three major approachesnApriori (Agrawal & SrikantVLDB94)nFreq. pattern growth (FPgrowthHan, Pei & Yin SIGMOD00)nVertical data format approach (CharmZaki & Hsiao SDM02)November 1, 2021Data Mining: Concepts and Tech
15、niques10Apriori: A Candidate Generation-and-Test ApproachnApriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant VLDB94, Mannila, et al. KDD 94)nMethod: nInitially, scan DB once to get frequent 1-itemsetnGenerate le
16、ngth (k+1) candidate itemsets from length k frequent itemsetsnTest the candidates against DBnTerminate when no frequent or candidate set can be generatedNovember 1, 2021Data Mining: Concepts and Techniques11The Apriori AlgorithmAn Example Database TDB1st scanC1L1L2C2C22nd scanC3L33rd scanTidItems10A
17、, C, D20B, C, E30A, B, C, E40B, EItemsetsupA2B3C3D1E3ItemsetsupA2B3C3E3ItemsetA, BA, CA, EB, CB, EC, EItemsetsupA, B1A, C2A, E1B, C2B, E3C, E2ItemsetsupA, C2B, C2B, E3C, E2ItemsetB, C, EA, C, EA, B, CItemsetsupB, C, E2Supmin = 2November 1, 2021Data Mining: Concepts and Techniques12The Apriori Algori
18、thmnPseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size kL1 = frequent items;for (k = 1; Lk !=; k+) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1
19、 with min_support endreturn k Lk;November 1, 2021Data Mining: Concepts and Techniques13Important Details of ApriorinHow to generate candidates?nStep 1: self-joining LknStep 2: pruningnHow to count supports of candidates?nExample of Candidate-generationnL3=abc, abd, acd, ace, bcdnSelf-joining: L3*L3n
20、abcd from abc and abdnacde from acd and acenPruning:nacde is removed because ade is not in L3nC4=abcdNovember 1, 2021Data Mining: Concepts and Techniques14How to Generate Candidates?nSuppose the items in Lk-1 are listed in an ordernStep 1: self-joining Lk-1 insert into Ckselect p.item1, p.item2, , p
21、.itemk-1, q.itemk-1from Lk-1 p, Lk-1 qwhere p.item1=q.item1, , p.itemk-2=q.itemk-2, p.itemk-1 66.7%.nplay basketball not eat cereal 20%, 33.3% is more accurate, although with lower support and confidencenMeasure of dependent/correlated events: lift89. 05000/3750*5000/30005000/2000),(CBliftBasketball
22、Not basketballSum (row)Cereal200017503750Not cereal10002501250Sum(col.)300020005000( , )( ) ( )P A BliftP A P B33. 15000/1250*5000/30005000/1000),(CBliftNovember 1, 2021Data Mining: Concepts and Techniques52Which Measures Should Be Used?nlift and 2 are not good measures for correlations in large tra
23、nsactional DBsnall-conf or coherence could be good measures (OmiecinskiTKDE03)nBoth all-conf and coherence have the downward closure property nEfficient algorithms can be derived for mining (Lee et al. ICDM03sub)Difference Between Confidence, Lift, All-Confidence and CoherenceNovember 1, 2021Data Mi
24、ning: Concepts and Techniques53P(A)P(B)P(A,B)Lift: P(A, B) / (P(A) * P(B)Confidence: P(A,B) / P(A)All-Conf: P(A, B) / max(P(A), P(B)Coherence: P(A,B) / (P(A)+P(B)-P(A,B)November 1, 2021Data Mining: Concepts and Techniques54Are lift and 2 Good Measures of Correlation?n“Buy walnuts buy milk 1%, 80%” i
25、s misleadingnif 85% of customers buy milknSupport and confidence are not good to represent correlationsnSo many interestingness measures? (Tan, Kumar, Sritastava KDD02)MilkNo MilkSum (row)Coffeem, cm, ccNo Coffeem, cm, ccSum(col.)mm)()()(BPAPBAPliftDBm, cm, cmcmcliftall-confcoh2A1100010010010,0009.2
26、60.910.839055A210010001000100,0008.440.090.05670A3100010010000100,0009.180.090.098172A4100010001000100010.50.330)sup(_max_)sup(_XitemXconfall| )(|)sup(XuniverseXcoh November 1, 2021Data Mining: Concepts and Techniques55Chapter 5: Mining Frequent Patterns, Association and CorrelationsnBasic concepts
27、and a road mapnEfficient and scalable frequent itemset mining methodsnMining various kinds of association rulesnFrom association mining to correlation analysisnConstraint-based association miningnSummaryNovember 1, 2021Data Mining: Concepts and Techniques56Constraint-based (Query-Directed) MiningnFi
28、nding all the patterns in a database autonomously? unrealistic!nThe patterns could be too many but not focused!nData mining should be an interactive process nUser directs what to be mined using a data mining query language (or a graphical user interface)nConstraint-based miningnUser flexibility: pro
29、vides constraints on what to be minednSystem optimization: explores such constraints for efficient miningconstraint-based miningNovember 1, 2021Data Mining: Concepts and Techniques57Constraints in Data MiningnData constraintnfind product pairs sold together in stores in Chicago in Dec.02nDimension/l
30、evel constraintnin relevance to region, price, brand, customer categorynRule (or pattern) constraintnsmall sales (price $200)nInterestingness constraintnstrong rules: min_support 3%, min_confidence 60%November 1, 2021Data Mining: Concepts and Techniques58Constrained Mining vs. Constraint-Based Searc
31、hnConstrained mining vs. constraint-based search/reasoningnBoth are aimed at reducing search spacenFinding all patterns satisfying constraints vs. finding some (or one) answer in constraint-based search in AInConstrained mining vs. query processing in DBMSnDatabase query processing requires to find
32、allnConstrained pattern mining shares a similar philosophy as pushing selections deeply in query processingNovember 1, 2021Data Mining: Concepts and Techniques59The Apriori Algorithm ExampleTID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5Database Ditemset sup.1223334153itemset sup.12233353Scan DC1L1ite
33、mset1 21 31 52 32 53 5itemset sup1 211 321 512 322 533 52itemset sup1 322 322 533 52L2C2C2Scan DC3L3itemset2 3 5Scan Ditemset sup2 3 52November 1, 2021Data Mining: Concepts and Techniques60Nave Algorithm: Apriori + Constraint TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5Database Ditemset sup.1223334
34、153itemset sup.12233353Scan DC1L1itemset1 21 31 52 32 53 5itemset sup1 211 321 512 322 533 52itemset sup1 322 322 533 52L2C2C2Scan DC3L3itemset2 3 5Scan Ditemset sup2 3 52Constraint: SumS.price 5November 1, 2021Data Mining: Concepts and Techniques61The Constrained Apriori Algorithm: Push an Anti-mon
35、otone Constraint Deep TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5Database Ditemset sup.1223334153itemset sup.12233353Scan DC1L1itemset1 21 31 52 32 53 5itemset sup1 211 321 512 322 533 52itemset sup1 322 322 533 52L2C2C2Scan DC3L3itemset2 3 5Scan Ditemset sup2 3 52Constraint: SumS.price 5November
36、1, 2021Data Mining: Concepts and Techniques62Chapter 5: Mining Frequent Patterns, Association and CorrelationsnBasic concepts and a road mapnEfficient and scalable frequent itemset mining methodsnMining various kinds of association rulesnFrom association mining to correlation analysisnConstraint-bas
37、ed association miningnSummaryNovember 1, 2021Data Mining: Concepts and Techniques63Frequent-Pattern Mining: SummarynFrequent pattern miningan important task in data miningnScalable frequent pattern mining methodsnApriori (Candidate generation & test)nProjection-based (FPgrowth)Mining a variety o
38、f rules and interesting patterns Constraint-based miningMining sequential and structured patternsnMining truly interesting patternsnSurprising, novel, concise, November 1, 2021Data Mining: Concepts and Techniques64Ref: Basic Concepts of Frequent Pattern Miningn(Association Rules) R. Agrawal, T. Imie
39、linski, and A. Swami. Mining association rules between sets of items in large databases. SIGMOD93.n(Max-pattern) R. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD98. n(Closed-pattern) N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for ass
40、ociation rules. ICDT99.n(Sequential pattern) R. Agrawal and R. Srikant. Mining sequential patterns. ICDE95November 1, 2021Data Mining: Concepts and Techniques65Ref: Apriori and Its ImprovementsnR. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB94.nH. Mannila, H. Toivonen,
41、and A. I. Verkamo. Efficient algorithms for discovering association rules. KDD94.nA. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. VLDB95.nJ. S. Park, M. S. Chen, and P. S. Yu. An effective hash-based algorithm for mining association
42、 rules. SIGMOD95.nH. Toivonen. Sampling large databases for association rules. VLDB96.nS. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket analysis. SIGMOD97.nS. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with
43、 relational database systems: Alternatives and implications. SIGMOD98.November 1, 2021Data Mining: Concepts and Techniques66Ref: Depth-First, Projection-Based FP MiningnR. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent itemsets. J. Parallel and Dist
44、ributed Computing:02.nJ. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD 00. nJ. Pei, J. Han, and R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets. DMKD00.nJ. Liu, Y. Pan, K. Wang, and J. Han. Mining Frequent Item Sets by Opportunistic
45、 Projection. KDD02. nJ. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining Top-K Frequent Closed Patterns without Minimum Support. ICDM02.nJ. Wang, J. Han, and J. Pei. CLOSET+: Searching for the Best Strategies for Mining Frequent Closed Itemsets. KDD03. nG. Liu, H. Lu, W. Lou, J. X. Yu. On Computing, Sto
46、ring and Querying Frequent Patterns. KDD03.November 1, 2021Data Mining: Concepts and Techniques67Ref: Vertical Format and Row Enumeration MethodsnM. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for discovery of association rules. DAMI:97.nZaki and Hsiao. CHARM: An Efficient A
47、lgorithm for Closed Itemset Mining, SDM02. nC. Bucila, J. Gehrke, D. Kifer, and W. White. DualMiner: A Dual-Pruning Algorithm for Itemsets with Constraints. KDD02.nF. Pan, G. Cong, A. K. H. Tung, J. Yang, and M. Zaki , CARPENTER: Finding Closed Patterns in Long Biological Datasets. KDD03.November 1,
48、 2021Data Mining: Concepts and Techniques68Ref: Mining Multi-Level and Quantitative RulesnR. Srikant and R. Agrawal. Mining generalized association rules. VLDB95.nJ. Han and Y. Fu. Discovery of multiple-level association rules from large databases. VLDB95.nR. Srikant and R. Agrawal. Mining quantitat
49、ive association rules in large relational tables. SIGMOD96.nT. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using two-dimensional optimized association rules: Scheme, algorithms, and visualization. SIGMOD96.nK. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Computing
50、 optimized rectilinear regions for association rules. KDD97.nR.J. Miller and Y. Yang. Association rules over interval data. SIGMOD97.nY. Aumann and Y. Lindell. A Statistical Theory for Quantitative Association Rules KDD99. November 1, 2021Data Mining: Concepts and Techniques69Ref: Mining Correlation
51、s and Interesting RulesnM. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM94.nS. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. SIGMOD97
52、.nC. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining causal structures. VLDB98.nP.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Interestingness Measure for Association Patterns. KDD02.nE. Omiecinski. Alternative Interest Measures for Mining Associations. T
53、KDE03.nY. K. Lee, W.Y. Kim, Y. D. Cai, and J. Han. CoMine: Efficient Mining of Correlated Patterns. ICDM03.November 1, 2021Data Mining: Concepts and Techniques70Ref: Mining Other Kinds of RulesnR. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. VLDB96.nB. Lent, A.
54、Swami, and J. Widom. Clustering association rules. ICDE97.nA. Savasere, E. Omiecinski, and S. Navathe. Mining for strong negative associations in a large database of customer transactions. ICDE98.nD. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A generaliza
55、tion of association-rule mining. SIGMOD98.nF. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new paradigm for fast, quantifiable data mining. VLDB98.nK. Wang, S. Zhou, J. Han. Profit Mining: From Patterns to Actions. EDBT02.November 1, 2021Data Mining: Concepts and Techniques71Ref
56、: Constraint-Based Pattern MiningnR. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints. KDD97.nR. Ng, L.V.S. Lakshmanan, J. Han & A. Pang. Exploratory mining and pruning optimizations of constrained association rules. SIGMOD98.nM.N. Garofalakis, R. Rastogi, K. Shim:
57、SPIRIT: Sequential Pattern Mining with Regular Expression Constraints. VLDB99.nG. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of constrained correlated sets. ICDE00.nJ. Pei, J. Han, and L. V. S. Lakshmanan. Mining Frequent Itemsets with Convertible Constraints. ICDE01.nJ. Pei, J. Han, and W
58、. Wang, Mining Sequential Patterns with Constraints in Large Databases, CIKM02.November 1, 2021Data Mining: Concepts and Techniques72Ref: Mining Sequential and Structured PatternsnR. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. EDBT96.nH. Mannila,
59、 H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. DAMI:97.nM. Zaki. SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning:01.nJ. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Pre
60、fix-Projected Pattern Growth. ICDE01.nM. Kuramochi and G. Karypis. Frequent Subgraph Discovery. ICDM01.nX. Yan, J. Han, and R. Afshar. CloSpan: Mining Closed Sequential Patterns in Large Datasets. SDM03.nX. Yan and J. Han. CloseGraph: Mining Closed Frequent Graph Patterns. KDD03.November 1, 2021Data Mining: Concepts and
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 果园经营权转让合同模板
- 个人与公司间借款协议书范本2024年
- 婚前财产协议书公证流程
- 展览延期协议书范本
- 自由职业者合作工作室合伙协议
- 房屋中介服务协议书样式
- 设计合同补充协议范本
- 沥青运输合同模板
- 建筑施工合同补充协议模板
- 合作试验协议
- 草原牧歌-鸿雁 课件 2024-2025学年人音版(简谱)(2024)初中音乐七年级上册
- 期中模拟试卷(1-4单元)(试题)-2024-2025学年五年级上册数学人教版
- 期中测试卷(1-3单元)(试题)-2024-2025学年六年级上册数学苏教版
- 小学数学西南师大五年级上册四小数混合运算小数混合运算 PPT
- 学生操行扣分记录统计表
- 服装专业英语综合词汇
- FIT与PPM转换
- 灯饰中英文术语
- 提高出院病案7天回收率PDCA持续改进方案
- 小学数学教学中有效情境的创设与利用案例1
- 《大数据导论通识课版》PPT课件
评论
0/150
提交评论