版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、August 12, 2022Data Mining: Concepts and Techniques1Data Mining: Concepts and Techniques Slides for Textbook Chapter 6 Jiawei Han and Micheline KamberDepartment of Computer Science University of Illinois at Urbana-Champaign August 12, 2022Data Mining: Concepts and Techniques2Chapter 6: Mining Associ
2、ation Rules in Large DatabasesAssociation rule miningAlgorithms for scalable mining of (single-dimensional Boolean) association rules in transactional databasesMining various kinds of association/correlation rules Constraint-based association miningSequential pattern miningApplications/extensions of
3、 frequent pattern miningSummaryAugust 12, 2022Data Mining: Concepts and Techniques3What Is Association Mining?Association rule mining:Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other i
4、nformation repositories.Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database AIS93Motivation: finding regularities in dataWhat products were often purchased together? Beer and diapers?!What are the subsequent purchases after buying a PC?What kinds of DNA are
5、sensitive to this new drug?Can we automatically classify web documents?August 12, 2022Data Mining: Concepts and Techniques4Why Is Frequent Pattern or Assoiciation Mining an Essential Task in Data Mining?Foundation for many essential data mining tasksAssociation, correlation, causalitySequential patt
6、erns, temporal or cyclic association, partial periodicity, spatial and multimedia associationAssociative classification, cluster analysis, iceberg cube, fascicles (semantic data compression)Broad applicationsBasket data analysis, cross-marketing, catalog design, sale campaign analysisWeb log (click
7、stream) analysis, DNA sequence analysis, etc.August 12, 2022Data Mining: Concepts and Techniques5Basic Concepts: Frequent Patterns and Association RulesItemset X=x1, , xkFind all the rules XY with min confidence and supportsupport, s, probability that a transaction contains XYconfidence, c, conditio
8、nal probability that a transaction having X also contains Y.Let min_support = 50%, min_conf = 50%:A C (50%, 66.7%)C A (50%, 100%)Customerbuys diaperCustomerbuys bothCustomerbuys beerTransaction-idItems bought10A, B, C20A, C30A, D40B, E, FAugust 12, 2022Data Mining: Concepts and Techniques6Mining Ass
9、ociation Rulesan ExampleFor rule A C:support = support(AC) = 50%confidence = support(AC)/support(A) = 66.6%Min. support 50%Min. confidence 50%Transaction-idItems bought10A, B, C20A, C30A, D40B, E, FFrequent patternSupportA75%B50%C50%A, C50%August 12, 2022Data Mining: Concepts and Techniques7Chapter
10、6: Mining Association Rules in Large DatabasesAssociation rule miningAlgorithms for scalable mining of (single-dimensional Boolean) association rules in transactional databasesMining various kinds of association/correlation rules Constraint-based association miningSequential pattern miningApplicatio
11、ns/extensions of frequent pattern miningSummaryAugust 12, 2022Data Mining: Concepts and Techniques8Apriori: A Candidate Generation-and-test ApproachAny subset of a frequent itemset must be frequentif beer, diaper, nuts is frequent, so is beer, diaperEvery transaction having beer, diaper, nuts also c
12、ontains beer, diaper Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested!Method: generate length (k+1) candidate itemsets from length k frequent itemsets, andtest the candidates against DBThe performance studies show its efficiency and
13、scalabilityAgrawal & Srikant 1994, Mannila, et al. 1994August 12, 2022Data Mining: Concepts and Techniques9The Apriori AlgorithmAn ExampleDatabase TDB1st scanC1L1L2C2C22nd scanC3L33rd scanTidItems10A, C, D20B, C, E30A, B, C, E40B, EItemsetsupA2B3C3D1E3ItemsetsupA2B3C3E3ItemsetA, BA, CA, EB, CB, EC,
14、EItemsetsupA, B1A, C2A, E1B, C2B, E3C, E2ItemsetsupA, C2B, C2B, E3C, E2ItemsetB, C, EItemsetsupB, C, E2August 12, 2022Data Mining: Concepts and Techniques10The Apriori AlgorithmPseudo-code:Ck: Candidate itemset of size kLk : frequent itemset of size kL1 = frequent items;for (k = 1; Lk !=; k+) do beg
15、in Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support endreturn k Lk;August 12, 2022Data Mining: Concepts and Techniques11Important Details of AprioriHow to genera
16、te candidates?Step 1: self-joining LkStep 2: pruningHow to count supports of candidates?Example of Candidate-generationL3=abc, abd, acd, ace, bcdSelf-joining: L3*L3abcd from abc and abdacde from acd and acePruning:acde is removed because ade is not in L3C4=abcdAugust 12, 2022Data Mining: Concepts an
17、d Techniques12How to Generate Candidates?Suppose the items in Lk-1 are listed in an orderStep 1: self-joining Lk-1 insert into Ckselect p.item1, p.item2, , p.itemk-1, q.itemk-1from Lk-1 p, Lk-1 qwhere p.item1=q.item1, , p.itemk-2=q.itemk-2, p.itemk-1 q.itemk-1Step 2: pruningforall itemsets c in Ck d
18、oforall (k-1)-subsets s of c doif (s is not in Lk-1) then delete c from CkAugust 12, 2022Data Mining: Concepts and Techniques13How to Count Supports of Candidates?Why counting supports of candidates a problem?The total number of candidates can be very huge One transaction may contain many candidates
19、Method:Candidate itemsets are stored in a hash-treeLeaf node of hash-tree contains a list of itemsets and countsInterior node contains a hash tableSubset function: finds all the candidates contained in a transactionAugust 12, 2022Data Mining: Concepts and Techniques14Example: Counting Supports of Ca
20、ndidates1,4,72,5,83,6,9Subset function2 3 45 6 71 4 51 3 61 2 44 5 71 2 54 5 81 5 93 4 53 5 63 5 76 8 93 6 73 6 8Transaction: 1 2 3 5 61 + 2 3 5 61 2 + 3 5 61 3 + 5 6August 12, 2022Data Mining: Concepts and Techniques15Efficient Implementation of Apriori in SQLHard to get good performance out of pur
21、e SQL (SQL-92) based approaches aloneMake use of object-relational extensions like UDFs, BLOBs, Table functions etc.Get orders of magnitude improvementS. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. In SIGMO
22、D98August 12, 2022Data Mining: Concepts and Techniques16Challenges of Frequent Pattern MiningChallengesMultiple scans of transaction databaseHuge number of candidatesTedious workload of support counting for candidatesImproving Apriori: general ideasReduce passes of transaction database scansShrink n
23、umber of candidatesFacilitate support counting of candidatesAugust 12, 2022Data Mining: Concepts and Techniques17DIC: Reduce Number of ScansABCDABCABDACDBCDABACBCADBDCDABCDItemset latticeOnce both A and D are determined frequent, the counting of AD beginsOnce all length-2 subsets of BCD are determin
24、ed frequent, the counting of BCD beginsTransactions1-itemsets2-itemsetsApriori1-itemsets2-items3-itemsDICS. Brin R. Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD97August 12, 2022Data Mining: Concepts and Techniques18Partition: Scan
25、Database Only TwiceAny itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DBScan 1: partition database and find local frequent patternsScan 2: consolidate global frequent patternsA. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mini
26、ng association in large databases. In VLDB95August 12, 2022Data Mining: Concepts and Techniques19Sampling for Frequent PatternsSelect a sample of original database, mine frequent patterns within sample using AprioriScan database once to verify frequent itemsets found in sample, only borders of closu
27、re of frequent patterns are checkedExample: check abcd instead of ab, ac, , etc.Scan database again to find missed frequent patternsH. Toivonen. Sampling large databases for association rules. In VLDB96August 12, 2022Data Mining: Concepts and Techniques20DHP: Reduce the Number of CandidatesA k-items
28、et whose corresponding hashing bucket count is below the threshold cannot be frequentCandidates: a, b, c, d, eHash entries: ab, ad, ae bd, be, de Frequent 1-itemset: a, b, d, eab is not a candidate 2-itemset if the sum of count of ab, ad, ae is below support thresholdJ. Park, M. Chen, and P. Yu. An
29、effective hash-based algorithm for mining association rules. In SIGMOD95August 12, 2022Data Mining: Concepts and Techniques21Eclat/MaxEclat and VIPER: Exploring Vertical Data FormatUse tid-list, the list of transaction-ids containing an itemsetCompression of tid-listsItemset A: t1, t2, t3, sup(A)=3I
30、temset B: t2, t3, t4, sup(B)=3Itemset AB: t2, t3, sup(AB)=2Major operation: intersection of tid-listsM. Zaki et al. New algorithms for fast discovery of association rules. In KDD97P. Shenoy et al. Turbo-charging vertical mining of large databases. In SIGMOD00August 12, 2022Data Mining: Concepts and
31、Techniques22Bottleneck of Frequent-pattern MiningMultiple database scans are costlyMining long patterns needs many passes of scanning and generates lots of candidatesTo find frequent itemset i1i2i100# of scans: 100# of Candidates: (1001) + (1002) + + (110000) = 2100-1 = 1.27*1030 !Bottleneck: candid
32、ate-generation-and-testCan we avoid candidate generation?August 12, 2022Data Mining: Concepts and Techniques23Mining Frequent Patterns Without Candidate GenerationGrow long patterns from short ones using local frequent items“abc” is a frequent patternGet all transactions having “abc”: DB|abc“d” is a
33、 local frequent item in DB|abc abcd is a frequent patternAugust 12, 2022Data Mining: Concepts and Techniques24Construct FP-tree from a Transaction Databasef:4c:1b:1p:1b:1c:3a:3b:1m:2p:2m:1Header TableItem frequency head f4c4a3b3m3p3min_support = 3TIDItems bought (ordered) frequent items100f, a, c, d
34、, g, i, m, pf, c, a, m, p200a, b, c, f, l, m, of, c, a, b, m300 b, f, h, j, o, wf, b400 b, c, k, s, pc, b, p500 a, f, c, e, l, p, m, nf, c, a, m, pScan DB once, find frequent 1-itemset (single item pattern)Sort frequent items in frequency descending order, f-listScan DB again, construct FP-treeF-lis
35、t=f-c-a-b-m-pAugust 12, 2022Data Mining: Concepts and Techniques25Benefits of the FP-tree StructureCompleteness Preserve complete information for frequent pattern miningNever break a long pattern of any transactionCompactnessReduce irrelevant infoinfrequent items are goneItems in frequency descendin
36、g order: the more frequently occurring, the more likely to be sharedNever be larger than the original database (not count node-links and the count field)For Connect-4 DB, compression ratio could be over 100August 12, 2022Data Mining: Concepts and Techniques26Partition Patterns and DatabasesFrequent
37、patterns can be partitioned into subsets according to f-listF-list=f-c-a-b-m-pPatterns containing pPatterns having m but no pPatterns having c but no a nor b, m, pPattern fCompleteness and non-redundencyAugust 12, 2022Data Mining: Concepts and Techniques27Find Patterns Having P From P-conditional Da
38、tabaseStarting at the frequent item header table in the FP-treeTraverse the FP-tree by following the link of each frequent item pAccumulate all of transformed prefix paths of item p to form ps conditional pattern baseConditional pattern basesitemcond. pattern basecf:3afc:3bfca:1, f:1, c:1mfca:2, fca
39、b:1pfcam:2, cb:1f:4c:1b:1p:1b:1c:3a:3b:1m:2p:2m:1Header TableItem frequency head f4c4a3b3m3p3August 12, 2022Data Mining: Concepts and Techniques28From Conditional Pattern-bases to Conditional FP-trees For each pattern-baseAccumulate the count for each item in the baseConstruct the FP-tree for the fr
40、equent items of the pattern basem-conditional pattern base:fca:2, fcab:1f:3c:3a:3m-conditional FP-treeAll frequent patterns relate to mm, fm, cm, am, fcm, fam, cam, fcamf:4c:1b:1p:1b:1c:3a:3b:1m:2p:2m:1Header TableItem frequency head f4c4a3b3m3p3August 12, 2022Data Mining: Concepts and Techniques29R
41、ecursion: Mining Each Conditional FP-treef:3c:3a:3m-conditional FP-treeCond. pattern base of “am”: (fc:3)f:3c:3am-conditional FP-treeCond. pattern base of “cm”: (f:3)f:3cm-conditional FP-treeCond. pattern base of “cam”: (f:3)f:3cam-conditional FP-treeAugust 12, 2022Data Mining: Concepts and Techniqu
42、es30A Special Case: Single Prefix Path in FP-treeSuppose a (conditional) FP-tree T has a shared single prefix-path PMining can be posed into two partsReduction of the single prefix path into one nodeConcatenation of the mining results of the two partsa2:n2a3:n3a1:n1b1:m1C1:k1C2:k2C3:k3b1:m1C1:k1C2:k
43、2C3:k3r1+a2:n2a3:n3a1:n1r1=August 12, 2022Data Mining: Concepts and Techniques31Mining Frequent Patterns With FP-treesIdea: Frequent pattern growthRecursively grow frequent patterns by pattern and database partitionMethod For each frequent item, construct its conditional pattern-base, and then its c
44、onditional FP-treeRepeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty, or it contains only one pathsingle path will generate all the combinations of its sub-paths, each of which is a frequent patternAugust 12, 2022Data Mining: Concepts and Techniques32Sc
45、aling FP-growth by DB ProjectionFP-tree cannot fit in memory?DB projectionFirst partition a database into a set of projected DBsThen construct and mine FP-tree for each projected DBParallel projection vs. Partition projection techniquesParallel projection is space costlyAugust 12, 2022Data Mining: C
46、oncepts and Techniques33Partition-based ProjectionParallel projection needs a lot of disk space Partition projection saves itTran. DB fcampfcabmfbcbpfcampp-proj DB fcamcbfcamm-proj DB fcabfcafcab-proj DB fcba-proj DBfcc-proj DBff-proj DB am-proj DB fcfcfccm-proj DB fffAugust 12, 2022Data Mining: Con
47、cepts and Techniques34FP-Growth vs. Apriori: Scalability With the Support ThresholdData set T25I20D10KAugust 12, 2022Data Mining: Concepts and Techniques35FP-Growth vs. Tree-Projection: Scalability with the Support ThresholdData set T25I20D100KAugust 12, 2022Data Mining: Concepts and Techniques36Why
48、 Is FP-Growth the Winner?Divide-and-conquer: pose both the mining task and DB according to the frequent patterns obtained so farleads to focused search of smaller databasesOther factorsno candidate generation, no candidate testcompressed database: FP-tree structureno repeated scan of entire database
49、 basic opscounting local freq items and building sub FP-tree, no pattern search and matchingAugust 12, 2022Data Mining: Concepts and Techniques37Implications of the Methodology Mining closed frequent itemsets and max-patternsCLOSET (DMKD00)Mining sequential patternsFreeSpan (KDD00), PrefixSpan (ICDE
50、01)Constraint-based mining of frequent patternsConvertible constraints (KDD00, ICDE01)Computing iceberg data cubes with complex measures H-tree and H-cubing algorithm (SIGMOD01)August 12, 2022Data Mining: Concepts and Techniques38Max-patternsFrequent pattern a1, , a100 (1001) + (1002) + + (110000) =
51、 2100-1 = 1.27*1030 frequent sub-patterns!Max-pattern: frequent patterns without proper frequent super patternBCDE, ACD are max-patternsBCD is not a max-patternTidItems10A,B,C,D,E20B,C,D,E,30A,C,D,FMin_sup=2August 12, 2022Data Mining: Concepts and Techniques39MaxMiner: Mining Max-patterns1st scan: f
52、ind frequent itemsA, B, C, D, E2nd scan: find support for AB, AC, AD, AE, ABCDEBC, BD, BE, BCDECD, CE, CDE, DE,Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in later scanR. Bayardo. Efficiently mining long patterns from databases. In SIGMOD98TidItems10A,B,C,D,E20B,C,D,E,30A,C,D,FPotent
53、ial max-patternsAugust 12, 2022Data Mining: Concepts and Techniques40Frequent Closed PatternsConf(acd)=100% record acd onlyFor frequent itemset X, if there exists no item y s.t. every transaction containing X also contains y, then X is a frequent closed pattern“acd” is a frequent closed patternConci
54、se rep. of freq patsReduce # of patterns and rulesN. Pasquier et al. In ICDT99TIDItems10a, c, d, e, f20a, b, e30c, e, f40a, c, d, f50c, e, fMin_sup=2August 12, 2022Data Mining: Concepts and Techniques41Mining Frequent Closed Patterns: CLOSETFlist: list of all frequent items in support ascending orde
55、rFlist: d-a-f-e-cDivide search spacePatterns having dPatterns having d but no a, etc.Find frequent closed pattern recursivelyEvery transaction having d also has cfa cfad is a frequent closed patternJ. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets, DMKD00.TI
56、DItems10a, c, d, e, f20a, b, e30c, e, f40a, c, d, f50c, e, fMin_sup=2August 12, 2022Data Mining: Concepts and Techniques42Mining Frequent Closed Patterns: CHARMUse vertical data format: t(AB)=T1, T12, Derive closed pattern based on vertical intersectionst(X)=t(Y): X and Y always happen togethert(X)t
57、(Y): transaction having X always has YUse diffset to accelerate miningOnly keep track of difference of tidst(X)=T1, T2, T3, t(Xy )=T1, T3 Diffset(Xy, X)=T2M. Zaki. CHARM: An Efficient Algorithm for Closed Association Rule Mining, CS-TR99-10, Rensselaer Polytechnic InstituteM. Zaki, Fast Vertical Min
58、ing Using Diffsets, TR01-1, Department of Computer Science, Rensselaer Polytechnic InstituteAugust 12, 2022Data Mining: Concepts and Techniques43Visualization of Association Rules: Pane GraphAugust 12, 2022Data Mining: Concepts and Techniques44Visualization of Association Rules: Rule GraphAugust 12,
59、 2022Data Mining: Concepts and Techniques45Chapter 6: Mining Association Rules in Large DatabasesAssociation rule miningAlgorithms for scalable mining of (single-dimensional Boolean) association rules in transactional databasesMining various kinds of association/correlation rules Constraint-based as
60、sociation miningSequential pattern miningApplications/extensions of frequent pattern miningSummaryAugust 12, 2022Data Mining: Concepts and Techniques46Mining Various Kinds of Rules or RegularitiesMulti-level, quantitative association rules, correlation and causality, ratio rules, sequential patterns
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
评论
0/150
提交评论