Improving Data Mining Algorithms Using Constraints 使用约束改进数据挖掘算法_第1页
Improving Data Mining Algorithms Using Constraints 使用约束改进数据挖掘算法_第2页
Improving Data Mining Algorithms Using Constraints 使用约束改进数据挖掘算法_第3页
Improving Data Mining Algorithms Using Constraints 使用约束改进数据挖掘算法_第4页
Improving Data Mining Algorithms Using Constraints 使用约束改进数据挖掘算法_第5页
已阅读5页,还剩73页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

PAGE

PAGE

30

TheOpenUniversityofIsrael

DepartmentofMathematicsandComputerScience

TheComputerScienceDivision

IMPROVINGDATAMININGALGORITHMSUSINGCONSTRAINTS

By

ShaiShimon

ID028455863

Emailshaishimon@

Cell0526-126207

PreparedunderthesupervisionofProfessorEhudGudes

Feb2012

TableofContents

TOC\o"1-5"\h\z

Listoffigures

3

ListofTables

4

1 ABSTRACTANDINTRODUCTION

5

1.1 ABSTRACT

5

1.2 INTRODUCTION

6

2 basicconcepts

8

2.1 INTRODUCION

12

2.2 APRIORI-FastAlgorithmsforMiningAssociationRules[1]

12

2.3 FP-TREEALGORITHM[8]

19

3 PAPERSsurvey

34

3.1 USINGCONSTRAINTS

34

3.1.1 INTRODUCTIONANDMOTIVATION

34

3.1.2 MININGFREQUENTITEMSETSWITHCONVERTIBLECONSTRAINTS[9]

34

Introduction

34

Convertibleconstraints-motivation

35

3.1.3 MININGASSOCIATIONRULESWITHITEMCONSTRAINTS[10]

37

Abstract

37

Introduction

37

Algorithms

37

Tradeoffs

38

Conclusions

38

3.1.4 EXAMINEROPTIMIZEDLEVEL-WISEFREQUENTPATTERNMININGWITHMONOTONECONSTRAINTSALGORITHM[4]

39

Abstract

39

Introduction

39

Definitions

40

ExAMineralgorithm

41

FlowchartofexaMiner

43

ExaMineralgorithmexample

44

Experiments

53

3.1.5 FP-BONSAIALGORITHM[5]

56

Introduction

56

FP-bonsaialgorithm

57

FP-Bonsaialgorithmexample

58

Disadvantage

63

Experiments

64

Summary

65

3.2 shortpaperssurveys

65

4 IMPLEMENTAION

70

5 summaryandconclusions

74

6 REFERENCES

75

Listoffigures

16

passexecutiontimesofAprioriandAprioriTId

Figure2.1.1-1

17

Executiontimesfordecreasingminimumsupport(maxpotentiallylargeitemsetis2

Figure2.1.1-2

18

Executiontimesfordecreasingminimumsupport(maxpotentiallylargeitemsetis4

Figure2.1.1-3

18

Executiontimesfordecreasingminimumsupport(maxpotentiallylargeitemsetis6

Figure2.1.1-4

26

FPgrowsexample

Figure2.1.1-5

26

FPgrowsexampleforp(1)

Figure2.1.1-6

27

FPgrowsexampleforp(2)

Figure2.1.1-7

27

FPgrowsexampleform(1)

Figure2.1.1-8

28

FPgrowsexampleform(2)

Figure2.1.1-9

28

FPgrowsexampleforam

Figure2.1.1-10

29

FPgrowsexampleforcamandfam

Figure2.1.1-11

29

FPgrowsexampleforcamandcm

Figure2.1.1-12

30

FPgrowsexampleforcamandfm

Figure2.1.1-13

30

FPgrowsexampleresults

Figure2.1.1-14

32

FPtreealgorithmexperiment–Runtime,supportthreshold

Figure2.1.1-15

32

FPgrowsalgorithmexperiment–Transactionsnumberwiththreshold=1.5%

Figure2.1.1-16

43

ExAMiner0

Figure3.1.4-1

44

ExAMiner1&ExAMiner2

Figure3.1.4-2

54

ExaMinerexperimentDataReductionRate(min_sup=1100)

Figure3.1.4-3

54

ExaMinerexperimentDataReductionRate(min_sup=500)

Figure3.1.4-4

55

ExaMinerexperimentRuntimesynthetic(min_sup=1200)

Figure3.1.4-5

56

ExaMinerexperimentRuntimesynthetic(sum(prices)>2800)

Figure3.1.4-6

64

FPBonsaiExaminerexperiment(BMS-POS)(1)

Figure3.1.5-1

65

FPBonsaiExaminerexperiment(BMS-POS)(2

Figure3.1.5-2

72

Applicationwindow–mainpanel

Figure4-1

73

Applicationwindow–FPtreeresultpanel

Figure4-2

ListofTables

8

Market-Baskettransactions

Table2-1

9

Convertibleanti-monotone

Table2-2

10

Convertiblemonotone

Table2-3

11

stronglyconvertibleconstraints

Table2-4

22

FPtreealgorithmexampleTid,item,andfrequency(1)

Table2.1.1-1

22

FPtreealgorithmexampleTid,item,andfrequency(2)

Table2.1.1-2

23

FPtreealgorithmexampleTid,item,andfrequency(3)

Table2.1.1-3

23

FPtreealgorithmexampleTid,item,andHeadertable(1)

Table2.1.1-4

24

FPtreealgorithmexampleTid,item,andHeadertable(2)

Table2.1.1-5

24

FPtreealgorithmexampleTid,item,andHeadertable(3)

Table2.1.1-6

27

FtreealgorithmexampleTid,item,andHeadertable(4)

Table2.1.1-7

27

FPtreealgorithmexampleTid,item,andHeadertable(5)

Table2.1.1-8

31

FPtreealgorithmexperiment–Syntheticdataset

Table2.1.1-9

36

TransactionIdANDtransaction

Table3.1.2-1

36

Frequentitemsetswithsupportthreshold

Table3.1.2-2

43

ExaMiner0

Table3.1.4-1-A

44

ExaMiner1

Table3.1.4-1-B

45

ExaMinerexamplelevelone(1)

Table3.1.4-2

45

ExaMinerexamplelevelone(2)

Table3.1.4-3

46

ExaMinerexamplelevelone(3)

Table3.1.4-4

47

ExaMinerexamplelevelone(4)

Table3.1.4-5

48

ExaMinerexamplelevelone(5)

Table3.1.4-6

49

ExaMinerexamplelevelone(6)

Table3.1.4-7

49

ExaMinerexampleleveltwo(1)

Table3.1.4-8

50

ExaMinerexampleleveltwo(2)

Table3.1.4-9

50

ExaMinerexampleleveltwo(4)

Table3.1.4-10

51

ExaMinerexampleleveltwo(5)

Table3.1.4-11

52

ExaMinerexampleleveltwo(7)

Table3.1.4-12

52

ExaMinerexampleleveltwo(7)

Table3.1.4-13

53

ExaMinerexampleleveltwo(7)

Table3.1.4-14

58

FPBonsaiexample-item,valuetable

Table3.1.5-1

59

FPBonsaiexample-Tid,itemstable

Table3.1.5-2

59

FPBonsaiexample-pruning–(constraintcheck)

Table3.1.5-3

60

FPBonsaiexampleRun-pruning(Supportcheck)

Table3.1.5-4

60

FPBonsaiexample-pruning(1)

Table3.1.5-5

61

FPBonsaiexampleRun-pruning(2)

Table3.1.5-6

61

FPBonsaiexample-pruning(3)

Table3.1.5-7

62

FPBonsaiexampleRun-pruning(4)

Table3.1.5-8

62

FPBonsaiexample-pruning(5)

Table3.1.5-9

62

FPBonsaiexampleresults

Table3.1.5-10

71

Transactionstable

Table4-1

71

Itemsandprices

Table4-2

71,72

Experimentsresults

Table4-3

ABSTRACTANDINTRODUCTION

ABSTRACT

Thepurposeofdataminingistoidentifyandpredictpatterns,trendsandrelationshipsindata.Themainstepsindataminingprocessare:

Definingtheproblem,preparationofinformation,dataanalysis,evaluationoftheresults,displayingtheresults.

InthisworkI'llpresentanumberofdataminingalgorithmsusingassociationrules.FirstI'llpresentthebasicalgorithms(AprioriAlgorithmandFPTree)andthenwe'lldiscussalgorithmswithconstraints.Wewillpresentthealgorithmswithconstraintsindetail,andalsoweshalldiscussthedifferencesbetweenthem.

Infactthisworkwillfocusondataminingalgorithmswithconstraints.Wewillfocusontheimportanceofconstraintsindatamining,ontheiruse,andexploredifferenttypesofconstraintsandeffectivemethodsofdataminingalgorithms.Asit'swellknown,sincethesizeofdataminingresultsmaysometimesbeverylarge,usingconstraintshelptheuserfindthedesiredinformationandimprovesthesystemperformance.Thisworkwillfocusoncertaintypesofconstraints,andalgorithmsthatwerebuiltforthem.Specifically,thealgorithmsthatwewillrevieware:

MININGFREQUENTITEMSETSWITHCONVERTIBLECONSTRAINTS[10]

MININGASSOCIATIONRULESWITHITEMCONSTRAINTS[11]

EXAMINEROPTIMIZEDLEVEL-WISEFREQUENTPATTERNMININGWITHMONOTONECONSTRAINTSALGORITHM[4]

FP-BONSAIALGORITHM[5]

Inadditionwewillreviewbrieflysixotherarticles:FourarticlesonconstraintsandtwoadvancedalgorithmsthanApriori.

Thelastphaseoftheworkisanimplementationoftwoalgorithms:Bonsai-treeandFP-tree.TheimplementationwascodedintheJAVAlanguage.TheDatabaseinputisasyntheticdatabaseanditwasbuiltbyarandomgeneratorthatwasespeciallydevelopedforthispurpose.Theresultsandconclusionsoftheevaluationaresummarizedinthepaper.

INTRODUCTION

BACKGROUND

Overtheyears,massstoragecosthasdecreaseddramatically,anddatabasetechnology,incorporatingtheubiquitousInternet,hasevolvedtobemoreintelligentandpowerful.Wearenowattheequinoxwherewehavetoomuchdatayetsofewcomputerizedtoolstoanalyzeit,letaloneapplytheknowledgeresultedfromtheanalysistoexpediteinformationdissemination,scientificresearch,andindustrialandcommercialdecisionmaking.Weareindeeddatabillionaireslivinginthegutterofknowledge.

Thisiswheredataminingcamein,whichstartedoutasadirectconsequenceofinformationtechnologydevelopment.Followingtheamazingprogressinthefield,dataminingcannowprovidetheoreticalfoundationstoimplementanalyzingsoftwareforvariouskindsofapplications.

Thispaperismainlyfocusonhowtoefficientlygenerateassociationrules.

Theuserisallowedtoexpresshisfocusinmining,bymeansofarichclassofconstraintsthatcaptureapplicationsemantics.Besidesallowinguserexplorationandcontrol,theparadigmallowsmanyoftheseconstraintstobepusheddeepinsidemining(laterdiscussedinbasicconcepts),thuspruningthesearchspaceofpatternstothoseofinteresttotheuser,andachievingsuperiorperformance.

Inthisreportwediscuss2maintopicsalgorithmsforconstraintsandefficiencyimprovements.

PURPOSE

Thispaperisdividedto4maintopics:

BasicconceptsandshortbriefonbothAprioriandFP-Treealgorithms–Inthissectionwe'llfocusonthebasicconceptswhichwillhelpusdealwiththerestofthepaperandwe'lldiscussshortlyabout2algorithms:AprioriandFP-Tree.

Papersurvey–Inthissectionweshowtheadvantageoftheconstraints.Withconstraintsweobtainfewerpatternswhicharemoreinteresting.Indeedconstraintsarethewayweusetodefinewhatis“interesting”.Herewe'llintroduce4articles,whichwilluseconstraintsmethodology:

"MiningFrequentItemsetswithConvertibleConstraints"[9]

"Miningassociationruleswithitemconstraints"[10]

"Examineralgorithm"[4]

"FPBonsai"[4]

Shortpapersurvey–Herewe'lldescribebrieflyfewarticleswhichdealbothimprovingbasicalgorithmsandconstraintsalgorithms.

Applicationimplementation–Afterdescribingalltheabovearticles,we'llshowtheresultsofanapplicationwhichwaswritteninjava.Thisapplicationimplements2algorithms:"FPTree"and"FPBonsai".Theprogramwasrunwithdatathatwasgeneratedsyntactically.Theprogramwillshowthedurationofeachalgorithminadditiontotheresults.

basicconcepts

Associationrulesmining-Givenasetoftransactions,findrulesthatwillpredicttheoccurrenceofanitembasedontheoccurrencesofotheritemsinthetransaction

Market-Baskettransactions

ExampleofAssociationRules

{Diapers}{Beer},

{Milk,Bread}{Eggs,Coke},

{Beer,Bread}{Milk},

Table2-1Market-Baskettransactions

Itemset

Acollectionofoneormoreitems

Example:{Milk,Bread,Diapers}

k-itemset

Anitemsetthatcontainskitems

Supportcount(s-sigma)

Frequencyoccurrenceofanitemset

E.g.s({Milk,Bread,Diapers})=2

Support

Thepercentageofthefractionoftransactionsthatcontainanitemsetrepresentsthesupport.Orinotherwordsanitemsetwhichappearsinxtransactionsofthedatabaseisthesupportofthisitemset.

E.g.s({Milk,Bread,Diapers})appearin2transactionsinthetableabovesothesupportis2/5*100=40%

Confidence

Confidencedenotesthestrengthofimplicationintherule,meansthemoretheconfidencehighertherelationshipbetweenthe2setsisstronger.Thecasuallinkbetweenmilkandbreadisstrongintheexamplebecausetheconfidenceis75%.

Confidence(X=>Y)=Support(XY)/Support(X)

E.g.s({Milk,Bread})=3/5

s({Milk})=4/5

Confidence(MilkBread)=(3/5)/(4/5)=0.75->75%

FrequentItemset

Anitemsetwhosesupportisgreaterthanorequaltoaminsupthreshold

AssociationRule

AnimplicationexpressionoftheformX®Y,whereXandYareitemsets

Example:

{Milk,Diapers}®{Beer}

Constraints

Whatareconstraintsindatamining?Constraintsaretherulesenforcedondatatransactions.

TheIdeainconstraintsistofocusonthespecificandrelevantitemsetswhichwewanttomine.Thefollowingbellowsaresomebasicconceptsregardingconstrains.

Constraintsmining–Aimtoreducesearchspace.Itfindallpatternssatisfyingconstraints

Constraintsbasedsearch-Aimtoreducesearchspaceandfindsonlysome(orone)answer

BothConstraintsminingandConstraintsbasedsearchareaimedatreducingsearchspacebutthefirstfindtheallthepatternsandtheotherfindsomeoroneofthepatterns.Thisofcoursemakesthedifferenceintheruntimeandthememoryusage.

Anti-monotonic-WhenanitemsetSviolatestheconstraint,sodoesanyofitssuperset

Example:C:range(S.profit)£15isanti-monotoneItemsetabviolatesC

range(ab)=40£15

Sodoeseverysupersetofab

Monotonic-WhenanitemsetSsatisfiestheconstraint,sodoesanyofitssuperset

Example:C:range(S.profit)³15ItemsetabsatisfiesC

range(ab)=40³15

Sodoeseverysupersetofab

Thefollowingtablesarefortheexamples

Item

Profit

a

40

b

0

c

-20

d

10

e

-30

f

30

g

20

h

-10

TID

Transaction

10

a,b,c,d,f

20

b,c,d,f,g,h

30

a,c,d,e,f

40

c,e,f,g

Table2-2ConvertibleAntimonotone

Convertibleanti-monotone-AssumethereisanorderR.WheneveranitemsetSsatisfiesC,sodoesanyprefixofS.

Example:C:avg(S)³20w.r.t.itemvaluedescendingorder

Theitemset“abc”satisfiesC

avg(abc)=30³20

andsodoes“ab”avg(ab)=35³20and“a”avg(a)=40³20

Convertiblemonotone-AssumethereisanorderR.WheneveranditemsetSviolatesC,sodoesanyprefixofS.

Example:C:avg(S)£20w.r.t.itemvaluedescendingorderThe

itemset“abc”violatesCavg(abc)=30£20andsodoes“ab”

avg(ab)=35£20and“a”avg(a)=40£20

Thefollowingtablesarefortheexamples

Item

Profit

a

40

b

30

c

20

d

10

e

7

f

5

g

3

h

1

TID

Transaction

10

a,b,c,d,f

20

b,c,d,f,g,h

30

a,c,d,e,f

40

c,e,f,g

Table2-3Convertiblemonotone

Stronglyconvertibleconstraints

WheneverthereexistsanorderRoverthesetofitemssuchthatCisconvertible

anti-monotoneRandconvertiblemonotoneR^-1

ExampleC:avg(X)³25isconvertibleanti-monotonew.r.t.itemvaluedescendingorderR

Theitemset“afg”satisfiesCsodoes“af”and“a”.

avg(X)³25isconvertiblemonotonew.r.t.itemvalueascendingorderR^-1

Theitemset“ech”violatesCsodoes“ec”and“e”.

Tabledescendingorder Tableascendingorder

Item

Value

e

1

c

3

h

5

b

8

d

10

g

20

f

30

a

40

Item

Value

a

40

f

30

g

20

d

10

b

8

h

5

c

3

e

1

Table2-4stronglyconvertibleconstraints

Succinctnessconstraints

GivenA1,thesetofitemssatisfyingasuccinctnessconstraintC,thenanysetSsatisfyingCisbasedonA1,i.e.,ScontainsasubsetbelongingtoA1.min(S.Price)£vissuccinctbecauseeachsubsetwhosatisfytheconstraintisasubsetofA1.,sum(S.Price)³visnotsuccinct.

min(S.Price)£v

A1={20,30,40,8,5,3}

V=70

min(A1)<VsatisfytheconstraintsodoeseachsubsetofA1satisfytheconstraint.

sum(A1)>VsatisfytheconstraintbutnoteachsubsetofA1satisfytheconstraint.Forexamplesum(20,30)<70

INTRODUCION

AprioriisthemostsimpleandmostwidelyknownalgorithmforminingfrequentitemsetscreatedbyR.AgrawalandR.Skrikant.

TheApriorialgorithmworksiteratively.Itfirstfindsthesetoflarge1-itemsets,andthensetof2-itemsets,andsoon.Thenumberofscanoverthetransactiondatabaseisasmanyasthelengthofthemaximalitemset.Aprioriisbasedonthefollowingfact:Thesimplebutpowerfulobservationleadstothegenerationofasmallercandidatesetusingthesetoflargeitemsetsfoundinthepreviousiteration.

Disadvantages

Generationofcandidateitemsetsisexpensive(inbothspaceandtime)

UnlikeAprioriFP-growthusesanextendedprefix-treestructuretostorethedatabaseinacompressedform.ItusesapatternfragmentgrowthmethodtoavoidthecostlyprocessofcandidategenerationandtestingusedbyApriori.

APRIORI-FastAlgorithmsforMiningAssociationRules[1]

Algorithmssummarize

Countitemoccurrences

Generatenewk-itemsetscandidates

Findthesupportofallthecandidates

Takeonlythosewithsupportoverminsup

Apriori,firstscansthetransactiondatabasesDinordertocountthesupportofeachitemiinI,anddeterminesthesetoflarge1-itemsets.Thenoneiterationisperformedforeachofthecomputationofthesetof2-itemsets,3-itemsets,andsoon.Thekthiterationconsistsoftwosteps:

GeneratethecandidatesetCkfromthesetoflarge(k-1)-itemsets,Lk-1.

ScanthedatabaseinordertocomputethesupportofeachcandidateitemsetinCk

Thecandidategenerationalgorithmisgivenasfollows:

Thecandidategenerationprocedurecomputesthesetofpotentiallylargek-itemsetsfromthesetoflarge(k-1)-itemsets.Anewcandidatek-itemsetisgeneratedfromtwolarge(k-1)-itemsetsiftheirfirst(k-2)itemsarethesame.ThecandidatesetCkisasupersetofthelargek-itemsets.Thecandidatesetisguaranteedtoincludeallpossiblelargek-itemsetsbecauseofthefactthatallsubsetsofalargeitemsetarealsolarge.SincealllargeitemsetsinLk-1arecheckedforcontributiontocandidateitemset,thecandidatesetCkiscertainlyasupersetoflargek-itemsets.Afterthecandidatesaregenerated,theircountsmustbecomputedinordertodeterminewhichofthemarelarge.Thiscountingstepisreallyimportantintheefficiencyofthealgorithm,becausethesetofthecandidateitemsetsmaybepossiblylarge.Apriorihandlesthisproblembyemployingahashtreeforstoringthecandidate.Thecandidategenerationalgorithmisusedtofindthecandidateitemsetscontainedinatransactionusingthishashtreestructure.ForeachtransactionTinthetransactiondatabaseD,thecandidatescontainedinTarefoundusingthehashtree,andthentheircountsareincremented.AfterexaminingalltransactioninD,theonesthatarelargeareinsertedintoLk.

Theproblemisthateverypassgoesoverthealldata,andit'snoefficientprocess.

TheanswerforthisproblemisaprioriTid.

Usesthedatabaseonlyonce.

BuildsastoragesetC^k

Membershastheform<TID,{Xk}>

Xkarepotentiallylargek-itemsintransactionTI.

Fork=1,C^1isthedatabase.

UsesC^kinpassk+1.

AlgorithmaprioryTid

Advantage

C^kcouldbesmallerthanthedatabase.

Ifatransactiondoesnotcontaink-itemsetcandidates,thanitwillbeexcludedfromC^k.

Forlargek,eachentrymaybesmallerthanthetransaction

Thetransactionmightcontainonlyfewcandidates.

Disadvantage

Forsmallk,eachentrymaybelargerthanthecorrespondingtransaction.

Anentryincludesallk-itemsetscontainedinthetransaction.

Figure2.1.1-1–PerpassexecutiontimesofAprioriandAprioriTId

WecanseeinthefigureabovethatintheearlierpassesaprioridoesbetterperformancebutinthelaterpassesaprioriTidbeatsApriori.That’sbecauseinthelaterpassesthenumberofcandidateitemsetsreduces.AprioriTiddoesn'tusethedatabaseitusesCKinstead.CKbecomesmallerandthat’swhyinthelaterpassesaprioriTidisbetter.

Sowhoisbetter?

Intheearlierpasses,AprioridoesbetterthanAprioriTid.However,AprioriTidbeatsAprioriinlaterpasses.Weobservedsimilarrelativebehaviorfortheotherdatasets,thereasonforwhichisasfollows.AprioriandAprioriTidusethesamecandidategenerationprocedureandthereforecountthesameitemsets.Inthelaterpasses,thenumberofcandidateitemsetsreduces.However,Aprioristillexamineseverytransactioninthedatabase.Ontheotherhand,ratherthanscanningthedatabase,AprioriTidscansCKforobtainingsupportcounts,andthesizeofCKhasbecomesmallerthanthesizeofthedatabase.WhentheCKsetscanfitinmemory,wedonotevenincurthecostofwritingthemtodisk.

Basedontheseobservations,wecandesignahybridalgorithm,whichwecallAprioriHybridthatusesAprioriintheinitialpassesandswitchestoAprioriTidwhenitexpectsthatthesetCKattheendofthepasswillfitinmemory.WeusethefollowingheuristictoestimateifCKwouldfitinmemoryinthenextpass.Attheendofthecurrentpass,wehavethecountsofthecandidate'siinCK.Fromthis,weestimatewhatthesizeofCKwouldhavebeenifithadbeengenerated.Thissize,inwords,is

IfCKinthispasswassmallenoughtofitinmemory,andtherewerefewerlargecandidatesinthecurrentpassthanthepreviouspass,weswitchtoAprioriTid.

Theswitchtakestime,butitstillworthit.WecanseefromthegraphsbellowtheadvantageofAprioryHybridalgorithm.IttakestheadvantagesofbothalgorithmsAprioriandAprioriTid.

T10.12.D100Kandtheothersrepresenttheparametersettings.

|T|-10–Averagesizeofthetransactions.

|I|-2-Averagesizeofthemaximalpotentiallylargeitemsets.

D–100K–Numberoftransactions.

settings12,14,16aretheaveragesizeofthemaximalpotentiallylargeitemsets.

WecanseeinthegraphsbellowthatApriorihasbetterperformancethanAprioriTid.Thereasonissmallnumberofitemsinallthetransactions.

AprioriTidhasgoodperformancewhenthesizeofthetransactionsisbig.BecauseinthespecificexamplesbellowthesizeissmalltheApriorihasbetterperformance.

Figure2.1.1-2–Executiontimesfordecreasingminimumsupport(maxpotentiallylargeitemsetis2

Figure2.1.1-3–Executiontimesfordecreasingminimumsupport(maxpotentiallylargeitemsetis4

Figure2.1.1-4–Executiontimesfordecreasingminimumsupport(maxpotentiallylargeitemsetis6

InthegraphabovewecanseehowAprioryHybridalgorithmtakestheadvantagesofbothalgorithmsAprioriandAprioriTid.

Note–Wemustrememberthatthefollowingconclusionsandthesummarybellowrefertothealgorithmsonthosetimes

Conclusions

TheApriorialgorithmsarebetterthanthepreviousalgorithms.

Forsmallproblemsbyfactors

Forlargeproblemsbyordersofmagnitudes.

Thealgorithmsarebestcombined.

Thealgorithmshowsgoodresultsinscale-upexperiments

AprioriTidusesC^kinsteadofthedatabase.IfC^kfitsinmemoryAprioriTidisfasterthanApriori

WhenC^kistoobigitcannotsitinmemory,andthecomputationtimeismuchlonger.ThusAprioriisfasterthanAprioriTid.

Summary

Associationrulesareanimportanttoolinanalyzingdatabases.

We’veseenanalgorithmwhichfindsallassociationrulesinadatabase.

Thealgorithmhasbettertimeresultsthenpreviousalgorithms.

Thealgorithmmaintainsitsperformancesforlargedatabases.

FP-TREEALGORITHM[8]

Candidategenerationisbyfarthemosttimeconsumingprocess,soitisdesirabletospeedthisup.FPTreealgorithmdirectlyminesfrequentitemsetswithoutgeneratingcandidates.TheclaimisthatbygatheringsufficientstatisticsintoaspecialstructurewhichcalledFPtree,allofthefrequentpatternscanbegeneratedwithoutgoingbacktothedatabase.Andthisdefinitelywillleadustobetterperformance.

AswelearnbeforeAprioriworkswellexceptwhentheinputis:

Lotsoffrequentpatternswithbigsetsofitemsorwithlowminimumsupportthreshold

Longpatterns

FPtreeavoidcandidatesetexplosionby:

Compacttreedatastructure(ItavoidrepeateddatascansthusitmuchsmallerthanthebasicDatabase).

Restrictedtest-only

Searchdivide-and-conquerbased

Algorithmssummarize

Thealgorithmmadeupfromtwophases:

Phase1-ConstructingFP-tree

ScanDBtofindL

Collectthesetoffrequentitems

SortLandDBindescendingfrequency

ScanDBagain-constructFP-tree

Phase2-ExecutingFP-Growth

MiningfrequentpatternsfromFP-tree

Processingfrequentitems

Onebyone

Bottomup

Eachitem

GeneratingaconditionalFP-tree

Algorithm–Phase1[8]

Algorithm–Phase2[8]

FP-Treealgorithmexample[8]

Herewe'llshowexamplewhichwillsummarizetheAlgorithmability.

Step1

ScanningDBtofindL

Example:Minimumsupport=60%

ScaneachTIDandupdatethefrequencyforeachiteminthenewtable

Table2.1.1-1-FPtreealgorithmexampleTid,item,andfrequency(1)

ScanDBtofindL(Listofallitemswhichmeetthesupport).

Afterscanning–Markingreentheitemswhichmeetthesupport

Table2.1.1-2-FPtreealgorithmexampleTid,item,andfrequency(2)

Step2

SortLindescendingfrequency

L={a:3,b:3,c:4,f:4,m:3,p:3}

L’={f:4,c:4,a:3,b:3,m:3,p:3}

Buildanewtablewhichcontainsonlytheitemswhichmeetthesupportandindescendingorder(seeL').

SortDB

Table2.1.1-3-FPtreealgorithmexampleTid,item,andfrequency(3)

Step3

InthisstepwescantheDBagaintoconstructtheFPtreetuplebytuple.

Westartwiththefirsttupletobuildthetreeinthesameorderoftheitems.Foreachitemwesetanumber,thisnumbernotehowmanytransactionsitbelongsto.

Table2.1.1-4-FPtreealgorithmexampleTid,item,andHeadertable(1)

Step3-Cont

Continuebuildingthetreeusingthesecondtuple.

Wecanseethenumbersineachnode.Itindicatesthenumberoftransactionswhichitbelongs

Table2.1.1-5-FPtreealgorithmexampleTid,item,andHeadertable(2)

Continuebuildingthetreeusingthethirdtuple.

Table2.1.1-6-FPtreealgorithmexampleTid,item,andHeadertable(3)

Continuebuildingthetreeusingtheforthtuple.

Table2.1.1-7-FPtreealgorithmexampleTid,item,andHeadertable(4)

Step3cont

Continuebuildingthetreeusingthelasttuple.

Table2.1.1-8-FPtreealgorithmexampleTid,item,andHeadertable(5)

Nowwe'llshowthesecondalgorithm-FPGrows

Afterthedatabaseiscompressedintoahighlycondensedandmuchsmallerdatastructure,wecontinuetothenextstep

MiningfrequentpatternsfromtheFP-Tree.Processingfrequentitemsonebyonebottomup.EachitemgeneratesaconditionalFP-Tree.

Figure2.1.1-5-FPgrowsexample

Exampleforp

FirstwemarkeachnodewhichisabovePinthesamebranch

Figure2.1.1-6-FPgrowsexampleforp(1)

Theonlyfrequentpatternsfor"P"are{p:3,cp:3}

Figure2.1.1-7-FPgrowsexamplep(2)

Exampleform

Again,firstwemarkeachnodewhichisaboveminthesamebranch

Figure2.1.1-8-FPgrowsexampleform(1)

Thefrequentpatternfor"m"are{m:3,am:3,cm:3,fm:3}

Figure2.1.1-9-FPgrowsexampleform(2)

SowerecursivelyconstructingconditionalFP-treefor:

am,cm,fm.

we'llstartwitham

Theprefixare{f:3,c:3}.Sothelargeitemsetswithamare:

{cam:3,fam:3}}

Figure2.1.1-10-FPgrowsexampleforam

Thefrequentpatternswithamare:

{am,cam,fam}.WerecursivelyconstructconditionalFP-Trees"cam","fam"

FP-Treefor"cam"

Thefrequentpatternswithcamare:{fcam}

Cam{fcam}

FP-Treefor"fam"

Thefrequentpatternswithfamis:{

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

最新文档

评论

0/150

提交评论