Improving Data Mining Algorithms Using Constraints 使用约束改进数据挖掘算法

上传人：媚*** IP属地：境外上传时间：2024-04-02 格式：DOC 页数：78 大小：2.60MB 积分：30 举报 版权申诉

Improving Data Mining Algorithms Using Constraints 使用约束改进数据挖掘算法_第2页

Improving Data Mining Algorithms Using Constraints 使用约束改进数据挖掘算法_第3页

Improving Data Mining Algorithms Using Constraints 使用约束改进数据挖掘算法_第4页

Improving Data Mining Algorithms Using Constraints 使用约束改进数据挖掘算法_第5页

已阅读5页，还剩73页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

PAGE

TheOpenUniversityofIsrael

DepartmentofMathematicsandComputerScience

TheComputerScienceDivision

IMPROVINGDATAMININGALGORITHMSUSINGCONSTRAINTS

ShaiShimon

ID028455863

Emailshaishimon@

Cell0526-126207

PreparedunderthesupervisionofProfessorEhudGudes

Feb2012

TableofContents

TOC\o"1-5"\h\z

Listoffigures

ListofTables

1 ABSTRACTANDINTRODUCTION

1.1 ABSTRACT

1.2 INTRODUCTION

2 basicconcepts

2.1 INTRODUCION

2.2 APRIORI-FastAlgorithmsforMiningAssociationRules[1]

2.3 FP-TREEALGORITHM[8]

3 PAPERSsurvey

3.1 USINGCONSTRAINTS

3.1.1 INTRODUCTIONANDMOTIVATION

3.1.2 MININGFREQUENTITEMSETSWITHCONVERTIBLECONSTRAINTS[9]

Introduction

Convertibleconstraints-motivation

3.1.3 MININGASSOCIATIONRULESWITHITEMCONSTRAINTS[10]

Abstract

Introduction

Algorithms

Tradeoffs

Conclusions

3.1.4 EXAMINEROPTIMIZEDLEVEL-WISEFREQUENTPATTERNMININGWITHMONOTONECONSTRAINTSALGORITHM[4]

Abstract

Introduction

Definitions

ExAMineralgorithm

FlowchartofexaMiner

ExaMineralgorithmexample

Experiments

3.1.5 FP-BONSAIALGORITHM[5]

Introduction

FP-bonsaialgorithm

FP-Bonsaialgorithmexample

Disadvantage

Experiments

Summary

3.2 shortpaperssurveys

4 IMPLEMENTAION

5 summaryandconclusions

6 REFERENCES

Listoffigures

passexecutiontimesofAprioriandAprioriTId

Figure2.1.1-1

Executiontimesfordecreasingminimumsupport(maxpotentiallylargeitemsetis2

Figure2.1.1-2

Executiontimesfordecreasingminimumsupport(maxpotentiallylargeitemsetis4

Figure2.1.1-3

Executiontimesfordecreasingminimumsupport(maxpotentiallylargeitemsetis6

Figure2.1.1-4

FPgrowsexample

Figure2.1.1-5

FPgrowsexampleforp(1)

Figure2.1.1-6

FPgrowsexampleforp(2)

Figure2.1.1-7

FPgrowsexampleform(1)

Figure2.1.1-8

FPgrowsexampleform(2)

Figure2.1.1-9

FPgrowsexampleforam

Figure2.1.1-10

FPgrowsexampleforcamandfam

Figure2.1.1-11

FPgrowsexampleforcamandcm

Figure2.1.1-12

FPgrowsexampleforcamandfm

Figure2.1.1-13

FPgrowsexampleresults

Figure2.1.1-14

FPtreealgorithmexperiment–Runtime,supportthreshold

Figure2.1.1-15

FPgrowsalgorithmexperiment–Transactionsnumberwiththreshold=1.5%

Figure2.1.1-16

ExAMiner0

Figure3.1.4-1

ExAMiner1&ExAMiner2

Figure3.1.4-2

ExaMinerexperimentDataReductionRate(min_sup=1100)

Figure3.1.4-3

ExaMinerexperimentDataReductionRate(min_sup=500)

Figure3.1.4-4

ExaMinerexperimentRuntimesynthetic(min_sup=1200)

Figure3.1.4-5

ExaMinerexperimentRuntimesynthetic(sum(prices)>2800)

Figure3.1.4-6

FPBonsaiExaminerexperiment(BMS-POS)(1)

Figure3.1.5-1

FPBonsaiExaminerexperiment(BMS-POS)(2

Figure3.1.5-2

Applicationwindow–mainpanel

Figure4-1

Applicationwindow–FPtreeresultpanel

Figure4-2

ListofTables

Market-Baskettransactions

Table2-1

Convertibleanti-monotone

Table2-2

Convertiblemonotone

Table2-3

stronglyconvertibleconstraints

Table2-4

FPtreealgorithmexampleTid,item,andfrequency(1)

Table2.1.1-1

FPtreealgorithmexampleTid,item,andfrequency(2)

Table2.1.1-2

FPtreealgorithmexampleTid,item,andfrequency(3)

Table2.1.1-3

FPtreealgorithmexampleTid,item,andHeadertable(1)

Table2.1.1-4

FPtreealgorithmexampleTid,item,andHeadertable(2)

Table2.1.1-5

FPtreealgorithmexampleTid,item,andHeadertable(3)

Table2.1.1-6

FtreealgorithmexampleTid,item,andHeadertable(4)

Table2.1.1-7

FPtreealgorithmexampleTid,item,andHeadertable(5)

Table2.1.1-8

FPtreealgorithmexperiment–Syntheticdataset

Table2.1.1-9

TransactionIdANDtransaction

Table3.1.2-1

Frequentitemsetswithsupportthreshold

Table3.1.2-2

ExaMiner0

Table3.1.4-1-A

ExaMiner1

Table3.1.4-1-B

ExaMinerexamplelevelone(1)

Table3.1.4-2

ExaMinerexamplelevelone(2)

Table3.1.4-3

ExaMinerexamplelevelone(3)

Table3.1.4-4

ExaMinerexamplelevelone(4)

Table3.1.4-5

ExaMinerexamplelevelone(5)

Table3.1.4-6

ExaMinerexamplelevelone(6)

Table3.1.4-7

ExaMinerexampleleveltwo(1)

Table3.1.4-8

ExaMinerexampleleveltwo(2)

Table3.1.4-9

ExaMinerexampleleveltwo(4)

Table3.1.4-10

ExaMinerexampleleveltwo(5)

Table3.1.4-11

ExaMinerexampleleveltwo(7)

Table3.1.4-12

ExaMinerexampleleveltwo(7)

Table3.1.4-13

ExaMinerexampleleveltwo(7)

Table3.1.4-14

FPBonsaiexample-item,valuetable

Table3.1.5-1

FPBonsaiexample-Tid,itemstable

Table3.1.5-2

FPBonsaiexample-pruning–(constraintcheck)

Table3.1.5-3

FPBonsaiexampleRun-pruning(Supportcheck)

Table3.1.5-4

FPBonsaiexample-pruning(1)

Table3.1.5-5

FPBonsaiexampleRun-pruning(2)

Table3.1.5-6

FPBonsaiexample-pruning(3)

Table3.1.5-7

FPBonsaiexampleRun-pruning(4)

Table3.1.5-8

FPBonsaiexample-pruning(5)

Table3.1.5-9

FPBonsaiexampleresults

Table3.1.5-10

Transactionstable

Table4-1

Itemsandprices

Table4-2

71,72

Experimentsresults

Table4-3

ABSTRACTANDINTRODUCTION

ABSTRACT

Thepurposeofdataminingistoidentifyandpredictpatterns,trendsandrelationshipsindata.Themainstepsindataminingprocessare:

Definingtheproblem,preparationofinformation,dataanalysis,evaluationoftheresults,displayingtheresults.

InthisworkI'llpresentanumberofdataminingalgorithmsusingassociationrules.FirstI'llpresentthebasicalgorithms(AprioriAlgorithmandFPTree)andthenwe'lldiscussalgorithmswithconstraints.Wewillpresentthealgorithmswithconstraintsindetail,andalsoweshalldiscussthedifferencesbetweenthem.

Infactthisworkwillfocusondataminingalgorithmswithconstraints.Wewillfocusontheimportanceofconstraintsindatamining,ontheiruse,andexploredifferenttypesofconstraintsandeffectivemethodsofdataminingalgorithms.Asit'swellknown,sincethesizeofdataminingresultsmaysometimesbeverylarge,usingconstraintshelptheuserfindthedesiredinformationandimprovesthesystemperformance.Thisworkwillfocusoncertaintypesofconstraints,andalgorithmsthatwerebuiltforthem.Specifically,thealgorithmsthatwewillrevieware:

MININGFREQUENTITEMSETSWITHCONVERTIBLECONSTRAINTS[10]

MININGASSOCIATIONRULESWITHITEMCONSTRAINTS[11]

EXAMINEROPTIMIZEDLEVEL-WISEFREQUENTPATTERNMININGWITHMONOTONECONSTRAINTSALGORITHM[4]

FP-BONSAIALGORITHM[5]

Inadditionwewillreviewbrieflysixotherarticles:FourarticlesonconstraintsandtwoadvancedalgorithmsthanApriori.

Thelastphaseoftheworkisanimplementationoftwoalgorithms:Bonsai-treeandFP-tree.TheimplementationwascodedintheJAVAlanguage.TheDatabaseinputisasyntheticdatabaseanditwasbuiltbyarandomgeneratorthatwasespeciallydevelopedforthispurpose.Theresultsandconclusionsoftheevaluationaresummarizedinthepaper.

INTRODUCTION

BACKGROUND

Overtheyears,massstoragecosthasdecreaseddramatically,anddatabasetechnology,incorporatingtheubiquitousInternet,hasevolvedtobemoreintelligentandpowerful.Wearenowattheequinoxwherewehavetoomuchdatayetsofewcomputerizedtoolstoanalyzeit,letaloneapplytheknowledgeresultedfromtheanalysistoexpediteinformationdissemination,scientificresearch,andindustrialandcommercialdecisionmaking.Weareindeeddatabillionaireslivinginthegutterofknowledge.

Thisiswheredataminingcamein,whichstartedoutasadirectconsequenceofinformationtechnologydevelopment.Followingtheamazingprogressinthefield,dataminingcannowprovidetheoreticalfoundationstoimplementanalyzingsoftwareforvariouskindsofapplications.

Thispaperismainlyfocusonhowtoefficientlygenerateassociationrules.

Theuserisallowedtoexpresshisfocusinmining,bymeansofarichclassofconstraintsthatcaptureapplicationsemantics.Besidesallowinguserexplorationandcontrol,theparadigmallowsmanyoftheseconstraintstobepusheddeepinsidemining(laterdiscussedinbasicconcepts),thuspruningthesearchspaceofpatternstothoseofinteresttotheuser,andachievingsuperiorperformance.

Inthisreportwediscuss2maintopicsalgorithmsforconstraintsandefficiencyimprovements.

PURPOSE

Thispaperisdividedto4maintopics:

BasicconceptsandshortbriefonbothAprioriandFP-Treealgorithms–Inthissectionwe'llfocusonthebasicconceptswhichwillhelpusdealwiththerestofthepaperandwe'lldiscussshortlyabout2algorithms:AprioriandFP-Tree.

Papersurvey–Inthissectionweshowtheadvantageoftheconstraints.Withconstraintsweobtainfewerpatternswhicharemoreinteresting.Indeedconstraintsarethewayweusetodefinewhatis“interesting”.Herewe'llintroduce4articles,whichwilluseconstraintsmethodology:

"MiningFrequentItemsetswithConvertibleConstraints"[9]

"Miningassociationruleswithitemconstraints"[10]

"Examineralgorithm"[4]

"FPBonsai"[4]

Shortpapersurvey–Herewe'lldescribebrieflyfewarticleswhichdealbothimprovingbasicalgorithmsandconstraintsalgorithms.

Applicationimplementation–Afterdescribingalltheabovearticles,we'llshowtheresultsofanapplicationwhichwaswritteninjava.Thisapplicationimplements2algorithms:"FPTree"and"FPBonsai".Theprogramwasrunwithdatathatwasgeneratedsyntactically.Theprogramwillshowthedurationofeachalgorithminadditiontotheresults.

basicconcepts

Associationrulesmining-Givenasetoftransactions,findrulesthatwillpredicttheoccurrenceofanitembasedontheoccurrencesofotheritemsinthetransaction

Market-Baskettransactions

ExampleofAssociationRules

{Diapers}{Beer},

{Milk,Bread}{Eggs,Coke},

{Beer,Bread}{Milk},

Table2-1Market-Baskettransactions

Itemset

Acollectionofoneormoreitems

Example:{Milk,Bread,Diapers}

k-itemset

Anitemsetthatcontainskitems

Supportcount(s-sigma)

Frequencyoccurrenceofanitemset

E.g.s({Milk,Bread,Diapers})=2

Support

Thepercentageofthefractionoftransactionsthatcontainanitemsetrepresentsthesupport.Orinotherwordsanitemsetwhichappearsinxtransactionsofthedatabaseisthesupportofthisitemset.

E.g.s({Milk,Bread,Diapers})appearin2transactionsinthetableabovesothesupportis2/5*100=40%

Confidence

Confidencedenotesthestrengthofimplicationintherule,meansthemoretheconfidencehighertherelationshipbetweenthe2setsisstronger.Thecasuallinkbetweenmilkandbreadisstrongintheexamplebecausetheconfidenceis75%.

Confidence(X=>Y)=Support(XY)/Support(X)

E.g.s({Milk,Bread})=3/5

s({Milk})=4/5

Confidence(MilkBread)=(3/5)/(4/5)=0.75->75%

FrequentItemset

Anitemsetwhosesupportisgreaterthanorequaltoaminsupthreshold

AssociationRule

AnimplicationexpressionoftheformX®Y,whereXandYareitemsets

Example:

{Milk,Diapers}®{Beer}

Constraints

Whatareconstraintsindatamining?Constraintsaretherulesenforcedondatatransactions.

TheIdeainconstraintsistofocusonthespecificandrelevantitemsetswhichwewanttomine.Thefollowingbellowsaresomebasicconceptsregardingconstrains.

Constraintsmining–Aimtoreducesearchspace.Itfindallpatternssatisfyingconstraints

Constraintsbasedsearch-Aimtoreducesearchspaceandfindsonlysome(orone)answer

BothConstraintsminingandConstraintsbasedsearchareaimedatreducingsearchspacebutthefirstfindtheallthepatternsandtheotherfindsomeoroneofthepatterns.Thisofcoursemakesthedifferenceintheruntimeandthememoryusage.

Anti-monotonic-WhenanitemsetSviolatestheconstraint,sodoesanyofitssuperset

Example:C:range(S.profit)£15isanti-monotoneItemsetabviolatesC

range(ab)=40£15

Sodoeseverysupersetofab

Monotonic-WhenanitemsetSsatisfiestheconstraint,sodoesanyofitssuperset

Example:C:range(S.profit)³15ItemsetabsatisfiesC

range(ab)=40³15

Sodoeseverysupersetofab

Thefollowingtablesarefortheexamples

Item

Profit

-20

-30

-10

TID

Transaction

a,b,c,d,f

b,c,d,f,g,h

a,c,d,e,f

c,e,f,g

Table2-2ConvertibleAntimonotone

Convertibleanti-monotone-AssumethereisanorderR.WheneveranitemsetSsatisfiesC,sodoesanyprefixofS.

Example:C:avg(S)³20w.r.t.itemvaluedescendingorder

Theitemset“abc”satisfiesC

avg(abc)=30³20

andsodoes“ab”avg(ab)=35³20and“a”avg(a)=40³20

Convertiblemonotone-AssumethereisanorderR.WheneveranditemsetSviolatesC,sodoesanyprefixofS.

Example:C:avg(S)£20w.r.t.itemvaluedescendingorderThe

itemset“abc”violatesCavg(abc)=30£20andsodoes“ab”

avg(ab)=35£20and“a”avg(a)=40£20

Thefollowingtablesarefortheexamples

Item

Profit

TID

Transaction

a,b,c,d,f

b,c,d,f,g,h

a,c,d,e,f

c,e,f,g

Table2-3Convertiblemonotone

Stronglyconvertibleconstraints

WheneverthereexistsanorderRoverthesetofitemssuchthatCisconvertible

anti-monotoneRandconvertiblemonotoneR^-1

ExampleC:avg(X)³25isconvertibleanti-monotonew.r.t.itemvaluedescendingorderR

Theitemset“afg”satisfiesCsodoes“af”and“a”.

avg(X)³25isconvertiblemonotonew.r.t.itemvalueascendingorderR^-1

Theitemset“ech”violatesCsodoes“ec”and“e”.

Tabledescendingorder Tableascendingorder

Item

Value

Item

Value

Table2-4stronglyconvertibleconstraints

Succinctnessconstraints

GivenA1,thesetofitemssatisfyingasuccinctnessconstraintC,thenanysetSsatisfyingCisbasedonA1,i.e.,ScontainsasubsetbelongingtoA1.min(S.Price)£vissuccinctbecauseeachsubsetwhosatisfytheconstraintisasubsetofA1.,sum(S.Price)³visnotsuccinct.

min(S.Price)£v

A1={20,30,40,8,5,3}

V=70

min(A1)<VsatisfytheconstraintsodoeseachsubsetofA1satisfytheconstraint.

sum(A1)>VsatisfytheconstraintbutnoteachsubsetofA1satisfytheconstraint.Forexamplesum(20,30)<70

INTRODUCION

AprioriisthemostsimpleandmostwidelyknownalgorithmforminingfrequentitemsetscreatedbyR.AgrawalandR.Skrikant.

TheApriorialgorithmworksiteratively.Itfirstfindsthesetoflarge1-itemsets,andthensetof2-itemsets,andsoon.Thenumberofscanoverthetransactiondatabaseisasmanyasthelengthofthemaximalitemset.Aprioriisbasedonthefollowingfact:Thesimplebutpowerfulobservationleadstothegenerationofasmallercandidatesetusingthesetoflargeitemsetsfoundinthepreviousiteration.

Disadvantages

Generationofcandidateitemsetsisexpensive(inbothspaceandtime)

UnlikeAprioriFP-growthusesanextendedprefix-treestructuretostorethedatabaseinacompressedform.ItusesapatternfragmentgrowthmethodtoavoidthecostlyprocessofcandidategenerationandtestingusedbyApriori.

APRIORI-FastAlgorithmsforMiningAssociationRules[1]

Algorithmssummarize

Countitemoccurrences

Generatenewk-itemsetscandidates

Findthesupportofallthecandidates

Takeonlythosewithsupportoverminsup

Apriori,firstscansthetransactiondatabasesDinordertocountthesupportofeachitemiinI,anddeterminesthesetoflarge1-itemsets.Thenoneiterationisperformedforeachofthecomputationofthesetof2-itemsets,3-itemsets,andsoon.Thekthiterationconsistsoftwosteps:

GeneratethecandidatesetCkfromthesetoflarge(k-1)-itemsets,Lk-1.

ScanthedatabaseinordertocomputethesupportofeachcandidateitemsetinCk

Thecandidategenerationalgorithmisgivenasfollows:

Thecandidategenerationprocedurecomputesthesetofpotentiallylargek-itemsetsfromthesetoflarge(k-1)-itemsets.Anewcandidatek-itemsetisgeneratedfromtwolarge(k-1)-itemsetsiftheirfirst(k-2)itemsarethesame.ThecandidatesetCkisasupersetofthelargek-itemsets.Thecandidatesetisguaranteedtoincludeallpossiblelargek-itemsetsbecauseofthefactthatallsubsetsofalargeitemsetarealsolarge.SincealllargeitemsetsinLk-1arecheckedforcontributiontocandidateitemset,thecandidatesetCkiscertainlyasupersetoflargek-itemsets.Afterthecandidatesaregenerated,theircountsmustbecomputedinordertodeterminewhichofthemarelarge.Thiscountingstepisreallyimportantintheefficiencyofthealgorithm,becausethesetofthecandidateitemsetsmaybepossiblylarge.Apriorihandlesthisproblembyemployingahashtreeforstoringthecandidate.Thecandidategenerationalgorithmisusedtofindthecandidateitemsetscontainedinatransactionusingthishashtreestructure.ForeachtransactionTinthetransactiondatabaseD,thecandidatescontainedinTarefoundusingthehashtree,andthentheircountsareincremented.AfterexaminingalltransactioninD,theonesthatarelargeareinsertedintoLk.

Theproblemisthateverypassgoesoverthealldata,andit'snoefficientprocess.

TheanswerforthisproblemisaprioriTid.

Usesthedatabaseonlyonce.

BuildsastoragesetC^k

Membershastheform<TID,{Xk}>

Xkarepotentiallylargek-itemsintransactionTI.

Fork=1,C^1isthedatabase.

UsesC^kinpassk+1.

AlgorithmaprioryTid

Advantage

C^kcouldbesmallerthanthedatabase.

Ifatransactiondoesnotcontaink-itemsetcandidates,thanitwillbeexcludedfromC^k.

Forlargek,eachentrymaybesmallerthanthetransaction

Thetransactionmightcontainonlyfewcandidates.

Disadvantage

Forsmallk,eachentrymaybelargerthanthecorrespondingtransaction.

Anentryincludesallk-itemsetscontainedinthetransaction.

Figure2.1.1-1–PerpassexecutiontimesofAprioriandAprioriTId

WecanseeinthefigureabovethatintheearlierpassesaprioridoesbetterperformancebutinthelaterpassesaprioriTidbeatsApriori.That’sbecauseinthelaterpassesthenumberofcandidateitemsetsreduces.AprioriTiddoesn'tusethedatabaseitusesCKinstead.CKbecomesmallerandthat’swhyinthelaterpassesaprioriTidisbetter.

Sowhoisbetter?

Intheearlierpasses,AprioridoesbetterthanAprioriTid.However,AprioriTidbeatsAprioriinlaterpasses.Weobservedsimilarrelativebehaviorfortheotherdatasets,thereasonforwhichisasfollows.AprioriandAprioriTidusethesamecandidategenerationprocedureandthereforecountthesameitemsets.Inthelaterpasses,thenumberofcandidateitemsetsreduces.However,Aprioristillexamineseverytransactioninthedatabase.Ontheotherhand,ratherthanscanningthedatabase,AprioriTidscansCKforobtainingsupportcounts,andthesizeofCKhasbecomesmallerthanthesizeofthedatabase.WhentheCKsetscanfitinmemory,wedonotevenincurthecostofwritingthemtodisk.

Basedontheseobservations,wecandesignahybridalgorithm,whichwecallAprioriHybridthatusesAprioriintheinitialpassesandswitchestoAprioriTidwhenitexpectsthatthesetCKattheendofthepasswillfitinmemory.WeusethefollowingheuristictoestimateifCKwouldfitinmemoryinthenextpass.Attheendofthecurrentpass,wehavethecountsofthecandidate'siinCK.Fromthis,weestimatewhatthesizeofCKwouldhavebeenifithadbeengenerated.Thissize,inwords,is

IfCKinthispasswassmallenoughtofitinmemory,andtherewerefewerlargecandidatesinthecurrentpassthanthepreviouspass,weswitchtoAprioriTid.

Theswitchtakestime,butitstillworthit.WecanseefromthegraphsbellowtheadvantageofAprioryHybridalgorithm.IttakestheadvantagesofbothalgorithmsAprioriandAprioriTid.

T10.12.D100Kandtheothersrepresenttheparametersettings.

|T|-10–Averagesizeofthetransactions.

|I|-2-Averagesizeofthemaximalpotentiallylargeitemsets.

D–100K–Numberoftransactions.

settings12,14,16aretheaveragesizeofthemaximalpotentiallylargeitemsets.

WecanseeinthegraphsbellowthatApriorihasbetterperformancethanAprioriTid.Thereasonissmallnumberofitemsinallthetransactions.

AprioriTidhasgoodperformancewhenthesizeofthetransactionsisbig.BecauseinthespecificexamplesbellowthesizeissmalltheApriorihasbetterperformance.

Figure2.1.1-2–Executiontimesfordecreasingminimumsupport(maxpotentiallylargeitemsetis2

Figure2.1.1-3–Executiontimesfordecreasingminimumsupport(maxpotentiallylargeitemsetis4

Figure2.1.1-4–Executiontimesfordecreasingminimumsupport(maxpotentiallylargeitemsetis6

InthegraphabovewecanseehowAprioryHybridalgorithmtakestheadvantagesofbothalgorithmsAprioriandAprioriTid.

Note–Wemustrememberthatthefollowingconclusionsandthesummarybellowrefertothealgorithmsonthosetimes

Conclusions

TheApriorialgorithmsarebetterthanthepreviousalgorithms.

Forsmallproblemsbyfactors

Forlargeproblemsbyordersofmagnitudes.

Thealgorithmsarebestcombined.

Thealgorithmshowsgoodresultsinscale-upexperiments

AprioriTidusesC^kinsteadofthedatabase.IfC^kfitsinmemoryAprioriTidisfasterthanApriori

WhenC^kistoobigitcannotsitinmemory,andthecomputationtimeismuchlonger.ThusAprioriisfasterthanAprioriTid.

Summary

Associationrulesareanimportanttoolinanalyzingdatabases.

We’veseenanalgorithmwhichfindsallassociationrulesinadatabase.

Thealgorithmhasbettertimeresultsthenpreviousalgorithms.

Thealgorithmmaintainsitsperformancesforlargedatabases.

FP-TREEALGORITHM[8]

Candidategenerationisbyfarthemosttimeconsumingprocess,soitisdesirabletospeedthisup.FPTreealgorithmdirectlyminesfrequentitemsetswithoutgeneratingcandidates.TheclaimisthatbygatheringsufficientstatisticsintoaspecialstructurewhichcalledFPtree,allofthefrequentpatternscanbegeneratedwithoutgoingbacktothedatabase.Andthisdefinitelywillleadustobetterperformance.

AswelearnbeforeAprioriworkswellexceptwhentheinputis:

Lotsoffrequentpatternswithbigsetsofitemsorwithlowminimumsupportthreshold

Longpatterns

FPtreeavoidcandidatesetexplosionby:

Compacttreedatastructure(ItavoidrepeateddatascansthusitmuchsmallerthanthebasicDatabase).

Restrictedtest-only

Searchdivide-and-conquerbased

Algorithmssummarize

Thealgorithmmadeupfromtwophases:

Phase1-ConstructingFP-tree

ScanDBtofindL

Collectthesetoffrequentitems

SortLandDBindescendingfrequency

ScanDBagain-constructFP-tree

Phase2-ExecutingFP-Growth

MiningfrequentpatternsfromFP-tree

Processingfrequentitems

Onebyone

Bottomup

Eachitem

GeneratingaconditionalFP-tree

Algorithm–Phase1[8]

Algorithm–Phase2[8]

FP-Treealgorithmexample[8]

Herewe'llshowexamplewhichwillsummarizetheAlgorithmability.

Step1

ScanningDBtofindL

Example:Minimumsupport=60%

ScaneachTIDandupdatethefrequencyforeachiteminthenewtable

Table2.1.1-1-FPtreealgorithmexampleTid,item,andfrequency(1)

ScanDBtofindL(Listofallitemswhichmeetthesupport).

Afterscanning–Markingreentheitemswhichmeetthesupport

Table2.1.1-2-FPtreealgorithmexampleTid,item,andfrequency(2)

Step2

SortLindescendingfrequency

L={a:3,b:3,c:4,f:4,m:3,p:3}

L’={f:4,c:4,a:3,b:3,m:3,p:3}

Buildanewtablewhichcontainsonlytheitemswhichmeetthesupportandindescendingorder(seeL').

SortDB

Table2.1.1-3-FPtreealgorithmexampleTid,item,andfrequency(3)

Step3

InthisstepwescantheDBagaintoconstructtheFPtreetuplebytuple.

Westartwiththefirsttupletobuildthetreeinthesameorderoftheitems.Foreachitemwesetanumber,thisnumbernotehowmanytransactionsitbelongsto.

Table2.1.1-4-FPtreealgorithmexampleTid,item,andHeadertable(1)

Step3-Cont

Continuebuildingthetreeusingthesecondtuple.

Wecanseethenumbersineachnode.Itindicatesthenumberoftransactionswhichitbelongs

Table2.1.1-5-FPtreealgorithmexampleTid,item,andHeadertable(2)

Continuebuildingthetreeusingthethirdtuple.

Table2.1.1-6-FPtreealgorithmexampleTid,item,andHeadertable(3)

Continuebuildingthetreeusingtheforthtuple.

Table2.1.1-7-FPtreealgorithmexampleTid,item,andHeadertable(4)

Step3cont

Continuebuildingthetreeusingthelasttuple.

Table2.1.1-8-FPtreealgorithmexampleTid,item,andHeadertable(5)

Nowwe'llshowthesecondalgorithm-FPGrows

Afterthedatabaseiscompressedintoahighlycondensedandmuchsmallerdatastructure,wecontinuetothenextstep

MiningfrequentpatternsfromtheFP-Tree.Processingfrequentitemsonebyonebottomup.EachitemgeneratesaconditionalFP-Tree.

Figure2.1.1-5-FPgrowsexample

Exampleforp

FirstwemarkeachnodewhichisabovePinthesamebranch

Figure2.1.1-6-FPgrowsexampleforp(1)

Theonlyfrequentpatternsfor"P"are{p:3,cp:3}

Figure2.1.1-7-FPgrowsexamplep(2)

Exampleform

Again,firstwemarkeachnodewhichisaboveminthesamebranch

Figure2.1.1-8-FPgrowsexampleform(1)

Thefrequentpatternfor"m"are{m:3,am:3,cm:3,fm:3}

Figure2.1.1-9-FPgrowsexampleform(2)

SowerecursivelyconstructingconditionalFP-treefor:

am,cm,fm.

we'llstartwitham

Theprefixare{f:3,c:3}.Sothelargeitemsetswithamare:

{cam:3,fam:3}}

Figure2.1.1-10-FPgrowsexampleforam

Thefrequentpatternswithamare:

{am,cam,fam}.WerecursivelyconstructconditionalFP-Trees"cam","fam"

FP-Treefor"cam"

Thefrequentpatternswithcamare:{fcam}

Cam{fcam}

FP-Treefor"fam"

Thefrequentpatternswithfamis:{

人人文库> 全部分类> 行业资料 > 信息产业

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

Improving Data Mining Algorithms Using Constraints 使用约束改进数据挖掘算法

文档简介

温馨提示

最新文档

评论

Improving Data Mining Algorithms Using Constraints 使用约束改进数据挖掘算法

文档简介

温馨提示

最新文档

评论

相关文档