版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、1 1Data Mining: Concepts and Techniques (3rd ed.) Chapter 4 Jiawei Han, Micheline Kamber, and Jian PeiUniversity of Illinois at Urbana-Champaign &Simon Fraser University2013 Han, Kamber & Pei. All rights reserved.3Chapter 4: Data Warehousing and On-line Analytical ProcessingnData Warehouse: Basic Co
2、nceptsnData Warehouse Modeling: Data Cube and OLAPnData Warehouse Design and UsagenData Warehouse ImplementationnSummary4What is a Data Warehouse?nDefined in many different ways, but not rigorously.nA decision support database that is maintained separately from the organizations operational database
3、nSupport information processing by providing a solid platform of consolidated, historical data for analysis.n“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of managements decision-making process.”W. H. InmonnData warehousing:nThe proc
4、ess of constructing and using data warehouses5Data WarehouseSubject-OrientednOrganized around major subjects, such as customer, product, salesnFocusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processingnProvide a simple and concise view around
5、 particular subject issues by excluding data that are not useful in the decision support process6Data WarehouseIntegratednConstructed by integrating multiple, heterogeneous data sourcesnrelational databases, flat files, on-line transaction recordsnData cleaning and data integration techniques are ap
6、plied.nEnsure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sourcesnE.g., Hotel price: currency, tax, breakfast covered, etc.nWhen data is moved to the warehouse, it is converted. 7Data WarehouseTime VariantnThe time horizon for the data wareho
7、use is significantly longer than that of operational systemsnOperational database: current value datanData warehouse data: provide information from a historical perspective (e.g., past 5-10 years)nEvery key structure in the data warehousenContains an element of time, explicitly or implicitlynBut the
8、 key of operational data may or may not contain “time element”8Data WarehouseNonvolatilenA physically separate store of data transformed from the operational environmentnOperational update of data does not occur in the data warehouse environmentnDoes not require transaction processing, recovery, and
9、 concurrency control mechanismsnRequires only two operations in data accessing: ninitial loading of data and access of data9OLTP vs. OLAP OLTP OLAP users clerk, IT professional knowledge worker function day to day operations decision support DB design application-oriented subject-oriented data curre
10、nt, up-to-date detailed, flat relational isolated historical, summarized, multidimensional integrated, consolidated usage repetitive ad-hoc access read/write index/hash on prim. key lots of scans unit of work short, simple transaction complex query # records accessed tens millions #users thousands h
11、undreds DB size 100MB-GB 100GB-TB metric transaction throughput query throughput, response 10Why a Separate Data Warehouse?nHigh performance for both systemsnDBMS tuned for OLTP: access methods, indexing, concurrency control, recoverynWarehousetuned for OLAP: complex OLAP queries, multidimensional v
12、iew, consolidationnDifferent functions and different data:nmissing data: Decision support requires historical data which operational DBs do not typically maintainndata consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sourcesndata quality: different sou
13、rces typically use inconsistent data representations, codes and formats which have to be reconcilednNote: There are more and more systems which perform OLAP analysis directly on relational databases11DataWarehouseExtractTransformLoadRefreshOLAP EngineAnalysisQueryReportsData miningMonitor&Integrator
14、MetadataData SourcesFront-End ToolsServeData MartsOperational DBsOthersourcesData StorageOLAP Server12Three Data Warehouse ModelsnEnterprise warehousencollects all of the information about subjects spanning the entire organizationnData Martna subset of corporate-wide data that is of value to a speci
15、fic groups of users. Its scope is confined to specific, selected groups, such as marketing data martnIndependent vs. dependent (directly from warehouse) data martnVirtual warehousenA set of views over operational databasesnOnly some of the possible summary views may be materialized13Extraction, Tran
16、sformation, and Loading (ETL)nData extractionnget data from multiple, heterogeneous, and external sourcesnData cleaningndetect errors in the data and rectify them when possiblenData transformationnconvert data from legacy or host format to warehouse formatnLoadnsort, summarize, consolidate, compute
17、views, check integrity, and build indicies and partitionsnRefreshnpropagate the updates from the data sources to the warehouse14Metadata RepositorynMeta data is the data defining warehouse objects. It stores:nDescription of the structure of the data warehousenschema, view, dimensions, hierarchies, d
18、erived data defn, data mart locations and contentsnOperational meta-datandata lineage (history of migrated data and transformation path), currency of data (active, archived, or purged), monitoring information (warehouse usage statistics, error reports, audit trails)nThe algorithms used for summariza
19、tionnThe mapping from operational environment to the data warehousenData related to system performancenwarehouse schema, view and derived data definitionsnBusiness datanbusiness terms and definitions, ownership of data, charging policies15Chapter 4: Data Warehousing and On-line Analytical Processing
20、nData Warehouse: Basic ConceptsnData Warehouse Modeling: Data Cube and OLAPnData Warehouse Design and UsagenData Warehouse ImplementationnSummary16From Tables and Spreadsheets to Data CubesnA data warehouse is based on a multidimensional data model which views data in the form of a data cubenA data
21、cube, such as sales, allows data to be modeled and viewed in multiple dimensionsnDimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) nFact table contains measures (such as dollars_sold) and keys to each of the related dimension tablesnIn data warehousing
22、 literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.17Cube: A Lattice of Cuboidstime,itemtime,item,locationtime, item, location, supplieralltimeitemlocation
23、suppliertime,locationtime,supplieritem,locationitem,supplierlocation,suppliertime,item,suppliertime,location,supplieritem,location,supplier0-D (apex) cuboid1-D cuboids2-D cuboids3-D cuboids4-D (base) cuboid18Conceptual Modeling of Data WarehousesnModeling data warehouses: dimensions & measuresnStar
24、schema: A fact table in the middle connected to a set of dimension tables nSnowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflakenFact constellations: Multiple fact tables share dimensi
25、on tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation 19Example of Star Schema time_keydayday_of_the_weekmonthquarteryeartimelocation_keystreetcitystate_or_provincecountrylocationSales Fact Table time_key item_key branch_key location_key units_sold dollars_
26、sold avg_salesMeasuresitem_keyitem_namebrandtypesupplier_typeitembranch_keybranch_namebranch_typebranch20Example of Snowflake Schematime_keydayday_of_the_weekmonthquarteryeartimelocation_keystreetcity_keylocationSales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_s
27、alesMeasuresitem_keyitem_namebrandtypesupplier_keyitembranch_keybranch_namebranch_typebranchsupplier_keysupplier_typesuppliercity_keycitystate_or_provincecountrycity21Example of Fact Constellationtime_keydayday_of_the_weekmonthquarteryeartimelocation_keystreetcityprovince_or_statecountrylocationSale
28、s Fact Tabletime_key item_key branch_key location_key units_sold dollars_sold avg_salesMeasuresitem_keyitem_namebrandtypesupplier_typeitembranch_keybranch_namebranch_typebranchShipping Fact Tabletime_key item_key shipper_key from_location to_location dollars_cost units_shippedshipper_keyshipper_name
29、location_keyshipper_typeshipper22A Concept Hierarchy: Dimension (location)allEuropeNorth_AmericaMexicoCanadaSpainGermanyVancouverM. WindL. Chan.allregionofficecountryTorontoFrankfurtcity23Data Cube Measures: Three CategoriesnDistributive: if the result derived by applying the function to n aggregate
30、 values is the same as that derived by applying the function on all the data without partitioningnE.g., count(), sum(), min(), max()nAlgebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggrega
31、te functionnE.g., avg(), min_N(), standard_deviation()nHolistic: if there is no constant bound on the storage size needed to describe a subaggregate. nE.g., median(), mode(), rank()24View of Warehouses and HierarchiesSpecification of hierarchiesnSchema hierarchyday month quarter; week yearnSet_group
32、ing hierarchy1.10 inexpensive25Multidimensional DatanSales volume as a function of product, month, and regionProductRegionMonthDimensions: Product, Location, TimeHierarchical summarization pathsIndustry Region YearCategory Country QuarterProduct City Month Week Office Day26A Sample Data CubeTotal an
33、nual salesof TVs in U.S.A.DateProductCountrysumsum TVVCRPC1Qtr2Qtr3Qtr4QtrU.S.ACanadaMexicosum27Cuboids Corresponding to the Cubeallproductdatecountryproduct,dateproduct,countrydate, countryproduct, date, country0-D (apex) cuboid1-D cuboids2-D cuboids3-D (base) cuboid28Typical OLAP OperationsnRoll u
34、p (drill-up): summarize datanby climbing up hierarchy or by dimension reductionnDrill down (roll down): reverse of roll-upnfrom higher level summary to lower level summary or detailed data, or introducing new dimensionsnSlice and dice: project and select nPivot (rotate): nreorient the cube, visualiz
35、ation, 3D to series of 2D planesnOther operationsndrill across: involving (across) more than one fact tablendrill through: through the bottom level of the cube to its back-end relational tables (using SQL)29Fig. 3.10 Typical OLAP Operations30A Star-Net Query Model Shipping MethodAIR-EXPRESSTRUCKORDE
36、RCustomer OrdersCONTRACTSCustomerProductPRODUCT GROUPPRODUCT LINEPRODUCT ITEMSALES PERSONDISTRICTDIVISIONOrganizationPromotionCITYCOUNTRYREGIONLocationDAILYQTRLYANNUALYTimeEach circle is called a footprint31Browsing a Data CubenVisualizationnOLAP capabilitiesnInteractive manipulation32Chapter 4: Dat
37、a Warehousing and On-line Analytical ProcessingnData Warehouse: Basic ConceptsnData Warehouse Modeling: Data Cube and OLAPnData Warehouse Design and UsagenData Warehouse ImplementationnSummary33Design of Data Warehouse: A Business Analysis FrameworknFour views regarding the design of a data warehous
38、e nTop-down viewnallows selection of the relevant information necessary for the data warehousenData source viewnexposes the information being captured, stored, and managed by operational systemsnData warehouse viewnconsists of fact tables and dimension tablesnBusiness query view nsees the perspectiv
39、es of data in the warehouse from the view of end-user34Data Warehouse Design Process nTop-down, bottom-up approaches or a combination of bothnTop-down: Starts with overall design and planning (mature)nBottom-up: Starts with experiments and prototypes (rapid)nFrom software engineering point of viewnW
40、aterfall: structured and systematic analysis at each step before proceeding to the nextnSpiral: rapid generation of increasingly functional systems, short turn around time, quick turn aroundnTypical data warehouse design processnChoose a business process to model, e.g., orders, invoices, etc.nChoose
41、 the grain (atomic level of data) of the business processnChoose the dimensions that will apply to each fact table recordnChoose the measure that will populate each fact table record35Data Warehouse Development: A Recommended ApproachDefine a high-level corporate data modelData MartData MartDistribu
42、ted Data MartsMulti-Tier Data WarehouseEnterprise Data WarehouseModel refinementModel refinement36Data Warehouse UsagenThree kinds of data warehouse applicationsnInformation processingnsupports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphsnAnalytical
43、processingnmultidimensional analysis of data warehouse datansupports basic OLAP operations, slice-dice, drilling, pivotingnData miningnknowledge discovery from hidden patterns nsupports associations, constructing analytical models, performing classification and prediction, and presenting the mining
44、results using visualization tools37From On-Line Analytical Processing (OLAP) to On Line Analytical Mining (OLAM)nWhy online analytical mining?nHigh quality of data in data warehousesnDW contains integrated, consistent, cleaned datanAvailable information processing structure surrounding data warehous
45、esnODBC, OLEDB, Web accessing, service facilities, reporting and OLAP toolsnOLAP-based exploratory data analysisnMining with drilling, dicing, pivoting, etc.nOn-line selection of data mining functionsnIntegration and swapping of multiple mining functions, algorithms, and tasks38Chapter 4: Data Wareh
46、ousing and On-line Analytical ProcessingnData Warehouse: Basic ConceptsnData Warehouse Modeling: Data Cube and OLAPnData Warehouse Design and UsagenData Warehouse ImplementationnSummary39Efficient Data Cube ComputationnData cube can be viewed as a lattice of cuboids nThe bottom-most cuboid is the ba
47、se cuboidnThe top-most cuboid (apex) contains only one cellnHow many cuboids in an n-dimensional cube with L levels?nMaterialization of data cubenMaterialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization)nSelection of which cuboids to materializen
48、Based on size, sharing, access frequency, etc.)11(niiLT40The “Compute Cube” OperatornCube definition and computation in DMQLdefine cube sales item, city, year: sum (sales_in_dollars)compute cube salesnTransform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.96)SEL
49、ECT item, city, year, SUM (amount)FROM SALESCUBE BY item, city, yearnNeed compute the following Group-Bys (date, product, customer),(date,product),(date, customer), (product, customer),(date), (product), (customer)() (item)(city)()(year)(city, item)(city, year)(item, year)(city, item, year)41Indexin
50、g OLAP Data: Bitmap IndexnIndex on a particular columnnEach value in the column has a bit vector: bit-op is fastnThe length of the bit vector: # of records in the base tablenThe i-th bit is set if the i-th row of the base table has the value for the indexed columnnnot suitable for high cardinality d
51、omainsnA recent bit compression technique, Word-Aligned Hybrid (WAH), makes it work for high cardinality domain as well Wu, et al. TODS06Cust Region TypeC1AsiaRetailC2EuropeDealerC3AsiaDealerC4America RetailC5EuropeDealerRecID Retail Dealer110201301410501RecIDAsia Europe America11002010310040015010B
52、ase tableIndex on RegionIndex on Type42Indexing OLAP Data: Join IndicesnJoin index: JI(R-id, S-id) where R (R-id, ) S (S-id, )nTraditional indices map the values to a list of record idsnIt materializes relational join in JI file and speeds up relational join nIn data warehouses, join index relates t
53、he values of the dimensions of a start schema to rows in the fact table.nE.g. fact table: Sales and two dimensions city and productnA join index on city maintains for each distinct city a list of R-IDs of the tuples recording the Sales in the city nJoin indices can span multiple dimensions43Efficien
54、t Processing OLAP QueriesnDetermine which operations should be performed on the available cuboidsnTransform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g., dice = selection + projectionnDetermine which materialized cuboid(s) should be selected for OLAP op.nLet the query to be
55、processed be on brand, province_or_state with the condition “year = 2004”, and there are 4 materialized cuboids available:1) year, item_name, city 2) year, brand, country3) year, brand, province_or_state4) item_name, province_or_state where year = 2004Which should be selected to process the query?nE
56、xplore indexing structures and compressed vs. dense array structs in MOLAP44OLAP Server ArchitecturesnRelational OLAP (ROLAP) nUse relational or extended-relational DBMS to store and manage warehouse data and OLAP middle warenInclude optimization of DBMS backend, implementation of aggregation naviga
57、tion logic, and additional tools and servicesnGreater scalabilitynMultidimensional OLAP (MOLAP) nSparse array-based multidimensional storage engine nFast indexing to pre-computed summarized datanHybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)nFlexibility, e.g., low level: relational, high-level: arr
58、aynSpecialized SQL servers (e.g., Redbricks) nSpecialized support for SQL queries over star/snowflake schemas45Chapter 4: Data Warehousing and On-line Analytical ProcessingnData Warehouse: Basic ConceptsnData Warehouse Modeling: Data Cube and OLAPnData Warehouse Design and UsagenData Warehouse Imple
59、mentationnSummary46SummarynData warehousing: A multi-dimensional model of a data warehousenA data cube consists of dimensions & measuresnStar schema, snowflake schema, fact constellationsnOLAP operations: drilling, rolling, slicing, dicing and pivotingnData Warehouse Architecture, Design, and Usagen
60、Multi-tiered architecturenBusiness analysis design frameworknInformation processing, analytical processing, data mining, OLAM (Online Analytical Mining)nImplementation: Efficient computation of data cubesnPartial vs. full vs. no materializationnIndexing OALP data: Bitmap index and join indexnOLAP qu
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
评论
0/150
提交评论