韩家炜数据挖掘讲座_第1页
韩家炜数据挖掘讲座_第2页
韩家炜数据挖掘讲座_第3页
韩家炜数据挖掘讲座_第4页
韩家炜数据挖掘讲座_第5页
已阅读5页,还剩53页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、August 14, 2022Data Mining: Concepts and Techniques1Data Mining: Concepts and Techniques Chapter 3 Jiawei HanDepartment of Computer Science University of Illinois at Urbana-Champaign 2006 Jiawei Han and Micheline Kamber, All rights reservedAugust 14, 2022Data Mining: Concepts and Techniques2August 1

2、4, 2022Data Mining: Concepts and Techniques3Chapter 3: Data Warehousing and OLAP Technology: An OverviewWhat is a data warehouse? A multi-dimensional data modelData warehouse architectureData warehouse implementationFrom data warehousing to data miningAugust 14, 2022Data Mining: Concepts and Techniq

3、ues4What is Data Warehouse?Defined in many different ways, but not rigorously.A decision support database that is maintained separately from the organizations operational databaseSupport information processing by providing a solid platform of consolidated, historical data for analysis.“A data wareho

4、use is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of managements decision-making process.”W. H. InmonData warehousing:The process of constructing and using data warehousesAugust 14, 2022Data Mining: Concepts and Techniques5Data WarehouseSubject-Orient

5、edOrganized around major subjects, such as customer, product, salesFocusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processingProvide a simple and concise view around particular subject issues by excluding data that are not useful in the decis

6、ion support processAugust 14, 2022Data Mining: Concepts and Techniques6Data WarehouseIntegratedConstructed by integrating multiple, heterogeneous data sourcesrelational databases, flat files, on-line transaction recordsData cleaning and data integration techniques are applied.Ensure consistency in n

7、aming conventions, encoding structures, attribute measures, etc. among different data sourcesE.g., Hotel price: currency, tax, breakfast covered, etc.When data is moved to the warehouse, it is converted. August 14, 2022Data Mining: Concepts and Techniques7Data WarehouseTime VariantThe time horizon f

8、or the data warehouse is significantly longer than that of operational systemsOperational database: current value dataData warehouse data: provide information from a historical perspective (e.g., past 5-10 years)Every key structure in the data warehouseContains an element of time, explicitly or impl

9、icitlyBut the key of operational data may or may not contain “time element”August 14, 2022Data Mining: Concepts and Techniques8Data WarehouseNonvolatileA physically separate store of data transformed from the operational environmentOperational update of data does not occur in the data warehouse envi

10、ronmentDoes not require transaction processing, recovery, and concurrency control mechanismsRequires only two operations in data accessing: initial loading of data and access of dataAugust 14, 2022Data Mining: Concepts and Techniques9Data Warehouse vs. Heterogeneous DBMSTraditional heterogeneous DB

11、integration: A query driven approachBuild wrappers/mediators on top of heterogeneous databases When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global a

12、nswer setComplex information filtering, compete for resourcesData warehouse: update-driven, high performanceInformation from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysisAugust 14, 2022Data Mining: Concepts and Techniques10Data Warehouse vs. Op

13、erational DBMSOLTP (on-line transaction processing)Major task of traditional relational DBMSDay-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc.OLAP (on-line analytical processing)Major task of data warehouse systemData analysis and decision m

14、akingDistinct features (OLTP vs. OLAP):User and system orientation: customer vs. marketData contents: current, detailed vs. historical, consolidatedDatabase design: ER + application vs. star + subjectView: current, local vs. evolutionary, integratedAccess patterns: update vs. read-only but complex q

15、ueriesAugust 14, 2022Data Mining: Concepts and Techniques11OLTP vs. OLAPAugust 14, 2022Data Mining: Concepts and Techniques12Why Separate Data Warehouse?High performance for both systemsDBMS tuned for OLTP: access methods, indexing, concurrency control, recoveryWarehousetuned for OLAP: complex OLAP

16、queries, multidimensional view, consolidationDifferent functions and different data:missing data: Decision support requires historical data which operational DBs do not typically maintaindata consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sourcesdata

17、 quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciledNote: There are more and more systems which perform OLAP analysis directly on relational databasesAugust 14, 2022Data Mining: Concepts and Techniques13Chapter 3: Data Warehousing

18、 and OLAP Technology: An OverviewWhat is a data warehouse? A multi-dimensional data modelData warehouse architectureData warehouse implementationFrom data warehousing to data miningAugust 14, 2022Data Mining: Concepts and Techniques14From Tables and Spreadsheets to Data CubesA data warehouse is base

19、d on a multidimensional data model which views data in the form of a data cubeA data cube, such as sales, allows data to be modeled and viewed in multiple dimensionsDimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) Fact table contains measures (such as

20、 dollars_sold) and keys to each of the related dimension tablesIn data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.August 14, 2022Data Mi

21、ning: Concepts and Techniques15Cube: A Lattice of Cuboidstime,itemtime,item,locationtime, item, location, supplieralltimeitemlocationsuppliertime,locationtime,supplieritem,locationitem,supplierlocation,suppliertime,item,suppliertime,location,supplieritem,location,supplier0-D(apex) cuboid1-D cuboids2

22、-D cuboids3-D cuboids4-D(base) cuboidAugust 14, 2022Data Mining: Concepts and Techniques16Conceptual Modeling of Data WarehousesModeling data warehouses: dimensions & measuresStar schema: A fact table in the middle connected to a set of dimension tables Snowflake schema: A refinement of star schema

23、where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflakeFact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation August 14, 2022Data Minin

24、g: Concepts and Techniques17Example of Star Schema time_keydayday_of_the_weekmonthquarteryeartimelocation_keystreetcitystate_or_provincecountrylocationSales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_salesMeasuresitem_keyitem_namebrandtypesupplier_typeitembranch

25、_keybranch_namebranch_typebranchAugust 14, 2022Data Mining: Concepts and Techniques18Example of Snowflake Schematime_keydayday_of_the_weekmonthquarteryeartimelocation_keystreetcity_keylocationSales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_salesMeasuresitem_key

26、item_namebrandtypesupplier_keyitembranch_keybranch_namebranch_typebranchsupplier_keysupplier_typesuppliercity_keycitystate_or_provincecountrycityAugust 14, 2022Data Mining: Concepts and Techniques19Example of Fact Constellationtime_keydayday_of_the_weekmonthquarteryeartimelocation_keystreetcityprovi

27、nce_or_statecountrylocationSales Fact Tabletime_key item_key branch_key location_key units_sold dollars_sold avg_salesMeasuresitem_keyitem_namebrandtypesupplier_typeitembranch_keybranch_namebranch_typebranchShipping Fact Tabletime_key item_key shipper_key from_location to_location dollars_cost units

28、_shippedshipper_keyshipper_namelocation_keyshipper_typeshipperAugust 14, 2022Data Mining: Concepts and Techniques20Cube Definition Syntax (BNF) in DMQLCube Definition (Fact Table)define cube : Dimension Definition (Dimension Table)define dimension as ()Special Case (Shared Dimension Tables)First tim

29、e as “cube definition”define dimension as in cube August 14, 2022Data Mining: Concepts and Techniques21Defining Star Schema in DMQLdefine cube sales_star time, item, branch, location:dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)define dimension time a

30、s (time_key, day, day_of_week, month, quarter, year)define dimension item as (item_key, item_name, brand, type, supplier_type)define dimension branch as (branch_key, branch_name, branch_type)define dimension location as (location_key, street, city, province_or_state, country)August 14, 2022Data Mini

31、ng: Concepts and Techniques22Defining Snowflake Schema in DMQLdefine cube sales_snowflake time, item, branch, location:dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)define dimension time as (time_key, day, day_of_week, month, quarter, year)define dimen

32、sion item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type)define dimension branch as (branch_key, branch_name, branch_type)define dimension location as (location_key, street, city(city_key, province_or_state, country)August 14, 2022Data Mining: Concepts and Techniques23Def

33、ining Fact Constellation in DMQLdefine cube sales time, item, branch, location:dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)define dimension time as (time_key, day, day_of_week, month, quarter, year)define dimension item as (item_key, item_name, brand

34、, type, supplier_type)define dimension branch as (branch_key, branch_name, branch_type)define dimension location as (location_key, street, city, province_or_state, country)define cube shipping time, item, shipper, from_location, to_location:dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)

35、define dimension time as time in cube salesdefine dimension item as item in cube salesdefine dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type)define dimension from_location as location in cube salesdefine dimension to_location as location in cube sale

36、sAugust 14, 2022Data Mining: Concepts and Techniques24Measures of Data Cube: Three CategoriesDistributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioningE.g., count(), sum(), min(), max()A

37、lgebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate functionE.g., avg(), min_N(), standard_deviation()Holistic: if there is no constant bound on the storage size needed to describe a

38、subaggregate. E.g., median(), mode(), rank()August 14, 2022Data Mining: Concepts and Techniques25A Concept Hierarchy: Dimension (location)allEuropeNorth_AmericaMexicoCanadaSpainGermanyVancouverM. WindL. Chan.allregionofficecountryTorontoFrankfurtcityAugust 14, 2022Data Mining: Concepts and Technique

39、s26View of Warehouses and HierarchiesSpecification of hierarchiesSchema hierarchyday month quarter; week yearSet_grouping hierarchy1.10 = minsupMotivationOnly a small portion of cube cells may be “above the water in a sparse cubeOnly calculate “interesting” cellsdata above certain thresholdAvoid exp

40、losive growth of the cubeSuppose 100 dimensions, only 1 base cell. How many aggregate cells if count = 1? What about count = 2?August 14, 2022Data Mining: Concepts and Techniques47Indexing OLAP Data: Bitmap IndexIndex on a particular columnEach value in the column has a bit vector: bit-op is fastThe

41、 length of the bit vector: # of records in the base tableThe i-th bit is set if the i-th row of the base table has the value for the indexed columnnot suitable for high cardinality domainsBase tableIndex on RegionIndex on TypeAugust 14, 2022Data Mining: Concepts and Techniques48Indexing OLAP Data: J

42、oin IndicesJoin index: JI(R-id, S-id) where R (R-id, ) S (S-id, )Traditional indices map the values to a list of record idsIt materializes relational join in JI file and speeds up relational join In data warehouses, join index relates the values of the dimensions of a start schema to rows in the fac

43、t table.E.g. fact table: Sales and two dimensions city and productA join index on city maintains for each distinct city a list of R-IDs of the tuples recording the Sales in the city Join indices can span multiple dimensionsAugust 14, 2022Data Mining: Concepts and Techniques49Efficient Processing OLA

44、P QueriesDetermine which operations should be performed on the available cuboidsTransform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g., dice = selection + projectionDetermine which materialized cuboid(s) should be selected for OLAP op.Let the query to be processed be on bran

45、d, province_or_state with the condition “year = 2004”, and there are 4 materialized cuboids available:1) year, item_name, city 2) year, brand, country3) year, brand, province_or_state4) item_name, province_or_state where year = 2004Which should be selected to process the query?Explore indexing struc

46、tures and compressed vs. dense array structs in MOLAPAugust 14, 2022Data Mining: Concepts and Techniques50Chapter 3: Data Warehousing and OLAP Technology: An OverviewWhat is a data warehouse? A multi-dimensional data modelData warehouse architectureData warehouse implementationFrom data warehousing

47、to data miningAugust 14, 2022Data Mining: Concepts and Techniques51Data Warehouse UsageThree kinds of data warehouse applicationsInformation processingsupports querying, basic statistical analysis, and reporting using crosstabs, tables, charts and graphsAnalytical processingmultidimensional analysis

48、 of data warehouse datasupports basic OLAP operations, slice-dice, drilling, pivotingData miningknowledge discovery from hidden patterns supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization toolsAugust

49、14, 2022Data Mining: Concepts and Techniques52From On-Line Analytical Processing (OLAP) to On Line Analytical Mining (OLAM)Why online analytical mining?High quality of data in data warehousesDW contains integrated, consistent, cleaned dataAvailable information processing structure surrounding data w

50、arehousesODBC, OLEDB, Web accessing, service facilities, reporting and OLAP toolsOLAP-based exploratory data analysisMining with drilling, dicing, pivoting, etc.On-line selection of data mining functionsIntegration and swapping of multiple mining functions, algorithms, and tasksAugust 14, 2022Data M

51、ining: Concepts and Techniques53An OLAM System ArchitectureData WarehouseMeta DataMDDBOLAMEngineOLAPEngineUser GUI APIData Cube APIDatabase APIData cleaningData integrationLayer3OLAP/OLAMLayer2MDDBLayer1Data RepositoryLayer4User InterfaceFiltering&IntegrationFilteringDatabasesMining queryMining resu

52、ltAugust 14, 2022Data Mining: Concepts and Techniques54Chapter 3: Data Warehousing and OLAP Technology: An OverviewWhat is a data warehouse? A multi-dimensional data modelData warehouse architectureData warehouse implementationFrom data warehousing to data miningSummaryAugust 14, 2022Data Mining: Co

53、ncepts and Techniques55Summary: Data Warehouse and OLAP TechnologyWhy data warehousing?A multi-dimensional model of a data warehouseStar schema, snowflake schema, fact constellationsA data cube consists of dimensions & measuresOLAP operations: drilling, rolling, slicing, dicing and pivotingData ware

54、house architectureOLAP servers: ROLAP, MOLAP, HOLAPEfficient computation of data cubesPartial vs. full vs. no materializationIndexing OALP data: Bitmap index and join indexOLAP query processing From OLAP to OLAM (on-line analytical mining)August 14, 2022Data Mining: Concepts and Techniques56References (I)S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the computation of multidimensional aggregates. VLDB96D. Agrawal, A. E. Abbadi, A. Singh, and T. Yurek. Efficient vie

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论