数据挖掘概念与技巧 课件chapter 2. data warehouse and olap technology for data mining_第1页
数据挖掘概念与技巧 课件chapter 2. data warehouse and olap technology for data mining_第2页
数据挖掘概念与技巧 课件chapter 2. data warehouse and olap technology for data mining_第3页
数据挖掘概念与技巧 课件chapter 2. data warehouse and olap technology for data mining_第4页
数据挖掘概念与技巧 课件chapter 2. data warehouse and olap technology for data mining_第5页
已阅读5页,还剩59页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Data Mining: Concepts and Techniques Slides for Textbook Chapter 2 Jiawei Han and Micheline KamberIntelligent Database Systems Research LabSchool of Computing Science Simon Fraser University, Canada汛馅辫缸杠迭浙覆疙宋螺雏暖蔬渡峦呜燃耗间曙秘垄盂酶盈皋莱搭搀丛离数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Min

2、ing数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining7/15/20221Data Mining: Concepts and TechniquesChapter 2: Data Warehousing and OLAP Technology for Data MiningWhat is a data warehouse? A multi-dimensional data modelData warehouse architectureData warehouse implementationFur

3、ther development of data cube technologyFrom data warehousing to data mining邢陷舰箍戈捣庞鸥颈欧桶斯郡密牛蛊唬婶颐歼兆肤焦秒耗第痢踪醒航呢铣数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining7/15/20222Data Mining: Concepts and Techniques

4、What is Data Warehouse?Defined in many different ways, but not rigorously.A decision support database that is maintained separately from the organizations operational databaseSupport information processing by providing a solid platform of consolidated, historical data for analysis.“A data warehouse

5、is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of managements decision-making process.”W. H. InmonData warehousing:The process of constructing and using data warehouses佩昭榆峻殷及蛛垣柒颈盂只薄陈尝盗稀节见咱碍驼添剩钳皖岗击无麦脾羽数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Techn

6、ology for Data Mining数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining7/15/20223Data Mining: Concepts and TechniquesData WarehouseSubject-OrientedOrganized around major subjects, such as customer, product, sales.Focusing on the modeling and analysis of data for decision maker

7、s, not on daily operations or transaction processing.Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.旺钧戮县乡惰忧堑渐矽壕挡倘堰茫昌辊宴零苹警碧劣泵哺判胡侨巡海颅宠数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining数据挖掘概

8、念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining7/15/20224Data Mining: Concepts and TechniquesData WarehouseIntegratedConstructed by integrating multiple, heterogeneous data sourcesrelational databases, flat files, on-line transaction recordsData cleaning and data integration tec

9、hniques are applied.Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sourcesE.g., Hotel price: currency, tax, breakfast covered, etc.When data is moved to the warehouse, it is converted. 瞬狞秋污猛谦哼桶课欠迷突记怨掠蟹不喂哀戏明标刻当俩数篷寒露拾匝儿数据挖掘概念与技术 课件Chapter 2

10、. Data Warehouse and OLAP Technology for Data Mining数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining7/15/20225Data Mining: Concepts and TechniquesData WarehouseTime VariantThe time horizon for the data warehouse is significantly longer than that of operational systems.Operat

11、ional database: current value data.Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)Every key structure in the data warehouseContains an element of time, explicitly or implicitlyBut the key of operational data may or may not contain “time element”.办腥洋芦吗烙菩

12、唇准论摊葫本吭盯钾鲁都哟适疥恬垦寒歪宦雍抉朋垦闽账数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining7/15/20226Data Mining: Concepts and TechniquesData WarehouseNon-VolatileA physically separate store of data transformed from the

13、operational environment.Operational update of data does not occur in the data warehouse environment.Does not require transaction processing, recovery, and concurrency control mechanismsRequires only two operations in data accessing: initial loading of data and access of data.柿猖油到肆沃您择虹忻羽舜佰须磊禁桐丹蜡稽樟怀炽上

14、茧园遂岔神赘事莹数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining7/15/20227Data Mining: Concepts and TechniquesData Warehouse vs. Heterogeneous DBMSTraditional heterogeneous DB integration: Build wrappers/mediat

15、ors on top of heterogeneous databases Query driven approachWhen a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer setComplex information filteri

16、ng, compete for resourcesData warehouse: update-driven, high performanceInformation from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis崩塔抱扳闪啡窄处警叶巳墩峦奢欺细茂恩厦观慈下辛滥秦芭手舟懂闹聊耽数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining数据挖掘概

17、念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining7/15/20228Data Mining: Concepts and TechniquesData Warehouse vs. Operational DBMSOLTP (on-line transaction processing)Major task of traditional relational DBMSDay-to-day operations: purchasing, inventory, banking, manufacturing, pay

18、roll, registration, accounting, etc.OLAP (on-line analytical processing)Major task of data warehouse systemData analysis and decision makingDistinct features (OLTP vs. OLAP):User and system orientation: customer vs. marketData contents: current, detailed vs. historical, consolidatedDatabase design:

19、ER + application vs. star + subjectView: current, local vs. evolutionary, integratedAccess patterns: update vs. read-only but complex queries重枕殴辕键乏些丫耪豹烽裤擦咎芭墟躯颈险李网左沟吕琅眷轴匠范统运调数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technol

20、ogy for Data Mining7/15/20229Data Mining: Concepts and TechniquesOLTP vs. OLAP邓民娃凝档焦贤灭樱避雹硝劝子卸约夕祥挖辱颐淑僵冷熙俭佬迎俗怪竖饿数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining7/15/202210Data Mining: Concepts and Techniq

21、uesWhy Separate Data Warehouse?High performance for both systemsDBMS tuned for OLTP: access methods, indexing, concurrency control, recoveryWarehousetuned for OLAP: complex OLAP queries, multidimensional view, consolidation.Different functions and different data:missing data: Decision support requir

22、es historical data which operational DBs do not typically maintaindata consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sourcesdata quality: different sources typically use inconsistent data representations, codes and formats which have to be reconcile

23、d蓬陈筋狸桑击氯埔哺形旺滩丫微寻稚压颊刘婿合舔兹阔津路程特呆泞财滞数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining7/15/202211Data Mining: Concepts and TechniquesChapter 2: Data Warehousing and OLAP Technology for Data MiningWhat is a d

24、ata warehouse? A multi-dimensional data modelData warehouse architectureData warehouse implementationFurther development of data cube technologyFrom data warehousing to data mining践拖片鸭义弯盖酱作秒城寡午严缮塘封连熟虫惦拍辙伺畏轩郊鄂潭挽涣噬数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining数据挖掘概念与技术 课件Cha

25、pter 2. Data Warehouse and OLAP Technology for Data Mining7/15/202212Data Mining: Concepts and TechniquesFrom Tables and Spreadsheets to Data CubesA data warehouse is based on a multidimensional data model which views data in the form of a data cubeA data cube, such as sales, allows data to be model

26、ed and viewed in multiple dimensionsDimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tablesIn data warehousing literature, an n-D base cube is called a base

27、cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.拆吵待嘲刘阜癸改暂他体侥配猎倦洋欧协敬袄霞洗何净水慈烃卫你逝乓陷数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP

28、 Technology for Data Mining7/15/202213Data Mining: Concepts and TechniquesCube: A Lattice of Cuboidsalltimeitemlocationsuppliertime,itemtime,locationtime,supplieritem,locationitem,supplierlocation,suppliertime,item,locationtime,item,suppliertime,location,supplieritem,location,suppliertime, item, loc

29、ation, supplier0-D(apex) cuboid1-D cuboids2-D cuboids3-D cuboids4-D(base) cuboid延募静咽茬愈关甚续愈曲黍铣街洞吩疡炬坝汲逮挤锌喧泥屯迟敏愚寞夜候数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining7/15/202214Data Mining: Concepts and Techn

30、iquesConceptual Modeling of Data WarehousesModeling data warehouses: dimensions & measuresStar schema: A fact table in the middle connected to a set of dimension tables Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables

31、, forming a shape similar to snowflakeFact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation 芭叛值疟隅苫羽偏菜糖乳棒庞郎澈码但菊挡涝溯弊沈榜授物筛懈拒浊蝎驰数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining数据挖掘概

32、念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining7/15/202215Data Mining: Concepts and TechniquesExample of Star Schema time_keydayday_of_the_weekmonthquarteryeartimelocation_keystreetcityprovince_or_streetcountrylocationSales Fact Table time_key item_key branch_key location_key un

33、its_sold dollars_sold avg_salesMeasuresitem_keyitem_namebrandtypesupplier_typeitembranch_keybranch_namebranch_typebranch稻珊柜医沮移云札柯镰作瑞青恋浇峪效在式歉水壮铂嚣祟悲咨话遂嗽垒芝数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining7/

34、15/202216Data Mining: Concepts and TechniquesExample of Snowflake Schematime_keydayday_of_the_weekmonthquarteryeartimelocation_keystreetcity_keylocationSales Fact Table time_key item_key branch_key location_key units_sold dollars_sold avg_salesMeasuresitem_keyitem_namebrandtypesupplier_keyitembranch

35、_keybranch_namebranch_typebranchsupplier_keysupplier_typesuppliercity_keycityprovince_or_streetcountrycity翰枢脏阉斜绍翠王魂棵捞抱瓮逸恼骆箭柏利陇汇葱煤阐州冤涵雨敦帐胁展数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining7/15/202217Data

36、Mining: Concepts and TechniquesExample of Fact Constellationtime_keydayday_of_the_weekmonthquarteryeartimelocation_keystreetcityprovince_or_streetcountrylocationSales Fact Tabletime_key item_key branch_key location_key units_sold dollars_sold avg_salesMeasuresitem_keyitem_namebrandtypesupplier_typei

37、tembranch_keybranch_namebranch_typebranchShipping Fact Tabletime_key item_key shipper_key from_location to_location dollars_cost units_shippedshipper_keyshipper_namelocation_keyshipper_typeshipper迢浆坍每悸炽疽俏丽黍拭卤车令堡躬务献争诡敦霄辈鳞笆濒抹攘怜毙剔缓数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Minin

38、g数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining7/15/202218Data Mining: Concepts and TechniquesA Data Mining Query Language, DMQL: Language PrimitivesCube Definition (Fact Table)define cube : Dimension Definition ( Dimension Table )define dimension as ()Special Case (Shared

39、 Dimension Tables)First time as “cube definition”define dimension as in cube 枉秋眉僵尾溜着挝孕叛讳樟乡责趣靴儿炉霹屡鸣侩八坊背术此微锥卞客落数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining7/15/202219Data Mining: Concepts and Techniqu

40、esDefining a Star Schema in DMQLdefine cube sales_star time, item, branch, location:dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)define dimension time as (time_key, day, day_of_week, month, quarter, year)define dimension item as (item_key, item_name,

41、brand, type, supplier_type)define dimension branch as (branch_key, branch_name, branch_type)define dimension location as (location_key, street, city, province_or_state, country)狡胜折助悲傲疆嚼戚猩窥浙短痘渔阿锣论习饺光饮辽孙糟跺胜脚蔓椅拎布数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining数据挖掘概念与技术 课件Chapte

42、r 2. Data Warehouse and OLAP Technology for Data Mining7/15/202220Data Mining: Concepts and TechniquesDefining a Snowflake Schema in DMQLdefine cube sales_snowflake time, item, branch, location:dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)define dimen

43、sion time as (time_key, day, day_of_week, month, quarter, year)define dimension item as (item_key, item_name, brand, type, supplier(supplier_key, supplier_type)define dimension branch as (branch_key, branch_name, branch_type)define dimension location as (location_key, street, city(city_key, province

44、_or_state, country)统桅夕署枕耸带笺撕侮岸授埂数素蛋缀毡香滁皿症邦侍衷蓉祖孕暖虚窟读数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining7/15/202221Data Mining: Concepts and TechniquesDefining a Fact Constellation in DMQLdefine cube sales t

45、ime, item, branch, location:dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold = count(*)define dimension time as (time_key, day, day_of_week, month, quarter, year)define dimension item as (item_key, item_name, brand, type, supplier_type)define dimension branch as (b

46、ranch_key, branch_name, branch_type)define dimension location as (location_key, street, city, province_or_state, country)define cube shipping time, item, shipper, from_location, to_location:dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)define dimension time as time in cube salesdefine d

47、imension item as item in cube salesdefine dimension shipper as (shipper_key, shipper_name, location as location in cube sales, shipper_type)define dimension from_location as location in cube salesdefine dimension to_location as location in cube sales搜溜碍渣冯辑恨热匠爱衷返救邵泳琼腾轿匡嘛容啦玛跪破敛散凯蕉仑媒瞅数据挖掘概念与技术 课件Chapte

48、r 2. Data Warehouse and OLAP Technology for Data Mining数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining7/15/202222Data Mining: Concepts and TechniquesMeasures: Three Categoriesdistributive: if the result derived by applying the function to n aggregate values is the same as t

49、hat derived by applying the function on all the data without partitioning.E.g., count(), sum(), min(), max().algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function.E.g., avg(),

50、 min_N(), standard_deviation().holistic: if there is no constant bound on the storage size needed to describe a subaggregate. E.g., median(), mode(), rank().怠肄算破寂封滁荤侠静呛缴溅额捡种烧猩疲荡惰梁臆嘉双太简舔适架沛乘数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining数据挖掘概念与技术 课件Chapter 2. Data Warehouse

51、and OLAP Technology for Data Mining7/15/202223Data Mining: Concepts and TechniquesA Concept Hierarchy: Dimension (location)allEuropeNorth_AmericaMexicoCanadaSpainGermanyVancouverM. WindL. Chan.allregionofficecountryTorontoFrankfurtcity磊惨恶风矾表穴倾梧挡凋洪样维萄嗜庶曳顷筋整辕仰白屹殷泅颊净说厂贬数据挖掘概念与技术 课件Chapter 2. Data Wareh

52、ouse and OLAP Technology for Data Mining数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining7/15/202224Data Mining: Concepts and TechniquesView of Warehouses and HierarchiesSpecification of hierarchiesSchema hierarchyday month quarter; week yearSet_grouping hierarchy1.10 inexpen

53、sive泌刁估谍骄翰肢争圆含申谚始蛹倡婉斩实萎狙卖呕涩雀界货絮便吓给吗佣数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining7/15/202225Data Mining: Concepts and TechniquesMultidimensional DataSales volume as a function of product, month, and

54、regionProductRegionMonthDimensions: Product, Location, TimeHierarchical summarization pathsIndustry Region YearCategory Country QuarterProduct City Month Week Office Day破轰龄堕焰淑涸丛塑章叙玻符撩杖槽敖谈音熊蒲钵够练晦鬼右厘蕉盼混亚数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining数据挖掘概念与技术 课件Chapter 2. Dat

55、a Warehouse and OLAP Technology for Data Mining7/15/202226Data Mining: Concepts and TechniquesA Sample Data CubeTotal annual salesof TV in U.S.A.DateProductCountryAll, All, Allsumsum TVVCRPC1Qtr2Qtr3Qtr4QtrU.S.ACanadaMexicosum胜旺疡遂丫辖淆霄某郴犯屯狱虞究介宿砍畔咳浊自酮彰鳞吱淘干己豺芝裂数据挖掘概念与技术 课件Chapter 2. Data Warehouse and

56、OLAP Technology for Data Mining数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining7/15/202227Data Mining: Concepts and TechniquesCuboids Corresponding to the Cubeallproductdatecountryproduct,dateproduct,countrydate, countryproduct, date, country0-D(apex) cuboid1-D cuboids2-D cu

57、boids3-D(base) cuboid趴磁辊雍舍炎扇碧磕残落路刻伊隘斤褪宋干腾咸哑王固去瓮烽款因瓣庇和数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining7/15/202228Data Mining: Concepts and TechniquesBrowsing a Data CubeVisualizationOLAP capabilitiesInte

58、ractive manipulation症涩秦劲朽肩寒完庇告竟铂聋雍烈馒资孺骂勿胡护雇侗呕焉雁躲中僳酞腔数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining7/15/202229Data Mining: Concepts and TechniquesTypical OLAP OperationsRoll up (drill-up): summarize da

59、taby climbing up hierarchy or by dimension reductionDrill down (roll down): reverse of roll-upfrom higher level summary to lower level summary or detailed data, or introducing new dimensionsSlice and dice: project and select Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes

60、.Other operationsdrill across: involving (across) more than one fact tabledrill through: through the bottom level of the cube to its back-end relational tables (using SQL)改坞毗即温闷注愈袖彤耍臣荐宵弹弹尽洋兔稍舟饰歌贴孺旋舅遣惶蜜咐弛数据挖掘概念与技术 课件Chapter 2. Data Warehouse and OLAP Technology for Data Mining数据挖掘概念与技术 课件Chapter 2. D

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论