精通数据仓库设计_第1页
精通数据仓库设计_第2页
精通数据仓库设计_第3页
精通数据仓库设计_第4页
精通数据仓库设计_第5页
已阅读5页,还剩28页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、精通数据仓库设计(Mastering Data Warehouse Design)中英对照精通数据仓库设计(Mastering Data Warehouse Design)中英对照第一部分 基本概念我们发现,理解为什么采纳某个具体的方法,能帮助我们理解这个方法的价值并应用这个方法。因此,这一节的开始,我们先介绍企业信息工厂(Corporate Information Factory CIF),这种已经被证明的、稳定的体系结构。在这种体系结构下,商业智能(BI),包含两种形式的数据存贮,每一种都有一个BI环境下具体的角色。第一类数据存贮是数据仓库,数据仓库主要的角色是担当数据知识库,存贮来自不同

2、数据源的数据,使它能被另一类数据存贮访问。另一类数据存贮就是数据集市。总的来说,设计数据仓库最有效的方法是基于实体-关系数据模型和范式技术(由Code 和 Date 最初在1970,90,90年代为关系数据库创建)。PA数据集市的主要角色是提供企业用户一个容易的访问优良的、集成的信息的方法。在第1章描述有几种类型的数据集市,最常用的数据集市是创建联机分析处理(OLAP),OLAP最有效的设计方法是维度数据模型。在第2章,我们继续这个基本的主题,解释最重要的关系建模技术,介绍所需要的不同类型的模型,提供建立关系模型的过程,同时,我们解释为企业构建一个坚固的基础时,商业数据型、系统数据、技术数据等

3、模型等各类数据模型之间的关系,并解释他们之间是如何互相共享或继承特性。第1章 介绍欢迎阅读本书,这是第一本彻底描述构建一个多用途的、稳定的、可持续的,支持商业智能的数据仓库建模技术的书。这一章介绍BI及数据仓库的目标,解释他们如何组合成一个整体的企业信息工厂体系结构,讨论数据仓库建设的迭代性,论证数据仓库数据模型的重要性,以及采用这种数据模型形式的理由。我们讨论这种模型形式为什么应该基于关系设计技术,阐明是为了满足最小冗余,最大稳定性和可维护性的需要。这一章的另一节列出了可维护的数据仓库环境的特点。最后讨论这种建模方法对最终交付数据集市的影响。这一章,让读者理解后续章节的基本原理,后续章节会描

4、述创建数据仓库模型的细节。 Chapter 1 Introduction CHAPTEWelcome to the first book that thoroughly describes the data modeling techniques used in constructing a multipurpose, stable, and sustainable data warehouse used to support business intelligence (BI). This chapter introduces the data warehouse by describing

5、 the objectives of BI and the data warehouse and by explaining how these fit into the overall Corporate Information Factory (CIF) architecture. It discusses the iterative nature of the data warehouse constructionand demonstrates the importance of the data warehouse data model and the justification f

6、or the type of data model format suggested in this book. We discuss why the format of the model should be based on relational design techniques, illustrating the need to maximize nonredundancy, stability, and maintainability. Another section of the chapter outlines the characteristics of a maintaina

7、ble data warehouse environment. The chapter ends with a discussion of the impact of this modeling approach on the ultimate delivery of the data marts. This chapter sets up the reader to understand the rationale behind the ensuing chapters, which describe in detail how to create the data warehouse da

8、ta model. 1.1商业智能概述商业智能,在数据仓库领域,指的是一个企业学习过去的行为与活动,理解组织的过去,确定组织的现状,预计或者改变将来会发生的事情的能力。BI的概念已经提出20年了,让我们简短的回顾过去令人兴奋的、不断创新的10年。Overview of Business IntelligenceBI, in the context of the data warehouse, is the ability of an enterprise to study past behaviors and actions in order to understand where the o

9、rganization has been, determine its current situation, and predict or change what will happen in the future. BI has been maturing for more than 20 years. Lets briefly go over the past decade of this fascinating and innovative history. 也许你熟悉技术采纳曲线,最早采用新技术的公司叫创新者,下一类叫作早期采纳者,然后有前半数成员、后半数成员,最后是落伍者。这个曲线是

10、传统的钟型曲线,在开始的时候成指数增长,在后半周期市场缓慢下降。新技术一旦被引进,往往价钱昂贵且不完善,而很难应用;经过一段时间,性价比可以接受。手机(蜂窝电话)就是一个很好的例子。曾经,只有革新者(医生和律师?)带着手机,又笨重又昂贵,信号不连续,经常丢失通话。现在,你只要花60美元,随处可以拥有一个手机,且服务非常的可靠。Youre probably familiar with the technology adoption curve. The first companies to adopt the new technology are called innovators. The n

11、ext category is known as the early adopters, then there are members of the early majority, members of the late majority, and finally the laggards. The curve is a traditional bell curve, with exponential growth in the beginning and a slowdown in market growth occurring during the late majority period

12、. When new technology is introduced, it is usually hard to get, expensive, and imperfect. Over time, its availability, cost, and features improve to the point where just about anyone can benefit from ownership. Cell phones are a good example of this. Once, only the innovators (doctors and lawyers?)

13、carried them. The phones were big, heavy, and expensive. The service was spotty at best, and you got “dropped” a lot. Now, there are deals where you can obtain a cell phone for about $60, the service providers throw in $25 of airtime, and there are no monthly fees, and service is quite reliable.数据仓库

14、是这种采纳曲线另一个很好的例子。事实上,如果你还没有开始你的第一个数据仓库项目,那没有比现在更好的开始时间了。今天管理人期望得到大多数好的,及时的信息,用于领导企业进入下一个年代的、基于知识的决策,他们经常做到了,然而,并不是每次都这样。Data warehousing is another good example of the adoption curve. In fact, if you havent started your first data warehouse project, there has never been a better time. Executives toda

15、y expect, and often get, most of the good, timely information they need to make informed decisions to lead their companies into the next decade. But this wasnt always the case.就在在10年前,同样的管理者批准开发决策信息系统(Executive information systems EIS)来满足他们的需要。发起人后面的基本概念是合理的:以实时的方式,提供给管理者容易访问的关键性能信息。然而,很多这类系统没有实现它们目

16、标,大多数是因为基本的体系结构不能快速响应企业环境的变化。早期EIS系统另一个显著的缺点是需要花费大量的精力去提供管理者所需要的数据。数据获取,即提取、转换、装载(ETL)过程是一系列复杂的活动,它们的唯一目的是获取最准确的、集成的数据,然后通过数据仓库或者操作型数据存贮(ODS)让企业访问。Just a decade ago, these same executives sanctioned the development of executive information systems (EIS) to meet their needs. The concept behind EIS in

17、itiatives was soundto provide executives with easily accessible key performance information in a timely manner. However, many of these systems fell short of their objectives, largely because the underlying architecture could not respond fast enough to the enterprises changing environment. Another si

18、gnificant shortcoming of the early EIS days was the enormous effort required to provide the executives with the data they desired. Data acquisition or the extract, transform, and load (ETL) process is a complex set of activities whose sole purpose is to attain the most accurate and integrated data p

19、ossible and make it accessible to the enterprise through the data warehouse or operational data store (ODS).整个过程以手工密集的活动开始:硬编码“数据吸管”是唯一从操作型系统获取数据的方法,用于商业分析师的访问。这有点类似于早期的电话,穿着轮滑来回穿梭的操作员很难通过插入正确的线绳,连接你呼叫的电话。The entire process began as a manually intensive set of activities. Hard-coded “data suckers” w

20、ere the only means of getting data out of the operational systems for access by business analysts. This is similar to the early days of telephony, when operators on skates had to connect your phone with the one you were calling by racing back and forth and manually plugging in the appropriate cords.

21、 幸运的是,我们已经比那个年代前进了很多,数据仓库行业已经开发了太多的工具和技术支持数据的获取过程。现在,大多数ETL过程都已经自动化,就像今天的电话系统。同时,类似于电话的发展,这个过程保留了一些困难的,或者说本身决定的,复杂的问题。没有两个公司有同样数据获取过程,甚至不会有同样的问题。今天,大多数拥有重要数据仓库的大公司,严重依赖于 ETL工具,用于设计,构建和维护他们的BI环境。过去十年,另一个主要的改变是建模技术和工具的引入,带到了“容易使用”的阶段。由RalphKimball博士等人提出的维度建模概念,对全球的支持联机分析处理(OLAP)多维模型数据集市造成很大影响。Fortunat

22、ely, we have come a long way from those days, and the data warehouse industry has developed a plethora of tools and technologies to support the data acquisition process. Now, progress has allowed most of this process to be automated, as it has in todays telephony world. Also, similar to telephony ad

23、vances, this process remains a difficult, if not temperamental and complicated, one. No two companies will ever have the same data acquisition activities or even the same set of problems. Today, most major corporations with significant data warehousing efforts rely heavily on their ETL tools for des

24、ign, construction, and maintenance of their BI environments.Another major change during the last decade is the introduction of tools and modeling techniques that bring the phrase “easy to use” to life. The dimensional modeling concepts developed by Dr. Ralph Kimball and others are largely responsibl

25、e for the widespread use of multidimensional data marts to support online analytical processing. 除了多维分析,还开发了其它一些复杂的技术用于支持数据挖掘、统计分析、探索等需要。现在,一个成熟的BI环境需要比星型模式多得多:平文件、无偏数据统计子集,规范化数据结构模式等,除了星形模式,所有这些都属数据仓库必须支持的、重要的数据需求。当然,我们不能低估互联网对数据仓库的影响。互联网消除了计算机的神秘性,管理者在日常生活中使用互联网,不再对触摸键盘心存芥蒂。终端用户工具公司认识到了互联网的影响,且大多数

26、都利用了这种成就:它们的界面都复制了流行的互联网浏览器与搜索引擎的视觉特性。这些工具的强大及直观,导致商业分析师和管理者广乏使用BI。In addition to multidimensional analyses, other sophisticated technologies have evolved to support data mining, statistical analysis, and exploration needs. Now mature BI environments require much more than star schemas flat files, s

27、tatistical subsets of unbiased data, normalized data structures, in addition to star schemas, are all significant data requirements that must be supported by your data warehouse.Of course, we shouldnt underestimate the impact of the Internet on data warehousing. The Internet helped remove the mystiq

28、ue of the computer. Executives use the Internet in their daily lives and are no longer wary of touching the keyboard. The end-user tool vendors recognized the impact of the Internet, and most of them seized upon that realization: to design their interface suchthat it replicated some of the look-and-

29、feel features of the popular Internet browsers and search engines. The sophisticationand simplicityof these tools has led to a widespread use of BI by business analysts and executives.发生最近几年的另一个重要事件是:发生了从技术追赶业务到业务驱使技术的转变。在BI的早期,信息技术(IT)部门认识到了BI的价值,并努力向商业团体兜售这些价值。不幸的是,有时IT伙计向商业团体兜售的是构建数据仓库的希望。今天,复杂的决

30、策支持环境的价值在商业界得到广发的认同。例如,一个有效的客户关系管理程序不能离开战略(含有相关数据集市的数据仓库)和战术(操作型数据存贮和操作型集市)的决策支持能力。(见图1.1):Another important event taking place in the last few years is the transformation from technology chasing the business to the business demanding technology. In the early days of BI, the information technology (

31、IT) group recognized its value and tried to sell its merits to the business community. In some unfortunate cases, the IT folks set out to build a data warehouse with the hope that the business community would use it. Today, the value of a sophisticated decision support environment is widely recogniz

32、ed throughout the business. As an example, an effective customer relationship management program could not exist without strategic (data warehouse with associated marts) and a tactical (operational data store and oper mart) decision-making capabilities. (See Figure 1.1) BI体系结构过去十年最重要的发展是提出了广为接受的BI体系

33、结构,支持所有的技术需求。这种体系结构认识到EIS方法有不少重大缺陷,最严重的缺陷是EIS数据结构常常从源系统直接获取数据,导致需要非常复杂的数据获取环境,需要大量的人力和计算机资源去维护。CIF(见图1.2)体系,现在已经有大多数决策支持系统使用,通过把数据隔离成主要的5个数据库(操作型系统,数据仓库,操作型数据存贮,数据集市,操作集市)来解决这个问题,把从源系统到商业用户的数据移动过程合并为一个高效的过程。rBI ArchitectureOne of the most significant developments during the last 10 years has been th

34、e introduction of a widely accepted architecture to support all BI technological demands. This architecture recognized that the EIS approach had several major flaws, the most significant of which was that the EIS data structures were often fed directly from source systems, resulting in a very comple

35、x dataacquisition environment that required significant human and computer resources to maintain. The Corporate Information Factory (CIF) (see Figure 1.2), the architecture used in most decision support environments today, addressed that deficiency by segregating data into five major databases (oper

36、ational systems, data warehouse, operational data store, data marts, and oper marts) and incorporating processes to effectively and efficiently move data from the source systems to the business users.(翻转90度之后的图:)这些组件进一步分为两个主要的组。“取数据入”组从操作型系统获取数据,集成,清洗并推入数据库,以方便使用。在CIF中包含如下组件:操作型系统数据库(源系统)包含公司日常的商业数据

37、,这仍然是决策支持系统最主要的数据来源。 数据仓库是集成的、包含明细的、包含历史数据的数据集合,用于支持战略决策。操作型数据存贮是集成的,明细的,现在的数据集合,用于支持战术决策。These components were further separated into two major groupings of components and processes: Getting data in consists of the processes and databases involved in acquiring data from the operational systems, int

38、egrating it, cleaning it up, and putting it into a database for easy usage. The components of the CIF that are found in this function: The operational system databases (source systems) contain the data used to run the day-to-day business of the company. These are still the major source of data for t

39、he decision support environment. The data warehouse is a collection or repository of integrated, detailed, historical data to support strategic decision-making. The operational data store is a collection of integrated, detailed, current data to support tactical decision making.“数据获取”组是一系列的过程和程序,用于从操

40、作型系统抽取数据到数据仓库和操作型数据存贮。数据获取过程执行数据集成、清洗功能,把数据转换为企业统一的格式。这种企业级的格式,反映了一个企业商业规则的集成的集合。数据获取层是CIP体系中最复杂的一部份。除了清洗和转换外,数据获取层还包含审计和控制过程,保证进入数据仓库或操作型数据存贮系统数据的完整性。“取信息出”由一系列过程和数据库组成,用于把BI交付给最终的企业用户和分析师,在CIF中包括如下组件:从数据仓库分离出的数据集市,用于提供商业团体各种各样的决策分析支持。从ODS 分离出的操作集市,用于提供商业团体对现在的操作型数据进行多维访问。把数据从数据仓库转移到操作集市的过程叫数据交付。类似

41、于数据获取层,在移动数据的同时也制造数据。只是在数据交付时,来源是数据仓库或ODS,这里已经包含了高质量的,集成的数据,且数据符合企业的商业规则。 Data acquisition is a set of processes and programs that extracts data for the data warehouse and operational data store from the operational systems. The data acquisition programs perform the cleansing as well as the integrat

42、ion of the data and transformation into an enterprise format. This enterprise format reflects an integrated set of enterprise business rules that usually causes the data acquisition layer to be the most complex component in the CIF. In addition to programs that transform and clean up data, the data

43、acquisition layer also includes audit and control processes and programs to ensure the integrity of the data as it enters the data warehouse or operational data store. Getting information out consists of the processes and databases involved in delivering BI to the ultimate business consumer or analy

44、st. The components of the CIF that are found in this function: The data marts are derivatives from the data warehouse used to provide the business community with access to various types of strategic analysis. The oper marts are derivatives of the ODS used to provide the business community with dimen

45、sional access to current operational data. Data delivery is the process that moves data from the data warehouse into data and oper marts. Like the data acquisition layer, it manipulates the data as it moves it. In the case of data delivery, however, the origin is the data warehouse or ODS, which alr

46、eady contains high quality, integrated data that conforms to the enterprise business rules.CIF体系并不是一开始就如此。一开始,它由数据仓库和一些轻量级的汇总数据、高度汇总数据组成最开始,需要历史数据的集合用来支持战略决策。一段时间后,产生了操作型数据存贮,用于支持战术决策支持系统;轻量级与高度汇总的数据存放在现在所谓的数据集市里。让我们看看CIF的运转情况。客户关系管理(CRM)是一个普通的需求驱动器,驱动了战术信息部件(操作型系统,操作型数据存贮,操作型集市),战略信息部件(数据仓库和各种类型的数据

47、集市)。当然,对CRM来说,这些技术是必须的,但远远不止这些技术,除了为客户和组织提供长期价值外,它还需要商业策略,企业文化与架构,客户信息等。提供的架构非常适合环境,在这个体系架构里,每一个部件都有专门的设计和功能。The CIF didnt just happen. In the beginning, it consisted of the data warehouse and sets of lightly summarized and highly summarized datainitially a collection of the historical data needed t

48、o support strategic decisions. Over time, it spawned the operational data store with a focus on the tactical decision support requirements as well. The lightly and highly summarized sets of data evolved into what we now know are data marts.Lets look at the CIF in action. Customer Relationship Manage

49、ment (CRM) is a highly popular initiative that needs the components for tactical information (operational systems, operational data store, and oper marts) and for strategic information (data warehouse and various types of data marts). Certainly this technology is necessary for CRM, but CRM requires

50、more than just the technology it also requires alignment of the business strategy, corporate culture and organization, and customer information in addition to technology to provide long-term value to both the customer and the organization. An architecture such as that provided by the CIF fits very w

51、ell within the CRM environment, and each component has a specific design and function within this architecture. 在这一章,我们会更详细的描述每个部件。虽然CRM是数据仓库和操作型数据存贮常见的应用,但是还有很多其他的应用,如企业资源计划系统(ERP)的提供商,如SAP,ORACLE,PeopleSoft等公司都有数据仓库产品,并增加新的工具套件提供需要的功能。许多软件公司现在都提供各种插件,包含一般的分析应用,例如,如利率分析、关键绩效指标分析(KPI)等。我们会在本章的后面章节详细

52、的介绍CIF组件。数据仓库的改进非常重要的帮助公司对客户提供更好的服务及提高公司效益。数据仓库在技术不断变化的同时,拥有一个稳定的体系结构。构建数据仓库环境的工具已经发展了很长时间,他们非常复杂,对企业必需的数据提供设计、实现、维护、访问等极大的便利。CIF架构利用这些技术和工具的革新,创建了一个环境,把数据分成5个不同的存贮,每一种存贮担当一特定的角色,以正确的时间、正确的地点、正确的格式提供给企业团体正确的信息。想一想,你想成为数据仓库建设的后半部分还是落伍者?这值得等待。We describe each component in more detail later in this cha

53、pter. CRM is a popular application of the data warehouse and operational data store but there are many other applications. For example, the enterprise resource planning (ERP) vendors such as SAP, Oracle, and PeopleSoft have embraced data warehousing and augmented their tool suites to provide the nee

54、ded capabilities. Many software vendors are now offering various plug-ins containing generic analytical applications such as profitability or key performance indicator (KPI) analyses. We will cover the components of the CIF in far greater detail in the following sections of this chapter.The evolutio

55、n of data warehousing has been critical in helping companies better serve their customers and improve their profitability. It took a combination of technological changes and a sustainable architecture. The tools for building this environment have certainly come a long way. They are quite sophisticat

56、ed and offer great benefit in the design, implementation, maintenance, and access to critical corporate data. The CIF architecture capitalizes on these technologyand tool innovations. It creates an environment that segregates data into five distinct stores, each of which has a key role in providing

57、the business community with the right information at the right time, in the right place, and in the right form. So, if youre a data warehousing late majority or even a laggard, take heart. It was worth the wait.什么是数据仓库在我们开始描述建模技术前,我们先统一一些术语的定义:什么叫数据仓库,它在BI中的角色和用途,支持它的构造和使用的各种部件 。 数据仓库的角色和用途我们在本章的第一节

58、已经看到,BI 体系结构在过去的十年发生了极大的变化,从简单的报表和EIS系统,到多维分析,到数据挖掘,到数据探索。现在又引进了可定制的分析应用,这些技术是一个强壮的、成熟的BI环境的一部份。图1.3显示了这些技术发展的时间框架。考虑这些重要的、明显不同的技术和数据格式的需求,很明显,必须从一开始就有一个贮藏室,用于存贮高质量的、可信任的、灵活的、可重用的格式的数据,这些数据用于支持和维护BI环境。从一开始,数据仓库就是BI体系结构的一部份,不同的方法学及数据仓库大师给与这个部件不同的名字,如:筹备区:一个数据仓库的变种是“后勤”筹备区,在这里从操作型系统来的数据首先被带到一起,是数据一种不正式的设计和维护分组,唯一的目的是给多维数据集市提供数据。信息仓库:IBM公司早期对数据仓库的命名,不象筹备区定义那样清晰,在它的定义里,不仅包含历史数据仓库,还包含数据集市。What Is a Data Warehouse?Before we get started with the actual description of the modeling techniques, we need to make sure that all

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论