Teradata数据仓库简介_第1页
Teradata数据仓库简介_第2页
Teradata数据仓库简介_第3页
Teradata数据仓库简介_第4页
Teradata数据仓库简介_第5页
已阅读5页,还剩37页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Agendal关于关于TERADATAlTeradata数据库原理数据库原理lTeradata数据库架构数据库架构lTeradata数据库工作原理数据库工作原理lTeradata特性特性lTeradata数据仓库构建数据仓库构建l基本概念基本概念l常用工具介绍常用工具介绍l管理的一些约定管理的一些约定关于关于TERADATA Teradata最初产生于1976年,由加州理工学院和花旗银行的高科技项目-创建一个能够分析10的12次方 bytes 数据的系统。1 Kilobyte = 103 = 1000 bytes1 Megabyte= 106 = 1,000,000 bytes1 Gigaby

2、te= 109 = 1,000,000,000 bytes1 Terabyte= 1012 = 1,000,000,000,000 bytes1 Petabyte= 1015 = 1,000,000,000,000,000 bytes1 Exabyte= 1018 = 1,000,000,000,000,000,000 bytes1 Zetabyte= 1021 = 1,000,000,000,000,000,000,000 bytes1 Yottabyte= 1024 = 1,000,000,000,000,000,000,000,000 bytes关于关于TERADATATeradata

3、是全球最大的专注于数据仓库、咨询服务及企业分析方案的提供商,凭借业界领先的数据库、数据仓库解决方案、性能卓越的可扩展平台以及全球2000多个大型数据仓库项目的客户成功经验,成就了公司在数据仓库领域的创新领导地位。 ability to executecompleteness of visionniche playersvisionariesDATAllegroGreenplumNetezzaMySQLKognitioSand TechnologySybaseMicrosoftIBMTeradataOracleleaderschallengersas of September 200720072

4、008ability to executecompleteness of visionSybaseOracleIBMTeradataMicrosoft-DATAllegroNetezzaGreenplumHPSun Microsystems-MySQLKognitioSand TechnologyVerticaIngresIlluminate Solutions1010dataniche playersvisionariesleaderschallengersas of December 2008Gartner 评选Teradata为数据仓库领导厂商 4Teradata数据库原理数据库原理lT

5、eradata数据库架构数据库架构lTeradata数据库工作原理数据库工作原理lTeradata特性特性 通通 信信 界界 面面 存存 取取 模模 块块 单单 元元 AMP 分分 析析 引引 擎擎PEPE1PE2AMP1AMP2AMP4AMP3 网网 关关 通通 道道 单单 结结 点点 SMPUNIXPDE (并并 行行 数数 据据 库库 扩扩 展展 环环 境境)TCP/IP 封封 闭闭 主主 机机 系系 统统 信息传递层 磁盘阵列磁盘阵列Teradata数据库架构数据库架构BYNET信息信息传递网络传递网络 TERADATA的MPP架构CPUCPUCPUCPUMemoryCPUCPUCPU

6、CPUMemoryMPP系统与系统与Teradata多结点同时工作数据库由各结点共同拥有MPP (Massive Parallel Processing)海量并行处理服务器:由多个SMP服务器通过一定的结点互联网络进行连接,协同工作,完成相同的任务。从用户的角度来看是一个系统从用户的角度来看是一个系统!TERADATA的MPP架构 高速并行处理的BYNET技术 更适合复杂/即席查询 通过硬件技术实现Teradata的多个子任务均衡并行SMP 处理器节点处理器节点光纤光纤I/O连接连接MPP 数据高速交换层数据高速交换层 BYNET 2DA Controllers(w/Cache) BYNET

7、高性能磁盘阵列高性能磁盘阵列DA Controllers(w/Cache)DA Controllers(w/Cache)DA Controllers(w/Cache) Memory CPU CPU CPU CPU Memory Memory CPU CPU CPU CPU Memory Memory CPU CPU CPU CPU Memory Memory CPU CPU CPU CPU MemoryCLIQUE PARSING ENGINE (PE)SQL Parser & OptimizerQuery Step Dispatcher Access Module Processor

8、s (AMP)V-PE BY-NetV-AMPV-AMPV-AMPV-AMPV-PETeradata并行处理架构AMP4的数据AMP3的数据AMP2的数据每每个个并并行行单单元元只只管管理理自自己己的的数数据据Building IndexesBackup & RecoveryRow LockingTransactionJournalizingSortingReading WritingLoadingAggregatingAMPAMPAMPAMP1的数据AMP1BYNETTeradata并行的机制其他关系数据库其他关系数据库“有条件的并行有条件的并行”TeradataTeradata“无

9、条件的并行无条件的并行”时间时间初始查询查询优化查询并行扫描链接聚合排序收敛最终结果集并行处理性能 线性扩展能力 最大化的利用每个节点的资源 可灵活配置VPROCsAmpsVPROCsAmpsVPROCsAmpsVPROCsAmpsVPROCsAmpsVPROCsAmpsVPROCsAmpsVPROCsAmpsVPROCsAmpsVPROCsAmpsVPROCsAmpsVPROCsAmpsVPROCsAmpsVPROCsAmpsVPROCsAmpsVPROCsAmpsShared Nothing SoftwareEffective CPU Scaling Performance02468101

10、23456789101112Number of CPUsEffective CPU PerformanceTeradata WorldMark (CPU = 1.00, 88% 1-4 CPU SMP scaling, 98% pernode)IBM/SUN Best Case (CPU = 0.78, 91% SMP scaling)IBM/SUN Typical (CPU = 0.67, 88% SMP scaling)MPP小结 Teradata MPP 架构 使用当前最快的CPU 最好的扩展性 使用 shared-nothing MPP架构以达到线性扩展Teradata数据仓库构建数据

11、仓库构建l基本概念基本概念l常用工具介绍常用工具介绍l管理的一些约定管理的一些约定数据处理的演变数据仓库(数据仓库(Data Warehouse,可可简写为简写为DW )数据仓库是决策支持系统(DSS)和联机分析(OLAP)应用数据源的结构化数据环境。数据仓库研究和解决从数据库中获取信息的问题。数据仓库的特征在于面向主题、集成性、稳定性和时变性。Operational DataData WarehouseAccess ToolsEnd UsersATMPeople SoftPOSCognosAccessBizObjectsTeradata RDBMSETLETL是是Extraction-Tra

12、nsformation-Loading的的缩写缩写,负负责责将分布的、异构数据源中的数据如关系数据、平面数据文件将分布的、异构数据源中的数据如关系数据、平面数据文件等抽取到临时中间层后进行清洗、转换、集成等抽取到临时中间层后进行清洗、转换、集成,最后加载到数据最后加载到数据仓库或数据集市中仓库或数据集市中,成为联机分析处理、数据挖掘的基础。成为联机分析处理、数据挖掘的基础。ETLETL是构建数据仓库的重要一环,用户从数据源抽取出所需的数据,经过数据清洗,最终按照预先定义好的数据仓库模型,将数据加载到数据仓库中去。 主索引是表中的一个或多个字段,用于确定数据的物理分布每个表的数据根据 PI (主

13、索引)平均分布在不同的AMP 通过Hash算法实现数据自动分布 无需数据重组、重新分区、数据分布管理 可以是唯一或非唯一 一个表不会有两个主索引主索引的选择,关系到能否很好的发挥Teradata数据库的优势-并行处理。VAMP1 VAMP2 VAMP3 VAMP4 VAMPnTable 1 Table 2 Table 3Primary IndexTeradata Parallel Hash FunctionPDMPDMPDMPDMPDMPDMPDMPDMPDM主索引(Primary Index)PI:cust_idPI:cust_idPI:acc_id主键和主索引Indexes are con

14、ceptually different from keys.A PK is a relational modeling convention which allows each row to be uniquely identified.A PI is a Teradata convention which determines how the row will be stored and accessed.A significant percentage of tables may use the same columns for both the PK and the PI.A well-

15、designed database will use a PI that is different from the PK for some tables.Primary KeyPrimary IndexLogical concept of data modelingPhysical mechanism for access and storageTeradata doesnt need to recognize Each table must have exactly one primary indexNo limit on number of columns 16 column limit

16、 (V2R4.1); 64 column limit (V2R5 )Documented in data model Defined in CREATE TABLE statement(Optional in CREATE TABLE)Must be uniqueMay be unique or non-uniqueIdentifies each rowIdentifies 1 (unique) or multiple rows (non-unique)Values should not change Values may be changed (Delete + Insert)May not

17、 be NULL requires a value May be NULLDoes not imply an access path Defines most efficient access pathChosen for logical correctness Chosen for physical performanceAccessing via a Unique Primary IndexAMP 1AMP 2AMP 3AMP 4CustNamePhoneUPIBase TableCustNamePhoneUPIBase TableAdams 111-222231Smith555-6666

18、74Brown 333-999998Young 777-444412CustNamePhoneUPIBase TableAdams 444-666645Smith111-666649Smith555-777756Marsh 888-222251Rice666-555584Black444-555562Jones 777-666677Peters 555-777795CustNamePhoneUPIBase TableSmith222-333340Adams 666-777772Jones 222-888827White555-444437CREATE TABLE Customer(Cust I

19、NT,Name CHAR(10),Phone CHAR(8) )UNIQUE PRIMARY INDEX (Cust);SELECT * FROM customer WHERE cust = 45;PEUPI = 45HashingAlgorithmBYNETCustNamePhoneCUSTOMER table 37White555-4444 98Brown333-9999 74Smith555-6666 95Peters555-7777 27Jones222-8888 56Smith555-7777 45Adams444-6666 84 Rice 666-5555 49 Smith 111

20、-6666 51 Marsh 888-2222 31 Adams 111-2222 62 Black 444-5555 12 Young 777-4444 77 Jones 777-6666 72 Adams 666-7777 40 Smith 222-3333UPIPKRow Distribution Using a UPIAMP 1AMP 2AMP 3AMP 4OrderNum berCustom erNum berOrderDateOrderStatusPKUPI7325732474157103722573847402718872022311213124/134/134/134/104/

21、154/124/164/134/09OOCOCCCCC720224/09C740234/16C732524/13C722524/15C718814/13C738414/12C732434/13C710314/10COrder741514/13CThe PK column(s) willoften be used as a UPI.PI values for Order_Number are known to be unique (its a PK).Teradata will distribute different index values evenly across AMPs.Result

22、ing row distribution among AMPs is uniform.Row Distribution Using a NUPIOrderNum berCustom erNum berOrderDateOrderStatusPKNUPI7325732474157103722573847402718872022311213124/134/134/134/104/154/124/164/134/09OOCOCCCCCOrder7225 24/15C7325 24/1307415 14/13C7384 14/12C7324 34/1307402 34/16C7103 14/10CAM

23、P 1AMP 2AMP 47202 24/09C7188 14/13CAMP 3Customer_Number may be the referred access column for ORDER table, thus a good index candidate.Values for Customer_Number are non-unique and therefore a NUPI.Rows with the same PI value distribute to the same AMP causing row distribution to be less uniform or

24、skewed.Row Distribution Using a Highly Non-Unique Index OrderNum berCustom erNum berOrderDateOrderStatusPKNUPI7325732474157103722573847402718872022311213124/134/134/134/104/154/124/164/134/09OOCOCCCCCOrderAMP 1AMP 2AMP 47325 24/1307324 34/1307103 14/10O7325 24/1307225 24/15C7415 14/13C7384 14/12C740

25、2 34/16C7202 24/09C7188 14/13CAMP 3Values for Order_Status are highly non-unique.Only two values exist, so only two MPs will be used in this table.This table will not perform well in parallel operations.Highly non-unique columns are poor PI choices.The degree of uniqueness is criticalto efficiency.P

26、artitioned Primary Index 4 AMPs with Orders Table Defined with NPPI4 AMPs with Orders Table Defined with PPI on O_DateSecondary IndexesA secondary index is an alternate path to the rows of a table.A table can have from 0 to 32 secondary indexes.Secondary indexes: Do not affect table distribution. Ad

27、d overhead, both in terms of disk space and maintenance. May be added or dropped dynamically as needed. Are chosen to improve table performance.Full Table ScansSELECT * FROM customer WHERE Cust_Phone LIKE 524-;SELECT * FROM customer WHERE Cust_Name Davis;SELECT * FROM customer WHERE Cust_ID 1000;Exa

28、mples of Full-Table Scans:Every row of the table must be read.All AMPs scan their portion of the table in parallel.Primary Index choice affects FTS performance.Full-table scans typically occur when either: The index columns are not used in the query An index is used in a non-equality test A range of

29、 values is specified for the primary indexCUSTOMERCust_IDCust_NameCust_PhoneUSINUPINUSIQuery Submitting Tools BTEQ Basic Teradata Query utility Report writing and formatting features Interactive and batch queries Import/Export across all platformsFastLoadFast batch mode utility for loading new table

30、s onto the Teradata databaseCan reload previously emptied tablesFull Restart capabilityError Limits and Error Tables, accessible using SQLRestartable INMOD routine capabilityAbility to load data in several stagesTeradata DatabaseHostFastLoadFastLoad CharacteristicsLoad large amounts of data into an

31、empty table at high speed.Execute from Teradata servers, channel, or network-attached hosts.Loads into an empty table with no secondary indexes.Has two phases - creates an error table for each phase.Status of run is displayed.Checkpoints can be taken for restarts.Only load 1 empty table with 1 FastL

32、oad job. The Teradata Database will accommodate up to 15 FL/ML/FE applications at one time.Tables defined with Referential integrity, secondary indexes, Join Indexes, Hash Indexes, or Triggers cannot be loaded with FastLoad.Tables with Soft Referential Integrity (V2R5) can be loaded with FastLoad.Du

33、plicate rows cannot be loaded into a multiset table with FastLoad. If an AMP goes down, FastLoad cannot be restarted until it is back online.A Sample FastLoad ScriptLOGON tdpid/username,password;DROP TABLE Acct;DROP TABLE AcctErr1;DROP TABLE AcctErr2;CREATE TABLE Acct, FALLBACK ( AcctNumINTEGER,Numb

34、erINTEGER,StreetCHAR(25),CityCHAR(25),StateCHAR(2),Zip_CodeINTEGER)UNIQUE PRIMARY INDEX (AcctNum);LOGOFF; LOGON tdpid/username,password;BEGIN LOADING AcctERRORFILES AcctErr1, AcctErr2CHECKPOINT 100000;DEFINE in_AcctNum (INTEGER),in_Zip(INTEGER),in_Nbr(INTEGER),in_Street(CHAR(25),in_State(CHAR(2),in_

35、City(CHAR(25) ;INSERT INTO Acct VALUES ( :in_AcctNum,:in_Nbr,:in_Street,:in_City,:in_State,:in_Zip);END LOADING;LOGOFF; SETUPCreate the table, if it doesnt already exist. Start the utility.Error files must be defined.Checkpoint is optional.DEFINE the input; must agree with host data format.INSERT mu

36、st agree with table definition. Phase 1 begins. Unsorted blocks are written to disk.Phase 2 begins with END LOADING. Sorting and writing blocks to disk.MultiLoadBatch mode utility that runs on the host system. FastLoad-like technology TPump-like functionality.Supports up to five populated tables.Mul

37、tiple operations with one pass of input files.Conditional logic for applying changes.Supports INSERTs, UPDATEs, DELETEs and UPSERTs; typically with batch inputs from a host file.Affected data blocks only written once.Host and LAN support.Full Restart capability.Error reporting via error tables.Suppo

38、rt for INMODs.MULTILOADHOSTTeradata DBUPDATEsINSERTsDELETEsTABLE ATABLE BTABLE CTABLE DTABLE EMultiLoad LimitationsNo data retrieval capability.Concatenation of input data files is not allowed.Host will not process arithmetic functions.Host will not process exponentiation or aggregates. Cannot proce

39、ss tables defined with USIs, Referential Integrity, Join Indexes, Hash Indexes, or Triggers.Import tasks require use of Primary Index.Basic MultiLoad Statements.LOGTABLE restartlog_tablename ;.LOGON tdpid/userid, password ;.BEGIN MLOAD TABLES tablename1, . ;.LAYOUT layout_name ;.FIELD . ;.FILLER . ;

40、.DML LABEL label ;.IMPORT INFILE FROM m FOR n THRU k FORMAT FASTLOAD | BINARY | TEXT | UNFORMAT | VARTEXT c LAYOUT layout_name APPLY label WHERE condition ;.END MLOAD ;.LOGOFF ;.FIELD fieldname startpos datadesc | fieldexp NULLIF nullexpr DROP LEADING / TRAILING BLANKS / NULLS AND TRAILING / LEADING NULLS / BLANKS ;.FILLER fieldname startpos datadesc ;FastExportExports large volumes of formatted data from Teradata to a host user-written application.Takes advantage of multiple sessions.Export from multiple tables.Uses Support Environment.Fully automate

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论