云端的小飞象系列报告之二_第1页
云端的小飞象系列报告之二_第2页
云端的小飞象系列报告之二_第3页
云端的小飞象系列报告之二_第4页
云端的小飞象系列报告之二_第5页
已阅读5页,还剩23页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、云端的小飞象系列报告之二 Cloud组Hadoop in SIGMOD 2011Outline IntroductionNova: Continuous Pig/Hadoop WorkowsApache Hadoop Goes Realtime at Facebook Emerging Trends in the Enterprise Data AnalyticsA Hadoop Based Distributed Loading Approach to Parallel Data WarehousesIndustrial Session in Sigmod 2011Data Manageme

2、nt for Feeds and Streams(2)Dynamic Optimization and Unstructured Content (4)BusinessAnalytics(2)Support for Business Analytics and Warehousing (4)Applying Hadoop(4)IndustrialsessionNova: Continuous Pig/Hadoop WorkowsBy Yahoo!Nova OverviewScenariosIngesting and analyzing user behavior logs Building a

3、nd updating a search index from a stream of crawled web pages Processing semi-structured dataTwo-layer programming model (Nova over Pig)Continuous processingIndependent schedulingCross-module optimizationManageability featuresWorkflow ModelWorkflowTwo kinds of vertices: tasks (processing steps) and

4、channels (data containers)Edges connect tasks to channels and channels to tasksFour common patterns of processingNon-incremental (template detection)Stateless incremental (shingling)Stateless incremental with lookup table (template tagging)Stateful incremental (de-duping)Workflow Model (Cont.)Data a

5、nd Update ModelBlocks: A channels data is divided into blocksContains a complete snapshot of data on a channel as of some point in timeBase blocks are assigned increasing sequence numbers(B0,B1,B2Bn)Base blockUsed in conjunction with incremental processingContains instructions for transforming a bas

6、e block into a new base block( )Delta blockWorkflow Model (Cont.)Task/Data InterfaceConsumption mode: all or newProduction mode: B or Workflow Model (Cont.)Workflow Programming and SchedulingData-based trigger.Time-based triggerCascade trigger.Data Compaction and Garbage CollectionIf a channel has b

7、locks B0, , , ,the compaction operation computes and adds B3 to the channelAfter compaction is used to add B3 to the channel,and current cursor is at sequence number 2, then B0, , can be garbage-collected.Nova System ArchitectureApache Hadoop Goes Realtime at FacebookBy FacebookWorkload TypesFaceboo

8、k MessagingHigh Write ThroughputLarge TablesData MigrationFacebook InsightsRealtime AnalyticsHigh Throughput IncrementsFacebook Metrics System (ODS)Automatic ShardingFast Reads of Recent Data and Table ScansWhy Hadoop & HBaseElasticityHigh write throughputEfficient and low-latency strong consistency

9、 semantics within a data centerEfficient random reads from diskHigh Availability and Disaster RecoveryFault IsolationAtomic read-modify-write primitivesRange ScansTolerance of network partitions within a single data centerZero Downtime in case of individual data center failureActive-active serving c

10、apability across different data centersRealtime HDFSHigh Availability - AvatarNodeRealtime HDFS (Cont.)Hadoop RPC compatibilityBlock Availability: Placement Policya pluggable block placement policyRealtime HDFS (Cont.)Performance Improvements for a Realtime WorkloadRPC TimeoutReads from Local Replic

11、asNew FeaturesHDFS syncConcurrent Readers Production HBaseACID Compliance (RWCC: Read Write Consistency Control)Atomicity (WALEdit)ConsistencyAvailability ImprovementsHBase Master Rewrite,Region assignment in memory - ZooKeeperOnline UpgradesDistributed Log SplittingPerformance ImprovementsCompactio

12、n(minor and major)Read OptimizationsEmerging Trends in the Enterprise Data Analytics: Connecting Hadoop and DB2 WarehouseBy IBMMotivation1.Increasing volumes of data2. Hadoop-based solutions in conjunction with data warehousesA Hadoop Based Distributed Loading Approach to Parallel Data WarehousesBy

13、TeradataMotivationETL(Extraction Transformation Loading) is a critical part of data warehouseWhile data are partitioned and replicated across all nodes in a parallel data warehouse, load utilities reside on a single node(bottleneck)Why Hadoop for Teradata EDW(Enterprise Data Warehouse)?More disk spa

14、ce can be easily addedUse as a intermediate storageMapReduce for transformationLoad data in parallelBlock Assignment ProblemHDFS file F on a cluster of P nodes (each node is uniquely identified with an integer i where 1 i P) The problem is defined by: assignment(X, Y, n,m, k, r) X is the set of n bl

15、ocks (X = 1, . . . , n) of FY is the set of m nodes running PDBMS (called PDBMS nodes) (Y 1, . . . , P )k copies, m nodesr is the mapping recording the replicated block locations of each block. r(i) returns the set of nodes which has a copy of the block i.Block Assignment Problem(Cont.)An assignment g from the blocks in X to the nodes in Y is denoted by a mapping from X = 1, . . . , n to Y where g(i) = j (i X, j Y ) means that the block i is assigned to the node j. An eve

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

最新文档

评论

0/150

提交评论