




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、云端的小飞象系列报告之二 Cloud组Hadoop in SIGMOD 2011Outline IntroductionNova: Continuous Pig/Hadoop WorkowsApache Hadoop Goes Realtime at Facebook Emerging Trends in the Enterprise Data AnalyticsA Hadoop Based Distributed Loading Approach to Parallel Data WarehousesIndustrial Session in Sigmod 2011Data Manageme
2、nt for Feeds and Streams(2)Dynamic Optimization and Unstructured Content (4)BusinessAnalytics(2)Support for Business Analytics and Warehousing (4)Applying Hadoop(4)IndustrialsessionNova: Continuous Pig/Hadoop WorkowsBy Yahoo!Nova OverviewScenariosIngesting and analyzing user behavior logs Building a
3、nd updating a search index from a stream of crawled web pages Processing semi-structured dataTwo-layer programming model (Nova over Pig)Continuous processingIndependent schedulingCross-module optimizationManageability featuresWorkflow ModelWorkflowTwo kinds of vertices: tasks (processing steps) and
4、channels (data containers)Edges connect tasks to channels and channels to tasksFour common patterns of processingNon-incremental (template detection)Stateless incremental (shingling)Stateless incremental with lookup table (template tagging)Stateful incremental (de-duping)Workflow Model (Cont.)Data a
5、nd Update ModelBlocks: A channels data is divided into blocksContains a complete snapshot of data on a channel as of some point in timeBase blocks are assigned increasing sequence numbers(B0,B1,B2Bn)Base blockUsed in conjunction with incremental processingContains instructions for transforming a bas
6、e block into a new base block( )Delta blockWorkflow Model (Cont.)Task/Data InterfaceConsumption mode: all or newProduction mode: B or Workflow Model (Cont.)Workflow Programming and SchedulingData-based trigger.Time-based triggerCascade trigger.Data Compaction and Garbage CollectionIf a channel has b
7、locks B0, , , ,the compaction operation computes and adds B3 to the channelAfter compaction is used to add B3 to the channel,and current cursor is at sequence number 2, then B0, , can be garbage-collected.Nova System ArchitectureApache Hadoop Goes Realtime at FacebookBy FacebookWorkload TypesFaceboo
8、k MessagingHigh Write ThroughputLarge TablesData MigrationFacebook InsightsRealtime AnalyticsHigh Throughput IncrementsFacebook Metrics System (ODS)Automatic ShardingFast Reads of Recent Data and Table ScansWhy Hadoop & HBaseElasticityHigh write throughputEfficient and low-latency strong consistency
9、 semantics within a data centerEfficient random reads from diskHigh Availability and Disaster RecoveryFault IsolationAtomic read-modify-write primitivesRange ScansTolerance of network partitions within a single data centerZero Downtime in case of individual data center failureActive-active serving c
10、apability across different data centersRealtime HDFSHigh Availability - AvatarNodeRealtime HDFS (Cont.)Hadoop RPC compatibilityBlock Availability: Placement Policya pluggable block placement policyRealtime HDFS (Cont.)Performance Improvements for a Realtime WorkloadRPC TimeoutReads from Local Replic
11、asNew FeaturesHDFS syncConcurrent Readers Production HBaseACID Compliance (RWCC: Read Write Consistency Control)Atomicity (WALEdit)ConsistencyAvailability ImprovementsHBase Master Rewrite,Region assignment in memory - ZooKeeperOnline UpgradesDistributed Log SplittingPerformance ImprovementsCompactio
12、n(minor and major)Read OptimizationsEmerging Trends in the Enterprise Data Analytics: Connecting Hadoop and DB2 WarehouseBy IBMMotivation1.Increasing volumes of data2. Hadoop-based solutions in conjunction with data warehousesA Hadoop Based Distributed Loading Approach to Parallel Data WarehousesBy
13、TeradataMotivationETL(Extraction Transformation Loading) is a critical part of data warehouseWhile data are partitioned and replicated across all nodes in a parallel data warehouse, load utilities reside on a single node(bottleneck)Why Hadoop for Teradata EDW(Enterprise Data Warehouse)?More disk spa
14、ce can be easily addedUse as a intermediate storageMapReduce for transformationLoad data in parallelBlock Assignment ProblemHDFS file F on a cluster of P nodes (each node is uniquely identified with an integer i where 1 i P) The problem is defined by: assignment(X, Y, n,m, k, r) X is the set of n bl
15、ocks (X = 1, . . . , n) of FY is the set of m nodes running PDBMS (called PDBMS nodes) (Y 1, . . . , P )k copies, m nodesr is the mapping recording the replicated block locations of each block. r(i) returns the set of nodes which has a copy of the block i.Block Assignment Problem(Cont.)An assignment g from the blocks in X to the nodes in Y is denoted by a mapping from X = 1, . . . , n to Y where g(i) = j (i X, j Y ) means that the block i is assigned to the node j. An eve
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 湖北省部分省级示范高中2024~2025学年下学期高一期中测试数学答案
- 江苏省海门市2024-2025学年初三第四次月考物理试题试卷含解析
- 四川长江职业学院《信息技术基础》2023-2024学年第二学期期末试卷
- 武汉信息传播职业技术学院《文化创意产品设计》2023-2024学年第二学期期末试卷
- 六盘水幼儿师范高等专科学校《植物地理学实验》2023-2024学年第二学期期末试卷
- 山东省青岛市胶州市重点名校2024-2025学年初三数学试题第一次联考试题含解析
- 上饶卫生健康职业学院《商业银行业务与经营》2023-2024学年第二学期期末试卷
- 唐山幼儿师范高等专科学校《质量统计分析》2023-2024学年第二学期期末试卷
- 江西省抚州市临川二中学、崇仁二中学2025届初三第三次联合模拟化学试题含解析
- 山东省青岛市市北区2025年初三4月模拟训练化学试题含解析
- 电梯井内脚手架搭拆施工专项方案
- 涉外商标实务培训课件
- 2022年2月兴业银行审计部招聘人员模拟试题3套(含答案解析)
- 社会研究方法复习资料(风笑天版)
- 《青年友谊圆舞曲》音乐课件
- 博士后出站研究报告
- 中华人民共和国海关进出境自用物品申请表
- 高一语文《赤壁赋》 完整版课件PPT
- 纸包装生产企业设备管理课件
- 北师大版小学数学二年级下册第三单元《练习二》教学设计建议及课本习题解析
- 货物交接单范文
评论
0/150
提交评论