




已阅读5页,还剩166页未读, 继续免费阅读
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
云计算与云数据管理,先进数据管理前沿讲习班,主要内容,2,云计算概述 google 云计算技术:gfs,bigtable 和mapreduce yahoo云计算技术和hadoop 云数据管理的挑战,人民大学新开的分布式系统与云计算课程,3,分布式系统概述 分布式云计算技术综述 分布式云计算平台 分布式云计算程序开发,第一篇分布式系统概述,4,第一章:分布式系统入门 第二章:客户-服务器端构架 第三章:分布式对象 第四章:公共对象请求代理结构 (corba),第二篇 云计算综述,5,第五章:云计算入门 第六章:云服务 第七章:云相关技术比较 7.1网格计算和云计算 7.2 utility计算(效用计算)和云计算 7.3并行和分布计算和云计算 7.4集群计算和云计算,第三篇 云计算平台,6,第八章:google云平台的三大技术 第九章:yahoo云平台的技术 第十章:aneka 云平台的技术 第十一章:greenplum云平台的技术 第十二章:amazon dynamo云平台的技术,第四篇 云计算平台开发,7,第十三章:基于hadoop系统开发 第十四章:基于hbase系统开发 第十五章:基于google apps系统开发 第十六章:基于ms azure系统开发 第十七章:基于amazon ec2系统开发,cloud computing,why we use cloud computing?,why we use cloud computing?,case 1: write a file save computer down, file is lost files are always stored in cloud, never lost,why we use cloud computing?,case 2: use ie - download, install, use use qq - download, install, use use c+ - download, install, use get the serve from the cloud,what is cloud and cloud computing?,cloud demand resources or services over internet scale and reliability of a data center.,what is cloud and cloud computing?,cloud computing is a style of computing in which dynamically scalable and often virtualized resources are provided as a serve over the internet. users need not have knowledge of, expertise in, or control over the technology infrastructure in the “cloud“ that supports them.,characteristics of cloud computing,virtual. software, databases, web servers, operating systems, storage and networking as virtual servers. on demand. add and subtract processors, memory, network bandwidth, storage.,iaas infrastructure as a service,paas platform as a service,saas software as a service,types of cloud service,software delivery model no hardware or software to manage service delivered through a browser customers use the service on demand instant scalability,saas,examples your current crm package is not managing the load or you simply dont want to host it in-house. use a saas provider such as s your email is hosted on an exchange server in your office and it is very slow. outsource this using hosted exchange.,saas,platform delivery model platforms are built upon infrastructure, which is expensive estimating demand is not a science! platform management is not fun!,paas,examples you need to host a large file (5mb) on your website and make it available for 35,000 users for only two months duration. use cloud front from amazon. you want to start storage services on your network for a large number of files and you do not have the storage capacityuse amazon s3.,paas,computer infrastructure delivery model a platform virtualization environment computing resources, such as storing and processing capacity. virtualization taken a step further,iaas,examples you want to run a batch job but you dont have the infrastructure necessary to run it in a timely manner. use amazon ec2. you want to host a website, but only for a few days. use flexiscale.,iaas,cloud computing and other computing techniques,the 21st century vision of computing,leonard kleinrock , one of the chief scientists of the original advanced research projects agency network (arpanet) project which seeded the internet, said: “ as of now, computer networks are still in their infancy, but as they grow up and become sophisticated, we will probably see the spread of computer utilities which, like present electric and telephone utilities, will service individual homes and offices across the country.”,the 21st century vision of computing,sun microsystems co-founder bill joy he also indicated “it would take time until these markets to mature to generate this kind of value. predicting now which companies will capture the value is impossible. many of them have not even been created yet.”,the 21st century vision of computing,definitions,utility,definitions,utility,utility computing is the packaging of computing resources, such as computation and storage, as a metered service similar to a traditional public utility,definitions,utility,a computer cluster is a group of linked computers, working together closely so that in many respects they form a single computer.,definitions,utility,grid computing is the application of several computers to a single problem at the same time usually to a scientific or technical problem that requires a great number of computer processing cycles or access to large amounts of data,definitions,utility,cloud computing is a style of computing in which dynamically scalable and often virtualized resources are provided as a service over the internet.,grid computing & cloud computing,share a lot commonality intention, architecture and technology difference programming model, business model, compute model, applications, and virtualization.,grid computing & cloud computing,the problems are mostly the same manage large facilities; define methods by which consumers discover, request and use resources provided by the central facilities; implement the often highly parallel computations that execute on those resources.,grid computing & cloud computing,virtualization grid do not rely on virtualization as much as clouds do, each individual organization maintain full control of their resources cloud an indispensable ingredient for almost every cloud,2019/2/5,36,any question and any comments ?,主要内容,37,云计算概述 google 云计算技术:gfs,bigtable 和mapreduce yahoo云计算技术和hadoop 云数据管理的挑战,google cloud computing techniques,the google file system,the google file system (gfs),a scalable distributed file system for large distributed data intensive applications multiple gfs clusters are currently deployed. the largest ones have: 1000+ storage nodes 300+ terabytes of disk storage heavily accessed by hundreds of clients on distinct machines,introduction,shares many same goals as previous distributed file systems performance, scalability, reliability, etc gfs design has been driven by four key observation of google application workloads and technological environment,intro: observations 1,1. component failures are the norm constant monitoring, error detection, fault tolerance and automatic recovery are integral to the system 2. huge files (by traditional standards) multi gb files are common i/o operations and blocks sizes must be revisited,intro: observations 2,3. most files are mutated by appending new data this is the focus of performance optimization and atomicity guarantees 4. co-designing the applications and apis benefits overall system by increasing flexibility,the design,cluster consists of a single master and multiple chunkservers and is accessed by multiple clients,the master,maintains all file system metadata. names space, access control info, file to chunk mappings, chunk (including replicas) location, etc. periodically communicates with chunkservers in heartbeat messages to give instructions and check state,the master,helps make sophisticated chunk placement and replication decision, using global knowledge for reading and writing, client contacts master to get chunk locations, then deals directly with chunkservers master is not a bottleneck for reads/writes,chunkservers,files are broken into chunks. each chunk has a immutable globally unique 64-bit chunk-handle. handle is assigned by the master at chunk creation chunk size is 64 mb each chunk is replicated on 3 (default) servers,clients,linked to apps using the file system api. communicates with master and chunkservers for reading and writing master interactions only for metadata chunkserver interactions for data only caches metadata information data is too large to cache.,chunk locations,master does not keep a persistent record of locations of chunks and replicas. polls chunkservers at startup, and when new chunkservers join/leave for this. stays up to date by controlling placement of new chunks and through heartbeat messages (when monitoring chunkservers),operation log,record of all critical metadata changes stored on master and replicated on other machines defines order of concurrent operations also used to recover the file system state,system interactions: leases and mutation order,leases maintain a mutation order across all chunk replicas master grants a lease to a replica, called the primary the primary choses the serial mutation order, and all replicas follow this order minimizes management overhead for the master,atomic record append,client specifies the data to write; gfs chooses and returns the offset it writes to and appends the data to each replica at least once heavily used by googles distributed applications. no need for a distributed lock manager gfs choses the offset, not the client,atomic record append: how?,follows similar control flow as mutations primary tells secondary replicas to append at the same offset as the primary if a replica append fails at any replica, it is retried by the client. so replicas of the same chunk may contain different data, including duplicates, whole or in part, of the same record,atomic record append: how?,gfs does not guarantee that all replicas are bitwise identical. only guarantees that data is written at least once in an atomic unit. data must be written at the same offset for all chunk replicas for success to be reported.,detecting stale replicas,master has a chunk version number to distinguish up to date and stale replicas increase version when granting a lease if a replica is not available, its version is not increased master detects stale replicas when a chunkservers report chunks and versions remove stale replicas during garbage collection,garbage collection,when a client deletes a file, master logs it like other changes and changes filename to a hidden file. master removes files hidden for longer than 3 days when scanning file system name space metadata is also erased during heartbeat messages, the chunkservers send the master a subset of its chunks, and the master tells it which files have no metadata. chunkserver removes these files on its own,fault tolerance: high availability,fast recovery master and chunkservers can restart in seconds chunk replication master replication “shadow” masters provide read-only access when primary master is down mutations not done until recorded on all master replicas,fault tolerance: data integrity,chunkservers use checksums to detect corrupt data since replicas are not bitwise identical, chunkservers maintain their own checksums for reads, chunkserver verifies checksum before sending chunk update checksums during writes,introduction to mapreduce,mapreduce: insight,”consider the problem of counting the number of occurrences of each word in a large collection of documents” how would you do it in parallel ?,mapreduce programming model,inspired from map and reduce operations commonly used in functional programming languages like lisp. users implement interface of two primary methods: 1. map: (key1, val1) (key2, val2) 2. reduce: (key2, val2) val3,map operation,map, a pure function, written by the user, takes an input key/value pair and produces a set of intermediate key/value pairs. e.g. (docid, doc-content) draw an analogy to sql, map can be visualized as group-by clause of an aggregate query.,reduce operation,on completion of map phase, all the intermediate values for a given output key are combined together into a list and given to a reducer. can be visualized as aggregate function (e.g., average) that is computed over all the rows with the same group-by attribute.,pseudo-code,map(string input_key, string input_value): / input_key: document name / input_value: document contents for each word w in input_value: emitintermediate(w, “1“); reduce(string output_key, iterator intermediate_values): / output_key: a word / output_values: a list of counts int result = 0; for each v in intermediate_values: result += parseint(v); emit(asstring(result);,mapreduce: execution overview,mapreduce: example,mapreduce in parallel: example,mapreduce: fault tolerance,handled via re-execution of tasks. task completion committed through master what happens if mapper fails ? re-execute completed + in-progress map tasks what happens if reducer fails ? re-execute in progress reduce tasks what happens if master fails ? potential trouble !,mapreduce:,walk through of one more application,mapreduce : pagerank,pagerank models the behavior of a “random surfer”. c(t) is the out-degree of t, and (1-d) is a damping factor (random jump) the “random surfer” keeps clicking on successive links at random not taking content into consideration. distributes its pages rank equally among all pages it links to. the dampening factor takes the surfer “getting bored” and typing arbitrary url.,pagerank : key insights,effects at each iteration is local. i+1th iteration depends only on ith iteration at iteration i, pagerank for individual nodes can be computed independently,pagerank using mapreduce,use sparse matrix representation (m) map each row of m to a list of pagerank “credit” to assign to out link neighbours. these prestige scores are reduced to a single pagerank value for a page by aggregating over them.,pagerank using mapreduce,source of image: lin 2008,phase 1: process html,map task takes (url, page-content) pairs and maps them to (url, (prinit, list-of-urls) prinit is the “seed” pagerank for url list-of-urls contains all pages pointed to by url reduce task is just the identity function,phase 2: pagerank distribution,reduce task gets (url, url_list) and many (url, val) values sum vals and fix up with d to get new pr emit (url, (new_rank, url_list) check for convergence using non parallel component,mapreduce: some more apps,distributed grep. count of url access frequency. clustering (k-means) graph algorithms. indexing systems,mapreduce programs in google source tree,mapreduce: extensions and similar apps,pig (yahoo) hadoop (apache) dryadlinq (microsoft),large scale systems architecture using mapreduce,bigtable: a distributed storage system for structured data,introduction,bigtable is a distributed storage system for managing structured data. designed to scale to a very large size petabytes of data across thousands of servers used for many google projects web indexing, personalized search, google earth, google analytics, google finance, flexible, high-performance solution for all of googles products,motivation,lots of (semi-)structured data at google urls: contents, crawl metadata, links, anchors, pagerank, per-user data: user preference settings, recent queries/search results, geographic locations: physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, scale is large billions of urls, many versions/page (20k/version) hundreds of millions of users, thousands or q/sec 100tb+ of satellite image data,why not just use commercial db?,scale is too large for most commercial databases even if it werent, cost would be very high building internally means system can be applied across many projects for low incremental cost low-level storage optimizations help performance significantly much harder to do when running on top of a database layer,goals,want asynchronous processes to be continuously updating different pieces of data want access to most current data at any time need to support: very high read/write rates (millions of ops per second) efficient scans over all or interesting subsets of data efficient joins of large one-to-one and one-to-many datasets often want to examine data changes over time e.g. contents of a web page over multiple crawls,bigtable,distributed multi-level map fault-tolerant, persistent scalable thousands of servers terabytes of in-memory data petabyte of disk-based data millions of reads/writes per second, efficient scans self-managing servers can be added/removed dynamically servers adjust to load imbalance,building blocks,building blocks: google file system (gfs): raw storage scheduler: schedules jobs onto machines lock service: distributed lock manager mapreduce: simplified large-scale data processing bigtable uses of building blocks: gfs: stores persistent data (sstable file format for storage of data) scheduler: schedules jobs involved in bigtable serving lock service: master election, location bootstrapping map reduce: often used to read/write bigtable data,basic data model,a bigtable is a sparse, distributed persistent multi-dimensional sorted map (row, column, timestamp) - cell contents good match for most google applications,webtable example,want to keep copy of a large collection of web pages and related information use urls as row keys various aspects of web page as column names store contents of web pages in the contents: column under the timestamps when they were fetched.,rows,name is an arbitrary string access to data in a row is atomic row creation is implicit upon storing data rows ordered lexicographically rows close together lexicographically usually on one or a small number of machines,rows (cont.),reads of short row ranges are efficient and typically require communication with a small number of machines. can exploit this property by selecting row keys so they get good locality for data access. example: , , , vs edu.gatech.math, edu.gatech.phys, edu.uga.math, edu.uga.phys,columns,columns have two-level name structure: family:optional_qualifier column family unit of access control has associated type information qualifier gives unbounded columns additional levels of indexing, if desired,timestamps,used to store different versions of data in a cell new writes default to current time, but timestamps for writes can also be set explicitly by clients lookup options: “return most recent k values” “return all values in timestamp range (or all values)” column families can be marked w/ attributes: “only retain most recent k values in a cell” “keep values until they are older than k seconds”,implementation three major components,library linked into every client one master server responsible for: assigning tablets to ta
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025年三硼酸锂(LBO)晶体项目建议书
- 信托投资合同样本
- 劳动局合同模板
- 股权转让顾问协议二零二五年
- 二零二五厦门二手房买卖合同大全
- 房屋抵押协议书二零二五年
- 个人猪场转让合同
- 二零二五版冷静期离婚协议书
- 家庭宽带业务协议
- 知识产权共有协议二零二五年
- 2025届贵州省安顺市高三二模语文试题
- 市政道路电力、照明、通信管道工程施工方案方案
- 球的体积和表面积说课稿
- GB/T 30726-2014固体生物质燃料灰熔融性测定方法
- 可吸收丝素修复膜(CQZ1900597)
- 凯莱通综合版
- 步行功能训练详解课件
- 几内亚共和国《矿产法》
- 物理讲义纳米光子学
- 保洁服务礼仪培训(共55张)课件
- 中考英语写作指导课件(共41张PPT)
评论
0/150
提交评论