夏俊鸾：Spark――基于内存的下一代大数据分析框架

上传人：n*** IP属地：贵州上传时间：2020-06-21 格式：PDF 页数：35 大小：1.61MB 积分：20 举报 版权申诉

已阅读5页，还剩30页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

1、Spark: High-Speed Big Data Analysis Framework Intel Andrew Xia Weibo:Andrew-Xia Agenda Intel contributions to Spark Collaboration Real world cases Summary Spark Overview Open source projects initiated by AMPLab in UC Berkeley Apache incubation since June 2013 Intel closely collaborating with AMPLab

2、& the community on open source development UC BERKELEY Contributions by Intel Netty based shuffle for Spark FairScheduler for Spark Spark job log files Metrics system for Spark Spark shell on YARN Spark (standalone mode) integration with security Hadoop Byte code generation for Shark Co-partitioned

3、join in Shark . . . Intel China 3 committers 7 contributors 50+ patches Agenda Intel contributions to Spark Collaboration Real world cases Summary Intel partnering with several big websites Building next-gen big data analytics using the Spark stack E.g., Alibaba , Baidu iQiyi, Youku, etc. Collaborat

4、ion Partners Advertising Operation analysis Effect analysis Directional optimization Analysis Website report Platform report Monitor system Recommendation Ranking list Personal recommendation Hot-click analysis Big Data in Partners FAQ #1: Poor Performance machine learning and graph computation OLAP

5、 for Tabular data, interactive query FAQs Hadoop Data Sharing iter. 1 iter. 2 . . . Input query 1 query 2 query 3 result 1 result 2 result 3 HDFS read Slow due to replication, serialization, and disk IO Spark Data Sharing Input query 1 query 2 query 3 . . . one-time processing iter. 1 iter. 2 . . .

6、Input 10-100 faster than network and disk FAQ #2: Too many big data systems FAQs Big Data Systems Today MapReduce Specialized systems (iterative, interactive and streaming apps) General batch processing Vision of Spark Ecosystem One stack to rule them all! Spark Ecosystem HDFS/Hadoop Storage Mesos Y

7、ARN Tachyon Spark Shark SQL Spark Streaming Graphx Graph- parallel MLBase Machine learning MPI MapReduce FAQ #3: Study cost FAQs Code Size 0 20000 40000 60000 80000 100000 120000 140000 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines GraphX Sha

8、rk* * also calls into Hive Streaming FAQ #4: Is Spark Stable? FAQs Spark 0.8 has been released Spark 0.9 will be release in Jan 2013 Spark Status FAQ #5: Not enough memory to cache FAQs Graceful degradation Scheduler takes care of this Other options MEMORY_ONLY MEMORY_ONLY_SER MEMORY_AND_DISK DISK_O

9、NLY Not Enough Memory FAQ #6: How to recover when failure? FAQs How to Failover? Input query 1 query 2 query 3 . . . one-time processing iter. 1 iter. 2 . . . Input Lineage: track the graph of transformations that built RDD Checkpoint: lineage graphs get large How to Failover? FAQ #7: Is Spark compa

10、tible with Hadoop ecosystem? FAQs FAQ #8:Need port to Spark? FAQs FAQ #9: Any cons about Spark? FAQs Agenda Intel contributions to Spark Collaboration Real world cases Summary Logs continuously collected & streamed in Through queuing/messaging systems Incoming logs processed in a (semi) streaming fa

11、shion Aggregations for different time periods, demographics, etc. Join logs and history tables when necessary Aggregation results then consumed in a (semi) streaming fashion Monitoring, alerting, etc. Case1#:Real-Time Log Aggregation Implications Better streaming framework support Complex (e.g., sta

12、tful) analysis, fault-tolerance, etc. Kafka & Spark not collocated DStream retrieves logs in background (over network) and caches blocks in memory Memory tuning to reduce GC is critical spark.cleaner.ttl (throughput * spark.cleaner.ttl spark mem free size) Storage level (MEMORY_ONLY_SER2) Lower late

13、ncy (several seconds) No startup overhead (reusing SparkContext) Real-Time Log Aggregation: Spark Streaming Kafka Cluster Log Collectors Spark Cluster RDBMS Algorithm: complex match operations Mostly matrix based Multiplication, factorization, etc. Sometime graph-based E.g., sparse matrix Iterative

14、computations Matrix (graph) cached in memory across iterations Case #2:Machine Learning & Graph Analysis N-degree association in the graph Computing associations between two vertices that are n-hop away E.g., friends of friend Graph-parallel implementation Bagel (Pregel on Spark) and GraphX Memory o

15、ptimizations for efficient graph caching critical Speedup from 20+ minutes to 2 minutes Graph Analysis: N-Degree Association Graph Analysis: N-Degree Association v w u Statew = list of Weight(x, w) (for current top K weights to vertex w) Statev = list of Weight(x, v) (for current top K weights to vertex v) v w u Messages = D(x, u) = Weight(x, w) * edge(w, u) (for weight(x, w) in Statew) Messages = D(x, u) = Weight(x, v) * edge(w, u) (for weight(

人人文库> 全部分类> 应用文书 > 事务文书

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

夏俊鸾：Spark――基于内存的下一代大数据分析框架

文档简介

温馨提示

最新文档

评论

夏俊鸾：Spark――基于内存的下一代大数据分析框架

文档简介

温馨提示

最新文档

评论

相关文档