




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Spark: High-Speed Big Data Analysis Framework Intel Andrew Xia Weibo:Andrew-Xia Agenda Intel contributions to Spark Collaboration Real world cases Summary Spark Overview Open source projects initiated by AMPLab in UC Berkeley Apache incubation since June 2013 Intel closely collaborating with AMPLab
2、& the community on open source development UC BERKELEY Contributions by Intel Netty based shuffle for Spark FairScheduler for Spark Spark job log files Metrics system for Spark Spark shell on YARN Spark (standalone mode) integration with security Hadoop Byte code generation for Shark Co-partitioned
3、join in Shark . . . Intel China 3 committers 7 contributors 50+ patches Agenda Intel contributions to Spark Collaboration Real world cases Summary Intel partnering with several big websites Building next-gen big data analytics using the Spark stack E.g., Alibaba , Baidu iQiyi, Youku, etc. Collaborat
4、ion Partners Advertising Operation analysis Effect analysis Directional optimization Analysis Website report Platform report Monitor system Recommendation Ranking list Personal recommendation Hot-click analysis Big Data in Partners FAQ #1: Poor Performance machine learning and graph computation OLAP
5、 for Tabular data, interactive query FAQs Hadoop Data Sharing iter. 1 iter. 2 . . . Input query 1 query 2 query 3 result 1 result 2 result 3 HDFS read Slow due to replication, serialization, and disk IO Spark Data Sharing Input query 1 query 2 query 3 . . . one-time processing iter. 1 iter. 2 . . .
6、Input 10-100 faster than network and disk FAQ #2: Too many big data systems FAQs Big Data Systems Today MapReduce Specialized systems (iterative, interactive and streaming apps) General batch processing Vision of Spark Ecosystem One stack to rule them all! Spark Ecosystem HDFS/Hadoop Storage Mesos Y
7、ARN Tachyon Spark Shark SQL Spark Streaming Graphx Graph- parallel MLBase Machine learning MPI MapReduce FAQ #3: Study cost FAQs Code Size 0 20000 40000 60000 80000 100000 120000 140000 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines GraphX Sha
8、rk* * also calls into Hive Streaming FAQ #4: Is Spark Stable? FAQs Spark 0.8 has been released Spark 0.9 will be release in Jan 2013 Spark Status FAQ #5: Not enough memory to cache FAQs Graceful degradation Scheduler takes care of this Other options MEMORY_ONLY MEMORY_ONLY_SER MEMORY_AND_DISK DISK_O
9、NLY Not Enough Memory FAQ #6: How to recover when failure? FAQs How to Failover? Input query 1 query 2 query 3 . . . one-time processing iter. 1 iter. 2 . . . Input Lineage: track the graph of transformations that built RDD Checkpoint: lineage graphs get large How to Failover? FAQ #7: Is Spark compa
10、tible with Hadoop ecosystem? FAQs FAQ #8:Need port to Spark? FAQs FAQ #9: Any cons about Spark? FAQs Agenda Intel contributions to Spark Collaboration Real world cases Summary Logs continuously collected & streamed in Through queuing/messaging systems Incoming logs processed in a (semi) streaming fa
11、shion Aggregations for different time periods, demographics, etc. Join logs and history tables when necessary Aggregation results then consumed in a (semi) streaming fashion Monitoring, alerting, etc. Case1#:Real-Time Log Aggregation Implications Better streaming framework support Complex (e.g., sta
12、tful) analysis, fault-tolerance, etc. Kafka & Spark not collocated DStream retrieves logs in background (over network) and caches blocks in memory Memory tuning to reduce GC is critical spark.cleaner.ttl (throughput * spark.cleaner.ttl spark mem free size) Storage level (MEMORY_ONLY_SER2) Lower late
13、ncy (several seconds) No startup overhead (reusing SparkContext) Real-Time Log Aggregation: Spark Streaming Kafka Cluster Log Collectors Spark Cluster RDBMS Algorithm: complex match operations Mostly matrix based Multiplication, factorization, etc. Sometime graph-based E.g., sparse matrix Iterative
14、computations Matrix (graph) cached in memory across iterations Case #2:Machine Learning & Graph Analysis N-degree association in the graph Computing associations between two vertices that are n-hop away E.g., friends of friend Graph-parallel implementation Bagel (Pregel on Spark) and GraphX Memory o
15、ptimizations for efficient graph caching critical Speedup from 20+ minutes to 2 minutes Graph Analysis: N-Degree Association Graph Analysis: N-Degree Association v w u Statew = list of Weight(x, w) (for current top K weights to vertex w) Statev = list of Weight(x, v) (for current top K weights to vertex v) v w u Messages = D(x, u) = Weight(x, w) * edge(w, u) (for weight(x, w) in Statew) Messages = D(x, u) = Weight(x, v) * edge(w, u) (for weight(
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 专题2.10 函数的综合应用(解析版)-2024年高考数学一轮复习精讲精练宝典(新高考专用)
- 车间地基施工方案
- 景观塔施工方案
- 互联网电商知识培训课件
- 印刷制作设计合同范例
- 吉首售房合同范例
- 2025年英语 英语五官标准课件
- 压手续不押车合同范例
- 脑疝的护理诊断及护理问题
- 丰富多样的幼儿园节日庆典计划
- 高中地理 选择性必修二 纽约的发展 纽约的辐射功能 城市的辐射功能 课件(第2课时)
- 抽油井示功图分析以及应用
- 新药发明简史
- 培优的目的及作用
- 高分子物理化学全套课件
- 【学海导航】2013届高三物理一轮复习 第11章 第3节 电磁振荡与电磁波 电磁波谱课件 新人教版
- 电工plc培训-技工技能类
- 塑胶及喷油件检验标准
- 电力系统碳排放流的计算方法初探_周天睿
- 长阳土家族自治县骨干教师考核评价评分表(试行)
- 雨水泵站工程施工设计方案范文
评论
0/150
提交评论