版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Spark: High-Speed Big Data Analysis Framework Intel Andrew Xia Weibo:Andrew-Xia Agenda Intel contributions to Spark Collaboration Real world cases Summary Spark Overview Open source projects initiated by AMPLab in UC Berkeley Apache incubation since June 2013 Intel closely collaborating with AMPLab
2、& the community on open source development UC BERKELEY Contributions by Intel Netty based shuffle for Spark FairScheduler for Spark Spark job log files Metrics system for Spark Spark shell on YARN Spark (standalone mode) integration with security Hadoop Byte code generation for Shark Co-partitioned
3、join in Shark . . . Intel China 3 committers 7 contributors 50+ patches Agenda Intel contributions to Spark Collaboration Real world cases Summary Intel partnering with several big websites Building next-gen big data analytics using the Spark stack E.g., Alibaba , Baidu iQiyi, Youku, etc. Collaborat
4、ion Partners Advertising Operation analysis Effect analysis Directional optimization Analysis Website report Platform report Monitor system Recommendation Ranking list Personal recommendation Hot-click analysis Big Data in Partners FAQ #1: Poor Performance machine learning and graph computation OLAP
5、 for Tabular data, interactive query FAQs Hadoop Data Sharing iter. 1 iter. 2 . . . Input query 1 query 2 query 3 result 1 result 2 result 3 HDFS read Slow due to replication, serialization, and disk IO Spark Data Sharing Input query 1 query 2 query 3 . . . one-time processing iter. 1 iter. 2 . . .
6、Input 10-100 faster than network and disk FAQ #2: Too many big data systems FAQs Big Data Systems Today MapReduce Specialized systems (iterative, interactive and streaming apps) General batch processing Vision of Spark Ecosystem One stack to rule them all! Spark Ecosystem HDFS/Hadoop Storage Mesos Y
7、ARN Tachyon Spark Shark SQL Spark Streaming Graphx Graph- parallel MLBase Machine learning MPI MapReduce FAQ #3: Study cost FAQs Code Size 0 20000 40000 60000 80000 100000 120000 140000 Hadoop MapReduce Storm (Streaming) Impala (SQL) Giraph (Graph) Spark non-test, non-example source lines GraphX Sha
8、rk* * also calls into Hive Streaming FAQ #4: Is Spark Stable? FAQs Spark 0.8 has been released Spark 0.9 will be release in Jan 2013 Spark Status FAQ #5: Not enough memory to cache FAQs Graceful degradation Scheduler takes care of this Other options MEMORY_ONLY MEMORY_ONLY_SER MEMORY_AND_DISK DISK_O
9、NLY Not Enough Memory FAQ #6: How to recover when failure? FAQs How to Failover? Input query 1 query 2 query 3 . . . one-time processing iter. 1 iter. 2 . . . Input Lineage: track the graph of transformations that built RDD Checkpoint: lineage graphs get large How to Failover? FAQ #7: Is Spark compa
10、tible with Hadoop ecosystem? FAQs FAQ #8:Need port to Spark? FAQs FAQ #9: Any cons about Spark? FAQs Agenda Intel contributions to Spark Collaboration Real world cases Summary Logs continuously collected & streamed in Through queuing/messaging systems Incoming logs processed in a (semi) streaming fa
11、shion Aggregations for different time periods, demographics, etc. Join logs and history tables when necessary Aggregation results then consumed in a (semi) streaming fashion Monitoring, alerting, etc. Case1#:Real-Time Log Aggregation Implications Better streaming framework support Complex (e.g., sta
12、tful) analysis, fault-tolerance, etc. Kafka & Spark not collocated DStream retrieves logs in background (over network) and caches blocks in memory Memory tuning to reduce GC is critical spark.cleaner.ttl (throughput * spark.cleaner.ttl spark mem free size) Storage level (MEMORY_ONLY_SER2) Lower late
13、ncy (several seconds) No startup overhead (reusing SparkContext) Real-Time Log Aggregation: Spark Streaming Kafka Cluster Log Collectors Spark Cluster RDBMS Algorithm: complex match operations Mostly matrix based Multiplication, factorization, etc. Sometime graph-based E.g., sparse matrix Iterative
14、computations Matrix (graph) cached in memory across iterations Case #2:Machine Learning & Graph Analysis N-degree association in the graph Computing associations between two vertices that are n-hop away E.g., friends of friend Graph-parallel implementation Bagel (Pregel on Spark) and GraphX Memory o
15、ptimizations for efficient graph caching critical Speedup from 20+ minutes to 2 minutes Graph Analysis: N-Degree Association Graph Analysis: N-Degree Association v w u Statew = list of Weight(x, w) (for current top K weights to vertex w) Statev = list of Weight(x, v) (for current top K weights to vertex v) v w u Messages = D(x, u) = Weight(x, w) * edge(w, u) (for weight(x, w) in Statew) Messages = D(x, u) = Weight(x, v) * edge(w, u) (for weight(
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2026年烟台市芝罘区社区工作者招聘考试模拟试题及答案解析
- 漳州科技职业学院《法律逻辑学补充》2025-2026学年期末试卷
- 泉州职业技术大学《人类学概论》2025-2026学年期末试卷
- 长春中医药大学《电磁学》2025-2026学年期末试卷
- 长春信息技术职业学院《大学生职业与发展》2025-2026学年期末试卷
- 江西农业大学《病理学与病理生理学》2025-2026学年期末试卷
- 芜湖航空职业学院《工程监理》2025-2026学年期末试卷
- 合肥共达职业技术学院《数字经济学》2025-2026学年期末试卷
- 亳州职业技术学院《知识产权法》2025-2026学年期末试卷
- 泉州轻工职业学院《船舶消防》2025-2026学年期末试卷
- 镇江市2026烟草专卖局招聘考试-行测-专业知识题库(含答案)
- 2026年上海对外经贸大学辅导员招聘笔试模拟试题及答案解析
- 南通市医疗机构主要运行指标定期公布工作实施方案
- 四川三江招商集团有限公司2026年3月公开招聘工作人员考试参考试题及答案解析
- 【励志教育】主题班会:《张雪机车夺冠》从山村少年到世界冠军的缔造者【课件】
- AI赋能地理教学的应用实践研究-初中-地理-论文
- 浙江省杭州山海联盟2024-2025学年度七年级英语下册期中试题卷(含答案)
- 湖北省武汉市2026高三下学期3月调研考试化学试题 含答案
- WS 436-2013医院二次供水运行管理
- 全国高中化学奥林匹克竞赛山东省预赛试题
- 晶闸管及其工作原理-课件
评论
0/150
提交评论