Spark内存使用机制分析_第1页
Spark内存使用机制分析_第2页
Spark内存使用机制分析_第3页
Spark内存使用机制分析_第4页
Spark内存使用机制分析_第5页
已阅读5页,还剩72页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Deep Dive:How Spark Uses MemorySpark内存使用机制分析议程Memory Usage OverviewMemory ContentionTungsten Memory FormatCache-aware ComputationFuture PlansWhere Spark Uses Memorystorage: memory used to cache data that will be used later.(controlled by memory manager)execution: memory used for computation in shuff

2、les, joins, sorts and aggregations. (controlled by memory manager)others: user data structure, internal metadata, objects created by UDF, etc.4, 3, 5, 1, 6,2IteratorSort1234561, 2, 3, 4, 5,6Iteratortake(3)ExecutionMemoryWhat if I want the sorted values again?4, 3, 5, 1, 6,2IteratorSort1234561, 2, 3,

3、 4, 5,6Iteratortake(3)4, 3, 5, 1, 6,2IteratorSort1234561, 2, 3, 4, 5,6Iteratortake(4).4, 3, 5, 1, 6,2IteratorSort1234561, 2, 3, 4, 5,6IteratorExecutionMemoryCache123456take(3)take(4)take(5)StorageMemoryMemory ContentionHow to arbitrate memory between execution and storage?How to arbitrate memory acr

4、oss tasks running in parallel?How to arbitrate memory across operators running within thesame task?Challenge #1How to arbitrate memory between execution and storage?Easy, static assignment!total available memoryexecutionstorageEasy, static assignment!executionstorageSpill to DiskEasy, static assignm

5、ent!executionstorageEvict LRU blocks to diskInefficient memory usage leads to bad performanceEasy, static assignment!executionstorageExecution can only use a fraction of thememory, even when there is no storage!Easy, static assignment!executionstorageEfficient use of memory required user tuningUnifi

6、ed Memory ManagementexecutionstorageWhat happens if there is alreadystorage?Unified Memory ManagementexecutionstorageEvict LRU blocks to diskUnified Memory ManagementexecutionstorageDesign ConsiderationsWhy evict storage, not execution?Spilled execution data will always be read back from disk, where

7、 as cached data may not.What if the application relies on cache?Unified Memory ManagementexecutionstorageThis is bad!Design ConsiderationsWhy evict storage, not execution?Spilled execution data will always be read back from disk, where as cached data may not.What if the application relies on cache?a

8、llow users to specify a minimum unevictable amount of cacheddata(not a reservation!)Challenge #2How to arbitrate memory across tasks running in parallel?Easy, static assignment!Worker machine has 4 coresEach task gets of the totalmemoryTask1Task 2Task 3Task 4Dynamic AssignmentTask 1The share of each

9、 task depends on thenumber of actively running tasksDynamic AssignmentTask 1The share of each task depends on thenumber of actively running tasksDynamic AssignmentTask 1Now another task comes along, the first task have to spill to free up memoryDynamic AssignmentEach task is now assigned of thetotal

10、 memoryTask 1Task 2Dynamic AssignmentEach task is now assigned of thetotal memoryTask1Task 2Task 3Task 4Dynamic AssignmentTask 3Last remaining task gets all the memoryStatic vs Dynamic AssignmentBoth are fair and starvation freeStatic Assignment is simplerDynamic assignment handles stragglers better

11、Challenge #3How to arbitrate memory across operators running within the same task?SELECT age,AVG(height) FROM students GROUP BY ageORDER BYAVG(height)students.groupBy(age).avg(height).orderBy(avg(height).collect()ScanProjectAggregateSortScanProjectAggregateSortThe task has 6pages of memoryScanProjec

12、tAggregateSortMap / age (total, count)20 (483, 3)21 (935, 5)22 (172, 1)ScanProjectAggregateSortAll 6 pages were used by Aggregate, leaving no memory for Sort!ScanProjectAggregateSortSolution #1Reserve a page for each operatorScanProjectAggregateSortSolution #1Reserve a page for each operatorStarvati

13、on free, but still not fairWhat if there were more operators?ScanProjectAggregateSortSolution #2Cooperative spillingScanProjectAggregateSortSolution #2Cooperative spillingScanProjectAggregateSortSolution #2Cooperative spillingSort forces Aggregate to spill a page to free memoryScanProjectAggregateSo

14、rtSolution #2Cooperative spillingSort needs more memory so it forces Aggregate to spill another page(and so on)ScanProjectAggregateSortSolution #2Cooperative spillingSort finishes with 3 pagesAggregate does not have to spill its remaining pagesRecap: three source of contentionHow to arbitrate memory

15、 between execution and storage?across tasks running in parallel?across operators running with the same task?Instead of statically reserving memory in advance, deal with memory contention when it raises by forcing members to spillHow Spark keep data in memoryPut data as objects on the heap and operat

16、e on theseobjects.Data caching is simply using a list to keep data objects.Data objects? No!It is hard to monitor and control the memory usage whenwe have a lot of objects.Garbage collection will be the killer.Java objects has notable space overhead.High serialization cost when transfer data inside

17、cluster.Keep data as binary and operate onbinarydatasource1010100001010001000100100110100010001111010100100101binary formatCacheManagerMemory Manageroperatorsallocatememorymemory pagesallocatememorymemory pagesJava Objects Based Row FormatBoxedInteger(123)String(“data”)String(“bricks”)RowArray5+ obj

18、ectshigh space overheadslow value accessingexpensive hashCode()Efficient Binary Format“bricks”0 x01233240“data”46null tracking(123, “data”, “bricks”)offset and length ofoffset and length ofdataEfficient Binary Format0 x1220344“data”0 x1220344“data”0 x1062“da”123 JSONfilessubstringHow to process bina

19、ry data more efficient?Understanding CPU CacheMemory is becoming slower and slower than CPU, we should keep the frequently accessed data in CPU cache.Understanding CPU CachePre-fetch data into CPU cache, with cache line boundary.The most 2 important techniques in big data are .Sort and Hash!Naive So

20、rtrecordrecordrecordpointerpointerpointerNaive SortrecordrecordrecordpointerpointerpointerNaive SortrecordrecordrecordpointerpointerpointerNaive SortrecordrecordrecordpointerpointerpointerNaive SortEach comparison needs to access 2 different memory regions, which makes it hard for CPU cache to pre-f

21、etch data, poor cache locality!Cache-aware Sortrecordrecordrecordptrptrptrkey-prefixkey-prefixkey-prefixCache-aware Sortrecordrecordrecordptrptrptrkey-prefixkey-prefixkey-prefixCache-aware Sortrecordrecordrecordptrptrptrkey-prefixkey-prefixkey-prefixCache-aware SortMost of the time, just go through

22、the key-prefixes in alinear fashion, good cache locality!Naive Hash MappointerpointerpointerkeyvaluekeyvaluekeyvalueNaive Hash MappointerpointerpointerkeyvaluekeyvaluekeyvaluekeylookupNaive Hash Mappointerpointerkeyvaluekeyvaluekeyvaluekeypointerhash(key) %sizeNaive Hash Mappointerpointerkeyvaluekey

23、valuekeyvaluekeypointercompare these 2 keysNaive Hash Mappointerpointerpointerkeyvaluekeyvaluekeyvaluekeylinear probingNaive Hash Mappointerpointerpointerkeyvaluekeyvaluekeyvaluekeycompare these 2keysNaive Hash MapEach lookup needs many pointer dereferences and key comparison when hash collision happens, and jumps between 2 memory regions, bad cache locality!ptrptrptrfull hashfull hashfull hashCache-aware Hash Mapkeyvaluekeyvaluekeyvalueptrptrptrfull hashfull hashfull hashCache-aware Hash Mapkeyvaluekeyvaluekeyvaluekeylookupptrptrptrfull hashfull hashfull hashCache-aware Hash Mapkeyv

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论