第6章存储器层次系统l2_第1页
第6章存储器层次系统l2_第2页
第6章存储器层次系统l2_第3页
第6章存储器层次系统l2_第4页
第6章存储器层次系统l2_第5页
已阅读5页,还剩50页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、School of Computer Science and Technology, HIT第六章 Part2 :高速缓存器¢ 教师: 郑贵滨¢ 计算机科学与技术学院¢ 哈尔滨工业大学1Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third EditionSchool of Computer Science and CTeacrhnneoglioegMy,eHlloITn主要内容¢ 高速缓存器组织结构和操作¢ 高速缓存对程序性能的影响§§

2、;重新排列循环以提高空间局部性使用分块来提高时间局部性22Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third EditionSchool of Computer Science and Technology, HIT局部性¢ 局部性原理(Principle of Locality)§ 程序倾向于使用距离最近用过的指令/数据地址相近或相等的指令/数据。¢ 时间局部性(Temporal locality)§ 最近被再次过的信息,很可能在近期还会¢ 空间局部

3、性(Spatial locality)地址接近的数据项,被使用的时间也倾向于接近3Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third EditionSchool of Computer Science and Technology, HIT器层次结构L0:寄存器CPU 寄存器保存着从L1高速缓存取出的字更小、更快、更贵(每字节)的 设备L1:L1 高速缓存 (SRAM)L2 高速缓存(SRAM)L1 高速缓存保存着从L2高速缓存取出的缓存行L2 高速缓存保存着从L3高速缓存取L2:出的缓存行L3:L3

4、 高速缓存L3 高速缓存保存着从主存取出的缓存行主存保存着从磁盘取出的磁盘块(SRAM) 更大、更慢、更廉价 (每字节)L4:主存(DRAM)本地磁盘保存着从远程服务器磁盘上取出的文件的 设备本地二级(本地磁盘)L5:二级(分布式文件系统、Web 服务器)L6:4Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third EditionSchool of Computer Science and Technology, HIT高速缓存的基本概念第K层更小、更快、更昂贵的设备缓存着第K+1层块集合的一个子集第K

5、层大小为传输单元第K+1层更大、更慢、更便宜的设备被划分成块第k+1层5Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third Edition6School of Computer Science and Technology, HIT高速缓存的基本概念: 命中请求块 b 中的数据块 b 在缓存中: 命中!请求: 14缓存器6Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third Editio/p>

6、31144987School of Computer Science and Technology, HIT高速缓存的基本概念: 不命中请求块 b中的数据块 b 不在缓存中: 不命中!请求: 12缓存从器中取出块 b请求: 12块 b在缓存中器 放置策略(Placementpolicy): 决定块b放在缓存中的位置 替换策略(Replacement policy): 决定该替换存储器中的哪一块7Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third Edition15121187430123141928S

7、chool of Computer Science and Technology, HIT高速缓存基本概念:缓存不命中的种类¢ 冷不命中(或强制性不命中)§当缓存为空时,对任何数据的请求都会不命中不命中大部分缓存将第k+1层的某个块限制在第k层块的一个子集里(有时只是一个块)§ 例如,第k+1层的块i必须放置在第k层的块 (i mod 4) 中¢§§当缓存足够大,但是被的对象都到同一缓存块中,此种不命中称为不命中§ 例如,程序请求块 0, 8, 0, 8, 0, 8,这时每次请求都不命中¢ 容量不命中§

8、当工作集(working set)的大小超过缓存的大小时,会发生容量不命中Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third Edition8School of Computer Science and Technology, HIT高速缓存器¢ 高速缓存器是小型的、快速的基于SRAM的器是在硬件中自动管理的¢ 保持经常主存的块¢ CPU 首先查找缓存中的数据¢ 典型的系统结构: CPU 系统总线内存总线9Bryant and OHallaron, Comput

9、er Systems: A Programmers Perspective, Third Edition主存I/O 桥寄存器文件总线接口高速缓存ALUSchool of Computer Science and Technology, HIT回顾:现代CPU设计寄存器更新OK?10Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third Edition操作结果地址地址.数据数据执行单元数据Cache功能部件加载算术运算算术运算算术运算分支指令控制单元地址指令操作指令译码退役单元寄存器文件指令Cache取指控

10、制School of Computer Science and CTeacrhnneoglioegMy,eHlloITnCPU实例系统总线内存总线台式机PC主存(DRAM)CPU (Intel Core i7)主板Source: DellSource: DellSource: PC MagazineSource:Source: Dell111Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third Edition主存I/O 桥寄存器文件总线接口高速缓存ALUSchool of Computer Scienc

11、e and CTeacrhnneoglioegMy,eHlloITn实例(Cont.)Intel Sandy BridgeProcessor DieL1: 32KB 指令 + 32KB 数据L2: 256KBL3: 320MB1212Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third EditionSchool of Computer Science and Technology, HIT通用的高速缓存器组织结构(S, E, B)每个高速缓存块有B = 2b 字节1个有效位t个标记位每组E行(E =

12、2e)组0组1S = 2s 组组S-1高速缓存大小 C = S x E x B 数据字节13Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third Edition012B-1有效标记012B-1有效标记012B-1有效标记012B-1有效标记012B-1有效标记012B-1有效标记School of Computer Science and Technology, HIT高速缓存读地址:数据从这个偏移位置开始标记组索引块偏移组组0组114Bryant and OHallaron, Computer Sys

13、tems: A Programmers Perspective, Third Edition012B-1有效标记012B-1有效标记012B-1有效标记012B-1有效标记t位s 位b 位 组 行:检查集合中的任何行是否有匹配的标记是 + 行有效: 命中 数据:从块偏移开始的数据School of Computer Science and Technology, HIT示例:直接高速缓存(E = 1)直接: 每一组只有一行假设: 缓存块大小为8字节Int地址:选择组S = 2s 组15Bryant and OHallaron, Computer Systems: A Programmers P

14、erspective, Third Edition0127有效标记t 位0011000127有效标记0127有效标记0127有效标记0127有效标记0127有效标记School of Computer Science and CTeacrhnneoglioegMy,eHlloITn示例:直接高速缓存(E = 1)直接: 每一组只有一行假设: 缓存块大小为8字节int地址:有效+ 标记匹配 = 命中块偏移16Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third Edition01234有效标记t 位001

15、1005 6 7选择组School of Computer Science and CTeacrhnneoglioegMy,eHlloITn示例:直接高速缓存(E = 1)直接: 每一组只有一行假设: 缓存块大小为8字节int地址:有效+ 标记匹配 = 命中块偏移int (4 字节)在这里如果标记不匹配: 旧的行被、替换17Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third Edition01234有效标记t 位0011005 6 7选择组School of Computer Science and

16、Technology, HIT直接高速缓存模拟M=16 字节(4-位 地址), B=2 字节/块,E=1 块/组地址跟踪(读, 每次读一个字节):S=4 组,t=1s=2b=100002, 不命中00012, 命中01112, 不命中10002, 不命中01780不命中00002块标记v组0 组1 组2 组318Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third Edition0101?M80?-1910?0?010?M?6-7xSchool of Computer Science and Techn

17、ology, HITE-路组相联高速缓存(E = 2)E = 2: 每组两行假设:缓存块大小为8字节short int地址每组两行选择组S 组19Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third Edition有效标记01234567有效标记01234567有效标记01234567有效标记01234567有效标记01234567有效标记01234567t 位001100School of Computer Science and Technology, HITE-路组相联高速缓存(E = 2)E =

18、 2: 每组两行假设:缓存块大小为8字节short int地址比较有效标记匹配= 命+块偏移20Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third Edition有效标记01234有效标记01234t 位001100选中择组567567School of Computer Science and Technology, HITE-路组相联高速缓存(E = 2)E = 2: 每组两行假设:缓存块大小为8字节short int地址比较有效标记匹配= 命+块偏移short int (2 字节) 在这里不匹配

19、: 在组中选择1行用于和替换 替换策略: 随机、最近最少使用(LRU), 21Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third Edition有效标记01234有效标记01234t 位001100选中择组5 6 75 6 7School of Computer Science and Technology, HIT2-路组相联缓存模拟t=2s=1b=1M=16 字节地址, B=2 字节/块, S=2 组,E=2 块/组地址跟踪(读, 每次读一个字节):01780v00002,00012,01112,

20、10002,00002块misshit miss miss hit标记组0组122Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third Edition100?1M? 6-70?101?0M? 8-9100?0M? 0-1xxxxSchool of Computer Science and Technology, HIT关于怎么写?¢ 存在多个数据副本:§L1, L2, L3,主存, 磁盘¢ 在写命中时要做什么?§§直写(Write-through,立即写

21、入器)写回(Write-back,推缓存行要替换时才写入内存)§ 需要一个修改位(标识缓存行与内存是否相同/有修改)¢ 写不命中时要做什么?§写分配(Write-allocate加载到缓存,更新这个缓存行)§ 如后续有较多向该位置的写,优势明显非写分配(No-write-allocate直接写到主存中,不加载到缓存中)§¢ 典型方案§§直写+ 非写分配写回+ 写分配23Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third E

22、ditionSchool of Computer Science and Technology, HITIntel Core i7高速缓存层次结构处理器封装L1 指令高速缓存和数据高速缓存:32 KB, 8-way,时间: 4周期L2 统一的高速缓存:256 KB, 8-way,时间: 10 周期L3 统一的高速缓存:8 MB, 16-way,时间: 40-75 周期块大小:所有缓存都是64 字节Core 0Core 3RegsRegsL1d-cacheL1i-cacheL1d-cacheL1i-cacheL2 unified cacheL2 unified cacheL3 unified c

23、ache(shared by all cores)Main memory24Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third EditionSchool of Computer Science and Technology, HIT通用的高速缓存器组织结构(S, E, B)每个高速缓存块有B = 2b 字节1个有效位 t个标记位每组E行(E = 2e)组0组1S = 2s 组组S-1高速缓存大小 C = S x E x B 数据字节25Bryant and OHallaron, Computer S

24、ystems: A Programmers Perspective, Third Edition012B-1有效标记012B-1有效标记012B-1有效标记012B-1有效标记012B-1有效标记012B-1有效标记School of Computer Science and Technology, HIT例子: Core i7 L1 数据缓存26Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third Edition栈地址:块偏移:0x? 0x00007f7262a1e010 组索引:0x?标记:0x?块

25、偏移: ?位组索引: ?位标记: ?位32 KB、8路组相联1个有效位 t个标记位缓存块64 字节/块组0每组48/52 位地址范围E行(E= 2e)B =S = 2s 组组1S =s =E =e =C =组S-1高速缓存大小 C = S x E x B 数据字节有效标记012B-1012B-1有效标记有效标记012B-1012B-1有效标记有效标记012B-1012B-1有效标记School of Computer Science and Technology, HIT例子: Core i7 L1 数据缓存27Bryant and OHallaron, Computer Systems: A

26、 Programmers Perspective, Third Edition栈地址:块偏移:0x10 0x00007f7262a1e010 组索引:0x0标记:0x7f7262a1e0000 0001 0000块偏移:6 bits组索引:6 bits标记: 剩余全部bit(36/40 bits)32 KB、8路组相联1个有效位 t个标记位缓存块64 字节/块组0每组48/52 位地址范围E行(E= 2e)B = 64S = 2s 组组1S = 64, s = 6E = 8, e = 3C = 64 x 64 x 8 = 32,768组S-1高速缓存大小 C = S x E x B 数据字节有

27、效标记012B-1012B-1有效标记有效标记012B-1012B-1有效标记有效标记012B-1012B-1有效标记School of Computer Science and Technology, HIT高速缓存性能指标¢ 不§ 一部分内存= 1 在缓存中没有找到(不命中/)§ 典型数值(百分比):¢ § L1:3-10%§ L2:可以相当小(e.g., < 1%) ,依赖于L2大小等因素。¢ 命中时间§ 从高速缓存向处理器一行的时间§ 时间包括判断行是否在缓存中典型数值:§ L1

28、4个时钟周期§ L2 10个时钟周期§¢ 不命中处罚§由于不命中需要额外的时间§ 通常主存需50-200周期(趋势: 增加!)28Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third EditionSchool of Computer Science and Technology, HIT不VS¢ 命中or 不命中:差距巨大¢ 如果只有L1 和主存,那么可以差100倍¢ 你会相信99%要比97%好两倍?¢ 假设&#

29、162; 缓存命中时间为1个周期¢ 不命中处罚要100个周期¢ 平均¢ 97%¢ 99%时间: 1 周期+ 0.03 x 100 周期= 4 周期: 1 周期+ 0.01 x 100 周期= 2 周期¢ 这就是为什么用“不”而不是“”29Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third EditionSchool of Computer Science and Technology, HIT编写高速缓存友好的代码¢ 让常见的情况运行得快&#

30、167; 专注在函数和内循环上¢ 使用内层循环的缓存不命中数量降到最低§§变量好(时间局部性)反复步长为1的模式好(空间局部性)¢ 关键思想局部性的定性概念是通过缓冲器的理解而量化30Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third EditionSchool of Computer Science and Technology, HIT主要内容¢ 高速缓存的组织结构和运算¢ 高速缓存对程序性能的影响§§重新排列以提升空

31、间局部性使用块来提高时间局部性31Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third EditionSchool of Computer Science and Technology, HIT器山¢ 读吞吐量(读带宽)§ 每秒从系统中的字节数(MB/s)器山:测量的函数吞吐量作为空间和时间局部性¢§紧凑方式去描述系统性能32Bryant and OHallaron, Computer Systems: A Programmers Perspective, Thi

32、rd EditionSchool of Computer Science and CTeacrhnneoglioegMy,eHlloITn器山测试函数用多种elems、stride组合调用test()对于每个elems和stride:1. test()函数开始预热缓存2. Call test() 函数之后测量读吞吐量(MB/s)3332long dataMAXELEMS; /* Global array to traverse */* test - Iterate over first "elems" elements of* array “data” with strid

33、e of "stride", using* using 4x4 loop unrolling.*/int test(int elems, int stride) long i, sx2=stride*2, sx3=stride*3, sx4=stride*4; long acc0 = 0, acc1 = 0, acc2 = 0, acc3 = 0;long length = elems, limit = length - sx4;/* Combine 4 elements at a time */ for (i = 0; i < limit; i += sx4) ac

34、c0 = acc0 + datai;acc1 = acc1 + datai+stride; acc2 = acc2 + datai+sx2; acc3 = acc3 + datai+sx3;/* Finish any remaining elements */for (; i < length; i+) acc0 = acc0 + datai;return (acc0 + acc1) + (acc2 + acc3);Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third EditionSchool

35、of Computer Science and Technology, HITCore i7 Haswell2.1 GHz32 KB L1 高速缓存256 KB L2高速缓存8 MB L3 高速缓存64 B 块大小器山积极预取16000 L1 140001200010000时间局部性山脊8000600040002000空间局部性的斜坡032ks1128ks3 Mem 512ks52ms78mStride (x8 bytes)s9Size (bytes)32ms11128m34Bryant and OHallaron, Computer Systems: A Programmers Perspe

36、ctive, Third EditionRead throughput (MB/s)L3L2School of Computer Science and Technology, HIT主要内容¢ 高速缓存的组织结构和运算¢ 高速缓存对程序性能的影响§§重新排列以提升空间局部性使用块来提高时间局部性35Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third EditionSchool of Computer Science and Technology, HIT矩阵乘

37、法的例子¢ 描述:§§N x N 矩阵相乘矩阵元素类型是doubles (8 字节) 总共O(N3) 个操作每个元素都要读N次每个目标中都要对N个值求和§ 但也可以保存在寄存器中§§§36Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third Edition/* ijk */变量sum保存for (i=0; i<n; i+) 在寄存器中for (j=0; j<n; j+) sum = 0.0;for (k=0; k<n

38、; k+)sum += aik * bkj; cij = sum;School of Computer Science and Technology, HIT矩阵相乘不¢ 假设:分析§ 块大小= 32B (足够容纳4个double数)§ 矩阵的维数N非常大:1/N大约为 0.0§ 缓存并未大到足够容纳多行¢ 分析方法:§ 看内循环的的模式jjk=xiikCAB37Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third EditionSchool

39、of Computer Science and Technology, HIT内存中C数组的布局(回顾)¢ C 数组分配按行顺序§ 每行在连续的内存位置¢ 按行扫描:§ for (i = 0; i < N; i+)sum += a0i;§连续的元素§ 如果块大小(B) > sizeof(aij)字节, 利用空间局部性不= sizeof(aij) / B¢ 按列扫描:§ for (i = 0; i < n; i+) sum += ai0;§远隔的元素§ 没有空间局部性!= 1 (

40、i.e. 100%)§ 不38Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third EditionSchool of Computer Science and Technology, HIT矩阵乘法(ijk)内层循环:(*,j)(i,*)ABC行顺序列顺序固定每次内层循环迭代的不命中数:A0.25B1.0C0.0块大小 = 32B (4 doubles)39Bryant and OHallaron, Computer Systems: A Programmers Perspective, Th

41、ird Edition(i,j)/* ijk */for (i=0; i<n; i+) for (j=0; j<n; j+) sum = 0.0;for (k=0; k<n; k+)sum += aik * bkj; cij = sum;matmult/mm.cSchool of Computer Science and Technology, HIT矩阵乘法(jik)内层循环:(*,j)(i,*)ABC行顺序列顺序固定每次内层循环迭代的不命中数:A0.25B1.0C0.0块大小 = 32B (4 doubles)40Bryant and OHallaron, Compute

42、r Systems: A Programmers Perspective, Third Edition(i,j)/* ijk */for (j=0; j<n; j+) for (i=0; i<n; i+) sum = 0.0;for (k=0; k<n; k+)sum += aik * bkj; cij = sum;matmult/mm.cSchool of Computer Science and Technology, HIT矩阵乘法(kij)内层循环:(k,*)(i,*)ABC固定行顺序行顺序每次内层循环迭代的不命中数:A0.0B0.25C0.25块大小 = 32B (

43、4 doubles)41Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third Edition(i,k)/* kij */for (k=0; k<n; k+) for (i=0; i<n; i+) r = aik;for (j=0; j<n; j+) cij += r * bkj;matmult/mm.cSchool of Computer Science and Technology, HIT矩阵乘法(ikj)内层循环:(k,*)(i,*)ABC行顺序行顺序固定每次内层循环迭代的不命中

44、数:A0.0B0.25C0.25块大小 = 32B (4 doubles)42Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third Edition(i,k)/* ikj */for (i=0; i<n; i+) for (k=0; k<n; k+) r = aik;for (j=0; j<n; j+) cij += r * bkj;matmult/mm.cSchool of Computer Science and Technology, HIT矩阵乘法(jki)内层循环:(*,k)(

45、*,j)ABC列顺序固定列顺序每次内层循环迭代的不命中数:A1.0B0.0C1.0块大小 = 32B (4 doubles)43Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third Edition(k,j)/* jki */for (j=0; j<n; j+) for (k=0; k<n; k+) r = bkj;for (i=0; i<n; i+) cij += aik * r;matmult/mm.cSchool of Computer Science and Technology

46、, HIT矩阵乘法(kji)内层循环:(*,k)(*,j)ABC列顺序固定列顺序每次内层循环迭代的不命中数:A1.0B0.0C1.0块大小 = 32B (4 doubles)44Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third Edition(k,j)/* kji */for (k=0; k<n; k+) for (j=0; j<n; j+) r = bkj;for (i=0; i<n; i+) cij += aik * r;matmult/mm.cSchool of Comput

47、er Science and Technology, HIT矩阵乘法总结for (i=0; i<n; i+) for (j=0; j<n; j+) sum = 0.0;for (k=0; k<n; k+)sum += aik * bkj; cij += sum;ijk(& jik): 2 加载, 0 每次迭代的不1.25=for (k=0; k<n; k+) for (i=0; i<n; i+) kij(& ikj): 2加载, 1 每次迭代的不0.5r = aik;for (j=0; j<n; j+)cij += r * bkj;=for

48、(j=0; j<n; j+) jki(& kji): 2加载, 1 每次迭代的不2.0for (k=0; k<n; k+) r = bkj;for (i=0; i<n; i+) cij += aik * r;=45Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third EditionSchool of Computer Science and Technology, HITCore i7矩阵乘法性能100jkijikkjikijijkikj ijk / jik10 kij / ikj 150100150200250 300 350400 450 500 550 600 650700数组大小(n)46Bryant and OHallaron, Computer Systems: A Programmers Perspective, Third Ed

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论