计算机系统结构分级存储器体系_第1页
计算机系统结构分级存储器体系_第2页
计算机系统结构分级存储器体系_第3页
计算机系统结构分级存储器体系_第4页
计算机系统结构分级存储器体系_第5页
已阅读5页,还剩46页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Chapter 4Memory HierarchyComputer Architecture计算机系统结构 Ideally one would desire an indefinitely large memory capacity such that any particular word would be immediately available We are forced to recognize the possibility of constructing a hierarchy of memories each of which has greater capacity than

2、 the preceding but which is less quickly accessible. A.W. Burks, Memory HierarchyRegistersMemoryStackDiskCacheOther StoragesMemory HierarchyWe focus on the memory hierarchy for desktopRegistersCacheMain MemoryDisk storage Because of locality and the higher speed of smaller memories, a memory hierarc

3、hy can improve performance.Memory HierarchyLevel1234NameRegistersCacheMain memorydisk storageTypical size1 KB16 MB100 GBImplementation Technologycustom memory with multiple ports CMOSon-chip or off-chip CMOS SRAMCMOS DRAMmagnetic diskAccess time (ns)0.25 - 0.50.5 - 2580 - 250 5,000,000 Bandwidth (MB

4、/sec)20,000- 100,0005000 - 10,0001000 - 500020 - 150Managed bycompilerhardwareoperating systemoperating system/operatorBacked bycachemain memorydiskCD or TapeSmall largeFast SlowRegisters and Stack Built in CPUprocessorALUregister(Top ofStack)TOS1Example: Thecode sequence for C=A+B:Push APush BAddPo

5、p CABCmemoryA=2B=3C=A+B=2+3=5processorALUregisterABCmemory2TOS2POP C: move the datain TOP1 to memory C53Push A : TOP1=2Push B: TOP1=3TOP2=2Add: TOP1=532523Cache First Memory After CPUA safe place for hiding or storing thingsThe first level of the memory hierarchy encountered once the address leaves

6、the CPUWhenever buffering is employed to reuse commonly occurring itemsThe cache and main memory have the same relationship as the main memory and diskCacheCache hitwhen the CPU finds a requested data item in the cacheCache missWhen the CPU does not find a data item it needs in cache TerminologyCach

7、eLatency & BandwidthThe time requested for the cache miss depends on both the latency and bandwidth. Latency determines the time to retrieve the first word of the block. Bandwidth determines the time to retrieve the rest of the block TerminologyCache TerminologyBlockA fixed-size collection of data c

8、ontaining the requested wordBlock is retrieved from the main memory and placed into the cache for reuse laterPagesFixed-size blocksAt anytime each pages resides either in main memory or on diskCache TerminologyPage faultWhen the CPU references an item within a page that is not present in the cache o

9、r main memory.When page fault happens, CPU will switch to some other task while disk access occurs.CacheCache PerformanceWe now account for the number of cycles during which the CPU is stalled waiting for a memory access, which we call the memory stall cycles. The performance is then the product of

10、the clock cycle time and the sum of the CPU clock cycles and the memory stall cycles:CPU execution time = (CPU clock cycles + memory stall cycles) Clock cycle timeCPU clock cycles include the time to handle a cache hit, and that the CPU is stalled during a cache miss.CacheCache PerformanceThe number

11、 of memory stall cycles depends on both the number of misses and the cost per miss, which is called the miss penalty:CacheCache Performance Review the CPU Performance EquationEssentially all computers are constructed using a clock running at a constant rate. These discrete time events are called tic

12、ks, clock ticks, clock periods, clocks, cycles, or clock cycles. Computer designers refer to the time of a clock period by its duration (e.g., 1ns) or by its rate (e.g., 1GHz). CPU time for a program can then be expressed two ways: CacheCache Performance Review the CPU Performance EquationIn additio

13、n to the number of clock cycles needed to execute a program, we can also count the number of instructions executed the instruction path length or instruction count (IC). If we know the number of clock cycles and the instruction count, we can calculate the average number of clock cycles per instructi

14、on (CPI). The CPI is computed as:CacheCache Performance Review the CPU Performance EquationThen, the clock cycles can be defined as IC X CPI. This allows us to use CPI in the execution time formula: Verify the CalculationCPU time = Instruction count x Clock cycle time x Cycles per instructionOr CPU

15、time = Instruction count x Cycles per instruction / Clock rateCacheCache Performance Example 1Assume we have a computer where the clocks per instruction (CPI) is 1.0 when all memory accesses hit in the cache. The only data accesses are loads and stores, and these total 50% of the instructions. If th

16、e miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the computer be if all instructions were cache hits?CacheCache Performance Answer for Example 1Assume we have a computer where the clocks per instruction (CPI) is 1.0 when all memory accesses hit in the cache. The only

17、data accesses are loads and stores, and these total 50% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the computer be if all instructions were cache hits?First compute the performance for the computer that always hits:CacheCache Performanc

18、e Answer for Example 1Assume we have a computer where the clocks per instruction (CPI) is 1.0 when all memory accesses hit in the cache. The only data accesses are loads and stores, and these total 50% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much fast

19、er would the computer be if all instructions were cache hits?For the computer with the real cache, first we compute memory stall cycles:CacheCache Performance Answer for Example 1Assume we have a computer where the clocks per instruction (CPI) is 1.0 when all memory accesses hit in the cache. The on

20、ly data accesses are loads and stores, and these total 50% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the computer be if all instructions were cache hits?Where the middle term (1 + 0.5) represents one instruction access and 0.5 data acc

21、esses per instruction. The total performance is thusCacheCache Performance Answer for Example 1Assume we have a computer where the clocks per instruction (CPI) is 1.0 when all memory accesses hit in the cache. The only data accesses are loads and stores, and these total 50% of the instructions. If t

22、he miss penalty is 25 clock cycles and the miss rate is 2%, how much faster would the computer be if all instructions were cache hits?The performance ratio is the execution times: The computer with no cache misses is 1.75 times faster.CacheCache Performance Example 2To show equivalency between the t

23、wo miss rate equations, lets redo the example above, this time assuming a miss rate per 1000 instructions of 30. What is memory stall time in terms of instruction count?CacheCache Performance Answer for example 2Recomputing the memory stall cycles:We get the same answer as on example 1.CacheFour Nor

24、mal Memory Hierarchy QuestionsQ1: Block placementWhere can a block be placed in the upper level?Q2: Block identificationHow is a block found if it is in the upper level?Q3: Block replacementWhich block should be replace on a miss?Q4: Write strategyWhat happens on a write?CacheQ1: Where Can a Block B

25、e Placed in a Cache?If each block has only one place it can appear in the cache, the cache is said to be direct mapped. The mapping is usually(Block address) MOD (Number of blocks in cache)CacheQ1: Where Can a Block Be Placed in a Cache?If a block can be placed anywhere in the cache, the cache is sa

26、id to be fully associative.CacheQ1: Where Can a Block Be Placed in a Cache?If a block can be placed in a restricted set of places in the cache, the cache is set associative. A set is a group of blocks in the cache. A block is first mapped onto a set, and then the block can be placed anywhere within

27、that set. The set is usually chosen by bit selection; that is,(Block address) MOD (Number of blocks in cache)If there are n blocks in a set, the cache placement is called n-way set associative.CacheQ1: Where Can a Block Be Placed in a Cache?CacheQ1: Where Can a Block Be Placed in a Cache?In fully as

28、sociative, block 12 from the lower level can go into any of the eight block frames of the cache.With direct mapping, block 12 can only be placed into block frame 4 (12 module 8).Set associative, which has some of both features, allows the block to be placed anywhere in set 0 (12 module 4). With two

29、blocks per set, this means block 12 can be placed either in block 0 or in block 1 of the cache.Note: Real caches contain thousands of block frames and real memories contain millions of blocks.The organization has four sets with two blocks per set, called two-way set associative.CacheQ2: How is a Blo

30、ck Found If It Is in the Cache?Caches have an address tag on each block frame that gives the block address. The tag of every cache block that might contain the desired information is checked to see if it matches the block address from the CPU. As a rule, all possible tags are searched in parallel be

31、cause speed is critical.The tag is used to check all the blocks in the set, and the index is used to select the set. The block offset is the address of the desired data within the block. Fully associative caches have no index field.CacheQ3: Which Block Should Be Replaced on a Cache Miss?When a miss

32、occurs, the cache controller must select a block to be replaced with the desired data.A benefit of direct-mapped placement is that hardware decisions are simplified in face, so simple that there is no choice: Only one block frame is checked for a hit, and only that block can be replaced.With fully a

33、ssociative or set-associative placement, there are many blocks to choose from on a miss. There are three primary strategies employed for selecting which block to replace:CacheQ3: Which Block Should Be Replaced on a Cache Miss?There is little difference between LRU and random for the largest size cac

34、heLRU outperforming the others for smaller caches.FIFO generally outperforms random in the smaller cache sizes.CacheQ4: What Happens on a Write?Read cache: The block can be read from the cache at the same time that the tag is read and compared, so the block read begins as soon as the block address i

35、s available. If it is hit, the requested part of the block is passed on to the CPU immediately. If it is a miss, just ignore the value read. Modifying a block cannot begin until the tag is checked to see if the address is a hit. Because tag checking cannot occur in parallel, writes normally take lon

36、ger than reads.CacheQ4: What Happens on a Write?There are two basic options when writing to the cache:Write through The information is written to both the block in the cache and to the block in the lower-level memory.Write back The information is written only to the block in the cache. The modified

37、cache block in written to main memory only when it is replaced.To reduce the frequency of writing back blocks on replacement. dirty bit - check it is dirty or clean. If it is clean, the block is not written back on a miss. CacheQ4: What Happens on a Write?Advantages of write back:. Write back writes

38、 occur at the speed of the cache memory, and multiple writes within a block require only one write to the lower-level memory. The information is written to both the block in the cache and to the block in the lower-level memory. Since some writes dont go to memory, write back uses less memory bandwid

39、th. . Since write back uses the rest of the memory hierarchy and memory buses less than write through, it also saves power.CacheQ4: What Happens on a Write?Advantages of write though:. Write through is easier to implement than write back. The cache is always clean, so unlike write back read misses n

40、ever result in writes to lower level. The next lower level has the most current copy of the data, which simplifies data coherency.We want write back for processor caches to reduce the memory traffic and write through to keep the cache consistent with lower levels of the memory hierarchy. When the CP

41、U must wait for writes to complete during write through, the CPU is said to write stall.Write buffer can be used to reduce the write stall, however write stalls can occur even with write buffer.Memory TechnologyDRAM dynamic RAMInternal organization of a 64M bit DRAMAddress: row access strobe (RAS) +

42、 column access strobe (CAS)DRAM TechnologyMemory Technology - DRAMDRAM Refresh to prevent loss of information, each bit must be “refreshed” periodically.The time for a refresh is typically a full memory access (RAS and CAS) for each row of DRAM.The memory system is occasionally unavailable because i

43、t is sending a signal telling every chip to refresh.DRAM s are commonly sold on small boards called dual inline memory module (DIMMs).DIMMs typically contain 4 to 16 DRAMsIn early years (before 1998) DRAMs obeyed Moores Law, bringing out a new chip with four times the capacity every three years.Virt

44、ually all desktop or server computers since 1975 used DRAMs for main memory, all use SRAM for cache.Memory Technology - DRAMTimes of fast and slow DRAMs with each generation.Row access strobe (RAS)Year of intrChip sizeSlowest DRAM(ns)Fastest DRAM(ns)CAS time(ns)Cycle time(ns)198064 KB180150752501983

45、256 KB1501205022019861 MB1201002519019894 MB1008020165199216 MB806015120199664 MB7050121101998128 MB7050101002000256 MB65457902002512 MB6040580Memory TechnologySRAM Technology static RAMThe dynamic nature of the circuits in DRAM require data to be written back after being read (refresh)SRAMs typical

46、ly use six transistors per bit to prevent the information from being disturbed when read.SRAM needs only minimal power to retain the charge in standby mode, but DRAMs must continue to be refreshed occasionally so as to not lose information.In DRAM designs the emphasis is on cost per bit and capacity

47、, while SRAM designs are concerned with speed and capacity.SRAM TechnologyMemory TechnologyEmbedded computers usually have small memories, and most do not have a disk.ROM read only memory. ROM is programmed at the time of manufacture, needing only a single transistor per bit to represent 1 or 0. Not

48、hing the computer can do can modify the contents of this memory. Flash memory. Allows the memory to be modified after manufactory. Flash memory allows the reading at almost DRAM speeds, but writing flash is 10 to 100 times slower.ROM and FlashMemory TechnologyDivides physical memory into blocks and

49、allocates them to different processes.Virtual memory automatically manages the two levels of the memory hierarchy represented by main memory and secondary storage.Virtual MemoryMemory TechnologyWith virtual memory, the CPU produces virtual addresses that are translated by a combination of hardware a

50、nd software to physical addresses, which access main memory.This process is called memory mapping or address translation.Replacement on cache misses is primarily controlled by hardware, while virtual memory replacement is primarily controlled by the operating system.Virtual MemoryMemory TechnologyVi

51、rtual memory system can be categorized into two classes:. those with fixed-size blocks, called pages. those with variable-size blocks, called segments.Pages are typically fixed at 4096 to 65,536 bytes, while segment size varies.The decision to use paged virtual memory versus segmented virtual memory affects the CPU. Paged addressing has a single fixed-size address divided into page number

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论