Lecture 10 Vectors Coputer Science Division EECS at UC 讲座10载体计算机科学部是由加州大学_第1页
Lecture 10 Vectors Coputer Science Division EECS at UC 讲座10载体计算机科学部是由加州大学_第2页
Lecture 10 Vectors Coputer Science Division EECS at UC 讲座10载体计算机科学部是由加州大学_第3页
Lecture 10 Vectors Coputer Science Division EECS at UC 讲座10载体计算机科学部是由加州大学_第4页
Lecture 10 Vectors Coputer Science Division EECS at UC 讲座10载体计算机科学部是由加州大学_第5页
已阅读5页,还剩61页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、cs252/kubiatowiczlec 10.110/4/00cs252graduate computer architecturelecture 10vector processing (continued)branch predictionoctober 4, 2000prof. john kubiatowiczcs252/kubiatowiczlec 10.210/4/00review:vector processing vector processing represents an alternative to complicated superscalar processors.

2、primitive operations on large vectors of data load/store architecture: data loaded into vector registers; computation is register to register. memory system can take advantage of predictable access patterns: unit stride, non-unit stride, indexed vector processors exploit large amounts of parallelism

3、 without data and control hazards: every element is handled independently and possibly in parallel same effect as scalar loop without the control hazards or complexity of tomasulo-style hardware hardware parallelism can be varied across a wide range by changing number of vector lanes in each vector

4、functional unit.cs252/kubiatowiczlec 10.310/4/00review:ilp? wherefore art thou? there is a fair amount of ilp available, but branches get in the way better branch prediction techniques? probably not much room to go still: prediction rates already up in the 93% and above fundamental new programming m

5、odel? vector model accommodates long memory latency, doesnt rely on caches as does out-of-order, superscalar/vliw designs: no branch prediction! loops are implicit in model much easier for hardware: more powerful instructions, more predictable memory accesses, fewer hazards, fewer branches, fewer mi

6、spredicted branches, . but, what % of computation is vectorizable? is vector a good match to new apps such as multimedia, dsp? right answer? both? neither? (my favorite)cs252/kubiatowiczlec 10.410/4/00review: “dlxv” vector instructionsinstr.operands operationcomment addv v1,v2,v3v1=v2+v3vector + vec

7、tor addsv v1,f0,v2v1=f0+v2scalar + vector multv v1,v2,v3v1=v2xv3vector x vector mulsv v1,f0,v2v1=f0 xv2scalar x vector lv v1,r1v1=mr1.r1+63load, stride=1 lvws v1,r1,r2v1=mr1.r1+63*r2 load, stride=r2 lvi v1,r1,v2v1=mr1+v2i,i=0.63 indir.(gather) ceqv vm,v1,v2 vmaski = (v1i=v2i)? comp. setmask mov vlr,

8、r1vec. len. reg. = r1set vector length mov vm,r1vec. mask = r1set vector maskcs252/kubiatowiczlec 10.510/4/00vector example with dependency/* multiply amk * bkn to get cmn */for (i=1; im; i+) for (j=1; jn; j+) sum = 0; for (t=1; tk; t+) sum += ait * btj; cij = sum; cs252/kubiatowiczlec 10.610/4/00st

9、raightforward solution:use scalar processor this type of operation is called a reduction grab one element at a time from a vector register and send to the scalar unit? usually bad, since path between scalar processor and vector processor not usually optimized all that well alternative: special opera

10、tion in vector processor shift all elements left vector length elements or collapse into a compact vector all elements not masked supported directly by some vector processors usually not as efficient as normal vector operations (number of cycles probably logarithmic in number of bits!)cs252/kubiatow

11、iczlec 10.710/4/00novel matrix multiply solution you dont need to do reductions for matrix multiply you can calculate multiple independent sums within one vector register you can vectorize the j loop to perform 32 dot-products at the same time (assume maximul vector length is 32) show it in c source

12、 code, but can imagine the assembly vector instructions from itcs252/kubiatowiczlec 10.810/4/00optimized vector example/* multiply amk * bkn to get cmn */for (i=1; im; i+) for (j=1; jn; j+=32)/* step j 32 at a time. */sum0:31 = 0; /* init vector reg to zeros. */ for (t=1; t 4 x 64 = 256 clocks(or 4

13、clocks per result)1: lv v1,rx;load vector x2: mulv v2,f0,v1 ;vector-scalar mult.lvv3,ry;load vector y3: addv v4,v2,v3 ;add4: svry,v4;store the resultcs252/kubiatowiczlec 10.1310/4/00hardware vector length what to do when software vector length doesnt exactly match hardware vector length? vector-leng

14、th register (vlr) controls the length of any vector operation, including a vector load or store. (cannot be the length of vector registers)do 10 i = 1, n10y(i) = a * x(i) + y(i) dont know n until runtime! n max. vector length (mvl)?cs252/kubiatowiczlec 10.1410/4/00strip mining suppose vector length

15、max. vector length (mvl)? strip mining: generation of code such that each vector operation is done for a size to the mvl 1st loop do short piece (n mod mvl), rest vl = mvl low = 1 vl = (n mod mvl) /*find the odd size piece*/ do 1 j = 0,(n / mvl) /*outer loop*/do 10 i = low,low+vl-1 /*runs for length

16、 vl*/y(i) = a*x(i) + y(i) /*main operation*/10continuelow = low+vl /*start of next vector*/vl = mvl /*reset the length to max*/1continuecs252/kubiatowiczlec 10.1510/4/00dlxv start-up time start-up time: pipeline latency time (depth of fu pipeline); another sources of overhead operation start-up pena

17、lty (from cray-1) vector load/store12 vector multiply7 vector add6assume convoys dont overlap; vector length = n:convoystart1st resultlast result1. lv 012 11+n (=12+n-1)2. mulv, lv12+n 12+n+718+2nmultiply startup12+n+112+n+1324+2nload start-up3. addv25+2n25+2n+630+3nwait convoy 24. sv 31+3n31+3n+12

18、42+4nwait convoy 3cs252/kubiatowiczlec 10.1610/4/00vector opt #1: chaining suppose:mulvv1,v2,v3addvv4,v1,v5; separate convoy? chaining: vector register (v1) is not as a single entity but as a group of individual registers, then pipeline forwarding can work on individual elements of a vector flexible

19、 chaining: allow vector to chain to any other active vector operation = more read/write ports as long as enough hw, increases convoy sizemultvaddvmultvaddvtotal=141total=77764646764646unchainedchainedcs252/kubiatowiczlec 10.1710/4/00example execution of vector codevector memory pipelinevector multip

20、ly pipelinevector adder pipeline8 lanes, vector length 32,chainingscalarcs252/kubiatowiczlec 10.1810/4/00vector stride suppose adjacent elements not sequential in memorydo 10 i = 1,100do 10 j = 1,100a(i,j) = 0.0do 10 k = 1,10010a(i,j) = a(i,j)+b(i,k)*c(k,j) either b or c accesses not adjacent (800 b

21、ytes between) stride: distance separating elements that are to be merged into a single vector (caches do unit stride) = lvws (load vector with stride) instruction think of addresses per vector elementcs252/kubiatowiczlec 10.1910/4/0032memory operations load/store operations move groups of data betwe

22、en registers and memory three types of addressing unit stride contiguous block of information in memory fastest: always possible to optimize this non-unit (constant) stride harder to optimize memory system for all possible strides prime number of data banks makes it easier to support different strid

23、es at full bandwidth indexed (gather-scatter) vector equivalent of register indirect good for sparse arrays of data increases number of programs that vectorizecs252/kubiatowiczlec 10.2010/4/00interleaved memory layout great for unit stride: contiguous elements in different drams startup time for vec

24、tor operation is latency of single read what about non-unit stride? above good for strides that are relatively prime to 8 bad for: 2, 4 better: prime number of banks!vector processorunpipelineddramunpipelineddramunpipelineddramunpipelineddramunpipelineddramunpipelineddramunpipelineddramunpipelineddr

25、amaddrmod 8= 0addrmod 8= 1addrmod 8= 2addrmod 8= 4addrmod 8= 5addrmod 8= 3addrmod 8= 6addrmod 8= 7cs252/kubiatowiczlec 10.2110/4/00how to get full bandwidth for unit stride? memory system must sustain (# lanes x word) /clock no. memory banks memory latency to avoid stalls m banks m words per memory

26、lantecy l clocks if m l, then gap in memory pipeline:clock:0ll+1 l+2l+m- 1 l+m 2 lword:-012m-1-m may have 1024 banks in sram if desired throughput greater than one word per cycle either more banks (start multiple requests simultaneously) or wider drams. only good for unit stride or large data types

27、more banks/weird numbers of banks good to support more strides at full bandwidth will read paper on how to do prime number of banks efficientlycs252/kubiatowiczlec 10.2210/4/00vector opt #2: sparse matrices suppose:do100 i = 1,n100a(k(i) = a(k(i) + c(m(i) gather (lvi) operation takes an index vector

28、 and fetches data from each address in the index vector this produces a “dense” vector in the vector registers after these elements are operated on in dense form, the sparse vector can be stored in expanded form by a scatter store (svi), using the same index vector cant be figured out by compiler si

29、nce cant know elements distinct, no dependencies use cvi to create index 0, 1xm, 2xm, ., 63xmcs252/kubiatowiczlec 10.2310/4/00sparse matrix example cache (1993) vs. vector (1988)ibm rs6000cray ympclock72 mhz167 mhzcache256 kb0.25 kblinpack140 mflops160 (1.1)sparse matrix 17 mflops125 (7.3)(cholesky

30、blocked ) cache: 1 address per cache block (32b to 64b) vector: 1 address per element (4b)cs252/kubiatowiczlec 10.2410/4/00vector opt #3: conditional execution suppose:do 100 i = 1, 64if (a(i) .ne. 0) thena(i) = a(i) b(i)endif100 continue vector-mask control takes a boolean vector: when vector-mask

31、register is loaded from vector test, vector instructions operate only on vector elements whose corresponding entries in the vector-mask register are 1. still requires clock even if result not stored; if still performs operation, what about divide by 0?cs252/kubiatowiczlec 10.2510/4/00cs 252 administ

32、rivia exam: wednesday 10/18location: tbatime: 5:30 - 8:30 this info is on the lecture page (has been) meet at lavals afterwards for pizza and beverages assignment up now due on friday oct 13th! done in pairs. put both names on papers. make sure you have partners! feel free to use mailing list for th

33、is.cs252/kubiatowiczlec 10.2610/4/00parallelism and power if code is vectorizable, then simple hardware, more energy efficient than out-of-order machines. can decrease power by lowering frequency so that voltage can be lowered, then duplicating hardware to make up for slower clock: note that vo can

34、be made as small as permissible within process constraints by simply increasing “n”fcv2 power 1;12000 :changepower constant eperformanc 1 vvlanesnlanesfnfcs252/kubiatowiczlec 10.2710/4/00vector options use vectors for inner loop parallelism (no surprise) one dimension of array: a0, 0, a0, 1, a0, 2,

35、. think of machine as, say, 16 vector regs each with 32 elements 1 instruction updates 32 elements of 1 vector register and for outer loop parallelism! 1 element from each column: a0,0, a1,0, a2,0, . think of machine as 32 “virtual processors” (vps) each with 16 scalar registers! ( multithreaded pro

36、cessor) 1 instruction updates 1 scalar register in 64 vps hardware identical, just 2 compiler perspectivescs252/kubiatowiczlec 10.2810/4/00virtual processor vector model:treat like simd multiprocessor vector operations are simd (single instruction multiple data) operations each virtual processor has

37、 as many scalar “registers” as there are vector registers there are as many virtual processors as current vector length. each element is computed by a virtual processor (vp)cs252/kubiatowiczlec 10.2910/4/00vector architectural stategeneralpurposeregistersflagregisters(32)vp0vp1vp$vlr-1vr0vr1vr31vf0v

38、f1vf31$vdw bits1 bitvirtual processors ($vlr)vcr0vcr1vcr31controlregisters32 bitscs252/kubiatowiczlec 10.3010/4/00designing a vector processor changes to scalar how pick vector length? how pick number of vector registers? context switch overhead exception handling masking and flag instructionscs252/

39、kubiatowiczlec 10.3110/4/00changes to scalar processor to run vector instructions decode vector instructions send scalar registers to vector unit (vector-scalar ops) synchronization for results back from vector register, including exceptions things that dont run in vector dont have high ilp, so can

40、make scalar cpu simplecs252/kubiatowiczlec 10.3210/4/00how pick vector length? longer good because:1) hide vector startup2) lower instruction bandwidth3) tiled access to memory reduce scalar processor memory bandwidth needs4) if know max length of app. is large performance loss! alternative model: a

41、rithmetic exceptions set vector flag registers, 1 flag bit per element software inserts trap barrier instructions from sw to check the flag bits as needed ieee floating point requires 5 flag bitscs252/kubiatowiczlec 10.3710/4/00exception handling: page faults page faults must be precise instruction

42、page faults not a problem could just wait for active instructions to drain also, scalar core runs page-fault code anyway data page faults harder option 1: save/restore internal vector unit state freeze pipeline, dump vector state perform needed ops restore state and continue vector pipelinecs252/kub

43、iatowiczlec 10.3810/4/00exception handling: page faults option 2: expand memory pipeline to check addresses before send to memory + memory buffer between address check and registers multiple queues to transfer from memory buffer to registers; check last address in queues before load 1st element from

44、 buffer. per address instruction queue (paiq) which sends to tlb and memory while in parallel go to address check instruction queue (aciq) when passes checks, instruction goes to committed instruction queue (ciq) to be there when data returns. on page fault, only save intructions in paiq and aciqcs2

45、52/kubiatowiczlec 10.3910/4/00masking and flag instructions flag have multiple uses (conditional, arithmetic exceptions) alternative is conditional move/merge clear that fully masked is much more effiecient that with conditional moves not perform extra instructions, avoid exceptions downside is:1) e

46、xtra bits in instruction to specify the flag regsiter2) extra interlock early in the pipeline for raw hazards on flag registerscs252/kubiatowiczlec 10.4010/4/00flag instruction ops do in scalar processor vs. in vector unit with vector ops? disadvantages to using scalar processor to do flag calculati

47、ons (as in cray):1) if mvl word size = multiple instructions; also limits mvl in future2) scalar exposes memory latency3) vector produces flag bits 1/clock, but scalar consumes at 64 per clock, so cannot chain together proposal: separate vector flag functional units and instructions in vucs252/kubia

48、towiczlec 10.4110/4/00mips r10000 vs. t0*see /real/spert/t0-intro.htmlcs252/kubiatowiczlec 10.4210/4/00vectors are inexpensivescalarn ops per cycle o(n2) circuitryhp pa-80004-way issuereorder buffer:850k transistorsincl. 6,720 5-bit register number comparatorsvectorn ops p

49、er cycle o(n + en2) circuitryt0 vector micro24 ops per cycle730k transistors totalonly 23 5-bit register number comparatorsno floating pointcs252/kubiatowiczlec 10.4310/4/00vectors lower powervector one inst fetch, decode, dispatch per vector structured register accesses smaller code for high perfor

50、mance, less power in instruction cache misses bypass cache one tlb lookup pergroup of loads or stores move only necessary dataacross chip boundarysingle-issue scalarone instruction fetch, decode, dispatch per operationarbitrary register accesses,adds area and powerloop unrolling and software pipelin

51、ing for high performance increases instruction cache footprintall data passes through cache; waste power if no temporal localityone tlb lookup per load or storeoff-chip access in whole cache linescs252/kubiatowiczlec 10.4410/4/00superscalar energy efficiency even worsevector control logic growslinea

52、rly with issue width vector unit switchesoff when not in use vector instructions expose parallelism without speculation software control ofspeculation when desired: whether to use vector mask or compress/expand for conditionalssuperscalarcontrol logic grows quad-ratically with issue widthcontrol log

53、ic consumes energy regardless of available parallelismspeculation to increase visible parallelism wastes energycs252/kubiatowiczlec 10.4510/4/00vector applicationslimited to scientific computing? multimedia processing (compress., graphics, audio synth, image proc.) standard benchmark kernels (matrix

54、 multiply, fft, convolution, sort) lossy compression (jpeg, mpeg video and audio) lossless compression (zero removal, rle, differencing, lzw) cryptography (rsa, des/idea, sha/md5) speech and handwriting recognition operating systems/networking (memcpy, memset, parity, checksum) databases (hash/join,

55、 data mining, image/video serving) language run-time support (stdlib, garbage collection) even specint95cs252/kubiatowiczlec 10.4610/4/00“vector” for multimedia? intel mmx: 57 new 80 x86 instructions (1st since 386) similar to intel 860, mot. 88110, hp pa-71000lc, ultrasparc 3 data types: 8 8-bit, 4

56、 16-bit, 2 32-bit in 64bits reuse 8 fp registers (fp and mmx cannot mix) short vector: load, add, store 8 8-bit operands claim: overall speedup 1.5 to 2x for 2d/3d graphics, audio, video, speech, comm., . use in drivers or added to library routines; no compiler+cs252/kubiatowiczlec 10.4710/4/00mmx i

57、nstructions move 32b, 64b add, subtract in parallel: 8 8b, 4 16b, 2 32b opt. signed/unsigned saturate (set to max) if overflow shifts (sll,srl, sra), and, and not, or, xor in parallel: 8 8b, 4 16b, 2 32b multiply, multiply-add in parallel: 4 16b compare = , in parallel: 8 8b, 4 16b, 2 32b sets field

58、 to 0s (false) or 1s (true); removes branches pack/unpack convert 32b 16b, 16b 8b pack saturates (set to max) if number is too largecs252/kubiatowiczlec 10.4810/4/00new architecture directions “media processing will become the dominant force in computer arch. & microprocessor design.” “. new med

59、ia-rich applications. involve significant real-time processing of continuous media streams, and make heavy use of vectors of packed 8-, 16-, and 32-bit integer and fl. pt.” needs include high memory bw, high network bw, continuous media data types, real-time response, fine grain parallelism “how mul

60、timedia workloads will change processor design”, diefendorff & dubey, ieee computer (9/97)cs252/kubiatowiczlec 10.4910/4/00ring-basedswitchcpu+$return of vectors: tentative viram-1 floorplani/on0.18 m dram32 mb in 16 banks x 256b, 128 subbanksn0.25 m, 5 metal logicn- 200 mhz mips, 16k i$, 16k d$n- 4 200

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论