北邮体系结构实验三,四,五_第1页
北邮体系结构实验三,四,五_第2页
北邮体系结构实验三,四,五_第3页
北邮体系结构实验三,四,五_第4页
北邮体系结构实验三,四,五_第5页
已阅读5页,还剩26页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、实验三五八七八实验四四五七八九实验71二三四五六七八十十班级:2012211302 学号:2012211144姓名:袁凯琦dlx处理器程序设计2錫類2实验s的2势裣飾2实验设备环境2实3金翻2实验内容和要求2实验频2实娜#7代蠟&7实3金麵7实验刖勺7实验判7实验设备环境7实難理:7教学要点与学习难点7实验内容和要求7实验频7你解决的困难和解决方法+实习体会12循环展开(选作)12实3金麵12实翻勺12实-验w12实验设备环境12实觀理:12教学要点与学习难点12实验内容和要求12实验频:13你解决的困难和解决方法26.你没有解决的困难(如有)以及你做过的努力27-.实验类别: 综合型

2、二. 巨的:学习使用dlx汇编语言编程,进一步分析相关现象三. 实验学时:4四. 实验设备环境: dlx汇编语言环境五. 掌握i4:w:运算算法和编程方法。六. 实验内容和要求:自编一段汇编代码,完成两双精度浮点一维向量的加法(或乘除法)运算,并输山结果 向量长度=16。观察程序中出现的数据/控制/结构相关七. 实验步骤:1. 熟悉dlx汇编语言。(1) 汇编器处理汇编文件时,数据位于内存中data指针所指向的空间,指令位于text 指针所指向的空间。(2) trap 0是通知wlndlx模拟器程序结束,trap 5是输出格式化到标准输出2. 编写两双精度浮点一维向量的加法运算程序。代码清单如

3、下: datavi:.double1,2,3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20长度为20v2:.double1.1,2.2, 3.3,4.4, 5.5,6.6,7.7,8.8, 9.9,10,11.914. 6,15.5,16. 4,17.3,18.2, 19. 1,20;向量 v2,长度为 20a:.asciiz"result 二c:.asciizzz.align2d:.wordc:保存c的变量结构:dizhi:.space8;相加的结果必须保存在dizhi中,才能正确显示.text.glob

4、almainmain:addirl, r0, aswdizhi, rl;存储字,保存a的首地址addirl4, ro, dizhitrap5;输出字符串"result = ;向量vi,12.8, 13.7!班别 学号: 姓名:20122113022012211144袁凯琦addirlo, r0, 0;rl0 = 0addir8, ro, 20;r8 = 20,即向量的长度loop:idf2,vl(rlo)idf4, v2(rl0)adddf2, f2, f4;将4, v2的相应项依次相加,保存在f4sddizhi, f2;存储双精度浮点数f4addirl4, ro, dtrap5;输

5、出结果addirl0,rl0,8:取vi,v2下一项subir8, r8, 1;循环次数减一bnezr8, loop;假如r8!=0,则返回到looptrap0;结朿运行完毕之后岀现:运行结果如下:dlx-stancbrd-1/o ,镛igg)j simulation is running.cancel i 00:07result = 27100000 4?200000 6300000 8?400000 10?500000 12.600000 14.700000 16.8c,0000 18.900000 20.000000 22.900000 24.800000 26.700000 28.60

6、0000 30.500000 32.4c i 0000 34.300000 36.200000 38.100000 40.0000003. 实验观察与分析(1)观察程序中出现的数据/控制/结构相关stalls:raw stalls: 183 (40.13怎 of all cycles)waw stalls: 0 0.00尨 of all cycles)structural stalls: 0 (0.00 of all cycles)control stalls: 19 (4.17% of all cycles)trap stalls: 66 (14.47 of all cycles)total

7、: 268 stall(s) (58.77% of all cycles)本次实验执行过程共出现kaw数裾相关183次,控制相关19次,trap66次,共有stall 268次。具体如下:1) raw相关addi1 j0,0x1140 sw dizhi(r0)j1addi r10 j0,0x0addi r8,r0,0x14 id f2,$data(r10)if i idmex ii mem 1 wbifr-stall | idif | idmex i mem 1 we:if| mem | wb |if 11st all | id | intex | mem | v/bid f4zv2(r10)

8、adddf2j2j4 addd (2j2j4 sd dizhi(ro)j2subi r8zr8z0x1 bnez r8joop2) tstalladdi r14x10x1154 trap 0x5|if |id|intex | |we: |:|if|t-stall |id | intex | mem | v/baddi r10瓜0x0iiif| id| intex imem |wbiaddi r1 0,0x1150if| idilintex | mem | wb11trap 0x511if|t-stall|id| intex| mem |wb11addi r10j10,0x8111if| id|

9、 intex |mem |wbtrap 0x0nopnopnop3)控制相关bnez r8joopif | r-st all | id | intex | mem|trap 0x0if abortedid f2,$data(r10)i if | id(2) 考察增加浮点运算部件对性能的影响。 比较浮点运算部件分别为1和4时,floating point stage configurationfloating point stage configurationaddition units:d4multiplication units:14division units:14count:delay:

10、addition units:44multiplication units:44division units:4count delay:number of units in each class: 1 <= m <= 8z delay (clock cycles): 1 <= n <= 50warning: if you change the values,ihe processorwill be reset automahcally!number of units in each class: 1 <= m <= 8z delay (clock cycle

11、s): 1 <= n <= 50warning: if you change ihe values, the processorwill be reset automatically!okcancel0k |cancel接下来查看statists进行比较,如下图 " " ml statistics- 1 口 1 x|kml statistics- | | x|total:456 cycle(s) executed.id executed by 187 inshuction(s).2 lnstruction(sj currently in pipeline.tot

12、al:456 cycle(s) executed.id executed by 187 ln$truction(s).2 lnstruclion(sj currently in pipeline.hardware conf iguration:memofy size: 32768 bytes faddex-stages: 1, required cycles: 4 fmulex-slages: 1z required cycles: 4 fdivex-stages: 1, required cycles: 4forwarding disabled.hardware conf iguration

13、:memory size: 32768 bytes fadde父-stages: 4, required cycles: 4 fmulex-slages: 4z required cycles: 4 fdivex-stages: 4z required cycles: 4forwarding disabled.stalls:raw stalls: 183 (40.13% of all cycles)waw stalls: 0 (0.00 of all cycles)structural stalls: 0 (0.0肪 of all cycles)control stalls: 19 (4.17

14、 of all cycles)trap stalls: 66 (14.47 of all cycles)total: 268 slall(s) (58.77 of all cycles)stalls:raw stalls: 183 (40.13% of all cycles)waw stalls: 0 (0.00% of all cycles)structural stalls: 0 (0.00% of all cycles)control stalls: 19 (4.17% of all cycles)trap stalls: 66 (14.47 of all cycles)total: 2

15、68 stall(s) 58.77% of all cycles)conditional branches):total: 20 (10.70% of all instructions), thereof:taken: 19 (95.00 of all cond. branches) not taken: 1 (5.00% of all cond. branches)conditional branches):total: 20 (10.70% of all instructions), thereof:taken: 19 (95.00 of all cond. branches) not t

16、aken: 1 (5.00% of all cond. branches)load-/store-instruct ions:total: 61 (32.62茗 of all instructions), thereof:loads: 40 65.57% of load7s tore-l nslruclions)stores: 21 34.43% of load7store-1 nshuctions)load-/s t ore-instructions:total: 61 (32.62% of all insuuclionsj, thereof:loads: 40 (65.57% of loa

17、d-/s tore-l nstruclions)stores: 21 (34.43 of load-/slore-lnstruchons)floating point stage instructions:total: 20 (10.70% of all instructions), thereof:additions: 20 100.00of floating point stage inst) multiplications: 0 0 00% of floating point stage inst.) divisions: 0 (0.00% of floating point slage

18、 inst)floating point stage instructions:total: 20 (10.70 of all instructions), thereof:additions: 20 (100.00% of floating point stage inst.) multiplications: 0 (0.00% of floating point stage inst) divisions: 0 (0.00% of floating point stage inst.)traps:traps: 22 (11.76 of all instructions)traps:trap

19、s: 22 (11.76 of all instruclions),丨li门d由以上两图nf得,本实验增加浮点运算部件对流水线性能没有影响。(3) 增加forward部件对性能的影响。tit statistics-ln|x|total:456 cycle executed id executed by 187 lrislfucbori(s).2 inslrucbon(s) cuner脚 in ppdne.hardware configuration: statisticstotal:373 cycle(s) executed id executed by 187 insbuctiorifs).

20、2 lnstruclion(sj curiendy in pipdne,hardware configuration: memory size: 32788 bytot faddex-stages: l requred c>det: 4 fmulex.stages: h requied cydec 4 fdiv£x.stag就 1, required cycbc 4 forwarding enabledstalls:raw 仙 ils 183 (40 m dal cycles)waw stalls 0(0 00ofal cyde$)structural 0 (o cmkof a

21、l cyde$)control 19 (417 of al cydej)trap stalls: 66 (14 47 of al cycles)total 268 stalls) (58 7n o( al cyctes)conditional branches)total 20 (10.70 of al insbucoonst theteof:taken; 19(95 00o<al cond branches) not laken 1 5xx)% d ai cond branches)lo«d-/store-instructions:total 61 (32 62 of al

22、imtruciiomt thereof:loads 40 (65 57 of load-zstae lnstructions) store$: 21 (34 43 ol load7st«e lrwtruction$jfloating point stage instructions:total 20 (107於 of al insbuctiomt thaeof:additions: 20 (100 00 of floabng point stage insl.) mulhplications 0 (0 00: of floating point stage inst) divisio

23、ns: 0 (0 ocrof floating point stage inst)traps:t raps: 22 (11 76oj al instructions)stalls:raw stalls: 100 (2681 cyctesl thereof:ld stalls: 20 (20 0ck of raw branch/jump stab: 20 (20.00 d raw stalls) floating point stab: 60 (60 00 of raw stalls)waw stalls: 0(0 0(ko(al cycles)slruclural stalls: 0 (0.0

24、0 of al cycles)control stalls: 19 (5.09 of al cydes)trap stalls: 66 (17.69 of al cycles)tolal: 135 stall(s) (49.6ck of al cydes)condi11onal branches ):total: 20 (10.70of al ln$<iucbonst thereof: laken: 19(95 00 of al cond blanches) not taken: 1 (5 00 of al cond branches)load-/store-instructions:t

25、otal 61 (32.62% of al im<fuc4ion$t (hereof:load$: 40 (65 57欠 of lodd7st«e-ln$huct»ons) stores: 21 (34 43% o( load7slor©4n$(rucbons)floating point stage instructions:total: 20(10.70 of al insuudionst (hereof:additions: 20 (100 00 of floating point $lage insl j multiplications 0 (0 0

26、01 o( floating point stage inst. divisions: 0 (0 00 of floating point stage inst)memory size: 32768 by<es faddex-stages l required cyctes: 4 fmulex-stages: 1, requred cycles: 4 fdivex-sages: 1. requred cycles: 4 forwarding disabledtraps:t raps: 22 (11.76% of al instrucbom)jj从上面的数据我们可以看出增加forward!

27、部件后: 时钟周期由456减少至373个,raw由原来占总时钟周期的40. 13%减少至26. 81%; raw个数由原来的183减少至100; 增加forward部件使得控制相关比例增加了,但是数目并没有增加。总而言之,使用forward部件后,总的时钟周期减少,数据相关减少,流水线的性能得 到一定的改善。(4) 观察转移指令在转移成功和转移不成功时候的流水线开销。conditional branches):total: 20 10.70 of all inshuctions), thereof:taken: 19 35.00% of all cond. branches) not take

28、n: 1 (5.00 of all cond. branches)由上图可得,转移指令一共20条,其中成功转移19条,占95%,不成功转移1条,占5%。 静态指令调度算法是在出现数据相关时,力了消除或者减少流水线空转,编译器确定并分离出程序中存在在相关的指令,然后进行指令调度,并对代码优化。但是静态指令调度只能 解决数据相关,条件转移结果与原理来相比没有变化。若转移不成功,对流水线的执行无影 响,流水线的吞吐率和效率没有降低。若转移成功,则要废弃预先读入的指令,重新从转移 成功处读入指令,毎执行一条条件转移指令,一条x段流水线就有x-2个流水线被浪费掉, 执行效率降低,性能有一定的损失。八.

29、实验体会加深丫对汇编语言的理解与运川,尤其是tmp5,输出格式化到标准输出的理解,在代码 中,应注意:c:.asciiz "%f".align 2d:.word c;保存c的变量结构:dizhi: .space 8;相加的结果必须保存在dizhi中,才能正确显示否则即使运算正确也不能把结果输岀。代码优化.实验类别: 综合实验二.的:学简单编译优化方法,观察采川编译优化方法所带来的性能的提 i5j o三.!1!五.六.七.实验学时:4实验设备环境: dlx汇编语言环境采用静态调度方法重排指令序列,减少相关,优化程序教学要点与学习难点:指令静态调度方法。实验内容和要求:对实验二

30、或实验三的代码进行优化,给出性能改进的a化值,同时给出采取优化手段的 理论依据。八.实验步骤:1. 优化实验3程序代码清单及注释说明 datavi:. double 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12, 13, 14, 15, 16, 17, 18, 19, 20v2:. double1. 1,2. 2,3. 3, 4. 4, 5. 5, 6. 6, 7. 7, 8. 8, 9. 9, 10, 11.9, 12. 8, 13.7, 14. 6, 15. 5, 16. 4, 17. 3, 18.2, 19. 1,20a:.asciiz "resul

31、t 二并口。名 班学姓c:2012211302 :2012211144 :袁凯琦.asciiz "%f ” .align 2 d . word cdizhi:space 8.text.global mainmain:addiaddiaddi;addiswaddirl, ro,arlo, ro, 0r8, ro, 20rl,r0,a 该指令与sw dizhi, rl存在raw相关,故将其提前dizhi, rlr 14, ro, dizhitrap 5;addi rlo, r0,0该指令与id 2, vi (rlo)存在raw相关,故将其提前;addi r8, ro, 20loop:id

32、f2, vi (rlo)idf4, v2(rl0);addd f2, f2, f4该指令与前而两条指令均存在kaw相关,将芄延后执行addirlo, rlo, 8subir8, r8, 1adddf2, f2, f4;sd dizhi, f2该指令与addd f2,f2,f4存在raw相关,将其延后执行addirl4, ro, dsddizhi, f2trap5;addi rlo, rlo, 8;subi r8, r8, 1bnez r8,loop trap 0执行完毕后,我们点击statistics查看运行结果数据分析ml statistics-|g|3<|total:353 cycl

33、e(s) executed.id executed by 187 instruction(s).2 instruction(s) currently in pipeline.hardware conf iguration:memory size: 32768 bytes faddex-stages: 1 z required cycles: 4 fmulex-stages: 1, required cycles: 4 fdivex-stages: 1 z required cycles: 4 forwarding disabled.stalls:raw stalls: 80 (22.66 of

34、 all cycles)waw stalls: 0 0.00% of all cycles)structural stalls: 0 (0.00 of all cycles)control stalls: 19 (5.38% of all cycles)trap stalls: 66 (18.70 of all cycles)total: 165 stall(46.74 of all cycles)conditional branches):total: 20 (10.70 of all instructions), thereof: taken: 19 (35.00% of all cond

35、. branches) not taken: 1 (5.00 of all cond. branches)load-/s t ore-instructions:total: g1 (32.62尨 of all instructions), thereof:loads: 40 65.57% of load-/store-instructions)stores: 21 34.43% of load-/store-lnslructions)floating point stage instructions:total: 20 (10.70 of all instructions), thereof:

36、additions: 20 fl 00.00茗 of floating point stage inst) muhiplicadons: 0 (0.00 of floating point stage inst) divisions: 0 0.00% of floating point stage inst)traps:traps: 22 11.76 of all instructions)2. 程序相关性分析结果 左图是优化前的,右图是优化后的stallsstallsraw stab 183(40 1 3%dlc/dm)raw stalls: 9(22of all cycles)waw 相h

37、 0 扣 oox of 城waw slals. 0 (lio0:£ol ail cycles)sliuclurd$idh 0(jooolmcydet)structural stab 0 (0 00$; of all cydes)control stab 19(417>>ofcpcbs)conbol stalls: 19 (5.38 of all cycles)ti«p suit 66(14 47% dal cycles)i rap66 (18of altotat 288(58 773s of al cydet)total "65 siqs) (46.7

38、 < of al cycles)由上述两图对比可以看出,数据相关:其raw相关由优化前的40. 13%减少力22. 66%,性能改善很多;结构相关没有 发生改变;控制相关:由原来的4. 17%变为5. 38%,没有改善。因此,可以看出,我所进行的代码优化对性能方而改善并不是很强烈,主要影响还是在数据 相关方面.3. 增加浮点运算部件对性能的影响。比较浮点运算部件分别为1和4时,floating point stage configurationfloating point stage configurationaddition units:d4multiplication units:1

39、4division units:14count:delay:addition units:44multiplication units:44division units:4count delay:number of units in each class: 1 <= m <= 8z delay (clock cycles): 1 <= n <= 50warning: if you change the values,ihe processorwill be reset automahcally!number of units in each class: 1 <=

40、 m <= 8z delay (clock cycles): 1 <= n <= 50warning: if you change ihe values, the processorwill be reset automatically!okcancel0k |cancel接下来查看statists进行比较,如下图i. statistics_ 12<|life statistics- | ! x|total:353 cycle(s) executed.id executed by 187 instruction(s).2 instruction(s) currently

41、 in pipeline.total:456 cycle(s) executed.id executed by 187 instruction.2 instruction(s) currently in pipeline.hardware conf iguration:memory size: 32768 bytes faddex-slages: 1, required cycles: 4 fmulex.stages: 1, required cycles: 4 fdivex-stages: 1 z required cycles: 4forwarding disabled.hardware

42、conf iguration:memory size: 32768 bytes faddex-stages: 4z required cycles: 4 fmulex-stages: 4, required cycles: 4 fdivex-stages: 4z required cycles: 4forwarding disabled.stalls:raw stalls: 80 (22.6防 of all cycles)waw stalls: 0 (0.00% of all cycles)structural stalls: 0 (0.00% of all cycles)control st

43、alls: 19 (5.38怎 of all cycles)trap stalls: 66 (18.70 of all cycles)total: 165 stall(s) (46.74 of all cycles)stalls:raw stalls: 183 40.13% of all cycles)waw stalls: 0 (0.00% of all cycles)structural stalls: 0 0.00% of all cycles)control stalls: 19 (4.17% of all cycles)trap stalls: 66 (14.47 of all cy

44、cles)total: 268 stall(s) (58.77% of all cycles)conditional branches):total: 20 (10.70 of all instructions), thereof: taken: 19 (95.00 of all cond. branches) not taken: 1 (5.00% of all cond. branches)conditional branches):total: 20 (10.70% of all instructionst thereof: taken: 19 35.00% of all cond. b

45、ranches) rnol taken: 1 (5.00% of all cond. branches)load-/store-instructions:total: 61 (32.62% of all instructions)、thereof:loads: 40 (65.57% of load7slore-lnslrucdons)stores: 21 (34.43% of load./store-instructions)load-/store-instruct ions:total: 61 (32.62% of all lnshuctions)z thereof:loads: 40 (6

46、5.57% of load-/store-lnstruclions)stores: 21 (34.43% of load-/slore-lnslructions)floating point stage instructions:total: 20 (10.70 of all instructions), thereof:additions: 20 (100.00% of floating point stage inst.) multiplications: 0 (0.00怎 of floating point stage inst.) divisions: 0 p.00% of float

47、ing point stage inst)floating point stage instructions:total: 20 (10.70% of al instruction thereof:additions: 20 (100.00% of floating point stage inst.) multiplications: 0 (0.00% of floating point stage inst) divisions: 0 (0.00% of floating point stage inst)traps:traps: 22 (11.76 of all instructions

48、)、ljjtraps:traps: 22 (11.76 of all instructions)d±j由以上两图nj得,本实验增加浮点运算部件对流水线性能没有影响。4. 增加forward部件对性能的影响。l statistics-lxtotal:353 cycle(s) executed.id executed by 187 instruction.2 inslruction(s) currently in pipeline.hardware conf iguration:memory size: 32768 bytes faddex-slages: 1 z required cy

49、cles: 4 fmulex-stages: 1 z required cycles: 4 fdivex-stages: 1, required cycles: 4 forwarding disabled.stalls:raw stalls: 80 (22.66% of all cycles)waw stalls: 0 (0.00 of all cycles)structural stalls: 0 (0.00 of all cycles)control stalls: 19 (5.38% of all cycles)trap stalls: 66 (13.70 of all cycles)t

50、otal: 165 stali(46.74% of all cycles)conditional branches):total: 20 (10.70% of all instructions)、thereof:taken: 19 95.00% of all cond. branches) not taken: 1 (5.00 of all cond. branches)load-/store-instruetions:total: 61 (32.62% of all instructions), thereof:loads: 40 (65.57% of load-/s tore-l nstr

51、uctions) stores: 21 (34.43% of load7store-lnstructions)floating point stage instructions:totab 20 (10.70% of all instructions), thereof:additions: 20 (100.00% of floating point stage inst.) multiplications: 0 (0.00% of floating point stage inst) divisions: 0 (0.00% of floating point stage inst.jtrap

52、s:traps: 22 (11.76 of all instructions)statisticstotal:313 cycle(s) executed.id executed by 187 insuuclion(s).2 inshuction(s) cunently in pipeline.hardware conf iguration: memory size: 32768 bytes faddex-stages: 4z required cycles: 4 fmulex-stages: 4z required cycles: 4 fdivex-stages: 4? required cy

53、cles: 4 forwarding enabled.jnlstalls:raw stalls: 40 12.78% of all cycles), thereof:ld stalls: 0 (0皿 of raw stalls)branch/jump stalls: 0 0.00% of raw stalls) floating point stalls: 40 (100.00% of raw stalls)waw stalls: 0 (0.00 of all cycles)structural stalls: 0 0.00% of all cycles)control stalls: 19

54、(6,07 of all cycles)trap stalls: 106 (33.86 of all cycles)total: 165 stal<$) (52.72 of all cycles)conditional branches):total: 20 (10.70 of all instructions), thereof: taken: 19 (95.00 of all cond. branches) not taken: 1 5.00% of all cond. branches)load-/store-instructions:total: 61 32.62% of all instructions), thereof:loads: 40 (65.57% of load7slore-l nshuclions) stores: 21 (34.43茗 of load/store-instructions)floating point stage instructions:total: 20 (10.70 of all inshuctions), thereof:additions: 20 (100.00% of floating point stage inst.) multip

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论