1、实验三五八七八实验四四五七八九实验71二三四五六七八十十班级:2012211302 学号:2012211144姓名:袁凯琦dlx处理器程序设计2錫類2实验s的2势裣飾2实验设备环境2实3金翻2实验内容和要求2实验频2实娜#7代蠟&7实3金麵7实验刖勺7实验判7实验设备环境7实難理:7教学要点与学习难点7实验内容和要求7实验频7你解决的困难和解决方法+实习体会12循环展开(选作)12实3金麵12实翻勺12实-验w12实验设备环境12实觀理:12教学要点与学习难点12实验内容和要求12实验频:13你解决的困难和解决方法26.你没有解决的困难(如有)以及你做过的努力27-.实验类别: 综合型
2、二. 巨的:学习使用dlx汇编语言编程,进一步分析相关现象三. 实验学时:4四. 实验设备环境: dlx汇编语言环境五. 掌握i4:w:运算算法和编程方法。六. 实验内容和要求:自编一段汇编代码,完成两双精度浮点一维向量的加法(或乘除法)运算,并输山结果 向量长度=16。观察程序中出现的数据/控制/结构相关七. 实验步骤:1. 熟悉dlx汇编语言。(1) 汇编器处理汇编文件时,数据位于内存中data指针所指向的空间,指令位于text 指针所指向的空间。(2) trap 0是通知wlndlx模拟器程序结束,trap 5是输出格式化到标准输出2. 编写两双精度浮点一维向量的加法运算程序。代码清单如
3、下: datavi:.double1,2,3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20长度为20v2:.double1.1,2.2, 3.3,4.4, 5.5,6.6,7.7,8.8, 9.9,10,11.914. 6,15.5,16. 4,17.3,18.2, 19. 1,20;向量 v2,长度为 20a:.asciiz"result 二c:.asciizzz.align2d:.wordc:保存c的变量结构:dizhi:.space8;相加的结果必须保存在dizhi中,才能正确显示.text.glob
4、almainmain:addirl, r0, aswdizhi, rl;存储字,保存a的首地址addirl4, ro, dizhitrap5;输出字符串"result = ;向量vi,12.8, 13.7!班别 学号: 姓名:20122113022012211144袁凯琦addirlo, r0, 0;rl0 = 0addir8, ro, 20;r8 = 20,即向量的长度loop:idf2,vl(rlo)idf4, v2(rl0)adddf2, f2, f4;将4, v2的相应项依次相加,保存在f4sddizhi, f2;存储双精度浮点数f4addirl4, ro, dtrap5;输
5、出结果addirl0,rl0,8:取vi,v2下一项subir8, r8, 1;循环次数减一bnezr8, loop;假如r8!=0,则返回到looptrap0;结朿运行完毕之后岀现:运行结果如下:dlx-stancbrd-1/o ,镛igg)j simulation is running.cancel i 00:07result = 27100000 4?200000 6300000 8?400000 10?500000 12.600000 14.700000 16.8c,0000 18.900000 20.000000 22.900000 24.800000 26.700000 28.60
6、0000 30.500000 32.4c i 0000 34.300000 36.200000 38.100000 40.0000003. 实验观察与分析(1)观察程序中出现的数据/控制/结构相关stalls:raw stalls: 183 (40.13怎 of all cycles)waw stalls: 0 0.00尨 of all cycles)structural stalls: 0 (0.00 of all cycles)control stalls: 19 (4.17% of all cycles)trap stalls: 66 (14.47 of all cycles)total
7、: 268 stall(s) (58.77% of all cycles)本次实验执行过程共出现kaw数裾相关183次,控制相关19次,trap66次,共有stall 268次。具体如下:1) raw相关addi1 j0,0x1140 sw dizhi(r0)j1addi r10 j0,0x0addi r8,r0,0x14 id f2,$data(r10)if i idmex ii mem 1 wbifr-stall | idif | idmex i mem 1 we:if| mem | wb |if 11st all | id | intex | mem | v/bid f4zv2(r10)
8、adddf2j2j4 addd (2j2j4 sd dizhi(ro)j2subi r8zr8z0x1 bnez r8joop2) tstalladdi r14x10x1154 trap 0x5|if |id|intex | |we: |:|if|t-stall |id | intex | mem | v/baddi r10瓜0x0iiif| id| intex imem |wbiaddi r1 0,0x1150if| idilintex | mem | wb11trap 0x511if|t-stall|id| intex| mem |wb11addi r10j10,0x8111if| id|
9、 intex |mem |wbtrap 0x0nopnopnop3)控制相关bnez r8joopif | r-st all | id | intex | mem|trap 0x0if abortedid f2,$data(r10)i if | id(2) 考察增加浮点运算部件对性能的影响。 比较浮点运算部件分别为1和4时,floating point stage configurationfloating point stage configurationaddition units:d4multiplication units:14division units:14count:delay:
10、addition units:44multiplication units:44division units:4count delay:number of units in each class: 1 <= m <= 8z delay (clock cycles): 1 <= n <= 50warning: if you change the values,ihe processorwill be reset automahcally!number of units in each class: 1 <= m <= 8z delay (clock cycle
11、s): 1 <= n <= 50warning: if you change ihe values, the processorwill be reset automatically!okcancel0k |cancel接下来查看statists进行比较,如下图 " " ml statistics- 1 口 1 x|kml statistics- | | x|total:456 cycle(s) executed.id executed by 187 inshuction(s).2 lnstruction(sj currently in pipeline.tot
12、al:456 cycle(s) executed.id executed by 187 ln$truction(s).2 lnstruclion(sj currently in pipeline.hardware conf iguration:memofy size: 32768 bytes faddex-stages: 1, required cycles: 4 fmulex-slages: 1z required cycles: 4 fdivex-stages: 1, required cycles: 4forwarding disabled.hardware conf iguration
13、:memory size: 32768 bytes fadde父-stages: 4, required cycles: 4 fmulex-slages: 4z required cycles: 4 fdivex-stages: 4z required cycles: 4forwarding disabled.stalls:raw stalls: 183 (40.13% of all cycles)waw stalls: 0 (0.00 of all cycles)structural stalls: 0 (0.0肪 of all cycles)control stalls: 19 (4.17
14、 of all cycles)trap stalls: 66 (14.47 of all cycles)total: 268 slall(s) (58.77 of all cycles)stalls:raw stalls: 183 (40.13% of all cycles)waw stalls: 0 (0.00% of all cycles)structural stalls: 0 (0.00% of all cycles)control stalls: 19 (4.17% of all cycles)trap stalls: 66 (14.47 of all cycles)total: 2
15、68 stall(s) 58.77% of all cycles)conditional branches):total: 20 (10.70% of all instructions), thereof:taken: 19 (95.00 of all cond. branches) not taken: 1 (5.00% of all cond. branches)conditional branches):total: 20 (10.70% of all instructions), thereof:taken: 19 (95.00 of all cond. branches) not t
16、aken: 1 (5.00% of all cond. branches)load-/store-instruct ions:total: 61 (32.62茗 of all instructions), thereof:loads: 40 65.57% of load7s tore-l nslruclions)stores: 21 34.43% of load7store-1 nshuctions)load-/s t ore-instructions:total: 61 (32.62% of all insuuclionsj, thereof:loads: 40 (65.57% of loa
17、d-/s tore-l nstruclions)stores: 21 (34.43 of load-/slore-lnstruchons)floating point stage instructions:total: 20 (10.70% of all instructions), thereof:additions: 20 100.00of floating point stage inst) multiplications: 0 0 00% of floating point stage inst.) divisions: 0 (0.00% of floating point slage
18、 inst)floating point stage instructions:total: 20 (10.70 of all instructions), thereof:additions: 20 (100.00% of floating point stage inst.) multiplications: 0 (0.00% of floating point stage inst) divisions: 0 (0.00% of floating point stage inst.)traps:traps: 22 (11.76 of all instructions)traps:trap
19、s: 22 (11.76 of all instruclions),丨li门d由以上两图nf得,本实验增加浮点运算部件对流水线性能没有影响。(3) 增加forward部件对性能的影响。tit statistics-ln|x|total:456 cycle executed id executed by 187 lrislfucbori(s).2 inslrucbon(s) cuner脚 in ppdne.hardware configuration: statisticstotal:373 cycle(s) executed id executed by 187 insbuctiorifs).
20、2 lnstruclion(sj curiendy in pipdne,hardware configuration: memory size: 32788 bytot faddex-stages: l requred c>det: 4 fmulex.stages: h requied cydec 4 fdiv£x.stag就 1, required cycbc 4 forwarding enabledstalls:raw 仙 ils 183 (40 m dal cycles)waw stalls 0(0 00ofal cyde$)structural 0 (o cmkof a
21、l cyde$)control 19 (417 of al cydej)trap stalls: 66 (14 47 of al cycles)total 268 stalls) (58 7n o( al cyctes)conditional branches)total 20 (10.70 of al insbucoonst theteof:taken; 19(95 00o<al cond branches) not laken 1 5xx)% d ai cond branches)lo«d-/store-instructions:total 61 (32 62 of al
22、imtruciiomt thereof:loads 40 (65 57 of load-zstae lnstructions) store$: 21 (34 43 ol load7st«e lrwtruction$jfloating point stage instructions:total 20 (107於 of al insbuctiomt thaeof:additions: 20 (100 00 of floabng point stage insl.) mulhplications 0 (0 00: of floating point stage inst) divisio
23、ns: 0 (0 ocrof floating point stage inst)traps:t raps: 22 (11 76oj al instructions)stalls:raw stalls: 100 (2681 cyctesl thereof:ld stalls: 20 (20 0ck of raw branch/jump stab: 20 (20.00 d raw stalls) floating point stab: 60 (60 00 of raw stalls)waw stalls: 0(0 0(ko(al cycles)slruclural stalls: 0 (0.0
24、0 of al cycles)control stalls: 19 (5.09 of al cydes)trap stalls: 66 (17.69 of al cycles)tolal: 135 stall(s) (49.6ck of al cydes)condi11onal branches ):total: 20 (10.70of al ln$<iucbonst thereof: laken: 19(95 00 of al cond blanches) not taken: 1 (5 00 of al cond branches)load-/store-instructions:t
25、otal 61 (32.62% of al im<fuc4ion$t (hereof:load$: 40 (65 57欠 of lodd7st«e-ln$huct»ons) stores: 21 (34 43% o( load7slor©4n$(rucbons)floating point stage instructions:total: 20(10.70 of al insuudionst (hereof:additions: 20 (100 00 of floating point $lage insl j multiplications 0 (0 0
26、01 o( floating point stage inst. divisions: 0 (0 00 of floating point stage inst)memory size: 32768 by<es faddex-stages l required cyctes: 4 fmulex-stages: 1, requred cycles: 4 fdivex-sages: 1. requred cycles: 4 forwarding disabledtraps:t raps: 22 (11.76% of al instrucbom)jj从上面的数据我们可以看出增加forward!
27、部件后: 时钟周期由456减少至373个,raw由原来占总时钟周期的40. 13%减少至26. 81%; raw个数由原来的183减少至100; 增加forward部件使得控制相关比例增加了,但是数目并没有增加。总而言之,使用forward部件后,总的时钟周期减少,数据相关减少,流水线的性能得 到一定的改善。(4) 观察转移指令在转移成功和转移不成功时候的流水线开销。conditional branches):total: 20 10.70 of all inshuctions), thereof:taken: 19 35.00% of all cond. branches) not take
28、n: 1 (5.00 of all cond. branches)由上图可得,转移指令一共20条,其中成功转移19条,占95%,不成功转移1条,占5%。 静态指令调度算法是在出现数据相关时,力了消除或者减少流水线空转,编译器确定并分离出程序中存在在相关的指令,然后进行指令调度,并对代码优化。但是静态指令调度只能 解决数据相关,条件转移结果与原理来相比没有变化。若转移不成功,对流水线的执行无影 响,流水线的吞吐率和效率没有降低。若转移成功,则要废弃预先读入的指令,重新从转移 成功处读入指令,毎执行一条条件转移指令,一条x段流水线就有x-2个流水线被浪费掉, 执行效率降低,性能有一定的损失。八.
29、实验体会加深丫对汇编语言的理解与运川,尤其是tmp5,输出格式化到标准输出的理解,在代码 中,应注意:c:.asciiz "%f".align 2d:.word c;保存c的变量结构:dizhi: .space 8;相加的结果必须保存在dizhi中,才能正确显示否则即使运算正确也不能把结果输岀。代码优化.实验类别: 综合实验二.的:学简单编译优化方法,观察采川编译优化方法所带来的性能的提 i5j o三.!1!五.六.七.实验学时:4实验设备环境: dlx汇编语言环境采用静态调度方法重排指令序列,减少相关,优化程序教学要点与学习难点:指令静态调度方法。实验内容和要求:对实验二
30、或实验三的代码进行优化,给出性能改进的a化值,同时给出采取优化手段的 理论依据。八.实验步骤:1. 优化实验3程序代码清单及注释说明 datavi:. double 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12, 13, 14, 15, 16, 17, 18, 19, 20v2:. double1. 1,2. 2,3. 3, 4. 4, 5. 5, 6. 6, 7. 7, 8. 8, 9. 9, 10, 11.9, 12. 8, 13.7, 14. 6, 15. 5, 16. 4, 17. 3, 18.2, 19. 1,20a:.asciiz "resul
31、t 二并口。名 班学姓c:2012211302 :2012211144 :袁凯琦.asciiz "%f ” .align 2 d . word cdizhi:space 8.text.global mainmain:addiaddiaddi;addiswaddirl, ro,arlo, ro, 0r8, ro, 20rl,r0,a 该指令与sw dizhi, rl存在raw相关,故将其提前dizhi, rlr 14, ro, dizhitrap 5;addi rlo, r0,0该指令与id 2, vi (rlo)存在raw相关,故将其提前;addi r8, ro, 20loop:id
32、f2, vi (rlo)idf4, v2(rl0);addd f2, f2, f4该指令与前而两条指令均存在kaw相关,将芄延后执行addirlo, rlo, 8subir8, r8, 1adddf2, f2, f4;sd dizhi, f2该指令与addd f2,f2,f4存在raw相关,将其延后执行addirl4, ro, dsddizhi, f2trap5;addi rlo, rlo, 8;subi r8, r8, 1bnez r8,loop trap 0执行完毕后,我们点击statistics查看运行结果数据分析ml statistics-|g|3<|total:353 cycl
33、e(s) executed.id executed by 187 instruction(s).2 instruction(s) currently in pipeline.hardware conf iguration:memory size: 32768 bytes faddex-stages: 1 z required cycles: 4 fmulex-stages: 1, required cycles: 4 fdivex-stages: 1 z required cycles: 4 forwarding disabled.stalls:raw stalls: 80 (22.66 of
34、 all cycles)waw stalls: 0 0.00% of all cycles)structural stalls: 0 (0.00 of all cycles)control stalls: 19 (5.38% of all cycles)trap stalls: 66 (18.70 of all cycles)total: 165 stall(46.74 of all cycles)conditional branches):total: 20 (10.70 of all instructions), thereof: taken: 19 (35.00% of all cond
35、. branches) not taken: 1 (5.00 of all cond. branches)load-/s t ore-instructions:total: g1 (32.62尨 of all instructions), thereof:loads: 40 65.57% of load-/store-instructions)stores: 21 34.43% of load-/store-lnslructions)floating point stage instructions:total: 20 (10.70 of all instructions), thereof:
36、additions: 20 fl 00.00茗 of floating point stage inst) muhiplicadons: 0 (0.00 of floating point stage inst) divisions: 0 0.00% of floating point stage inst)traps:traps: 22 11.76 of all instructions)2. 程序相关性分析结果 左图是优化前的,右图是优化后的stallsstallsraw stab 183(40 1 3%dlc/dm)raw stalls: 9(22of all cycles)waw 相h
37、 0 扣 oox of 城waw slals. 0 (lio0:£ol ail cycles)sliuclurd$idh 0(jooolmcydet)structural stab 0 (0 00$; of all cydes)control stab 19(417>>ofcpcbs)conbol stalls: 19 (5.38 of all cycles)ti«p suit 66(14 47% dal cycles)i rap66 (18of altotat 288(58 773s of al cydet)total "65 siqs) (46.7
38、 < of al cycles)由上述两图对比可以看出,数据相关:其raw相关由优化前的40. 13%减少力22. 66%,性能改善很多;结构相关没有 发生改变;控制相关:由原来的4. 17%变为5. 38%,没有改善。因此,可以看出,我所进行的代码优化对性能方而改善并不是很强烈,主要影响还是在数据 相关方面.3. 增加浮点运算部件对性能的影响。比较浮点运算部件分别为1和4时,floating point stage configurationfloating point stage configurationaddition units:d4multiplication units:1
39、4division units:14count:delay:addition units:44multiplication units:44division units:4count delay:number of units in each class: 1 <= m <= 8z delay (clock cycles): 1 <= n <= 50warning: if you change the values,ihe processorwill be reset automahcally!number of units in each class: 1 <=
40、 m <= 8z delay (clock cycles): 1 <= n <= 50warning: if you change ihe values, the processorwill be reset automatically!okcancel0k |cancel接下来查看statists进行比较,如下图i. statistics_ 12<|life statistics- | ! x|total:353 cycle(s) executed.id executed by 187 instruction(s).2 instruction(s) currently
41、 in pipeline.total:456 cycle(s) executed.id executed by 187 instruction.2 instruction(s) currently in pipeline.hardware conf iguration:memory size: 32768 bytes faddex-slages: 1, required cycles: 4 fmulex.stages: 1, required cycles: 4 fdivex-stages: 1 z required cycles: 4forwarding disabled.hardware
42、conf iguration:memory size: 32768 bytes faddex-stages: 4z required cycles: 4 fmulex-stages: 4, required cycles: 4 fdivex-stages: 4z required cycles: 4forwarding disabled.stalls:raw stalls: 80 (22.6防 of all cycles)waw stalls: 0 (0.00% of all cycles)structural stalls: 0 (0.00% of all cycles)control st
43、alls: 19 (5.38怎 of all cycles)trap stalls: 66 (18.70 of all cycles)total: 165 stall(s) (46.74 of all cycles)stalls:raw stalls: 183 40.13% of all cycles)waw stalls: 0 (0.00% of all cycles)structural stalls: 0 0.00% of all cycles)control stalls: 19 (4.17% of all cycles)trap stalls: 66 (14.47 of all cy
44、cles)total: 268 stall(s) (58.77% of all cycles)conditional branches):total: 20 (10.70 of all instructions), thereof: taken: 19 (95.00 of all cond. branches) not taken: 1 (5.00% of all cond. branches)conditional branches):total: 20 (10.70% of all instructionst thereof: taken: 19 35.00% of all cond. b
45、ranches) rnol taken: 1 (5.00% of all cond. branches)load-/store-instructions:total: 61 (32.62% of all instructions)、thereof:loads: 40 (65.57% of load7slore-lnslrucdons)stores: 21 (34.43% of load./store-instructions)load-/store-instruct ions:total: 61 (32.62% of all lnshuctions)z thereof:loads: 40 (6
46、5.57% of load-/store-lnstruclions)stores: 21 (34.43% of load-/slore-lnslructions)floating point stage instructions:total: 20 (10.70 of all instructions), thereof:additions: 20 (100.00% of floating point stage inst.) multiplications: 0 (0.00怎 of floating point stage inst.) divisions: 0 p.00% of float
47、ing point stage inst)floating point stage instructions:total: 20 (10.70% of al instruction thereof:additions: 20 (100.00% of floating point stage inst.) multiplications: 0 (0.00% of floating point stage inst) divisions: 0 (0.00% of floating point stage inst)traps:traps: 22 (11.76 of all instructions
48、)、ljjtraps:traps: 22 (11.76 of all instructions)d±j由以上两图nj得,本实验增加浮点运算部件对流水线性能没有影响。4. 增加forward部件对性能的影响。l statistics-lxtotal:353 cycle(s) executed.id executed by 187 instruction.2 inslruction(s) currently in pipeline.hardware conf iguration:memory size: 32768 bytes faddex-slages: 1 z required cy
49、cles: 4 fmulex-stages: 1 z required cycles: 4 fdivex-stages: 1, required cycles: 4 forwarding disabled.stalls:raw stalls: 80 (22.66% of all cycles)waw stalls: 0 (0.00 of all cycles)structural stalls: 0 (0.00 of all cycles)control stalls: 19 (5.38% of all cycles)trap stalls: 66 (13.70 of all cycles)t
50、otal: 165 stali(46.74% of all cycles)conditional branches):total: 20 (10.70% of all instructions)、thereof:taken: 19 95.00% of all cond. branches) not taken: 1 (5.00 of all cond. branches)load-/store-instruetions:total: 61 (32.62% of all instructions), thereof:loads: 40 (65.57% of load-/s tore-l nstr
51、uctions) stores: 21 (34.43% of load7store-lnstructions)floating point stage instructions:totab 20 (10.70% of all instructions), thereof:additions: 20 (100.00% of floating point stage inst.) multiplications: 0 (0.00% of floating point stage inst) divisions: 0 (0.00% of floating point stage inst.jtrap
52、s:traps: 22 (11.76 of all instructions)statisticstotal:313 cycle(s) executed.id executed by 187 insuuclion(s).2 inshuction(s) cunently in pipeline.hardware conf iguration: memory size: 32768 bytes faddex-stages: 4z required cycles: 4 fmulex-stages: 4z required cycles: 4 fdivex-stages: 4? required cy
53、cles: 4 forwarding enabled.jnlstalls:raw stalls: 40 12.78% of all cycles), thereof:ld stalls: 0 (0皿 of raw stalls)branch/jump stalls: 0 0.00% of raw stalls) floating point stalls: 40 (100.00% of raw stalls)waw stalls: 0 (0.00 of all cycles)structural stalls: 0 0.00% of all cycles)control stalls: 19
54、(6,07 of all cycles)trap stalls: 106 (33.86 of all cycles)total: 165 stal<$) (52.72 of all cycles)conditional branches):total: 20 (10.70 of all instructions), thereof: taken: 19 (95.00 of all cond. branches) not taken: 1 5.00% of all cond. branches)load-/store-instructions:total: 61 32.62% of all instructions), thereof:loads: 40 (65.57% of load7slore-l nshuclions) stores: 21 (34.43茗 of load/store-instructions)floating point stage instructions:total: 20 (10.70 of all inshuctions), thereof:additions: 20 (100.00% of floating point stage inst.) multip
- 迅雷网盘最最最全影视资源-持续更新7.26