




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Wuhan UniversityProgram OptimizationComputer System10thLecture, May 02, 2017Instructor:Cien Fan1Wuhan UniversityToday¢ Overview¢ Generally Useful Optimizations§§§§Code motion/precomputationStrength reductionSharing of common subexpressionsRemoving unnecessary procedure
2、calls¢ Optimization Blockers§§Procedure callsMemory aliasing¢ Exploiting Instruction-Level Parallelism¢ Dealing with Conditionals2Wuhan UniversityPerformance Realities¢ Theres more to performance than asymptotic complexity¢ Constant factors matter too!§§E
3、asily see 10:1 performance range depending on how code is writtenMust optimize at multiple levels:§ algorithm, data representations, procedures, and loops¢ Must understand system to optimize performance§§§§How programs are compiled and executedHow modern processors + me
4、mory systems operateHow to measure program performance and identify bottlenecksHow to improve performance without destroying code modularity and generality3Wuhan UniversityOptimizing Compilers¢ Provide efficient mapof program to machine§§§§register allocationcode selection a
5、nd ordering (scheduling)dead code elimination eliminating minor inefficiencies¢ Dont (usually) improve asymptotic efficiency§§up to programmer to select best overall algorithmbig-O savings are (often) more important than constant factors§ but constant factors also matter¢ Ha
6、ve difficulty overco“optimization blockers”§§potential memory aliasingpotential procedure side-effects4Wuhan UniversityLimitations of Optimizing Compilers¢ Operate under fundamental constraint§Must not cause any change in program behavior§ Except, possibly when program makin
7、g use of nonstandard languagefeaturesOften prevents it from making optimizations that would only affect behavior under pathological conditions.§¢ Behavior that may be obvious to the programmer can be obfuscated bylanguages and coding styles§e.g., Data ranges may be more limited than v
8、ariable types suggest¢ Most analysis is performed only within procedures§§Whole-program analysis is too expensive in most casesNewer versions of GCC do interprocedural analysis within individual files§ But, not between code in different files¢ Most analysis is based only on
9、static information§Compiler has difficultyipating run-time inputs¢ When in doubt, the compiler must be conservative5Wuhan UniversityGenerally Useful Optimizations¢ Optimizations that you or the compiler should do regardlessof processor / compiler¢ Code Motion§Reduce frequenc
10、y with which computation performed§§If it will always produce same resultEspecially moving code out of loop6long j;int ni = n*i;for (j = 0; j < n; j+) ani+j = bj;void set_row(double *a, double *b,long i, long n)long j;for (j = 0; j < n; j+) an*i+j = bj;Wuhan UniversityCompiler-Genera
11、ted Code Motion (-O1)7set_row:testq%rcx, %rcx# Test njle.L1# If 0, goto doneimulq%rcx, %rdx# ni = n*ileaq(%rdi,%rdx,8), %rdx# rowp = A + ni*8 movl$0, %eax# j = 0.L3:# loop:movsd(%rsi,%rax,8), %xmm0# t = bjmovsd%xmm0, (%rdx,%rax,8)# MA+ni*8 + j*8 = t addq$1, %rax# j+cmpq%rcx, %rax# j:njne.L3# if !=,
12、goto loop.L1:# done:rep ; retlong j;long ni = n*i; double *rowp = a+ni;for (j = 0; j < n; j+)*rowp+ = bj;void set_row(double *a, double *b, long i, long n)long j;for (j = 0; j < n; j+)an*i+j = bj;Wuhan UniversityReduction in Strength§§Replace costly operation with simpler oneShift, a
13、dd instead of multiply or divide16*x->x << 4§ Utility machine dependent§ Depends on cost of multiply or divide instruction On Intel Haswell, integer multiply requires 3 CPU cycles Recognize sequence of products§8for (i = 0; i < n; i+) int ni = n*i;for (j = 0; j < n; j+)
14、 ani + j = bj;int ni = 0;for (i = 0; i < n; i+) for (j = 0; j < n; j+)ani + j = bj;ni += n;Wuhan UniversityShare Common Subexpressions§§Reuse portions of expressionsGCC will do this with O13 multiplications: i*n, (i1)*n, (i+1)*n1 multiplication: i*n9imulq%rcx, %rsi # i*n addq%rdx, %r
15、si # i*n+j movq%rsi, %rax # i*n+j subq%rcx, %rax # i*n+j-nleaq(%rsi,%rcx), %rcx # i*n+j+nleaq1(%rsi), %rax # i+1 leaq-1(%rsi), %r8 # i-1 imulq %rcx, %rsi# i*n imulq %rcx, %rax# (i+1)*n imulq %rcx, %r8# (i-1)*n addq%rdx, %rsi# i*n+jaddq%rdx, %rax# (i+1)*n+j addq%rdx, %r8# (i-1)*n+jlong inj = i*n + j;
16、 up =valinj - n; down = valinj + n; left = valinj - 1; right = valinj + 1;sum = up + down + left + right;/* Sum neighbors of i,j */ up =val(i-1)*n + j ; down = val(i+1)*n + j ; left = vali*n+ j-1; right = vali*n+ j+1;sum = up + down + left + right;Wuhan UniversityOptimization Blocker #1: Procedure C
17、alls¢ Procedure to Convert String to Lower Case10void lower(char *s)size_t i;for (i = 0; i < strlen(s); i+) if (si >= 'A' && si <= 'Z')si -= ('A' - 'a');Wuhan UniversityLower Case Conversion Performance§§Time double when double string l
18、engthQuadratic performance11CPU seconds250200150lower1100500050000100000150000200000250000300000350000400000450000500000String lengthWuhan UniversityConvert Loop To Goto Form§strlen executed every iteration12void lower(char *s)size_t i = 0;if (i >= strlen(s) goto done;loop:if (si >= '
19、A' && si <= 'Z')si -= ('A' - 'a');i+;if (i < strlen(s) goto loop;done:Wuhan UniversityCalling Strlen¢ Strlen performance§Only way to determine length of string is to scan inull character.tire length, looking for¢ Overall performance, string of
20、 length N§§§N calls to strlenRequire times N, N-1, N-2, , 1 Overall O(N2) performance13/* My version of strlen */ size_t strlen(const char *s)size_t length = 0; while (*s != '0') s+;length+;return length;Wuhan UniversityImproving Performance§§§Move call to strle
21、n outside of loopSince resuoes not change from one iteration to anotherForm of code motion14void lower(char *s)size_t i;size_t len = strlen(s); for (i = 0; i < len; i+)if (si >= 'A' && si <= 'Z')si -= ('A' - 'a');Wuhan UniversityLower Case Conversion
22、Performance§§Time doubles when double string lengthLinear performance of lower215CPU seconds250200150lower110050lower20050000100000150000200000250000300000350000400000450000500000String lengthWuhan UniversityOptimization Blocker: Procedure Calls¢ Why couldnt compiler move strlen out o
23、f inner loop?§Procedure may have side effects§ Alters global state each time calledFunction may not return same value for given arguments§ Depends on other parts of global state§ Procedure lower could interact with strlen§¢ Warning:§§Compiler treats procedure
24、call as a black boxWeak optimizations near them¢ Remedies:§Use of inline functions§ GCC does this with O1 Within single file Do your own code motion§16size_t lencnt = 0;size_t strlen(const char *s)size_t length = 0; while (*s != '0') s+; length+;lencnt += length;return le
25、ngth;Wuhan UniversityMemory Matters§§Code updates bi on every iterationWhy couldnt compiler optimize this away?17# sum_rows1 inner loop.L4:movsd(%rsi,%rax,8), %xmm0# FP load addsd(%rdi), %xmm0# FP add movsd%xmm0, (%rsi,%rax,8)# FP store addq$8, %rdicmpq%rcx, %rdijne.L4/* Sum rows is of n X
26、 n matrix a and store in vector b */void sum_rows1(double *a, double *b, long n) long i, j;for (i = 0; i < n; i+) bi = 0;for (j = 0; j < n; j+) bi += ai*n + j;Wuhan UniversityMemory AliasingValue of B:§§Code updates bi on every iterationMust consider possibility that these updates wi
27、ll affect program behavior18i = 2: 3, 22, 224i = 1: 3, 22, 16i = 0: 3, 8, 16init: 4, 8, 16double A9 = 0,1,2,4,8, 16,32, 64, 128;double B3 = A+3; sum_rows1(A, B, 3);/* Sum rows is of n X n matrix a and store in vector b */void sum_rows1(double *a, double *b, long n) long i, j;for (i = 0; i < n; i+
28、) bi = 0;for (j = 0; j < n; j+) bi += ai*n + j;Wuhan UniversityRemoving Aliasing§No need to store intermediate results19# sum_rows2 inner loop.L10:addsd(%rdi), %xmm0# FP load + add addq$8, %rdicmpq%rax, %rdijne.L10/* Sum rows is of n X n matrix a and store in vector b */void sum_rows2(double
29、 *a, double *b, long n) long i, j;for (i = 0; i < n; i+) double val = 0;for (j = 0; j < n; j+) val += ai*n + j;bi = val;Wuhan UniversityOptimization Blocker: Memory Aliasing¢ Aliasing§§Two different memory references specify single locationEasy to have happen in C§ Since al
30、lowed to do address arithmetic§ Direct access to storage structuresGet in habit of introducing local variables§Accumulating within loopsYour way of telling compiler not to check for aliasing§§20Wuhan UniversityExploiting Instruction-Level Parallelism¢ Need general understand
31、ing of modern processor design§Hardware can execute multiple instructions in parallel¢ Performance limited by data dependencies¢ Simple transformations can yield dramatic performance improvement§§Compilers often cannot make these transformationsLack of associativity and dist
32、ributivity in floating-point arithmetic21Wuhan UniversityBenchmark Example: Data Type for Vectors01len-1¢ Data Types§Use different declarationsfor data_tint long floatdouble§§§§22/* retrieve vector element and store at val */int get_vec_element(*vec v, size_t idx, data_
33、t *val)if (idx >= v->len) return 0;*val = v->dataidx; return 1;lendata/* data structure for vectors */ typedef structsize_t len; data_t *data; vec;Wuhan UniversityBenchmark ComputationCompute sum orproduct of vector elements¢ Data Types¢ Operations§§Use different declara
34、tionsfor data_tint long floatdoubleUse different definitions ofOP and IDENT§§§§§§+*01/23void combine1(vec_ptr v, data_t *dest)long int i;*dest = IDENT;for (i = 0; i < vec_length(v); i+) data_t val; get_vec_element(v, i, &val);*dest = *dest OP val;Wuhan University
35、Cycles Per Element (CPE)¢ Convenient way to express performance of program that operates on vectors or lists¢ Length = n¢ In our case: CPE = cycles per OP¢ T = CPE*n + Overhead§CPE is slope of line24Cycles25002000psum11500Slope = 9.01000psum2500Slope = 6.00050100150200Elemen
36、tsWuhan UniversityBenchmark PerformanceCompute sum orproduct of vector elements25MethodIntegerDouble FPOperationAddMultAddMultCombine1 unoptimized22.6820.0219.9820.18Combine1 O110.1210.1210.1711.14void combine1(vec_ptr v, data_t *dest)long int i;*dest = IDENT;for (i = 0; i < vec_length(v); i+) da
37、ta_t val; get_vec_element(v, i, &val);*dest = *dest OP val;Wuhan UniversityBasic Optimizations¢ Move vec_length out of loop¢ Avoid bounds check on each cycle¢ Accumulate in temporary26void combine4(vec_ptr v, data_t *dest)long i;long length = vec_length(v); data_t *d = get_vec_sta
38、rt(v); data_t t = IDENT;for (i = 0; i < length; i+) t = t OP di;*dest = t;Wuhan UniversityEffect of Basic Optimizations¢ Eliminates sources of overhead in loop27MethodIntegerDouble FPOperationAddMultAddMultCombine1 O110.1210.1210.1711.14Combine41.273.013.015.01void combine4(vec_ptr v, data_t
39、 *dest)long i;long length = vec_length(v); data_t *d = get_vec_start(v); data_t t = IDENT;for (i = 0; i < length; i+) t = t OP di;*dest = t;Wuhan UniversityModern CPU DesignRegister UpdatesPrediction OK?28Operation ResultsAddr.Addr.DataDataExecutionData CacheFunctionalUnitsStoreLoadArithArithArit
40、hBranchInstruction ControlAddressInstructions OperationsInstruction DecodeRetirement UnitRegister FileInstruction CacheFetchControlWuhan UniversitySuperscalar Processor¢ Definition: A superscalar processor can issue and execute multiple instructions in one cycle. The instructions are retrieved
41、from a sequential instruction stream and are usually scheduled dynamically.¢ Benefit: without programeffort, superscalarprocessor can take advantage of the instruction levelparallelism that most programs have¢ Most modern CPUs are superscalar.¢ Intel: since Pentium (1993)29Wuhan Unive
42、rsityPipelined Functional UnitsStage 1Stage 2Stage 3§§§§Divide computation into stagesPass partial computations from stage to stageStage i can start on new computation once values passed to i+1 E.g., complete 3 multiplications in 7 cycles, even though each requires 3 cycles30Time
43、1234567Stage 1a*ba*cp1*p2Stage 2a*ba*cp1*p2Stage 3a*ba*cp1*p2long mult_eg(long a, long b, long c) long p1 = a*b;long p2 = a*c; long p3 = p1 * p2; return p3;Wuhan UniversityHaswell CPU§8 Total Functional UnitsMultiple instructions can execute in parallel2 load, with address computation 1 store,
44、with address computation 4 integer2 FP multiply1 FP add1 FP divideSome instructions take > 1 cycle, but can be pipelined¢¢InstructionLoad / StoreInteger Multiply Integer/Long Divide Single/Double FP Multiply Single/Double FP Add Single/Double FP DivideLatency433-30533-15Cycles/Issue113-
45、30113-1531Wuhan Universityx86-64 Compilation of Combine4¢ Inner Loop (Case: Integer Multiply)32MethodIntegerDouble FPOperationAddMultAddMultCombine41.273.013.015.01Latency Bound1.003.003.005.00.L519:# Loop:imull (%rax,%rdx,4), %ecx # t = t * di addq $1, %rdx# i+cmpq %rdx, %rbp# Compare length:i
46、jg.L519# If >, goto LoopWuhan UniversityCombine4 = Serial Computation (OP = *)¢ Computation (length=8)(1 * d0) * d1) * d2) * d3)* d4) * d5) * d6) * d7)¢ Sequential dependence1 d0*d1*d2*§Performance: determined by latency of OPd3*d4*d5*d6*d7*33Wuhan UniversityLoop Unrolling (2x1)
47、62; Perform 2x more useful work per iteration34void unroll2a_combine(vec_ptr v, data_t *dest)long length = vec_length(v); long limit = length-1;data_t *d = get_vec_start(v); data_t x = IDENT;long i;/* Combine 2 elements at a time */for (i = 0; i < limit; i+=2) x = (x OP di) OP di+1;/* Finish any
48、remaining elements */ for (; i < length; i+) x = x OP di;*dest = x;Wuhan UniversityEffect of Loop Unrolling¢ Helps integer add§Achieves latency bound¢ Others dont improve. Why?§Still sequential dependency35x = (x OP di) OP di+1;MethodIntegerDouble FPOperationAddMultAddMultComb
49、ine41.273.013.015.01Unroll 2x11.013.013.015.01LatencyBound1.003.003.005.00Wuhan UniversityLoop Unrolling with Reassociation (2x1a)¢ Can this change the result of the computation?¢ Yes, for FP. Why?36void unroll2aa_combine(vec_ptr v, data_t *dest)long length = vec_length(v); long limit = le
50、ngth-1;data_t *d = get_vec_start(v); data_t x = IDENT;long i;/* Combine 2 elements at a time */for (i = 0; i < limit; i+=2) x = x OP (di OP di+1);/* Finish any remaining elements */ for (; i < length; i+) x = x OP di;*dest = x;x = (x OP di) OP di+1;Compare to beforeWuhan UniversityEffect of Re
51、association¢ Nearly 2x speedup for Int *, FP +, FP *2 func. units for FP *2 func. units for load§Reason: Breaks sequential dependency4 func. units for int +2 func. units for load§Why is that? (next slide)37x = x OP (di OP di+1);MethodIntegerDouble FPOperationAddMultAddMultCombine41.27
52、3.013.015.01Unroll 2x11.013.013.015.01Unroll 2x1a1.011.511.512.51Latency Bound1.003.003.005.00ThroughputBound0.501.001.000.50Wuhan UniversityReassociated ComputationWhat changed:§ Ops in the next iteration can be started early (no dependency)¢d0 d1*Overall Performance¢d d1§§
53、23N elements, D cycles latency/op(N/2+1)*D cycles:CPE = D/2*d4 d5*d6 d7*38x = x OP (di OP di+1);Wuhan UniversityLoop Unrolling with Separate Accumulators(2x2)¢ Different form of reassociation39void unroll2a_combine(vec_ptr v, data_t *dest)long length = vec_length(v); long limit = length-1;data_
54、t *d = get_vec_start(v);data_t x0 = IDENT; data_t x1 = IDENT; long i;/* Combine 2 elements at a time */for (i = 0; i < limit; i+=2) x0 = x0 OP di; x1 = x1 OP di+1;/* Finish any remaining elements */ for (; i < length; i+) x0 = x0 OP di;*dest = x0 OP x1;Wuhan UniversityEffect of Separate Accumulators
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 河南省周口市鹿邑县第二高级中学校2024-2025学年高二下学期3月月考语文试题
- 旅游创业企划书
- 女教师新入职发言稿
- 2024年特许金融分析师考试的复习策略试题及答案
- 2024年特许金融分析师考试信心试题及答案
- 特许金融分析师试题及答案精粹
- 2024年特许金融分析师考试考生经验交流会试题及答案
- 金融分析师考试学习资料整合与试题及答案
- 2024年特许金融分析师考试快速提升试题及答案
- 2024年特许金融分析师考试核心知识点试题及答案
- 高速公路工程质量管理体系及保证措施
- 菠菜色素提取和分离
- 中铁工程项目内部控制管理手册(492页)
- 气瓶充装安全及培训课件PPT幻灯片
- 防雷检测专业技术人员能力认定考试题库完整
- 计算机考试Excel操作题原题及操作步骤82435
- (高清版)辐射供暖供冷技术规程JGJ142-2012
- 新教科版六年级下册科学课件3 宇宙 第6课 浩瀚的宇宙
- 智慧城市_城市智慧停车解决方案
- 灭火器操作规程
- 电缆原材料检验规范
评论
0/150
提交评论