深入理解计算机系统10optimization_第1页
深入理解计算机系统10optimization_第2页
深入理解计算机系统10optimization_第3页
深入理解计算机系统10optimization_第4页
深入理解计算机系统10optimization_第5页
已阅读5页,还剩51页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Wuhan UniversityProgram OptimizationComputer System10thLecture, May 02, 2017Instructor:Cien Fan1Wuhan UniversityToday¢ Overview¢ Generally Useful Optimizations§§§§Code motion/precomputationStrength reductionSharing of common subexpressionsRemoving unnecessary procedure

2、calls¢ Optimization Blockers§§Procedure callsMemory aliasing¢ Exploiting Instruction-Level Parallelism¢ Dealing with Conditionals2Wuhan UniversityPerformance Realities¢ Theres more to performance than asymptotic complexity¢ Constant factors matter too!§§E

3、asily see 10:1 performance range depending on how code is writtenMust optimize at multiple levels:§ algorithm, data representations, procedures, and loops¢ Must understand system to optimize performance§§§§How programs are compiled and executedHow modern processors + me

4、mory systems operateHow to measure program performance and identify bottlenecksHow to improve performance without destroying code modularity and generality3Wuhan UniversityOptimizing Compilers¢ Provide efficient mapof program to machine§§§§register allocationcode selection a

5、nd ordering (scheduling)dead code elimination eliminating minor inefficiencies¢ Dont (usually) improve asymptotic efficiency§§up to programmer to select best overall algorithmbig-O savings are (often) more important than constant factors§ but constant factors also matter¢ Ha

6、ve difficulty overco“optimization blockers”§§potential memory aliasingpotential procedure side-effects4Wuhan UniversityLimitations of Optimizing Compilers¢ Operate under fundamental constraint§Must not cause any change in program behavior§ Except, possibly when program makin

7、g use of nonstandard languagefeaturesOften prevents it from making optimizations that would only affect behavior under pathological conditions.§¢ Behavior that may be obvious to the programmer can be obfuscated bylanguages and coding styles§e.g., Data ranges may be more limited than v

8、ariable types suggest¢ Most analysis is performed only within procedures§§Whole-program analysis is too expensive in most casesNewer versions of GCC do interprocedural analysis within individual files§ But, not between code in different files¢ Most analysis is based only on

9、static information§Compiler has difficultyipating run-time inputs¢ When in doubt, the compiler must be conservative5Wuhan UniversityGenerally Useful Optimizations¢ Optimizations that you or the compiler should do regardlessof processor / compiler¢ Code Motion§Reduce frequenc

10、y with which computation performed§§If it will always produce same resultEspecially moving code out of loop6long j;int ni = n*i;for (j = 0; j < n; j+) ani+j = bj;void set_row(double *a, double *b,long i, long n)long j;for (j = 0; j < n; j+) an*i+j = bj;Wuhan UniversityCompiler-Genera

11、ted Code Motion (-O1)7set_row:testq%rcx, %rcx# Test njle.L1# If 0, goto doneimulq%rcx, %rdx# ni = n*ileaq(%rdi,%rdx,8), %rdx# rowp = A + ni*8 movl$0, %eax# j = 0.L3:# loop:movsd(%rsi,%rax,8), %xmm0# t = bjmovsd%xmm0, (%rdx,%rax,8)# MA+ni*8 + j*8 = t addq$1, %rax# j+cmpq%rcx, %rax# j:njne.L3# if !=,

12、goto loop.L1:# done:rep ; retlong j;long ni = n*i; double *rowp = a+ni;for (j = 0; j < n; j+)*rowp+ = bj;void set_row(double *a, double *b, long i, long n)long j;for (j = 0; j < n; j+)an*i+j = bj;Wuhan UniversityReduction in Strength§§Replace costly operation with simpler oneShift, a

13、dd instead of multiply or divide16*x->x << 4§ Utility machine dependent§ Depends on cost of multiply or divide instruction On Intel Haswell, integer multiply requires 3 CPU cycles Recognize sequence of products§8for (i = 0; i < n; i+) int ni = n*i;for (j = 0; j < n; j+)

14、 ani + j = bj;int ni = 0;for (i = 0; i < n; i+) for (j = 0; j < n; j+)ani + j = bj;ni += n;Wuhan UniversityShare Common Subexpressions§§Reuse portions of expressionsGCC will do this with O13 multiplications: i*n, (i1)*n, (i+1)*n1 multiplication: i*n9imulq%rcx, %rsi # i*n addq%rdx, %r

15、si # i*n+j movq%rsi, %rax # i*n+j subq%rcx, %rax # i*n+j-nleaq(%rsi,%rcx), %rcx # i*n+j+nleaq1(%rsi), %rax # i+1 leaq-1(%rsi), %r8 # i-1 imulq %rcx, %rsi# i*n imulq %rcx, %rax# (i+1)*n imulq %rcx, %r8# (i-1)*n addq%rdx, %rsi# i*n+jaddq%rdx, %rax# (i+1)*n+j addq%rdx, %r8# (i-1)*n+jlong inj = i*n + j;

16、 up =valinj - n; down = valinj + n; left = valinj - 1; right = valinj + 1;sum = up + down + left + right;/* Sum neighbors of i,j */ up =val(i-1)*n + j ; down = val(i+1)*n + j ; left = vali*n+ j-1; right = vali*n+ j+1;sum = up + down + left + right;Wuhan UniversityOptimization Blocker #1: Procedure C

17、alls¢ Procedure to Convert String to Lower Case10void lower(char *s)size_t i;for (i = 0; i < strlen(s); i+) if (si >= 'A' && si <= 'Z')si -= ('A' - 'a');Wuhan UniversityLower Case Conversion Performance§§Time double when double string l

18、engthQuadratic performance11CPU seconds250200150lower1100500050000100000150000200000250000300000350000400000450000500000String lengthWuhan UniversityConvert Loop To Goto Form§strlen executed every iteration12void lower(char *s)size_t i = 0;if (i >= strlen(s) goto done;loop:if (si >= '

19、A' && si <= 'Z')si -= ('A' - 'a');i+;if (i < strlen(s) goto loop;done:Wuhan UniversityCalling Strlen¢ Strlen performance§Only way to determine length of string is to scan inull character.tire length, looking for¢ Overall performance, string of

20、 length N§§§N calls to strlenRequire times N, N-1, N-2, , 1 Overall O(N2) performance13/* My version of strlen */ size_t strlen(const char *s)size_t length = 0; while (*s != '0') s+;length+;return length;Wuhan UniversityImproving Performance§§§Move call to strle

21、n outside of loopSince resuoes not change from one iteration to anotherForm of code motion14void lower(char *s)size_t i;size_t len = strlen(s); for (i = 0; i < len; i+)if (si >= 'A' && si <= 'Z')si -= ('A' - 'a');Wuhan UniversityLower Case Conversion

22、Performance§§Time doubles when double string lengthLinear performance of lower215CPU seconds250200150lower110050lower20050000100000150000200000250000300000350000400000450000500000String lengthWuhan UniversityOptimization Blocker: Procedure Calls¢ Why couldnt compiler move strlen out o

23、f inner loop?§Procedure may have side effects§ Alters global state each time calledFunction may not return same value for given arguments§ Depends on other parts of global state§ Procedure lower could interact with strlen§¢ Warning:§§Compiler treats procedure

24、call as a black boxWeak optimizations near them¢ Remedies:§Use of inline functions§ GCC does this with O1 Within single file Do your own code motion§16size_t lencnt = 0;size_t strlen(const char *s)size_t length = 0; while (*s != '0') s+; length+;lencnt += length;return le

25、ngth;Wuhan UniversityMemory Matters§§Code updates bi on every iterationWhy couldnt compiler optimize this away?17# sum_rows1 inner loop.L4:movsd(%rsi,%rax,8), %xmm0# FP load addsd(%rdi), %xmm0# FP add movsd%xmm0, (%rsi,%rax,8)# FP store addq$8, %rdicmpq%rcx, %rdijne.L4/* Sum rows is of n X

26、 n matrix a and store in vector b */void sum_rows1(double *a, double *b, long n) long i, j;for (i = 0; i < n; i+) bi = 0;for (j = 0; j < n; j+) bi += ai*n + j;Wuhan UniversityMemory AliasingValue of B:§§Code updates bi on every iterationMust consider possibility that these updates wi

27、ll affect program behavior18i = 2: 3, 22, 224i = 1: 3, 22, 16i = 0: 3, 8, 16init: 4, 8, 16double A9 = 0,1,2,4,8, 16,32, 64, 128;double B3 = A+3; sum_rows1(A, B, 3);/* Sum rows is of n X n matrix a and store in vector b */void sum_rows1(double *a, double *b, long n) long i, j;for (i = 0; i < n; i+

28、) bi = 0;for (j = 0; j < n; j+) bi += ai*n + j;Wuhan UniversityRemoving Aliasing§No need to store intermediate results19# sum_rows2 inner loop.L10:addsd(%rdi), %xmm0# FP load + add addq$8, %rdicmpq%rax, %rdijne.L10/* Sum rows is of n X n matrix a and store in vector b */void sum_rows2(double

29、 *a, double *b, long n) long i, j;for (i = 0; i < n; i+) double val = 0;for (j = 0; j < n; j+) val += ai*n + j;bi = val;Wuhan UniversityOptimization Blocker: Memory Aliasing¢ Aliasing§§Two different memory references specify single locationEasy to have happen in C§ Since al

30、lowed to do address arithmetic§ Direct access to storage structuresGet in habit of introducing local variables§Accumulating within loopsYour way of telling compiler not to check for aliasing§§20Wuhan UniversityExploiting Instruction-Level Parallelism¢ Need general understand

31、ing of modern processor design§Hardware can execute multiple instructions in parallel¢ Performance limited by data dependencies¢ Simple transformations can yield dramatic performance improvement§§Compilers often cannot make these transformationsLack of associativity and dist

32、ributivity in floating-point arithmetic21Wuhan UniversityBenchmark Example: Data Type for Vectors01len-1¢ Data Types§Use different declarationsfor data_tint long floatdouble§§§§22/* retrieve vector element and store at val */int get_vec_element(*vec v, size_t idx, data_

33、t *val)if (idx >= v->len) return 0;*val = v->dataidx; return 1;lendata/* data structure for vectors */ typedef structsize_t len; data_t *data; vec;Wuhan UniversityBenchmark ComputationCompute sum orproduct of vector elements¢ Data Types¢ Operations§§Use different declara

34、tionsfor data_tint long floatdoubleUse different definitions ofOP and IDENT§§§§§§+*01/23void combine1(vec_ptr v, data_t *dest)long int i;*dest = IDENT;for (i = 0; i < vec_length(v); i+) data_t val; get_vec_element(v, i, &val);*dest = *dest OP val;Wuhan University

35、Cycles Per Element (CPE)¢ Convenient way to express performance of program that operates on vectors or lists¢ Length = n¢ In our case: CPE = cycles per OP¢ T = CPE*n + Overhead§CPE is slope of line24Cycles25002000psum11500Slope = 9.01000psum2500Slope = 6.00050100150200Elemen

36、tsWuhan UniversityBenchmark PerformanceCompute sum orproduct of vector elements25MethodIntegerDouble FPOperationAddMultAddMultCombine1 unoptimized22.6820.0219.9820.18Combine1 O110.1210.1210.1711.14void combine1(vec_ptr v, data_t *dest)long int i;*dest = IDENT;for (i = 0; i < vec_length(v); i+) da

37、ta_t val; get_vec_element(v, i, &val);*dest = *dest OP val;Wuhan UniversityBasic Optimizations¢ Move vec_length out of loop¢ Avoid bounds check on each cycle¢ Accumulate in temporary26void combine4(vec_ptr v, data_t *dest)long i;long length = vec_length(v); data_t *d = get_vec_sta

38、rt(v); data_t t = IDENT;for (i = 0; i < length; i+) t = t OP di;*dest = t;Wuhan UniversityEffect of Basic Optimizations¢ Eliminates sources of overhead in loop27MethodIntegerDouble FPOperationAddMultAddMultCombine1 O110.1210.1210.1711.14Combine41.273.013.015.01void combine4(vec_ptr v, data_t

39、 *dest)long i;long length = vec_length(v); data_t *d = get_vec_start(v); data_t t = IDENT;for (i = 0; i < length; i+) t = t OP di;*dest = t;Wuhan UniversityModern CPU DesignRegister UpdatesPrediction OK?28Operation ResultsAddr.Addr.DataDataExecutionData CacheFunctionalUnitsStoreLoadArithArithArit

40、hBranchInstruction ControlAddressInstructions OperationsInstruction DecodeRetirement UnitRegister FileInstruction CacheFetchControlWuhan UniversitySuperscalar Processor¢ Definition: A superscalar processor can issue and execute multiple instructions in one cycle. The instructions are retrieved

41、from a sequential instruction stream and are usually scheduled dynamically.¢ Benefit: without programeffort, superscalarprocessor can take advantage of the instruction levelparallelism that most programs have¢ Most modern CPUs are superscalar.¢ Intel: since Pentium (1993)29Wuhan Unive

42、rsityPipelined Functional UnitsStage 1Stage 2Stage 3§§§§Divide computation into stagesPass partial computations from stage to stageStage i can start on new computation once values passed to i+1 E.g., complete 3 multiplications in 7 cycles, even though each requires 3 cycles30Time

43、1234567Stage 1a*ba*cp1*p2Stage 2a*ba*cp1*p2Stage 3a*ba*cp1*p2long mult_eg(long a, long b, long c) long p1 = a*b;long p2 = a*c; long p3 = p1 * p2; return p3;Wuhan UniversityHaswell CPU§8 Total Functional UnitsMultiple instructions can execute in parallel2 load, with address computation 1 store,

44、with address computation 4 integer2 FP multiply1 FP add1 FP divideSome instructions take > 1 cycle, but can be pipelined¢¢InstructionLoad / StoreInteger Multiply Integer/Long Divide Single/Double FP Multiply Single/Double FP Add Single/Double FP DivideLatency433-30533-15Cycles/Issue113-

45、30113-1531Wuhan Universityx86-64 Compilation of Combine4¢ Inner Loop (Case: Integer Multiply)32MethodIntegerDouble FPOperationAddMultAddMultCombine41.273.013.015.01Latency Bound1.003.003.005.00.L519:# Loop:imull (%rax,%rdx,4), %ecx # t = t * di addq $1, %rdx# i+cmpq %rdx, %rbp# Compare length:i

46、jg.L519# If >, goto LoopWuhan UniversityCombine4 = Serial Computation (OP = *)¢ Computation (length=8)(1 * d0) * d1) * d2) * d3)* d4) * d5) * d6) * d7)¢ Sequential dependence1 d0*d1*d2*§Performance: determined by latency of OPd3*d4*d5*d6*d7*33Wuhan UniversityLoop Unrolling (2x1)

47、62; Perform 2x more useful work per iteration34void unroll2a_combine(vec_ptr v, data_t *dest)long length = vec_length(v); long limit = length-1;data_t *d = get_vec_start(v); data_t x = IDENT;long i;/* Combine 2 elements at a time */for (i = 0; i < limit; i+=2) x = (x OP di) OP di+1;/* Finish any

48、remaining elements */ for (; i < length; i+) x = x OP di;*dest = x;Wuhan UniversityEffect of Loop Unrolling¢ Helps integer add§Achieves latency bound¢ Others dont improve. Why?§Still sequential dependency35x = (x OP di) OP di+1;MethodIntegerDouble FPOperationAddMultAddMultComb

49、ine41.273.013.015.01Unroll 2x11.013.013.015.01LatencyBound1.003.003.005.00Wuhan UniversityLoop Unrolling with Reassociation (2x1a)¢ Can this change the result of the computation?¢ Yes, for FP. Why?36void unroll2aa_combine(vec_ptr v, data_t *dest)long length = vec_length(v); long limit = le

50、ngth-1;data_t *d = get_vec_start(v); data_t x = IDENT;long i;/* Combine 2 elements at a time */for (i = 0; i < limit; i+=2) x = x OP (di OP di+1);/* Finish any remaining elements */ for (; i < length; i+) x = x OP di;*dest = x;x = (x OP di) OP di+1;Compare to beforeWuhan UniversityEffect of Re

51、association¢ Nearly 2x speedup for Int *, FP +, FP *2 func. units for FP *2 func. units for load§Reason: Breaks sequential dependency4 func. units for int +2 func. units for load§Why is that? (next slide)37x = x OP (di OP di+1);MethodIntegerDouble FPOperationAddMultAddMultCombine41.27

52、3.013.015.01Unroll 2x11.013.013.015.01Unroll 2x1a1.011.511.512.51Latency Bound1.003.003.005.00ThroughputBound0.501.001.000.50Wuhan UniversityReassociated ComputationWhat changed:§ Ops in the next iteration can be started early (no dependency)¢d0 d1*Overall Performance¢d d1§§

53、23N elements, D cycles latency/op(N/2+1)*D cycles:CPE = D/2*d4 d5*d6 d7*38x = x OP (di OP di+1);Wuhan UniversityLoop Unrolling with Separate Accumulators(2x2)¢ Different form of reassociation39void unroll2a_combine(vec_ptr v, data_t *dest)long length = vec_length(v); long limit = length-1;data_

54、t *d = get_vec_start(v);data_t x0 = IDENT; data_t x1 = IDENT; long i;/* Combine 2 elements at a time */for (i = 0; i < limit; i+=2) x0 = x0 OP di; x1 = x1 OP di+1;/* Finish any remaining elements */ for (; i < length; i+) x0 = x0 OP di;*dest = x0 OP x1;Wuhan UniversityEffect of Separate Accumulators

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论