




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Parallel Programmingin C with MPI and OpenMP,Michael J. Quinn,Chapter 11,Matrix Multiplication,Outline,Sequential algorithms Iterative, row-oriented Recursive, block-oriented Parallel algorithms Rowwise block striped decomposition Cannons algorithm,Iterative, Row-oriented Algorithm,Series of inner p
2、roduct (dot product) operations,Performance as n Increases,Reason:Matrix B Gets Too Big for Cache,Computing a row of C requires accessing every element of B,Block Matrix Multiplication,Replace scalar multiplication with matrix multiplication Replace scalar addition with matrix addition,Recurse Until
3、 B Small Enough,Comparing Sequential Performance,First Parallel Algorithm,Partitioning Divide matrices into rows Each primitive task has corresponding rows of three matrices Communication Each task must eventually see every row of B Organize tasks into a ring,First Parallel Algorithm (cont.),Agglome
4、ration and mapping Fixed number of tasks, each requiring same amount of computation Regular communication among tasks Strategy: Assign each process a contiguous group of rows,Communication of B,A,A,B,C,A,A,B,C,A,A,B,C,A,A,B,C,Communication of B,A,A,B,C,A,A,B,C,A,A,B,C,A,A,B,C,Communication of B,A,A,
5、B,C,A,A,B,C,A,A,B,C,A,A,B,C,Communication of B,A,A,B,C,A,A,B,C,A,A,B,C,A,A,B,C,Complexity Analysis,Algorithm has p iterations During each iteration a process multiplies(n / p) (n / p) block of A by (n / p) n block of B: (n3 / p2) Total computation time: (n3 / p) Each process ends up passing(p-1)n2/p
6、 = (n2) elements of B,Isoefficiency Analysis,Sequential algorithm: (n3) Parallel overhead: (pn2)Isoefficiency relation: n3 Cpn2 n Cp This system does not have good scalability,Weakness of Algorithm 1,Blocks of B being manipulated have p times more columns than rows Each process must access every ele
7、ment of matrix B Ratio of computations per communication is poor: only 2n / p,Parallel Algorithm 2(Cannons Algorithm),Associate a primitive task with each matrix element Agglomerate tasks responsible for a square (or nearly square) block of C Computation-to-communication ratio rises to n / p,Element
8、s of A and B Needed to Compute a Processs Portion of C,Algorithm 1,Cannons Algorithm,Blocks Must Be Aligned,Before,After,Blocks Need to Be Aligned,A00,B00,A01,B01,A02,B02,A03,B03,A10,B10,A11,B11,A12,B12,A13,B13,A20,B20,A21,B21,A22,B22,A23,B23,A30,B30,A31,B31,A32,B32,A33,B33,Each triangle represents
9、a matrix block Only same-color triangles should be multiplied,Rearrange Blocks,A00,B00,A01,B01,A02,B02,A03,B03,A10,B10,A11,B11,A12,B12,A13,B13,A20,B20,A21,B21,A22,B22,A23,B23,A30,B30,A31,B31,A32,B32,A33,B33,Block Aij cycles left i positions Block Bij cycles up j positions,Consider Process P1,2,B02,A
10、10,A11,A12,B12,A13,B22,B32,Step 1,Consider Process P1,2,B12,A11,A12,A13,B22,A10,B32,B02,Step 2,Consider Process P1,2,B22,A12,A13,A10,B32,A11,B02,B12,Step 3,Consider Process P1,2,B32,A13,A10,A11,B02,A12,B12,B22,Step 4,Complexity Analysis,Algorithm has p iterations During each iteration process multip
11、lies two (n / p ) (n / p ) matrices: (n3 / p 3/2) Computational complexity: (n3 / p) During each iteration process sends and receives two blocks of size (n / p ) (n / p ) Communication complexity: (n2/ p),Isoefficiency Analysis,Sequential algorithm: (n3) Parallel overhead: (pn2) Isoefficiency relati
12、on: n3 C pn2 n C p This system is highly scalable,Summary,Considered two sequential algorithms Iterative, row-oriented algorithm Recursive, block-oriented algorithm Second has better cache hit rate as n increases Developed two parallel algorithms First based on rowwise block striped decomposition Se
13、cond based on checkerboard block decomposition Second algorithm is scalable, while first is not,Parallel Programmingin C with MPI and OpenMP,Michael J. Quinn,Chapter 12,Solving Linear Systems,Outline,Terminology Back substitution Gaussian elimination Jacobi method Conjugate gradient method,Terminolo
14、gy,System of linear equations Solve Ax = b for x Special matrices Symmetrically banded Upper triangular Lower triangular Diagonally dominant Symmetric,Symmetrically Banded,4,2,-1,0,0,0,3,-4,5,6,0,0,1,6,3,2,4,0,0,2,-2,0,9,2,0,0,7,3,8,7,0,0,0,4,0,2,Semibandwidth 2,Upper Triangular,4,2,-1,5,9,2,0,-4,5,
15、6,0,-4,0,0,3,2,4,6,0,0,0,0,9,2,0,0,0,0,8,7,0,0,0,0,0,2,Lower Triangular,4,0,0,0,0,0,0,0,0,0,0,0,5,4,3,0,0,0,2,6,2,3,0,0,8,-2,0,1,8,0,-3,5,7,9,5,2,Diagonally Dominant,19,0,2,2,0,6,0,-15,2,0,-3,0,5,4,22,-1,0,4,2,3,2,13,0,-5,5,-2,0,1,16,0,-3,5,5,3,5,-32,Symmetric,3,0,2,2,0,6,0,7,4,3,-3,5,5,4,0,-1,0,4,2
16、,3,-1,9,0,-5,0,-3,0,0,5,5,6,5,4,-5,5,-3,Back Substitution,Used to solve upper triangular systemTx = b for x Methodology: one element of x can be immediately computed Use this value to simplify system, revealing another element that can be immediately computed Repeat,Back Substitution,1x0,+1x1,1x2,+4
17、x3,8,=, 2x1,3x2,+1x3,5,=,2x2, 3x3,0,=,2x3,4,=,Back Substitution,1x0,+1x1,1x2,+4x3,8,=, 2x1,3x2,+1x3,5,=,2x2, 3x3,0,=,2x3,4,=,x3 = 2,Back Substitution,1x0,+1x1,1x2,0,=, 2x1,3x2,3,=,2x2,6,=,2x3,4,=,Back Substitution,1x0,+1x1,1x2,0,=, 2x1,3x2,3,=,2x2,6,=,2x3,4,=,x2 = 3,Back Substitution,1x0,+1x1,3,=, 2
18、x1,12,=,2x2,6,=,2x3,4,=,Back Substitution,1x0,+1x1,3,=, 2x1,12,=,2x2,6,=,2x3,4,=,x1 = 6,Back Substitution,1x0,9,=, 2x1,12,=,2x2,6,=,2x3,4,=,Back Substitution,1x0,9,=, 2x1,12,=,2x2,6,=,2x3,4,=,x0 = 9,Pseudocode,for i n 1 down to 1 do x i b i / a i, i for j 0 to i 1 do b j b j x i a j, i endfor endfor
19、,Time complexity: (n2),Data Dependence Diagram,We cannot execute the outer loop in parallel. We can execute the inner loop in parallel.,Row-oriented Algorithm,Associate primitive task with each row of A and corresponding elements of x and b During iteration i task associated with row j computes new
20、value of bj Task i must compute xi and broadcast its value Agglomerate using rowwise interleaved striped decomposition,Interleaved Decompositions,Rowwise interleaved striped decomposition,Columnwise interleaved striped decomposition,Complexity Analysis,Each process performs about n / (2p) iterations
21、 of loop j in all A total of n -1 iterations in all Computational complexity: (n2/p) One broadcast per iteration Communication complexity: (n log p),Column-oriented Algorithm,Associate one primitive task per column of A and associated element of x Last task starts with vector b During iteration i ta
22、sk i computes xi, updates b, and sends b to task i -1 In other words, no computational concurrency Agglomerate tasks in interleaved fashion,Complexity Analysis,Since b always updated by a single process, computational complexity same as sequential algorithm: (n2) Since elements of b passed from one process to another each iteration, communication complexity is (n2),Comparison,Message- passing time dominates,Computation time dominates,Gaussian Elimination,Used to solve Ax = b when A is dense Reduces Ax = b to upper triangular system Tx = c Back su
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
评论
0/150
提交评论