




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、1High Performance NumericalLinear Algebra Computing(Part I)Zhaojun BaiDepartment of Computer Science &Department of MathematicsUniversity of California, D/bai2Acknowledgements Lecture notes are in part based on previous notes:- Jim Demmel, UC Berkeley
2、- Kathy Yelick, UC Berkeley- Jack Dongarra, UT Knoxville- John Gilbert, Xerox PARC UCSB 3I. Introduction: overview of high performance computing, impacts to linear algebra and numerical libraries, principles of parallel computingII. Uniprocessor optimization, matrix multiplication, and BLAS: paralle
3、lism in modern uniprocessor, memory hierarchies, cache and its importance in performance, matrix multiply cache optimization, BLAS.III. Dense matrix computing: motivation, optimizing Gaussian elimination, block algorithms, LAPACK and ScaLAPACK.IV. Sparse matrix computing: the landscape of sparse Ax
4、= b solvers. Direct and iterative methods. Algorithm templates for solving linear systems and eigenvalue problems.Contents of lectures:4Outline of IntroductionLarge important problems require powerful computers Why powerful computers must be parallel processors Principles of parallel computing perfo
5、rmanceHigh performance computing: now and futureSupplement: floating point arithmeticDemos in Matlab 5Why we need power computersTraditional scientific and engineering paradigm:-Do theory or paper design.-Perform experiments or build system.Limitations:-Too difficult - build large wind tunnels.-Too
6、expensive - build a throw-away passenger jet.-Too slow - wait for climate or galactic evolution.-Too dangerous - weapons, drug design, climate experimentation.Computational science paradigm:-Use high performance computer systems to simulate the phenomenon-Base on known physical laws and efficient nu
7、merical methods1) Simulation: the third pillar of science6Computational Science HPC offers a new way to do science:- Experiment - Theory - Computation Computation used to approximate physical systems - Advantages include:- Playing with simulation parameters to study emergent trends- Possible replay
8、of a particular simulation event- Study systems where no exact theories exist 7Some Particularly Challenging Computations Science- Global climate modeling- Astrophysical modeling- Biology: Genome analysis; protein folding (drug design) Engineering- Crash simulation- Semiconductor design- Earthquake
9、and structural modeling Business- Financial and economic modeling- Transaction processing, web services and search engines Defense- Nuclear weapons - test by simulations- Cryptography8Units of Measure in HPC High Performance Computing (HPC) units are:- Flop/s: floating point operations- Bytes: size
10、of data Typical sizes are millions, billions, trillionsMegaMflop/s = 106 flop/secMbyte = 106 byte (220 = 1,048,576)GigaGflop/s = 109 flop/secGbyte = 109 byte (230 = 1,073,741,824)TeraTflop/s = 1012 flop/secTbyte = 1012 byte (240 = 10,995,211,627,776)PetaPflop/s = 1015 flop/secPbyte = 1015 byte (250
11、= 1,125,899,906,842,624)9Economic Impact of HPC Airlines:- System-wide logistics optimization systems on parallel systems.- Savings: approx. $100 million per airline per year. Automotive design:- Major automotive companies use large systems (500+ CPUs) for:- CAD-CAM, crash testing, structural integr
12、ity and aerodynamics.- One company has 500+ CPU parallel system.- Savings: approx. $1 billion per company per year. Semiconductor industry:- Semiconductor firms use large systems (500+ CPUs) for- device electronics simulation and logic validation - Savings: approx. $1 billion per company per year. S
13、ecurities industry:- Savings: approx. $15 billion per year for U.S. home mortgages.10High Performance Computers 20 years ago-1x106 Floating Point Ops/sec (Mflop/s) - Scalar based 10 years ago-1x109 Floating Point Ops/sec (Gflop/s) - Vector & Shared memory computing, bandwidth aware- Block partit
14、ioned, latency tolerant Today-1x1012 Floating Point Ops/sec (Tflop/s) - Highly parallel, distributed processing, message passing, network based- data decomposition, communication/computation 10 years away-1x1015 Floating Point Ops/sec (Pflop/s) - Many more levels MH, combination/grids&HPC- More
15、adaptive, LT and bandwidth aware, fault tolerant, extended precision, attention to SMP nodes11Why most powerful computers are parallel12Technology Trends: Microprocessor Capacity2X transistors/Chip Every 1.5 yearsCalled “Moores Law” Moores LawMicroprocessors have become smaller, denser, and more pow
16、erful.Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. Slide source: Jack Dongarra13Processor-memory problem Processors issue instructions roughly every nanosecond. DRAM can be accessed roughly every 100 nan
17、oseconds (!). DRAM cannot keep processors busy! And the gap is growing:- processors getting faster by 60% per year- DRAM getting faster by 7% per year (SDRAM and EDO RAM might help, but not enough)14Processor-DRAM Gap (latency)Proc60%/yr.DRAM7%/yr.1101001000198019811983198419851986198719881989199019
18、91199219931994199519961997199819992000DRAMCPU1982Processor-MemoryPerformance Gap:(grows 50% / year)PerformanceTime“Moores Law”15Why powerful computers are parallel Desire to solve bigger, more realistic applications Fundamental limits are being approached More cost effective solution16Trends in Para
19、llel Computing Performance Performance of several machines on the Linpack benchmark (dense matrix factorization)iPSC/860nCUBE/2CM2CM-200DeltaParagon XP/SCM-5Paragon XP/S MP(1024)Cray T3DC90Ymp/8320.11101001000198519871989199119931995G G F F L L O O P P S SMPPCray VPPXmpParagon XP/S MP (6768)T932ASCI
20、 red 17TOP500 List of the 500 most powerful computer in the world Yardstick: Rmax from LINPACK MPP Updated twice a year All data available from Ax=b, dense problem18Winner of TOPS 500, by yearYearMachineTflopsFactorfasterPeakTflopsNumProcsN(Ax=b)2002Earth System Computer, NEC35.64.940.
21、851041.04M2001ASCI White, IBM SP Power 37.21.511.17424.52M2000ASCI White, IBM SP Power 7424.43M1999ASCI Red, Intel PII Xeon9632.36M 1998ASCI Blue, IBM SP 604E5808.43M1997ASCI Red, Intel Ppro, 200 MHz9152.24M1996Hitachi CP-PACS.371.3.62048.10M1995Intel Paragon XP
22、/S MP.281.36768.13M19Top 10 Machines (Nov.2001)20ArchitecturesSingle ProcessorSMPMPPSIMDConstellation Cluster - NOW0100200300400500Jun-93Nov-93Jun-94Nov-94Jun-95Nov-95Jun-96Nov-96Jun-97Nov-97Jun-98Nov-98Jun-99Nov-99Jun-00Nov-00Jun-01Nov-01Y-MP C90Sun HPCParagonCM5T3DT3ESP2Cluster of Sun HPCASCI RedC
23、M2VP500SX3Constellation: # of p/n n21Principles of Parallel Computing Parallelism and Amdahls Law Finding and exploiting granularity Preserving data locality Load balancing Coordination and synchronization Performance modelingAll of these things make parallel programming more difficult than sequenti
24、al programming.22Locality and Parallelism Large memories are slow, fast memories are small. Storage hierarchies are large and fast on average. Parallel processors, collectively, have large, fast memories - the slow accesses to “remote” data we call “communication”. Algorithm should do most work on l
25、ocal data.ProcCacheL2 CacheL3 CacheMemoryConventional Storage HierarchyProcCacheL2 CacheL3 CacheMemoryProcCacheL2 CacheL3 CacheMemorypotentialinterconnects23Highly Parallel Computing: Where Are We? Performance:- Sustained performance has dramatically increased.- On most applications, sustained perfo
26、rmance per dollar now exceeds that of conventional supercomputers. But.- Conventional supercomputer systems are still faster on some applications. Languages and compilers:- Standardized, portable, high-level languages such as HPF, PVM and MPI are available. But .- Initial HPF releases are not very e
27、fficient.- Message passing programming is tedious and hardto debug.- Programming difficulty remains a major obstacle tousage by mainstream scientist.24Highly Parallel Computing: Where Are We? Operating systems:- Robustness and reliability are improving.- New system management tools improve system ut
28、ilization. But.- Reliability still not as good as conventional systems. I/O subsystems:- New RAID disks, HiPPI interfaces, etc. provide substantially improved I/O performance. But.- I/O remains a bottleneck on some systems.25The Importance of Standards - SoftwareWriting programs for MPP is hard .But
29、 . one-off efforts if written in a standard languagePast lack of parallel programming standards .- . has restricted uptake of technology (to enthusiasts)- . reduced portability (over a range of currentarchitectures and between future generations)Now standards exist: (PVM, MPI & HPF), which .- .
30、allows users & manufacturers to protect software investment- . encourage growth of a third party parallel software industry & parallel versions of widely used codes26The Importance of Standards - Hardware Processors-commodity RISC processors Interconnects-high bandwidth, low latency communic
31、ations protocol-no de-facto standard yet (ATM, Fibre Channel, HPPI, FDDI) Growing demand for total solution:-robust hardware + usable software HPC systems containing all the programming tools/ environments / languages / libraries / applicationspackages found on desktops27The Future of HPC The expens
32、e of being different is being replaced by the economics of being the same HPC needs to lose its special purpose tag Still has to bring about the promise of scalable general purpose computing . . but it is dangerous to ignore this technology Final success when MPP technology is embedded in desktop co
33、mputing Yesterdays HPC is todays mainframe is tomorrows workstation28Outline of Uniprocessor optimizationMemory HierarchiesCache and its importance in performanceOptimizing matrix multiply for cachesBLAS Bag of TricksSupplement: Strassens algorithm29Memory Hierarchy Most programs have a high degree
34、of locality in their accesses- spatial locality: accessing things nearby previous accesses- temporal locality: reusing an item that was previously accessed Memory hierarchy tries to exploit locality By taking advantage of the principle of locality: - present the user with as much memory as is availa
35、ble in the cheapest technology- Provide access at the speed offered by the fastest technology30Memory HierarchyControlDatapathSecondaryStorage(Disk)ProcessorRegistersMainMemory(DRAM)SecondLevelCache(SRAM)On-ChipCache1s10,000,000s (10s ms)Speed (ns):10s100s100sGsSize (bytes):KsMsTertiaryStorage(Disk/
36、Tape)10,000,000,000s (10s sec)Ts31Levels of the Memory HierarchyCPU Registers100s Bytes10s nsCacheK Bytes10-100 ns1-0.1 cents/bitMain MemoryM Bytes200ns- 500ns$.0001-.00001 cents /bitDiskG Bytes, 10 ms (10,000,000 ns)10 - 10 cents/bit-5-6CapacityAccess TimeCostTapeinfinitesec-min10-8RegistersCacheMe
37、moryDisk / Distributed MemoryTape / Clusters Instr. OperandsBlocksPagesFilesStagingXfer Unitprog./compiler1-8 bytescache cntl8-128 bytesOS512-4K bytesuser/operatorMbytesUpper LevelLower LevelfasterLarger32Idealized Uniprocessor Model Processor names bytes, words, etc. in its address space- These rep
38、resent integers, floats, pointers, arrays, etc.- Exist in the program stack, static region, or heap Operations include- Read and write (given an address/pointer)- Arithmetic and other logical operations Order specified by program- Read returns the most recently written data- Compiler and architectur
39、e translate high level expressions into “obvious” lower level instructions- Hardware executes instructions in order specified by compiler Cost- Each operations has roughly the same cost (read, write, add, multiply, etc.)33Uniprocessors in the Real World Real processors have- registers and caches- sm
40、all amounts of fast memory- store values of recently used or nearby data- different memory ops can have very different costs- parallelism- multiple “functional units” that can run in parallel- different orders, instruction mixes have different costs- pipelining- a form of parallelism, like an assemb
41、ly line in a factory Why is this your problem?In theory, compilers understand all of this and can optimize your program; in practice they dont.34Processor-DRAM Gap (latency)Proc60%/yr.DRAM7%/yr.110100100019801981198319841985198619871988198919901991199219931994199519961997199819992000DRAMCPU1982Proce
42、ssor-MemoryPerformance Gap:(grows 50% / year)PerformanceTime“Moores Law” Memory hierarchies are getting deeper- Processors get faster more quickly than memory35Matrix-multiply, optimized several waysSpeed of n-by-n matrix multiply on Sun Ultra-1/170, peak = 330 MFlops36Cache and Its Importance in Pe
43、rformanceMotivation:- Time to run code = clock cycles running code + clock cycles waiting for memory- For many years, CPUs have sped up an average of 50% per year over memory chip speed ups.Hence, memory access is the bottleneck to computing fast.Definition of a cache:- Dictionary: a safe place to h
44、ide or store things.- Computer: a level in a memory hierarchy.37Cache BenefitsData cache was designed with two key concepts in mind- Spatial Locality- When an element is referenced its neighbors will be referenced too- Cache lines are fetched together- Work on consecutive data elements in the same c
45、ache line - Temporal Locality- When an element is referenced, it might be referenced again soon- Arrange code so that data in cache is reused often38Lessons Actual performance of a simple program can be a complicated function of the architecture- Slight changes in the architecture or program change
46、the performance significantly- To write fast programs, need to consider architecture- We would like simple models to help us design efficient algorithms- Is this possible? We will illustrate with a common technique for improving cache performance, called blocking or tiling- Idea: used divide-and-con
47、quer to define a problem that fits in register/L1-cache/L2-cache39 Assume just 2 levels in the hierarchy, fast and slow All data initially in slow memory- m = number of memory elements (words) moved between fast and slow memory - tm = time per slow memory operation- f = number of arithmetic operatio
48、ns- tf = time per arithmetic operation tm- q = f / m average number of flops per slow element access Minimum possible time = f* tf when all data in fast memory Actual time Larger q means Time closer to minimum f * tf Using a Simple Model of Memory to OptimizeKey to algorithm efficiencyf * tf + m * t
49、m = f * tf * (1 + tm/tf * 1/q) Key to machine efficiency 40Warm up: Matrix-vector multiplicationimplements y = y + A*xfor i = 1:nfor j = 1:ny(i) = y(i) + A(i,j)*x(j)=+*y(i)y(i)A(i,:)x(:)41Warm up: Matrix-vector multiplicationread x(1:n) into fast memoryread y(1:n) into fast memoryfor i = 1:n read ro
50、w i of A into fast memory for j = 1:n y(i) = y(i) + A(i,j)*x(j)write y(1:n) back to slow memory m = number of slow memory refs = 3n + n2 f = number of arithmetic operations = 2n2 q = f / m = 2 Matrix-vector multiplication limited by slow memory speed42“Nave” Matrix Multiplyimplements C = C + A*Bfor
51、i = 1 to n for j = 1 to nfor k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j)=+*C(i,j)C(i,j)A(i,:)B(:,j)43“Nave” Matrix Multiplyimplements C = C + A*Bfor i = 1 to n read row i of A into fast memory for j = 1 to n read C(i,j) into fast memory read column j of B into fast memory for k = 1 to n C(i,j) = C(
52、i,j) + A(i,k) * B(k,j) write C(i,j) back to slow memory=+*C(i,j)A(i,:)B(:,j)C(i,j)44“Nave” Matrix MultiplyNumber of slow memory references on unblocked matrix multiplym = n3 read each column of B n times + n2 read each row of A once + 2n2 read and write each element of C once = n3 + 3n2So q = f / m
53、= 2n3 / (n3 + 3n2) = 2 for large n, no improvement over matrix-vector multiply=+*C(i,j)C(i,j)A(i,:)B(:,j)45Blocked (Tiled) Matrix MultiplyConsider A,B,C to be N by N matrices of b by b subblocks where b=n / N is called the block size for i = 1 to N for j = 1 to N read block C(i,j) into fast memory f
54、or k = 1 to N read block A(i,k) into fast memory read block B(k,j) into fast memory C(i,j) = C(i,j) + A(i,k) * B(k,j) do a matrix multiply on blocks write block C(i,j) back to slow memory=+*C(i,j)C(i,j)A(i,k)B(k,j)46Blocked (Tiled) Matrix MultiplyRecall: m is amount memory traffic between slow and f
55、ast memory matrix has nxn elements, and NxN blocks each of size bxb f is number of floating point operations, 2n3 for this problem q = f / m is our measure of algorithm efficiency in the memory systemThe amount of memory traffic is m = N*n2 read each block of B N3 times (N3 * n/N * n/N) + N*n2 read
56、each block of A N3 times + 2n2 read and write each block of C once = (2N + 2) * n2So q = f / m = 2n3 / (2N + 2) * n2) = n / N = b for large nHence we can improve performance by increasing the blocksize b Can be much faster than matrix-vector multiply (q=2)47Limits to Optimizing Matrix MultiplyThe bl
57、ocked algorithm has ratio q = b The large the block size, the more efficient our algorithm will be Limit: All three blocks from A,B,C must fit in fast memory (cache), so we cannot make these blocks arbitrarily large: 3b2 = M, so q = b = 4n2, f=O(n3), so q can possibly be as large as n, so BLAS3 is p
58、otentially much faster than BLAS2 Good algorithms used BLAS3 when possible (e.g., LAPACK) See /blas, /lapack51Level 1, 2 and 3 BLAS Level 1 BLAS Vector-Vector operations Level 2 BLAS Matrix-Vector operations Level 3 BLAS Matrix-Matrix operations+*+*52Level 1 BLAS Operate on vectors or
59、pairs of vectors- perform O(n) operations; - return either a vector or a scalar. saxpy - y(i) = a * x(i) + y(i), for i=1 to n. - s stands for single precision, daxpy is for double precision, caxpy for complex, and zaxpy for double complex, sscal y = a * x, for scalar a and vectors x,y sdot computes
60、s = S ni=1 x(i)*y(i) 53Level 2 BLAS Operate on a matrix and a vector; - return a matrix or a vector;- O(n2) operations sgemv: matrix-vector multiply- y = y + A*x- where A is m-by-n, x is n-by-1 and y is m-by-1. sger: rank-one update - A = A + y*xT, i.e., A(i,j) = A(i,j)+y(i)*x(j) - where A is m-by-n, y is m-by-1, x is n-
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 汽车行业合同样本:会员服务协议
- 移动基站租赁合同书范本
- 城市老旧小区消防系统改造项目合同
- 幼儿园临时教师聘任合同
- 新版民间房产抵押权转让合同
- 肾性水肿课件
- 智能化煤矿培训课件下载
- 旧货零售互联网+创新实践考核试卷
- 搪瓷器的创造思维与创意设计考核试卷
- 建筑施工现场安全监测与预警考核试卷
- 《陆上风电场工程概算定额》NBT 31010-2019
- 展会展中营销方案
- 四年级上册竖式计算100题及答案
- 2024届辽宁省沈阳市名校中考四模化学试题含答案解析
- 2024年新高考改革方案政策
- 2024年4月自考00431教学设计试题
- 2024年许昌职业技术学院单招职业技能测试题库及答案解析
- 《新媒体创意短视频制作》课件-运动短视频制作关键技术
- JTGT F20-2015 公路路面基层施工技术细则
- 7S培训管理教材课件(-28张)
- 过桥资金计划书
评论
0/150
提交评论