版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、High Performance Information Retrieval and MEMS CADon Intel ItaniumPerformance Tuning ParticipantsResearchersMark Adams (SNL), David Bailey (LBL), Parry Husbands (LBL), Xiaoye Li (LBL), Lenny Oliker (LBL)PhD StudentsYozo Hida, Geoff Pike, Rich VuducUndergradsBrian Gaeke , Jen Hsu, Shoaib Kamil, Suh
2、Kang, Hyun Kim, Gina Lee, Jaeseop Lee, Michael de Lorimier, Jin Moon, Randy Shoopman, Brandon Thompson OutlineMotivation, History, Related workTuning Sparse Matrix OperationsResults on Sun Ultra 1/170Preliminary results on P4Preliminary results on ItaniumApplications SUGAR a MEMS CAD systemInformati
3、on RetrievalFuture Work(Whats up with Millennium)(Background on Test Matrices)MotivationHistoryRelated Work Performance Tuning Motivation: performance of many applications dominated by a few kernelsMEMS CAD Nonlinear ODEs Nonlinear equations Linear equations Matrix multiplyMatrix-by-matrix or matrix
4、-by-vectorDense or SparseInformation retrieval by LSI Compress term-document matrix Sparse mat-vec multiplyInformation retrieval by LDA Maximum likelihood estimation Solve linear systemsMany other examples (not all linear algebra)Conventional Performance Tuning Vendor or user hand tunes kernels Draw
5、backs:Very time consuming and tedious workEven with intimate knowledge of architecture and compiler, performance hard to predictGrowing list of kernels to tuneExample: New BLAS StandardMust be redone for every architecture, compilerCompiler technology often lags architectureNot just a compiler probl
6、em: Best algorithm may depend on input, so some tuning must occur at run-time.Not all algorithms semantically or mathematically equivalentAutomatic Performance TuningApproach: for each kernelIdentify and generate a space of algorithmsSearch for the fastest one, by running themWhat is a space of algo
7、rithms?Depending on kernel and input, may varyinstruction mix and ordermemory access patternsdata structures mathematical formulation When do we search?Once per kernel and architecture At compile timeAt run timeAll of the aboveTuning pays off PHIPAC (Bilmes, Asanovic, Vuduc, Demmel)Tuning pays off A
8、TLAS (Dongarra, Whaley)Extends applicability of PHIPACIncorporated in Matlab (with rest of LAPACK)Other Automatic Tuning ProjectsFFTs and Signal ProcessingFFTW ()Given dimension n of FFT, choose best implementation at runtime by assembling prebuilt kernels for small factors of nWidely used, won 1999
9、 Wilkinson Prize for Numerical SoftwareSPIRAL (/spiral)Extensions to other transforms, DSPsUHFFT Extensions to higher dimension, parallelismSpecial session at ICCS 2001Organized by Yelick and Demmelwww.ucalgary.ca/iccsProceedings availablePointers to automatic tuning projects at /yelick/iccs-tuneSea
10、rch for optimal register tile sizes on Sun Ultra 1016 registers, but 2-by-3 tile size fastestSearch for Optimal L0 block size in dense matmul60% of peak on Pentium II-3004% of versions exceedHigh precision dense mat-vec multiply (XBLAS)High Precision Algorithms (XBLAS)Double-double (High precision w
11、ord represented as pair of doubles)Many variations on these algorithms; we currently use BaileysExploiting Extra-wide RegistersSuppose s(1) , , s(n) have f-bit fractions, SUM has Ff bit fractionConsider following algorithm for S = Si=1,n s(i)Sort so that |s(1)| |s(2)| |s(n)|SUM = 0, for i = 1 to n S
12、UM = SUM + s(i), end for, sum = SUMTheorem (D., Hida)If n 2F-f + 1, then sum agrees with S to about 1.5 ulpsIf n = 2F-f + 2, then error can be at least as large as 22f-F 1 ulpsIf n 2F-f + 3, then error can be arbitrary (S 0 but sum = 0 )Exampless(i) double (f=53), SUM double extended (F=64) accurate
13、 if n 211 + 1 = 2049Dot product of single precision x(i) and y(i) s(i) = x(i)*y(i) (f=2*24=48), SUM double extended (F=64) accurate if n 216 + 1 = 65537Tuning Sparse Matrix ComputationsTuning Sparse matrix-vector multiply SparsityOptimizes y = A*x for a particular sparse AIm and YelickAlgorithm spac
14、eDifferent code organization, instruction mixesDifferent register blockings (change data structure and fill of A)Different cache blockingDifferent number of columns of xDifferent matrix orderingsSoftware and papers available/yelick/sparsityHow Sparsity tunes y = A*xRegister Blocking Store matrix as
15、dense r x c blocksPrecompute performance in Mflops of dense A*x for various register block sizes r x cGiven A, sample it to estimate Fill if A blocked for varying r x cChoose r x c to minimize estimated running time Fill/MflopsStore explicit zeros in dense r x c blocks, unrollCache Blocking Useful w
16、hen source vector x enormousStore matrix as sparse 2k x 2l blocksSearch over 2k x 2l cache blocks to find fastest Register-Blocked Performance of SPMV on Dense Matrices (up to 12x12)333 MHz Sun Ultra IIi800 MHz Pentium III1.5 GHz Pentium 4800 MHz Itanium70 Mflops35 Mflops425 Mflops310 Mflops175 Mflo
17、ps105 Mflops250 Mflops110 MflopsWhich other sparse operations can we tune?General matrix-vector multiply A*xPossibly many vectors xSymmetric matrix-vector multiply A*xSolve a triangular system of equations T-1*xy = AT*A*xKernel of Information Retrieval via LSI (SVD)Same number of memory references a
18、s A*xy = Si (A(i,:)T * (A(i,:) * x)Future workA2*x, Ak*xKernel of Information Retrieval used by GoogleIncludes Jacobi, SOR, Changes calling algorithmAT*M*AMatrix triple productUsed in multigrid solverTest MatricesGeneral Sparse MatricesUp to n=76K, nnz = 3.21MFrom many application areas1 Dense2 to 1
19、7 - FEM18 to 39 - assorted41 to 44 linear programming45 - LSISymmetric MatricesSubset of General Matrices1 Dense2 to 8 - FEM9 to 13 - assortedLower Triangular MatricesObtained by running SuperLU on subset of General Sparse Matrices1 Dense2 13 FEMDetails on test matrices at end of talkResults on Sun
20、Ultra 1/170Speedups on SPMV from Sparsity on Sun Ultra 1/170 1 RHSSpeedups on SPMV from Sparsity on Sun Ultra 1/170 9 RHSPreliminary Results on P4 using icc and gccSpeedup of SPMV from Sparsity on P4/icc-5.0.1Performance of SPMV from Sparsity on P4/icc-5.0.1Fill for SPMV from Sparsity on P4/icc-5.0.
21、1Possible ImprovementsDoesnt work as well as on Sun Ultra 1/170; Why?Current heuristic to determine best r x c block biased to diagonal of performance plotDidnt matter on Sun, does on P4 and Itanium since performance so “nondiagonally dominant”Sparsity reg blocking results on P4 for FEM/fluids matri
22、cesMatrix #2 (150 Mflops to 400 Mflops)Matrix #5 (50 Mflops to 350 Mflops)Sparsity cache blocking results on P4 for LSISymmetric Sparse Matrix-Vector Multiply on P4 (vs nave full = 1)Sparse Triangular Solve (Matlabs colmmd ordering) on P4 AT*A on P4 (Accesses A only once)Preliminary Results on Itani
23、um using eccSpeedup of SPMV from Sparsity on Itanium/ecc-5.0.1Raw Performance of SPMV from Sparsity on Itanium Fill for SPMV from Sparsity on Itanium Possible ImprovementsCurrent heuristic to determine best r x c block biased to diagonal of performance plotDidnt matter on Sun, does on P4 and Itanium
24、 since performance so “nondiagonally dominant”Matrix 8: Chose 2x2 (164 Mflops) Better: 3x1 (196 Mflops)Matrix 9:Chose 2x2 (164 Mflops) Better: 3x1 (213 Mflops)Application of Performance TuningSUGAR A CAD Tool for MEMSApplications to SUGAR a tool for MEMS CADDemmel, Bai, Pister, Govindjee, Agogino, G
25、u, Input: description of MicroElectroMechanical System (as netlist)Output:DC, steady state, modal, transient analyses to assess behaviorCIF for fabrication Simulation capabilitiesBeams and plates (linear, nonlinear, prestressed,)Electrostatic forces, circuitsThermal expansion, Couette damping Availa
26、bilityMatlabPublicly /cfm249 registered users, many unregisteredWeb service M & MEMSRuns on Millennium Now in use in EE 245 at UCB96 usersLots of new features being added, including interface to measurementsMicromirror (Last, Pister)Laterally ac
27、tuated torsionally suspended micromirrorOver 10K dof, 100 line netlist (using subnets)DC and frequency analysisAll algorithms reduce to previous kernelsApplication of Performance TuningInformation RetrievalInformation RetrievalJordanCollaboration with Intel team building probabilistic graphical mode
28、lsBetter alternatives to LSI for document modeling and searchLatent Dirichlet Allocation (LDA)Model documents as union of themes, each with own word distributionMaximum likelihood fit to find themes in set of documents, classify themComputational bottleneck is solution of enormous linear systemsOne
29、of largest Millennium usersIdentifying influential documentsGiven hyperlink patterns of documents, which are most influential?Basis of Google (eigenvector of link matrix sparse matrix vector multiply)Applying Markov chain and perturbation theory to assess reliabilityKernel ICAEstimate set of sources
30、 s and mixing matrix A from samples x = A*sNew way to sample such that sources are as independent as possibleAgain reduces to linear algebra kernelsMore on Kernel ICAAlgorithm 1nonlinear eigenvalue problem, reduces to a sequence of manyvery large generalized spd eigenproblems A l B Block structured,
31、 A dense, B block diagonalOnly smallest nonzero eigenvalue neededSparse eigensolver (currently ARPACK/eigs)Use Incomplete Cholesky (IC) to get low rank approximzation to dense subblocks comprising A and BUse Complete (=Diagonal) Pivoting but take only 20 n stepsCost is O(n)Evaluating matrix entries
32、(exponentials) could be bottleneckNeed fast, low precision exponentialAlgorithm 2Like Algorithm 1, but find all eigenvalues/vectors of A l B Use Holy Grail Future WorkExploit Itanium Architecture128 (82-bit) floating point registers9 HW formats: 24/8(v), 24/15, 24/17, 53/11, 53/15, 53/17, 64/15, 64/
33、17Many few load/store instructionsfused multiply-add instructionpredicated instructionsrotating registers for software pipeliningprefetch instructionsthree levels of cacheTune current and wider set of kernelsImprove heuristics, eg choice of r x cIncorporate intoSUGARInformation RetrievalFurther auto
34、mate performance tuningGeneration of algorithm space generatorsPossible collaborations with IntelGetting right toolsGetting faster, less accurate transcendental functionsProvide feedback on toolsProvide tuned kernels, benchmarks, IR appsProvide system for tuning future kernelsTo provide usersTo eval
35、uate architectural designsMillenniumMillenniumCluster of clusters at UC Berkeley309 CPU cluster in Soda HallSmaller clusters across campusMade possible by Intel equipment grantSignificant other supportNSF, Sun, Microsoft, Nortel, campusMillennium TopologyMillennium Usage Oct 1 11, 2001100% utilizati
36、on for last few daysAbout half the jobs are parallelUsage highlightsAMANDAAntarctic Muon And Neutrino Detector A128 scientists from 15 universities and institutes in the U.S. and Europe.TEMPESTEUV lithography simulations via 3D electromagnetic scatteringcuervo./Volcano/study t
37、he defect printability on multilayer masks TitaniumHigh performance Java dialect for scientific computing/projects/titaniumImplementation of shared address space, and use of SSE2Digital Library ProjectLarge database of imageselib./Used to run spectral image segmentation algorithm for clustering, sea
38、rch on imagesUsage highlights (continued)CS 267Graduate class in parallel computing, 33 enrolled/dbindel/cs267taHomeworkDisaster ResponseHelp find people after Sept 11, set up immediately 48K reports in database, linked to other survivor databasesMEMS CAD (MicroElectroM
39、echanical Systems Computer Aided Design)Tool to help design MEMS systemsUsed this semester in EE 245, 93 More later in talkInformation RetrievalDevelopment of faster information retrieval algorithms/jordanMore later in talkMany applications are part of CITRISBackground on
40、 Test MatricesSparse Matrix Benchmark Suite (1/3)#Matrix NameProblem DomainDimensionNo. Non-zeros1denseDense matrix1,0001.00 M2raefsky3Fluid structure interaction21,2001.49 M3inaccuraAccuracy problem16,1461.02 M4bcsstk35*Stiff matrix automobile frame30,2371.45 M5venkat01Flow simulation62,4241.72 M6c
41、rystk02*FEM crystal free-vibration13,965969 k7crystk03*FEM crystal free-vibration24,6961.75 M8nasasrb*Shuttle rocket booster54,8702.68 M93dtube*3-D pressure tube45,3303.21 M10ct20stif*CT20 engine block52,3292.70 M11baiAirfoil eigenvalue calculation23,560484 k12raefsky4Buckling problem19,7791.33 M13e
42、x113-D steady flow problem16,2141.10 M14rdist1Chemical process simulation4,13494.4 k15vavasis32-D PDE problem41,0921.68 MNote: * indicates a symmetric matrix.Sparse Matrix Benchmark Suite (2/3)#Matrix NameProblem DomainDimensionNo. Non-zeros16orani678Economic modeling2,52990.2 k17rimFEM fluid mechan
43、ics problem22,5601.01 M18memplusCircuit simulation17,758126 k19gemat11Power flow4,92933.1 k20lhr10Chemical process simulation10,672233 k21goodwin*Fluid mechanics problem7,320325 k22bayer02Chemical process simulation13,93563.7 k23bayer10Chemical process simulation13,43694.9 k24coater2Simulation of co
44、ating flows9,540207 k25finan512*Financial portfolio optimization74,752597 k26onetone2Harmonic balance method36,057228 k27pwt*Structural engineering36,519326 k28vibrobox*Vibroacoustics12,328343 k29wang4Semiconductor device simulation26,068177 k30lnsp3937Fluid flow modeling3,93725.4 kSparse Matrix Ben
45、chmark Suite (3/3)#Matrix NameProblem DomainDimensionsNo. Non-zeros31lns3937Fluid flow modeling3,93725.4 k32sherman5Oil reservoir modeling3,31220.8 k33sherman3Oil reservoir modeling5,00520.0 k34orsreg1Oil reservoir modeling2,20514.1 k35saylr4Oil reservoir modeling3,56422.3 k36shyy161Viscous flow cal
46、culation76,480330 k37wang3Semiconductor device simulation26,064177 k38mcfeAstrophysics76524.4 k39jpwh991Circuit physics problem9916,02740gupta1*Linear programming31,8022.16 M41lpcrebLinear programming9,648 x 77,137261 k42lpcredLinear programming8,926 x 73,948247 k43lpfit2pLinear programming3,000 x 1
47、3,52550.3 k44lpnug20Linear programming15,240 x 72,600305 k45lsiLatent semantic indexing10 k x 255 k3.7 MMatrix #2 raefsky3 (FEM/Fluids)Matrix #2 (contd) raefsky3 (FEM/Fluids)Matrix #22 - bayer02 (chemical process simulation)Matrix #22 (contd)- bayer02 (chemical process simulation)Matrix #27 - pwt (structural engineering)Matrix #27 (contd)- pwt (structura
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2024年度文化传媒内容制作合同
- 2024年大型活动保障车辆租赁合同
- 2024年上海房屋装修工程分包合同
- 2024年廉洁承诺函:双方诚信自律协议
- 教育工作者主要先进事迹(5篇)
- 中学生读书演讲稿
- 2024年度质量控制合同:MLB棒球帽正品知识分享
- 2024年工程监测与检测合同
- 2024室内外演唱会舞台安全检测合同
- 2024年国际商贸合同的科学与艺术
- 《初中英语写作》课件
- DB37-T 5202-2021 建筑与市政工程基坑支护绿色技术标准
- 牙科手机的清洗消毒、灭菌及保养课件
- 人音版二年级下册音乐《小蜜蜂》课件
- 打印版医师执业注册健康体检表(新版)
- 湘教版八年级美术上册工作计划
- 高渗性非酮症糖尿病昏迷培训课件
- 国开成本会计第15章综合练习试题及答案
- 2022年陕西投资集团有限公司招聘笔试题库及答案解析
- 医院产后出血的应急演练脚本
- 基于实验验证并发现以太
评论
0/150
提交评论