版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、High Performance Information Retrieval and MEMS CADon Intel ItaniumPerformance Tuning ParticipantsResearchersMark Adams (SNL), David Bailey (LBL), Parry Husbands (LBL), Xiaoye Li (LBL), Lenny Oliker (LBL)PhD StudentsYozo Hida, Geoff Pike, Rich VuducUndergradsBrian Gaeke , Jen Hsu, Shoaib Kamil, Suh
2、Kang, Hyun Kim, Gina Lee, Jaeseop Lee, Michael de Lorimier, Jin Moon, Randy Shoopman, Brandon Thompson OutlineMotivation, History, Related workTuning Sparse Matrix OperationsResults on Sun Ultra 1/170Preliminary results on P4Preliminary results on ItaniumApplications SUGAR a MEMS CAD systemInformati
3、on RetrievalFuture Work(Whats up with Millennium)(Background on Test Matrices)MotivationHistoryRelated Work Performance Tuning Motivation: performance of many applications dominated by a few kernelsMEMS CAD Nonlinear ODEs Nonlinear equations Linear equations Matrix multiplyMatrix-by-matrix or matrix
4、-by-vectorDense or SparseInformation retrieval by LSI Compress term-document matrix Sparse mat-vec multiplyInformation retrieval by LDA Maximum likelihood estimation Solve linear systemsMany other examples (not all linear algebra)Conventional Performance Tuning Vendor or user hand tunes kernels Draw
5、backs:Very time consuming and tedious workEven with intimate knowledge of architecture and compiler, performance hard to predictGrowing list of kernels to tuneExample: New BLAS StandardMust be redone for every architecture, compilerCompiler technology often lags architectureNot just a compiler probl
6、em: Best algorithm may depend on input, so some tuning must occur at run-time.Not all algorithms semantically or mathematically equivalentAutomatic Performance TuningApproach: for each kernelIdentify and generate a space of algorithmsSearch for the fastest one, by running themWhat is a space of algo
7、rithms?Depending on kernel and input, may varyinstruction mix and ordermemory access patternsdata structures mathematical formulation When do we search?Once per kernel and architecture At compile timeAt run timeAll of the aboveTuning pays off PHIPAC (Bilmes, Asanovic, Vuduc, Demmel)Tuning pays off A
8、TLAS (Dongarra, Whaley)Extends applicability of PHIPACIncorporated in Matlab (with rest of LAPACK)Other Automatic Tuning ProjectsFFTs and Signal ProcessingFFTW ()Given dimension n of FFT, choose best implementation at runtime by assembling prebuilt kernels for small factors of nWidely used, won 1999
9、 Wilkinson Prize for Numerical SoftwareSPIRAL (/spiral)Extensions to other transforms, DSPsUHFFT Extensions to higher dimension, parallelismSpecial session at ICCS 2001Organized by Yelick and Demmelwww.ucalgary.ca/iccsProceedings availablePointers to automatic tuning projects at /yelick/iccs-tuneSea
10、rch for optimal register tile sizes on Sun Ultra 1016 registers, but 2-by-3 tile size fastestSearch for Optimal L0 block size in dense matmul60% of peak on Pentium II-3004% of versions exceedHigh precision dense mat-vec multiply (XBLAS)High Precision Algorithms (XBLAS)Double-double (High precision w
11、ord represented as pair of doubles)Many variations on these algorithms; we currently use BaileysExploiting Extra-wide RegistersSuppose s(1) , , s(n) have f-bit fractions, SUM has Ff bit fractionConsider following algorithm for S = Si=1,n s(i)Sort so that |s(1)| |s(2)| |s(n)|SUM = 0, for i = 1 to n S
12、UM = SUM + s(i), end for, sum = SUMTheorem (D., Hida)If n 2F-f + 1, then sum agrees with S to about 1.5 ulpsIf n = 2F-f + 2, then error can be at least as large as 22f-F 1 ulpsIf n 2F-f + 3, then error can be arbitrary (S 0 but sum = 0 )Exampless(i) double (f=53), SUM double extended (F=64) accurate
13、 if n 211 + 1 = 2049Dot product of single precision x(i) and y(i) s(i) = x(i)*y(i) (f=2*24=48), SUM double extended (F=64) accurate if n 216 + 1 = 65537Tuning Sparse Matrix ComputationsTuning Sparse matrix-vector multiply SparsityOptimizes y = A*x for a particular sparse AIm and YelickAlgorithm spac
14、eDifferent code organization, instruction mixesDifferent register blockings (change data structure and fill of A)Different cache blockingDifferent number of columns of xDifferent matrix orderingsSoftware and papers available/yelick/sparsityHow Sparsity tunes y = A*xRegister Blocking Store matrix as
15、dense r x c blocksPrecompute performance in Mflops of dense A*x for various register block sizes r x cGiven A, sample it to estimate Fill if A blocked for varying r x cChoose r x c to minimize estimated running time Fill/MflopsStore explicit zeros in dense r x c blocks, unrollCache Blocking Useful w
16、hen source vector x enormousStore matrix as sparse 2k x 2l blocksSearch over 2k x 2l cache blocks to find fastest Register-Blocked Performance of SPMV on Dense Matrices (up to 12x12)333 MHz Sun Ultra IIi800 MHz Pentium III1.5 GHz Pentium 4800 MHz Itanium70 Mflops35 Mflops425 Mflops310 Mflops175 Mflo
17、ps105 Mflops250 Mflops110 MflopsWhich other sparse operations can we tune?General matrix-vector multiply A*xPossibly many vectors xSymmetric matrix-vector multiply A*xSolve a triangular system of equations T-1*xy = AT*A*xKernel of Information Retrieval via LSI (SVD)Same number of memory references a
18、s A*xy = Si (A(i,:)T * (A(i,:) * x)Future workA2*x, Ak*xKernel of Information Retrieval used by GoogleIncludes Jacobi, SOR, Changes calling algorithmAT*M*AMatrix triple productUsed in multigrid solverTest MatricesGeneral Sparse MatricesUp to n=76K, nnz = 3.21MFrom many application areas1 Dense2 to 1
19、7 - FEM18 to 39 - assorted41 to 44 linear programming45 - LSISymmetric MatricesSubset of General Matrices1 Dense2 to 8 - FEM9 to 13 - assortedLower Triangular MatricesObtained by running SuperLU on subset of General Sparse Matrices1 Dense2 13 FEMDetails on test matrices at end of talkResults on Sun
20、Ultra 1/170Speedups on SPMV from Sparsity on Sun Ultra 1/170 1 RHSSpeedups on SPMV from Sparsity on Sun Ultra 1/170 9 RHSPreliminary Results on P4 using icc and gccSpeedup of SPMV from Sparsity on P4/icc-5.0.1Performance of SPMV from Sparsity on P4/icc-5.0.1Fill for SPMV from Sparsity on P4/icc-5.0.
21、1Possible ImprovementsDoesnt work as well as on Sun Ultra 1/170; Why?Current heuristic to determine best r x c block biased to diagonal of performance plotDidnt matter on Sun, does on P4 and Itanium since performance so “nondiagonally dominant”Sparsity reg blocking results on P4 for FEM/fluids matri
22、cesMatrix #2 (150 Mflops to 400 Mflops)Matrix #5 (50 Mflops to 350 Mflops)Sparsity cache blocking results on P4 for LSISymmetric Sparse Matrix-Vector Multiply on P4 (vs nave full = 1)Sparse Triangular Solve (Matlabs colmmd ordering) on P4 AT*A on P4 (Accesses A only once)Preliminary Results on Itani
23、um using eccSpeedup of SPMV from Sparsity on Itanium/ecc-5.0.1Raw Performance of SPMV from Sparsity on Itanium Fill for SPMV from Sparsity on Itanium Possible ImprovementsCurrent heuristic to determine best r x c block biased to diagonal of performance plotDidnt matter on Sun, does on P4 and Itanium
24、 since performance so “nondiagonally dominant”Matrix 8: Chose 2x2 (164 Mflops) Better: 3x1 (196 Mflops)Matrix 9:Chose 2x2 (164 Mflops) Better: 3x1 (213 Mflops)Application of Performance TuningSUGAR A CAD Tool for MEMSApplications to SUGAR a tool for MEMS CADDemmel, Bai, Pister, Govindjee, Agogino, G
25、u, Input: description of MicroElectroMechanical System (as netlist)Output:DC, steady state, modal, transient analyses to assess behaviorCIF for fabrication Simulation capabilitiesBeams and plates (linear, nonlinear, prestressed,)Electrostatic forces, circuitsThermal expansion, Couette damping Availa
26、bilityMatlabPublicly /cfm249 registered users, many unregisteredWeb service M & MEMSRuns on Millennium Now in use in EE 245 at UCB96 usersLots of new features being added, including interface to measurementsMicromirror (Last, Pister)Laterally ac
27、tuated torsionally suspended micromirrorOver 10K dof, 100 line netlist (using subnets)DC and frequency analysisAll algorithms reduce to previous kernelsApplication of Performance TuningInformation RetrievalInformation RetrievalJordanCollaboration with Intel team building probabilistic graphical mode
28、lsBetter alternatives to LSI for document modeling and searchLatent Dirichlet Allocation (LDA)Model documents as union of themes, each with own word distributionMaximum likelihood fit to find themes in set of documents, classify themComputational bottleneck is solution of enormous linear systemsOne
29、of largest Millennium usersIdentifying influential documentsGiven hyperlink patterns of documents, which are most influential?Basis of Google (eigenvector of link matrix sparse matrix vector multiply)Applying Markov chain and perturbation theory to assess reliabilityKernel ICAEstimate set of sources
30、 s and mixing matrix A from samples x = A*sNew way to sample such that sources are as independent as possibleAgain reduces to linear algebra kernelsMore on Kernel ICAAlgorithm 1nonlinear eigenvalue problem, reduces to a sequence of manyvery large generalized spd eigenproblems A l B Block structured,
31、 A dense, B block diagonalOnly smallest nonzero eigenvalue neededSparse eigensolver (currently ARPACK/eigs)Use Incomplete Cholesky (IC) to get low rank approximzation to dense subblocks comprising A and BUse Complete (=Diagonal) Pivoting but take only 20 n stepsCost is O(n)Evaluating matrix entries
32、(exponentials) could be bottleneckNeed fast, low precision exponentialAlgorithm 2Like Algorithm 1, but find all eigenvalues/vectors of A l B Use Holy Grail Future WorkExploit Itanium Architecture128 (82-bit) floating point registers9 HW formats: 24/8(v), 24/15, 24/17, 53/11, 53/15, 53/17, 64/15, 64/
33、17Many few load/store instructionsfused multiply-add instructionpredicated instructionsrotating registers for software pipeliningprefetch instructionsthree levels of cacheTune current and wider set of kernelsImprove heuristics, eg choice of r x cIncorporate intoSUGARInformation RetrievalFurther auto
34、mate performance tuningGeneration of algorithm space generatorsPossible collaborations with IntelGetting right toolsGetting faster, less accurate transcendental functionsProvide feedback on toolsProvide tuned kernels, benchmarks, IR appsProvide system for tuning future kernelsTo provide usersTo eval
35、uate architectural designsMillenniumMillenniumCluster of clusters at UC Berkeley309 CPU cluster in Soda HallSmaller clusters across campusMade possible by Intel equipment grantSignificant other supportNSF, Sun, Microsoft, Nortel, campusMillennium TopologyMillennium Usage Oct 1 11, 2001100% utilizati
36、on for last few daysAbout half the jobs are parallelUsage highlightsAMANDAAntarctic Muon And Neutrino Detector A128 scientists from 15 universities and institutes in the U.S. and Europe.TEMPESTEUV lithography simulations via 3D electromagnetic scatteringcuervo./Volcano/study t
37、he defect printability on multilayer masks TitaniumHigh performance Java dialect for scientific computing/projects/titaniumImplementation of shared address space, and use of SSE2Digital Library ProjectLarge database of imageselib./Used to run spectral image segmentation algorithm for clustering, sea
38、rch on imagesUsage highlights (continued)CS 267Graduate class in parallel computing, 33 enrolled/dbindel/cs267taHomeworkDisaster ResponseHelp find people after Sept 11, set up immediately 48K reports in database, linked to other survivor databasesMEMS CAD (MicroElectroM
39、echanical Systems Computer Aided Design)Tool to help design MEMS systemsUsed this semester in EE 245, 93 More later in talkInformation RetrievalDevelopment of faster information retrieval algorithms/jordanMore later in talkMany applications are part of CITRISBackground on
40、 Test MatricesSparse Matrix Benchmark Suite (1/3)#Matrix NameProblem DomainDimensionNo. Non-zeros1denseDense matrix1,0001.00 M2raefsky3Fluid structure interaction21,2001.49 M3inaccuraAccuracy problem16,1461.02 M4bcsstk35*Stiff matrix automobile frame30,2371.45 M5venkat01Flow simulation62,4241.72 M6c
41、rystk02*FEM crystal free-vibration13,965969 k7crystk03*FEM crystal free-vibration24,6961.75 M8nasasrb*Shuttle rocket booster54,8702.68 M93dtube*3-D pressure tube45,3303.21 M10ct20stif*CT20 engine block52,3292.70 M11baiAirfoil eigenvalue calculation23,560484 k12raefsky4Buckling problem19,7791.33 M13e
42、x113-D steady flow problem16,2141.10 M14rdist1Chemical process simulation4,13494.4 k15vavasis32-D PDE problem41,0921.68 MNote: * indicates a symmetric matrix.Sparse Matrix Benchmark Suite (2/3)#Matrix NameProblem DomainDimensionNo. Non-zeros16orani678Economic modeling2,52990.2 k17rimFEM fluid mechan
43、ics problem22,5601.01 M18memplusCircuit simulation17,758126 k19gemat11Power flow4,92933.1 k20lhr10Chemical process simulation10,672233 k21goodwin*Fluid mechanics problem7,320325 k22bayer02Chemical process simulation13,93563.7 k23bayer10Chemical process simulation13,43694.9 k24coater2Simulation of co
44、ating flows9,540207 k25finan512*Financial portfolio optimization74,752597 k26onetone2Harmonic balance method36,057228 k27pwt*Structural engineering36,519326 k28vibrobox*Vibroacoustics12,328343 k29wang4Semiconductor device simulation26,068177 k30lnsp3937Fluid flow modeling3,93725.4 kSparse Matrix Ben
45、chmark Suite (3/3)#Matrix NameProblem DomainDimensionsNo. Non-zeros31lns3937Fluid flow modeling3,93725.4 k32sherman5Oil reservoir modeling3,31220.8 k33sherman3Oil reservoir modeling5,00520.0 k34orsreg1Oil reservoir modeling2,20514.1 k35saylr4Oil reservoir modeling3,56422.3 k36shyy161Viscous flow cal
46、culation76,480330 k37wang3Semiconductor device simulation26,064177 k38mcfeAstrophysics76524.4 k39jpwh991Circuit physics problem9916,02740gupta1*Linear programming31,8022.16 M41lpcrebLinear programming9,648 x 77,137261 k42lpcredLinear programming8,926 x 73,948247 k43lpfit2pLinear programming3,000 x 1
47、3,52550.3 k44lpnug20Linear programming15,240 x 72,600305 k45lsiLatent semantic indexing10 k x 255 k3.7 MMatrix #2 raefsky3 (FEM/Fluids)Matrix #2 (contd) raefsky3 (FEM/Fluids)Matrix #22 - bayer02 (chemical process simulation)Matrix #22 (contd)- bayer02 (chemical process simulation)Matrix #27 - pwt (structural engineering)Matrix #27 (contd)- pwt (structura
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2025年信阳申信发展投资集团有限公司招聘工作人员18名考前自测高频考点模拟试题附答案
- 2025年四平市教育局直属学校专项招聘高校毕业生笔试备考题库附答案
- 2025年湖南怀化会同县社区专职工作人员招聘10人备考题库附答案
- 2025年黑河漠河市漠河林场公开招聘森林管护员13人(公共基础知识)综合能力测试题附答案
- 2025广东江门开平农商银行校园招聘备考题库附答案
- 2025年甘肃酒泉敦煌市选调事业单位工作人员14人备考题库附答案
- 2025年洛阳职业技术学院招才引智招聘高层次人才12名(公共基础知识)测试题附答案
- 2025广东广州天河区城市管理第三保洁所招聘编外工作人员6人备考题库附答案
- 2025年滁州来安县城市基础设施开发有限公司选聘经理层管理人员1名笔试备考题库附答案
- 吉安武功山旅游发展集团有限公司2026年面向社会公开招聘30名安保人员笔试备考题库及答案解析
- 水利电工程施工地质规程
- JJF 2019-2022 液体恒温试验设备温度性能测试规范
- 耐高温铝电解电容器项目计划书
- DZ∕T 0153-2014 物化探工程测量规范(正式版)
- (高清版)TDT 1013-2013 土地整治项目验收规程
- 国家开放大学电大《计算机应用基础(本) 》 终结性考试试题答案(完整版)
- 《建筑基坑降水工程技术规程》DBT29-229-2014
- 防污闪涂料施工技术措施
- 2023年广东学业水平考试物理常考知识点
- 中外政治思想史-复习资料
- 中国近代史期末复习(上)(第16-20课)【知识建构+备课精研】 高一历史上学期期末 复习 (中外历史纲要上)
评论
0/150
提交评论