




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Cerebras Wafer Scale Engine (WSE)The Most Powerful Processor for AI400,000 AI-optimized cores46,225 mm2 silicon1.2 trillion transistors18 Gigabytes of On-chip Memory9 PByte/s memory bandwidth100 Pbit/s fabric bandwidthTSMC 16nm processCerebras CS-1: Cluster-ScaleDL Performance in a Single SystemPowe
2、red by the WSEProgramming accessibility of a single node, using TensorFlow, PyTorch, and other frameworksDeploys into standard datacenter infrastructureMultiple units delivered, installed, and in-use today across multiple verticalsBuilt from the ground up for AI accelerationArchitecture Designed for
3、 Deep LearningEach component optimized for AI computeComputeFully-programmable core, ML-optimized extensionsDataflow architecture for sparse, dynamic workloadsMemoryDistributed, high performance, on-chip memoryCommunicationHigh bandwidth, low latency fabricCluster-scale networking on chipFully-confi
4、gurable to user-specified topologyTogether, orders of magnitude performance andefficiency gainLinear cluster-scale performance on a single chipSW Co-designed for Scale from the BeginningThis time, well cover how the codesigned software leverages the power and flexibility of the WSE for deep learning
5、 acceleration.And whats next.Programming the Wafer-Scale EngineUsers program the WSE using standard ML frameworks, e.g. TensorFlow, PyTorchCerebras Graph Compiler automatically compiles the DNN graphExtracts from Framework, converts to Cerebras IR, performs matching to Cerebras kernelsPlace & Route
6、allocates compute and memory, configures on-chip networkEnables straightforward programming, flexible execution, high performanceMatching to Kernel LibraryGraph matching from FW ops to Kernels:Primitives to be sized and placed by rest of CGCExpressed as nested for-loops for generality2 Kernel Types:
7、1. Auto-generatedGeneral and supports various operationsPolyhedral techniquesUnrolling loop dimensions across fabricHand-optimizedHigh-performance common kernelsHand-tuned ucode and fabric programmingfor i= 0.783:for j= 0.255:out0j += lhsi*rhsijMATMUL KernelMATMUL OpChoosing the Optimal Mapping Stra
8、tegyChoose mapping strategy for each kernelModel parallel size and allocation of each kernelData parallel replication factorStrategy determinesAllocation of compute cores to kernelAmount of memory to kernelOptimal communication patternNeural Network KernelsAutomatically Exploring the Optimization Se
9、arch SpaceOne possible allocation of the compute,memory, and fabric to each kernelA different allocation of compute,memory, and fabric to each kernelNeural Network KernelsCompute:Memory:Fabric:0%Allocation100%Compute:Memory:Fabric:0%Allocation100%Network Perf:Network Perf:Automatically Exploring the
10、 Optimization Search SpaceOption 6x636 coresOption 3x6:2X slower areaOption 6x12 2X faster 2X areaOption 3x3:4X slower areaFine-tuned Area vs Performance trade-off per kernelOption 12x12 4X faster 4X areadogContinuously Streaming InputcatContinuously Streaming OutputMapping Compute Kernelson the CS-
11、1Co-Designed for Training Flexibility and PerformanceCompiler stack and hardware architecture co-designedResult: Flexibility and PerformanceModel parallel and data parallel executionSparsity harvesting and inducingDynamic neural network execution1) Flexible ParallelismOptimization search enables spe
12、ctrum of parallel execution strategies on WSESingle algorithm uses both model and data parallelism in optimizationExecution strategies optimized for different network characteristicsData Parallel (Batch Size)Model ParallelWSE Layer-PipelinedLayer-sequential WSEGPUData ParallelDevice 1Device 2Running
13、 multiplesamples at same timeSample 1Sample NDevice 2Model ParallelRunning multiple partsof network at same timeDevice 1ModelPart 1ModelPart 24.03.53.02.52.01.51.00k100k300k400kRelative Perf (x factor)200kCores4-Layer BERT Performance, Fixed Batch Size = 16Using Model ParallelismRun all layers on fa
14、bric sectionsLayers in parallel = more performanceExecute network as pipelineEnabled by high bandwidth interconnectSmall batch sizeNo need to replicate networkNo weight sync overheadWeights updated in placeResult: linear performance scaling withsmall batch size876543210k100k300k400kRelative Perf (x
15、factor)200kCoresBERT Attention Kernel Performance1248Data parallel replicas:Using Data ParallelismRun layer replicas on fabric sectionsReplicas in parallel = more performanceApplies to smaller layers/networksNot forced to large batch sizeSmall batch size per replicaSingle sample execution enabled by
16、memory performance at low batchLarger batch by running multi-samplesLow weight sync overheadEnabled by low latency and highbandwidth interconnectResult: linear performance scaling withmedium batch size2) Translating Sparsity into Training PerformanceLarge number of zeros in neural networkNonlinears
17、create activation sparsityMul-Add by zero does not change the resultKernels designed for sparsityHarvest natural sparsity in neural networkInduce sparsity when not naturally occurringDense NetworkSparse NetworkNon-linearCore Designed for SparsityEnabled by dataflow scheduling in hardwareFabric data
18、triggers instruction lookupState machine schedules datapath cyclesIntrinsic sparsity harvestingSender filters out sparse zero dataReceiver skips unnecessary processingFine-grained execution datapathsSmall cores with independent instructionsHigh utilization for dynamic non-uniform workFabric InputFab
19、ric OutputInstructionsfmacz=z,w,aDataTensor ControlFMACDatapathzwRegistersData aCtrlNatural Sparsity in TransformerTransformer uses ReLU and Dropout non-linearsReLU is 90% naturally sparseDropout is 30% naturally sparse1.2x perf gain vs. dense non-linear and no dropoutFeed ForwardFeed ForwardNatural
20、 Sparsity in TransformerTransformer uses ReLU and Dropout non-linearsReLU is 90% naturally sparseDropout is 30% naturally sparse1.2x perf gain vs. dense non-linear and no dropoutFC1ReLUFC2DropBoauct kwardSparseSparseFeed-ForwardInducing SparsitySparsity can be induced byAdding sparse non-linear (e.g
21、. ReLU)Dropping relatively small valuesInducing sparsity on 34-Layer dense FC model1.7x perf gain with ReLU2.4x perf gain with ReLU+50% sparsity0123Dense, IdentityDense, ReLU50% sparse, ReLURelative Perf (x factor)34-Layer FC Performance vs. Induced Sparsity25% sparse, ReLUGPUCS-1.820%75%
22、Relative Perf (x factor)25%50%Induced SparsityBERT Performance vs. Induced SparsityInducing Sparsity in BERTBERT has no natural sparsityBut sparsity can be induced on most layers in both fwd and bwd passUp to 1.5x perf gain with 50% sparsity and minimal accuracy lossML user has control3) Designed to
23、 Unlock Smarter Techniques and ScaleWSE has a data flow architectureFlexibility to stream token by tokenInherent sparsity harvestingWSE is a MIMD architectureCan program each core independentlyPerform different operations on different dataFlexibility Enables Dynamic ML MethodsFine-grained dynamic execution enables new ML techniquesVariable sequence lengthStop at end of sequence, no paddingIrregular/NAS modelsHigh ut
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 临床医院实习报告总结范文
- 2025年地震观测设备项目合作计划书
- 2025年镍镉电池项目建议书
- 2025届上海市桃浦中学 高一物理第二学期期末复习检测模拟试题含解析
- 打造智慧办公生态圈如何运用区块链技术实现高效身份验证
- 广东省广州市广东第二师范学院番禺中2025届高一物理第二学期期末达标检测模拟试题含解析
- 心理驱动的学习教育心理学的新视角
- 学习动机与学习潜能的深度解析
- 专题04 荐信 感谢信 倡议书(测试)(原卷版)-2025年高考英语二轮复习
- 教育技术的前沿个性化学习与评估的挑战与机遇
- 2023年中国石化河北石家庄石油分公司社会招聘20人笔试模拟试题及答案解析
- 太阳能热水系统设计
- 医务科岗前培训
- 共青团团课主题班会课件PPT模板PPT
- GB/T 8685-2008纺织品维护标签规范符号法
- 合成氨行业发展现状及趋势分析
- 2022年徐闻县(中小学、幼儿园)教师招聘笔试试题及答案解析
- 网电部管理重点(中)
- 新生儿复苏解析课件
- ABI7500荧光定量PCR仪标准操作规程
- 语言领域核心经验《学前儿童语言学习与发展核心经验》
评论
0/150
提交评论