揭秘史上最大芯片WES二代_第1页
揭秘史上最大芯片WES二代_第2页
揭秘史上最大芯片WES二代_第3页
揭秘史上最大芯片WES二代_第4页
揭秘史上最大芯片WES二代_第5页
已阅读5页,还剩20页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、Cerebras Wafer Scale Engine (WSE)The Most Powerful Processor for AI400,000 AI-optimized cores46,225 mm2 silicon1.2 trillion transistors18 Gigabytes of On-chip Memory9 PByte/s memory bandwidth100 Pbit/s fabric bandwidthTSMC 16nm processCerebras CS-1: Cluster-ScaleDL Performance in a Single SystemPowe

2、red by the WSEProgramming accessibility of a single node, using TensorFlow, PyTorch, and other frameworksDeploys into standard datacenter infrastructureMultiple units delivered, installed, and in-use today across multiple verticalsBuilt from the ground up for AI accelerationArchitecture Designed for

3、 Deep LearningEach component optimized for AI computeComputeFully-programmable core, ML-optimized extensionsDataflow architecture for sparse, dynamic workloadsMemoryDistributed, high performance, on-chip memoryCommunicationHigh bandwidth, low latency fabricCluster-scale networking on chipFully-confi

4、gurable to user-specified topologyTogether, orders of magnitude performance andefficiency gainLinear cluster-scale performance on a single chipSW Co-designed for Scale from the BeginningThis time, well cover how the codesigned software leverages the power and flexibility of the WSE for deep learning

5、 acceleration.And whats next.Programming the Wafer-Scale EngineUsers program the WSE using standard ML frameworks, e.g. TensorFlow, PyTorchCerebras Graph Compiler automatically compiles the DNN graphExtracts from Framework, converts to Cerebras IR, performs matching to Cerebras kernelsPlace & Route

6、allocates compute and memory, configures on-chip networkEnables straightforward programming, flexible execution, high performanceMatching to Kernel LibraryGraph matching from FW ops to Kernels:Primitives to be sized and placed by rest of CGCExpressed as nested for-loops for generality2 Kernel Types:

7、1. Auto-generatedGeneral and supports various operationsPolyhedral techniquesUnrolling loop dimensions across fabricHand-optimizedHigh-performance common kernelsHand-tuned ucode and fabric programmingfor i= 0.783:for j= 0.255:out0j += lhsi*rhsijMATMUL KernelMATMUL OpChoosing the Optimal Mapping Stra

8、tegyChoose mapping strategy for each kernelModel parallel size and allocation of each kernelData parallel replication factorStrategy determinesAllocation of compute cores to kernelAmount of memory to kernelOptimal communication patternNeural Network KernelsAutomatically Exploring the Optimization Se

9、arch SpaceOne possible allocation of the compute,memory, and fabric to each kernelA different allocation of compute,memory, and fabric to each kernelNeural Network KernelsCompute:Memory:Fabric:0%Allocation100%Compute:Memory:Fabric:0%Allocation100%Network Perf:Network Perf:Automatically Exploring the

10、 Optimization Search SpaceOption 6x636 coresOption 3x6:2X slower areaOption 6x12 2X faster 2X areaOption 3x3:4X slower areaFine-tuned Area vs Performance trade-off per kernelOption 12x12 4X faster 4X areadogContinuously Streaming InputcatContinuously Streaming OutputMapping Compute Kernelson the CS-

11、1Co-Designed for Training Flexibility and PerformanceCompiler stack and hardware architecture co-designedResult: Flexibility and PerformanceModel parallel and data parallel executionSparsity harvesting and inducingDynamic neural network execution1) Flexible ParallelismOptimization search enables spe

12、ctrum of parallel execution strategies on WSESingle algorithm uses both model and data parallelism in optimizationExecution strategies optimized for different network characteristicsData Parallel (Batch Size)Model ParallelWSE Layer-PipelinedLayer-sequential WSEGPUData ParallelDevice 1Device 2Running

13、 multiplesamples at same timeSample 1Sample NDevice 2Model ParallelRunning multiple partsof network at same timeDevice 1ModelPart 1ModelPart 24.03.53.02.52.01.51.00k100k300k400kRelative Perf (x factor)200kCores4-Layer BERT Performance, Fixed Batch Size = 16Using Model ParallelismRun all layers on fa

14、bric sectionsLayers in parallel = more performanceExecute network as pipelineEnabled by high bandwidth interconnectSmall batch sizeNo need to replicate networkNo weight sync overheadWeights updated in placeResult: linear performance scaling withsmall batch size876543210k100k300k400kRelative Perf (x

15、factor)200kCoresBERT Attention Kernel Performance1248Data parallel replicas:Using Data ParallelismRun layer replicas on fabric sectionsReplicas in parallel = more performanceApplies to smaller layers/networksNot forced to large batch sizeSmall batch size per replicaSingle sample execution enabled by

16、memory performance at low batchLarger batch by running multi-samplesLow weight sync overheadEnabled by low latency and highbandwidth interconnectResult: linear performance scaling withmedium batch size2) Translating Sparsity into Training PerformanceLarge number of zeros in neural networkNonlinears

17、create activation sparsityMul-Add by zero does not change the resultKernels designed for sparsityHarvest natural sparsity in neural networkInduce sparsity when not naturally occurringDense NetworkSparse NetworkNon-linearCore Designed for SparsityEnabled by dataflow scheduling in hardwareFabric data

18、triggers instruction lookupState machine schedules datapath cyclesIntrinsic sparsity harvestingSender filters out sparse zero dataReceiver skips unnecessary processingFine-grained execution datapathsSmall cores with independent instructionsHigh utilization for dynamic non-uniform workFabric InputFab

19、ric OutputInstructionsfmacz=z,w,aDataTensor ControlFMACDatapathzwRegistersData aCtrlNatural Sparsity in TransformerTransformer uses ReLU and Dropout non-linearsReLU is 90% naturally sparseDropout is 30% naturally sparse1.2x perf gain vs. dense non-linear and no dropoutFeed ForwardFeed ForwardNatural

20、 Sparsity in TransformerTransformer uses ReLU and Dropout non-linearsReLU is 90% naturally sparseDropout is 30% naturally sparse1.2x perf gain vs. dense non-linear and no dropoutFC1ReLUFC2DropBoauct kwardSparseSparseFeed-ForwardInducing SparsitySparsity can be induced byAdding sparse non-linear (e.g

21、. ReLU)Dropping relatively small valuesInducing sparsity on 34-Layer dense FC model1.7x perf gain with ReLU2.4x perf gain with ReLU+50% sparsity0123Dense, IdentityDense, ReLU50% sparse, ReLURelative Perf (x factor)34-Layer FC Performance vs. Induced Sparsity25% sparse, ReLUGPUCS-1.820%75%

22、Relative Perf (x factor)25%50%Induced SparsityBERT Performance vs. Induced SparsityInducing Sparsity in BERTBERT has no natural sparsityBut sparsity can be induced on most layers in both fwd and bwd passUp to 1.5x perf gain with 50% sparsity and minimal accuracy lossML user has control3) Designed to

23、 Unlock Smarter Techniques and ScaleWSE has a data flow architectureFlexibility to stream token by tokenInherent sparsity harvestingWSE is a MIMD architectureCan program each core independentlyPerform different operations on different dataFlexibility Enables Dynamic ML MethodsFine-grained dynamic execution enables new ML techniquesVariable sequence lengthStop at end of sequence, no paddingIrregular/NAS modelsHigh ut

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论