版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Cerebras Wafer Scale Engine (WSE)The Most Powerful Processor for AI400,000 AI-optimized cores46,225 mm2 silicon1.2 trillion transistors18 Gigabytes of On-chip Memory9 PByte/s memory bandwidth100 Pbit/s fabric bandwidthTSMC 16nm processCerebras CS-1: Cluster-ScaleDL Performance in a Single SystemPowe
2、red by the WSEProgramming accessibility of a single node, using TensorFlow, PyTorch, and other frameworksDeploys into standard datacenter infrastructureMultiple units delivered, installed, and in-use today across multiple verticalsBuilt from the ground up for AI accelerationArchitecture Designed for
3、 Deep LearningEach component optimized for AI computeComputeFully-programmable core, ML-optimized extensionsDataflow architecture for sparse, dynamic workloadsMemoryDistributed, high performance, on-chip memoryCommunicationHigh bandwidth, low latency fabricCluster-scale networking on chipFully-confi
4、gurable to user-specified topologyTogether, orders of magnitude performance andefficiency gainLinear cluster-scale performance on a single chipSW Co-designed for Scale from the BeginningThis time, well cover how the codesigned software leverages the power and flexibility of the WSE for deep learning
5、 acceleration.And whats next.Programming the Wafer-Scale EngineUsers program the WSE using standard ML frameworks, e.g. TensorFlow, PyTorchCerebras Graph Compiler automatically compiles the DNN graphExtracts from Framework, converts to Cerebras IR, performs matching to Cerebras kernelsPlace & Route
6、allocates compute and memory, configures on-chip networkEnables straightforward programming, flexible execution, high performanceMatching to Kernel LibraryGraph matching from FW ops to Kernels:Primitives to be sized and placed by rest of CGCExpressed as nested for-loops for generality2 Kernel Types:
7、1. Auto-generatedGeneral and supports various operationsPolyhedral techniquesUnrolling loop dimensions across fabricHand-optimizedHigh-performance common kernelsHand-tuned ucode and fabric programmingfor i= 0.783:for j= 0.255:out0j += lhsi*rhsijMATMUL KernelMATMUL OpChoosing the Optimal Mapping Stra
8、tegyChoose mapping strategy for each kernelModel parallel size and allocation of each kernelData parallel replication factorStrategy determinesAllocation of compute cores to kernelAmount of memory to kernelOptimal communication patternNeural Network KernelsAutomatically Exploring the Optimization Se
9、arch SpaceOne possible allocation of the compute,memory, and fabric to each kernelA different allocation of compute,memory, and fabric to each kernelNeural Network KernelsCompute:Memory:Fabric:0%Allocation100%Compute:Memory:Fabric:0%Allocation100%Network Perf:Network Perf:Automatically Exploring the
10、 Optimization Search SpaceOption 6x636 coresOption 3x6:2X slower areaOption 6x12 2X faster 2X areaOption 3x3:4X slower areaFine-tuned Area vs Performance trade-off per kernelOption 12x12 4X faster 4X areadogContinuously Streaming InputcatContinuously Streaming OutputMapping Compute Kernelson the CS-
11、1Co-Designed for Training Flexibility and PerformanceCompiler stack and hardware architecture co-designedResult: Flexibility and PerformanceModel parallel and data parallel executionSparsity harvesting and inducingDynamic neural network execution1) Flexible ParallelismOptimization search enables spe
12、ctrum of parallel execution strategies on WSESingle algorithm uses both model and data parallelism in optimizationExecution strategies optimized for different network characteristicsData Parallel (Batch Size)Model ParallelWSE Layer-PipelinedLayer-sequential WSEGPUData ParallelDevice 1Device 2Running
13、 multiplesamples at same timeSample 1Sample NDevice 2Model ParallelRunning multiple partsof network at same timeDevice 1ModelPart 1ModelPart 24.03.53.02.52.01.51.00k100k300k400kRelative Perf (x factor)200kCores4-Layer BERT Performance, Fixed Batch Size = 16Using Model ParallelismRun all layers on fa
14、bric sectionsLayers in parallel = more performanceExecute network as pipelineEnabled by high bandwidth interconnectSmall batch sizeNo need to replicate networkNo weight sync overheadWeights updated in placeResult: linear performance scaling withsmall batch size876543210k100k300k400kRelative Perf (x
15、factor)200kCoresBERT Attention Kernel Performance1248Data parallel replicas:Using Data ParallelismRun layer replicas on fabric sectionsReplicas in parallel = more performanceApplies to smaller layers/networksNot forced to large batch sizeSmall batch size per replicaSingle sample execution enabled by
16、memory performance at low batchLarger batch by running multi-samplesLow weight sync overheadEnabled by low latency and highbandwidth interconnectResult: linear performance scaling withmedium batch size2) Translating Sparsity into Training PerformanceLarge number of zeros in neural networkNonlinears
17、create activation sparsityMul-Add by zero does not change the resultKernels designed for sparsityHarvest natural sparsity in neural networkInduce sparsity when not naturally occurringDense NetworkSparse NetworkNon-linearCore Designed for SparsityEnabled by dataflow scheduling in hardwareFabric data
18、triggers instruction lookupState machine schedules datapath cyclesIntrinsic sparsity harvestingSender filters out sparse zero dataReceiver skips unnecessary processingFine-grained execution datapathsSmall cores with independent instructionsHigh utilization for dynamic non-uniform workFabric InputFab
19、ric OutputInstructionsfmacz=z,w,aDataTensor ControlFMACDatapathzwRegistersData aCtrlNatural Sparsity in TransformerTransformer uses ReLU and Dropout non-linearsReLU is 90% naturally sparseDropout is 30% naturally sparse1.2x perf gain vs. dense non-linear and no dropoutFeed ForwardFeed ForwardNatural
20、 Sparsity in TransformerTransformer uses ReLU and Dropout non-linearsReLU is 90% naturally sparseDropout is 30% naturally sparse1.2x perf gain vs. dense non-linear and no dropoutFC1ReLUFC2DropBoauct kwardSparseSparseFeed-ForwardInducing SparsitySparsity can be induced byAdding sparse non-linear (e.g
21、. ReLU)Dropping relatively small valuesInducing sparsity on 34-Layer dense FC model1.7x perf gain with ReLU2.4x perf gain with ReLU+50% sparsity0123Dense, IdentityDense, ReLU50% sparse, ReLURelative Perf (x factor)34-Layer FC Performance vs. Induced Sparsity25% sparse, ReLUGPUCS-1.820%75%
22、Relative Perf (x factor)25%50%Induced SparsityBERT Performance vs. Induced SparsityInducing Sparsity in BERTBERT has no natural sparsityBut sparsity can be induced on most layers in both fwd and bwd passUp to 1.5x perf gain with 50% sparsity and minimal accuracy lossML user has control3) Designed to
23、 Unlock Smarter Techniques and ScaleWSE has a data flow architectureFlexibility to stream token by tokenInherent sparsity harvestingWSE is a MIMD architectureCan program each core independentlyPerform different operations on different dataFlexibility Enables Dynamic ML MethodsFine-grained dynamic execution enables new ML techniquesVariable sequence lengthStop at end of sequence, no paddingIrregular/NAS modelsHigh ut
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2026年可再生能源技术太阳能与水能利用技术题库
- 2026年会计基础与实务考试题库与解析
- 2026年外语学习英文语境与实践交际问题解答
- 2026年食品安全法规知识考试题保障公共健康
- 天文课外知识
- 2026浙江省城建融资租赁有限公司招聘5人参考考试试题及答案解析
- 2026年包头钢铁职业技术学院单招综合素质考试模拟试题含详细答案解析
- 2026年广州铁路职业技术学院单招综合素质考试备考题库含详细答案解析
- 2026年南京视觉艺术职业学院高职单招职业适应性测试备考题库及答案详细解析
- 2026年潍坊理工学院单招职业技能考试备考题库含详细答案解析
- 现金日记账模板(出纳版)
- DB34T 1948-2013 建设工程造价咨询档案立卷标准
- 2024中药药渣处理协议
- 心源性晕厥的查房
- 机械气道廓清技术临床应用专家共识(2023版)解读
- 压力性损伤风险评估与管理护理课件
- 市域治理现代化的培训课件
- 专家解析:渲染,烘托等的区别课件
- 广州花城汇UUPARK招商手册
- 20S517 排水管道出水口
- (完整word)长沙胡博士工作室公益发布新加坡SM2考试物理全真模拟试卷(附答案解析)
评论
0/150
提交评论