2024微软人工智能系统深度神经网络计算框架基础_第1页
2024微软人工智能系统深度神经网络计算框架基础_第2页
2024微软人工智能系统深度神经网络计算框架基础_第3页
2024微软人工智能系统深度神经网络计算框架基础_第4页
2024微软人工智能系统深度神经网络计算框架基础_第5页
已阅读5页,还剩40页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1 lossdogloss

CatDogRaccoon𝑑𝑤1

𝑑error𝑑𝑤2

𝑑𝑤3

𝑑𝑤4

𝑑error𝑑𝑤5

Error importnumpyasnpN,D=3,importnumpyasnpN,D=3,4x=np.random.randn(N,D)y=np.random.randn(N,D)z=np.random.randn(N,D)a=x*yb=a+zc=np.sum(b)

𝑥 𝑦 𝑧 𝑐𝑐grad_c=grad_c=1.0grad_b=grad_c*np.ones((N,D))grad_a=grad_b.copy()grad_z=grad_b.copy()grad_x=grad_a*ygrad_y=grad_a*x

𝑔𝑟𝑎𝑑_𝑥

𝑔𝑟𝑎𝑑_𝑦

𝑔𝑟𝑎𝑑_𝑧

Python-likeFlexibilityPython-like importxxlibimportxxlibx,y=load_data()y=xxlib.resnet152(x)libraryPython-likelibraryPython-like

Flexibility灵活 高效 Efficiency

library

Layer-based

Python-like

Flexibility ClassAttenionLayer<CPU>{voidforward(inputs..){}voidbackward(inputs,grad){}ClassAttenionLayer<CPU>{voidforward(inputs..){}voidbackward(inputs,grad){}ClassAttenionLayer<GPU>{…};REGISTER_LAYER(“Attention”,AttenionLayer); SGD:𝑤←𝑤−𝜂∇𝑤SGDwithmomentum:𝑤←𝑤−(𝛾∇𝑡−1+𝜂∇𝑡)𝑤 𝑤\hhttps://ruder.io/optimizing-gradient-descent/ 前端编程语言和接口Python,Lua,R,C++自动求导(AutoDifferentiation)统一模型表示:计算流图前端编程语言和接口Python,Lua,R,C++自动求导(AutoDifferentiation)统一模型表示:计算流图xw*b+y图的优化与调度执行Batching,Cache,Overlap内核代码优化与编译GPUkernel,autokernelgeneration 计算硬件计算硬件CPU,GPU,RDMAdevices AddLogWhileSubMatMulMergeMulConvBroadCastDivBatchNormReduceAddLogWhileSubMatMulMergeMulConvBroadCastDivBatchNormReduceReluLossMapTanhTransposeReshapeExpConcatenateSelectFloorSigmoid….. PAGE15PAGE15Numpyimportnumpyasnpnp.random.seed(importnumpyasnpnp.random.seed(0)N,D=3,4grad_c=1.0grad_b=grad_c*np.ones((N,D))grad_a=grad_b.copy()grad_z=grad_b.copy()grad_x=grad_a*ygrad_y=grad_a*x

𝑥 𝑦 𝑧 𝑐x=np.random.randn(N,D)y=𝑐x=np.random.randn(N,D)y=np.random.randn(N,D)z=np.random.randn(N,D)abc===x*ya+znp.sum(b)

𝑔𝑟𝑎𝑑_𝑦

𝑔𝑟𝑎𝑑_𝑧xyz𝛻x𝛻y*xyz𝛻x𝛻y*a*𝐠𝛻z+bΣc+𝐠𝛻a𝛻bΣ𝐠1717 L(𝑤)=𝐿𝑜𝑠𝑠 𝑓(𝑤,

),→𝜕𝐿(𝑤)𝜕𝑤L𝑥 =expexp 𝑥 +exp𝑥 2 +sin(exp𝑥 +exp𝑥 2)𝜕𝐿(𝑤)𝜕𝑤 𝐿 𝑥 =exp exp𝑥 +exp𝑥 2 +sin(exp𝑥 +exp𝑥 2) xyz𝛻x𝛻y*a*xyz𝛻x𝛻y*a*𝐠𝛻z+bΣc+𝐠𝛻a𝛻bΣ𝐠图的优化与调度执行Batching,Cache,Overlap内核代码优化与编译GPUkernel,autokernely+*bxw统一模型表示:计算流图图的优化与调度执行Batching,Cache,Overlap内核代码优化与编译GPUkernel,autokernely+*bxw统一模型表示:计算流图前端编程语言和接口Python,Lua,R,C++自动求导(AutoDifferentiation)计算硬件计算硬件CPU,GPU,RDMAdevicesxwxw*b+yxwxw*b+yPAGE29PAGE29Batchsame-typeoperatorsleverageGPUmassiveparallelism++×𝝈+M𝝈×𝒕𝒂𝒏𝒉+ +MMMMMRf Rzht-1xtData-flowgraphofaGRUcellWzWoWfhtBatchsame-typeoperatorsleverageGPUmassiveparallelism+×𝝈+×𝝈+M𝝈×𝒕𝒂𝒏𝒉+ +MMMMMRf Rzht-1xtWzWoWfht+×𝝈×𝝈+M𝒕𝒂𝒏𝒉Mht-1RWxthtData-flowgraphofaGRUcellPAGEPAGE31xyz𝛻x𝛻y*a*𝐠𝛻z+xyz𝛻x𝛻y*a*𝐠𝛻z+bΣc+𝐠𝛻a𝛻bΣ𝐠1xyzxyz𝛻x*𝐠𝛻y*a𝛻z+bΣc+𝐠𝛻bΣ𝐠𝛻aGPU0显式图划分GPU0𝒀MatMul𝑯Sigmoid 𝑾𝟐𝒀MatMul𝑯Sigmoid 𝑾𝟐MatMulGPU133DispatchpartitionsPartitiongraph𝑯𝝈𝒀*DispatchpartitionsPartitiongraph𝑯𝝈𝒀*𝑾𝟐*𝑾𝟏 𝑿tensortransmissionmechanism𝑯𝝈Send*𝒀*Recv𝑾𝑾𝟏𝑿𝟐Server0ServerServer0Server136x y z

𝛻x 𝛻yCPUcodeGPUcode

* a+ +𝐠b 𝛻bΣ Σ𝐠c

𝛻a

𝛻z计算硬件CPU,GPU,RDMAdevices前端编程语言和接口Python,Lua,R,C++自动求导(AutoDifferentiation)统一模型表示:计算流图* + yb图的优化与调度执行Batching,Cache,Overlap内核代码优化与编译GPUkernel,autokernelgeneration计算硬件CPU,GPU,RDMAdevices前端编程语言和接口Python,Lua,R,C++自动求导(AutoDifferentiation)统一模型表示:计算流图* + yb图的优化与调度执行Batching,Cache,Overlap内核代码优化与编译GPUkernel,autokernelgenerationLayer-basedStaticgraphLayer-basedStaticgraphPython-likePython,ScipyCannotleverageGPUNoprogrammingrestrictCNTK,Caffe2DeclarativeprogrammingGraphoptimizationCaffeProgramingwithconfigLargekernelgranularity

MoreFlexibilityimporttorchfromtorch.autogradimportVariableN,D=3,4x=Variable(torch.randn(N,D).cuda())y=Variable(torch.randn(N,D).cuda())z=Variable(torch.randn(N,D).cuda())foriinrange(10):importtorchfromtorch.autogradimportVariableN,D=3,4x=Variable(torch.randn(N,D).cuda())y=Variable(torch.randn(N,D).cuda())z=Variable(torch.randn(N,D).cuda())foriinrange(10):a=x*yb=a+zc=c+torch.sum(b)c.backward() 43Layer-basedStaticgraphLayer-basedStaticgraphDynamicgraphPython-likeDyNetImperativeprogramming(Define-by-run)NographoptimizationPython,ScipyCannotleverageGPUNoprogrammingrestrictCNTK,Caffe2DeclarativeprogrammingGraphoptimizationCaffeProgramingwithconfigLargekernelgranularity

MoreFlexibilityCompilerisusedtooptimizegeneralframeworktobemoreefficient,whilekeepingtheexistingflexibility!Compilerisusedtooptimizegeneralframeworktobemoreefficient,whilekeepingtheexistingflexibility!CustompurposemachinelearningalgorithmsTheanoCustompurposemachinelearningalgorithmsTheanoDisBeliefCaffeDeeplearningframeworksprovideeasierwaystoleveragevariouslibrariesMachineLearningLanguageandCompilerPowerfulCompilerInfrastructure:Codeoptimization,sparsityoptimization,hardwaretargeting

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论