版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
2024人工智能AI技术教程课程讲义名称备注1课程介绍Overviewand
system/AI
basics2人工智能系统概述System
perspective
of
System
for
AISystem
for
AI:
a
historic
view;
Fundamentals
of
neural
networks;Fundamentals
of
System
for
AI3深度神经网络计算框架基础Computation
frameworks
for
DNNBackprop
and
AD,
Tensor,
DAG,
Execution
graphPapers
and
systems:
PyTorch,
TensorFlow4矩阵运算与计算机体系结构Computer
architecture
for
Matrix
computationMatrix
computation,
CPU/SIMD,
GPGPU,
ASIC/TPUPapers
and
systems:
Blas,
TPU5分布式训练算法Distributed
training
algorithmsData
parallelism,
model
parallelism,
distributed
SGDPapers
and
systems:6分布式训练系统Distributed
training
systemsMPI,
parameter
servers,all-reduce,
RDMAPapers
and
systems:
Horovod7异构计算集群调度与资源管理系统Scheduling
and
resource
management
systemRunning
DNNjob
on
cluster:
container,
resource
allocation,
schedulingPapers
and
systems:
KubeFlow,
OpenPAI,
Gandiva,
HiveD8深度学习推导系统Inference
systemsEfficiency,
latency,
throughput,
and
deployment课程讲义名称备注9计算图编译优化Computation
graph
compilation
and
optimizationIR,
sub-graph
pattern
match,
Matrix
multiplication
and
memoryoptimizationPapers
and
systems:
XLA,
MLIR,
TVM,NNFusion10模型压缩和稀疏化处理Efficiency
via
compression
and
sparsityModel
compression,
SparsityPruning11自动机器学习系统AutoML
systemsHyper
parameter
tuning,
NASPapers
and
systems:
Hyperband,
SMAC,
ENAS,AutoKeras,
NNI12强化学习系统Reinforcement
learning
systemsTheory
of
RL,
systems
for
RLPapers
and
systems:
AC3,
RLlib,
AlphaZero13模型安全与隐私保护Security
and
PrivacyFederated
learning,
security,
privacyPapers
and
systems:
DeepFake14用AI技术优化计算机系统AIfor
systemsAI
for
traditional
systems
problems,
for
system
algorithmsPapers
and
systems:
Learned
Indexes,
Learned
query
path课程讲义名称备注Lab
1
(for
week1,2)框架及工具入门示例A
simple
throughout
end-to-end
AI
example,
from
asystem
perspectiveUnderstand
the
systems
from
debugger
info
andsystem
logsLab
2
(for
week
3)定制一个新的张量运算Customize
operatorsDesign
and
implement
a
customized
operator
(bothforward
and
backward):
in
pythonLab
3
(for
week
4)CUDA实现和优化CUDA
implementationAdd
a
CUDA
implementation
for
thecustomizedoperatorLab
4
(for
week
5,6)AllReduce实现和优化AllReduceImprove
one
of
AllReduce
operators’implementation
onHorovodLab
5
(for
week
7,
8)配置Container来进行云上训练或推理准备Configure
containers
for
customized
training
and
inferenceConfigure
containersLab
6学习使用调度管理系统Scheduling
and
resource
management
systemGet
familiar
with
OpenPAI
or
KubeFlowLab
7分布式训练任务练习Distributed
trainingTry
different
kinds
of
all
reduce
implementationsLab
8自动机器学习系统练习AutoMLSearch
for
a
new
neural
networkNN
structuree
forImage/NLP
tasksLab
9强化学习系统练习RLSystemsConfigure
and
get
familiar
with
one
of
the
followingRL
Systems:
RLlib,
…Deep
Learning深度学习正在改变世界Self-drivingPersonalassistantSurveillance
detectionTranslationMedicaldiagnosticsGameArtImage
recognitionSpeech
recognitionNatural
languageGenerative
modelReinforcement
learningCatDogRaccoonDogcatdoghoney
badger𝑤1𝑤2𝑤3𝑤4𝑤5𝑑error𝑑𝑤5𝑑error𝑑𝑤4𝑑error𝑑𝑤3𝑑error𝑑𝑤2𝑑error𝑑𝑤1ErrorslossRDMA计算能力海量的(标识)数据14M
images深度学习算法的进步语言、框架深度学习+系统的进步:编程语言、优化、计算机体系结构、并行计算以及分布式系统MNISTImageNetWeb
Images60K
samples16M
samplesBillions
of
Images10
categories1000
categoriesOpened
categoriesE.g.,
image
classification
problem1257.73.31.44.71.70.23TEST
ERROR
RATE
(%)LeNet,convolution,max-pooling,softmax,
1998AlexNet,
16.4%ReLU,
Dropout,2012Inception,6.7%Batchnormalization,2015ResNet,3.57%Residual
way,2015EfficientNet,3.1%NAS2019Image
recognitionSpeech
recognitionNatural
languageReinforcement
learning19602019CPUMoore’s
law108x1970
19801990
20002010ENIAC5
Kops~500
GopsXeon
E5DedicatedHardware105xGPUTPUTPUv3360
TopsV100TPUv1125
Tops90
Tops?Performance(Op/Sec)Deep
learning
frameworksMxNetTensorFlowCNTKPyTorchLanguage
FrontendSwift
for
TensorFlowCompiler
BackendTVMTensorFlow
XLACustom
purposemachine
learningalgorithmsTheanoDisBeliefCaffeAlgebra
&linear
libsCPUGPUDense
matmul
engineGPUFPGASpecial
AI
acceleratorsTPUGraphCoreOther
ASICsAI
frameworkDense
matmulengineDeep
learningframeworksprovide
easierways
to
leveragevarious
librariesCustom
purposemachine
learningalgorithmsTheanoDisBeliefCaffeAlgebra
&linear
libsCPUGPUA
Full-Featured
Programming
Language
forML:
Expressive
and
flexibleControl
flow,
recursion,
sparsityPowerful
Compiler
Infrastructure:Code
optimization,
sparsity
optimization,hardware
targetingMachine
Learning
Language
andCompilerSIMD
MIMDSparsity
SupportControl
Flowand
DynamicityAssociated
MemoryScalable
Network
Stack
(RDMA,
IB,
NVLink)Hardware
APIs
(GPU,
CPU,
FPGA,
ASIC)Resource
Management/SchedulerExperienceFrameworksArchitecture(single
node
and
Cloud)Deep
Learning
Runtime:Optimizer,
Planner,
ExecutorRuntimeEnd-to-End
AI
User
ExperiencesModel,
Algorithm,
Pipeline,
Experiment,
Tool,Life
CycleManagementProgramming
InterfacesComputation
graph,
(auto)
Gradient
calculationIR,
Compiler
infrastructureclass
3class
4class
5class
6class
7class
8更广泛的AI系统生态机器学习新模式(RL)自动机器学习(AutoML)安全与隐私模型推导、压缩与优化深度学习算法和框架广泛用途的高效新型通用AI算法多种深度学习框架的支持与进化深度神经网络编译架构及优化核心系统软硬件深度学习任务运行和优化环境通用资源管理和调度系统新型硬件及相关高性能网络和计算栈class
12class
11class
13class
10(2)开始训练(1)定义网络结构Fullyconnected 通常用作分类问题的最后几层Convolutionalneural
network 通常用作图像、语音等Locality强的数据Recurrentneural
network 通常用作序列及结构化的数据,比如文本信息、知识图Transformerneural
network 通常用作序列数据,比如文本信息#
A
recursive
TreeBank
model
in
a
dozen
lines
of
JPL
code#
Walk
the
tree,
accumulating
embedding
vecs#
Word
embedding
model
is
used
at
the
leaf
node
to
map
word#
index
into
high-dimensional
semantic
word
representation.#
Map
tree
embedding
to
sentiment#
Getsemantic
representations
forleft
and
right
children.#
A
composition
function
is
used
to
learn
semantic#
representation
for
phrase
at
the
internal
node.更多样化的结构更强大的建模能力更复杂的依赖关系更细粒度的计算模式Graph
definition
(IR)x
*w
b+
yFront-endLanguage
Binding:
Python,
Lua,
R,
C++OptimizationBatching,
Cache,
OverlapExecution
RuntimeCPU,
GPU,
RDMA
devicesTensorFlowx
yz*a+bΣcData-Flow
Graph
(DFG)as
Intermediate
Representation𝛻b𝛻a𝛻x𝛻y𝛻z+𝐠*𝐠TensorFlowx
yz*a+bΣ
Σ𝐠cAdd
gradient
backpropagation
to
Data-FlowGraph
(DFG)𝛻b𝛻a𝛻z+𝐠*𝐠xy
z
𝛻x
𝛻y*a+bΣ
Σ𝐠cCPU
codeGPU
code𝛻b𝛻a𝛻z+𝐠*𝐠xy
z
𝛻x
𝛻y*a+bΣ
Σ𝐠c......1OperatorsExperienceFrameworksArchitectureIDEProgramming
with:
VSCode,
Jupiter
NotebookLanguageIntegrated
with
mainstream
PL:
PyTorch
and
TensorFlow
inside
PythonCompilerIntermediate
representationCompilationOptimizationBasic
data
structure:
TensorLexical
analysis:
TokenUser
controlled:
mini-batchBasic
computation:
DAGParsing:
ASTData
parallelism
and
model
parallelismAdvance
features:
control
flowSemantic
analysis:Symbolic
ADLoop
nets
analysis:
pipeline
parallelism,control
flowGeneral
IRs:
MLIRCode
optimizationData
flow
analysis:
CSP,
Arithmetic,
FusionCode
generationHardware
dependent
optimizations:matrix
computation,
layoutResource
allocation
and
scheduler:memory,
recomputation,RuntimesSingle
node:
CuDNNMultimode:
Parameter
servers,
All
reducerComputation
cluster
resource
management
and
job
schedulerHardwareHardware
accelerators:CPU/GPU/ASIC/FPGANetworkaccelerators:
RDMA/IB/NVLinkDeep
learning
frameworksMxNetTensorFlowCNTKPyTorchLanguage
FrontendSwift
for
TensorFlowCompiler
BackendTVMTensorFlow
XLAAI
Framework
Densematmul
engineGPUFPGASpecial
AI
acceleratorsTPUGraphCoreOther
ASICsimport
"tensorflow/core/framework/to";import
"tensorflow/core/framework/op_to";import
"tensorflow/core/framework/tensor_toAFull-Featured
Programming
Languagefor
ML:
Expressive
and
flexibleControl
flow,
recursion,
sparsityPowerful
Compiler
Infrastructure:Code
optimization,
sparsity
optimization,hardwaretargetingMachine
Learning
Language
andCompilerSIMD
MIMDSparsity
SupportControl
Flowand
DynamicityAssociated
Memory//
Syntactically
similar
to
LLVM:func
@testFunction(%arg0:
i32){%x
=
call
@thingToCall(%arg0)
:
(i32)->
i32br
^bb1^bb1:%y
=
addi
%x,
%x:i32return
%y
:
i32}深度学习高度依赖数据规模和模型规模提高训练速度可以加快深度学习模型的开发速度大规模部署深度学习模型需要更快和更高效的推演速度Inference
performance
Serving
latency8
layers1.4
GFLOP16%
Error2012AlexNetImage152
layers22.6
GFLOP3.5%
Error2015ResNetSpeech80
GFLOP7,000
hrs
of
Data8%
Error2014Deep
Speech
1465
GFLOP12,000
hrs
of
Data5%
Error2015Deep
Speech
2Different
architectures:
CNN,RNN,
Transformer,
…High
computation
resourcerequirements:
model
size,
…Different
goals:
latency,throughput,
accuracy,
…Transparently
apply
over
heterogeneous
hardware
environmentScale-out Local
Efficiency Memory
EffectivenessBe
transparent
to
various
user
requirements系统、算法和硬件必须相互结合
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 对企业有利的加班合同(2篇)
- 二零二五年智能家电技术服务合同范本3篇
- 宜宾酒王二零二五年度800亿控量保价市场占有率提升合同2篇
- 二零二五年度酒店会议住宿套餐定制合同2篇
- 2025年度电子信息产业设备采购与技术服务合同3篇
- 二零二五版工程款分期支付还款协议合同范本3篇
- 二零二五版碧桂园集团施工合同示范文本6篇
- 二零二五版豆腐出口贸易代理合同3篇
- 二零二五年度韵达快递业务承包合同及综合运营支持协议3篇
- 2024年物流运输承包合同3篇
- 氧化铝生产工艺教学拜耳法
- 2023年十八项医疗核心制度考试题与答案
- 气管切开患者气道湿化的护理进展资料 气管切开患者气道湿化
- 管理模板:某跨境电商企业组织结构及部门职责
- 底架总组装工艺指导书
- 简单临时工劳动合同模板(3篇)
- 聚酯合成反应动力学
- 自动控制原理全套课件
- 上海科技大学,面试
- 《五年级奥数总复习》精编课件
- TS2011-16 带式输送机封闭栈桥图集
评论
0/150
提交评论