2024人工智能AI技术教程_第1页
2024人工智能AI技术教程_第2页
2024人工智能AI技术教程_第3页
2024人工智能AI技术教程_第4页
2024人工智能AI技术教程_第5页
已阅读5页,还剩47页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

2024人工智能AI技术教程

课程

讲义名称

备注

1

课程介绍

Overviewandsystem/AIbasics

2

人工智能系统概述

SystemperspectiveofSystemforAI

SystemforAI:ahistoricview;Fundamentalsofneuralnetworks;FundamentalsofSystemforAI

3

深度神经网络计算框架基础

ComputationframeworksforDNN

BackpropandAD,Tensor,DAG,ExecutiongraphPapersandsystems:PyTorch,TensorFlow

4

矩阵运算与计算机体系结构

ComputerarchitectureforMatrixcomputation

Matrixcomputation,CPU/SIMD,GPGPU,ASIC/TPU

Papersandsystems:Blas,TPU

5

分布式训练算法

Distributedtrainingalgorithms

Dataparallelism,modelparallelism,distributedSGDPapersandsystems:

6

分布式训练系统

Distributedtrainingsystems

MPI,parameterservers,all-reduce,RDMAPapersandsystems:Horovod

7

异构计算集群调度与资源管理系统

Schedulingandresourcemanagementsystem

RunningDNNjoboncluster:container,resourceallocation,schedulingPapersandsystems:KubeFlow,OpenPAI,Gandiva,HiveD

8

深度学习推导系统

Inferencesystems

Efficiency,latency,throughput,anddeployment

课程

讲义名称

备注

9

计算图编译优化

Computationgraphcompilationandoptimization

IR,sub-graphpatternmatch,Matrixmultiplicationandmemoryoptimization

Papersandsystems:XLA,MLIR,TVM,NNFusion

10

模型压缩和稀疏化处理

Efficiencyviacompressionandsparsity

Modelcompression,SparsityPruning

11

自动机器学习系统

AutoMLsystems

Hyperparametertuning,NAS

Papersandsystems:Hyperband,SMAC,ENAS,AutoKeras,NNI

12

强化学习系统

Reinforcementlearningsystems

TheoryofRL,systemsforRL

Papersandsystems:AC3,RLlib,AlphaZero

13

模型安全与隐私保护

SecurityandPrivacy

Federatedlearning,security,privacyPapersandsystems:DeepFake

14

用AI技术优化计算机系统

AIforsystems

AIfortraditionalsystemsproblems,forsystemalgorithms

Papersandsystems:LearnedIndexes,Learnedquerypath

课程

讲义名称

备注

Lab1(forweek1,2)

框架及工具入门示例

Asimplethroughoutend-to-endAIexample,froma

systemperspective

Understandthesystemsfromdebuggerinfoand

systemlogs

Lab2(forweek3)

定制一个新的张量运算

Customizeoperators

Designandimplementacustomizedoperator(bothforwardandbackward):inpython

Lab3(forweek4)

CUDA实现和优化CUDAimplementation

AddaCUDAimplementationforthecustomizedoperator

Lab4(forweek5,6)

AllReduce实现和优化

AllReduce

ImproveoneofAllReduceoperators’

implementationonHorovod

Lab5(forweek7,8)

配置Container来进行云上训练或推理准备

Configurecontainersforcustomizedtrainingandinference

Configurecontainers

Lab6

学习使用调度管理系统

Schedulingandresourcemanagementsystem

GetfamiliarwithOpenPAIorKubeFlow

Lab7

分布式训练任务练习

Distributedtraining

Trydifferentkindsofallreduceimplementations

Lab8

自动机器学习系统练习

AutoML

SearchforanewneuralnetworkNNstructureeforImage/NLPtasks

Lab9

强化学习系统练习

RLSystems

Configureandgetfamiliarwithoneofthefollowing

RLSystems:RLlib,…

Self-driving Surveillancedetection Translation Medicaldiagnostics Game

Personalassistant

DeepLearning

深度学习正在改变世界

Art

Imagerecognition Speechrecognition Naturallanguage Generativemodel Reinforcementlearning

cat

dog

honeybadger

𝑤1 𝑤2 𝑤3 𝑤4 𝑤5

CatDogRaccoon

loss

𝑑error

𝑑𝑤1

𝑑error

𝑑𝑤2

𝑑error

𝑑𝑤3

𝑑error

𝑑𝑤4

𝑑error

𝑑𝑤5

Errors

Dog

RDMA

海量的(标识)数据

深度学习算法的进步 语言、框架

计算能力

14Mimages

深度学习+系统的进步:编程语言、优化、计算机体系结构、并行计算以及分布式系统

E.g.,imageclassificationproblem

MNIST

ImageNet

WebImages

60Ksamples

16Msamples

BillionsofImages

10categories

1000categories

Openedcategories

TESTERRORRATE(%)

12

5

7.7

3.3

1.4

4.7

1.7

0.23

LeNet,convolution,max-pooling,softmax,1998

EfficientNet,

3.1%

NAS2019

AlexNet,16.4%

ReLU,Dropout,

2012

Inception,6.7%Batchnormalization,2015

ResNet,3.57%Residualway,2015

Imagerecognition

Speechrecognition

Naturallanguage

Reinforcementlearning

TPUv3

360Tops

V100

TPUv1

125Tops

90Tops

Performance(Op/Sec)

?

TPU

Dedicated

Hardware

GPU

CPU

Moore’slaw

5Kops

ENIAC

~500Gops

XeonE5

108x

105x

1960

1970 1980 1990 2000 2010

2019

CompilerBackend

TVM

TensorFlowXLA

LanguageFrontend

SwiftforTensorFlow

MxNetTensorFlowCNTK

PyTorch

Custompurposemachinelearningalgorithms

TheanoDisBeliefCaffe

Deeplearningframeworks

Algebra&

linearlibs

CPU

GPU

Densematmulengine

GPU

FPGA

SpecialAIaccelerators

TPU

GraphCore

OtherASICs

Custompurposemachinelearningalgorithms

TheanoDisBeliefCaffe

Deeplearningframeworksprovideeasierwaystoleveragevariouslibraries

MachineLearningLanguageandCompiler

PowerfulCompilerInfrastructure:

Codeoptimization,sparsityoptimization,hardwaretargeting

AFull-FeaturedProgrammingLanguageforML:Expressiveandflexible

Controlflow,recursion,sparsity

Algebra&

linearlibs

CPU

GPU

AIframeworkDensematmulengine

SIMDMIMD

SparsitySupport

ControlFlowandDynamicityAssociatedMemory

End-to-EndAIUserExperiences

Model,Algorithm,Pipeline,Experiment,Tool,LifeCycleManagement

Experience

ProgrammingInterfacesComputationgraph,(auto)Gradientcalculation

IR,Compilerinfrastructure

Frameworks

HardwareAPIs(GPU,CPU,FPGA,ASIC)

ResourceManagement/Scheduler

ScalableNetworkStack(RDMA,IB,NVLink)

DeepLearningRuntime:Optimizer,Planner,Executor

Runtime

Architecture

(singlenodeandCloud)

class3

class4

class5

class6

class7

class8

更广泛的AI系统生态

class12

机器学习新模式

(RL)

深度学习算法和框架

class11

class13

class10

自动机器学习

(AutoML)

安全与隐私

模型推导、压缩与优

广泛用途的高效新型通用AI算法

多种深度学习框架的支持与进化

深度神经网络编译架

构及优化

核心系统软硬件

深度学习任务运行和优 通用资源管理和调度系化环境 统

新型硬件及相关高性能网络和计算栈

(2)开始训练

定义网络结构

Fullyconnected

通常用作分类问题的最后几层

Convolutionalneuralnetwork

通常用作图像、语音等Locality强的数据

Recurrentneuralnetwork

通常用作序列及结构化的数据,比如文本信息、知识图

Transformerneuralnetwork

通常用作序列数据,比如文本信息

#ArecursiveTreeBankmodelinadozenlinesofJPLcode#Walkthetree,accumulatingembeddingvecs

#Wordembeddingmodelisusedattheleafnodetomapword#indexintohigh-dimensionalsemanticwordrepresentation.

#Getsemanticrepresentationsforleftandrightchildren.

#Acompositionfunctionisusedtolearnsemantic#representationforphraseattheinternalnode.

#Maptreeembeddingtosentiment

更多样化的结构

更强大的建模能力

更复杂的依赖关系

更细粒度的计算模式

ExecutionRuntime

CPU,GPU,RDMAdevices

Graphdefinition(IR)

xw

*b

+

y

Front-end

LanguageBinding:Python,Lua,R,C++

Optimization

Batching,Cache,Overlap

x

y

z

*

a

+

b

Σ

c

TensorFlow

Data-FlowGraph(DFG)

asIntermediateRepresentation

x

y

z

𝛻x

𝛻y

*

a

*𝐠

𝛻z

+

b

Σ

c

+𝐠

𝛻a

𝛻b

Σ𝐠

AddgradientbackpropagationtoData-FlowGraph(DFG)

TensorFlow

x y z

𝛻x 𝛻y

CPUcode

GPUcode

* *

a

+ +𝐠

b 𝛻b

Σ Σ𝐠

c

𝛻a

𝛻z

x

y

z

𝛻x

𝛻y

*

a

*𝐠

𝛻z

+

b

Σ

c

+𝐠

𝛻a

𝛻b

Σ𝐠

......

1

Operators

IDE

Programmingwith:VSCode,JupiterNotebook

Language

IntegratedwithmainstreamPL:PyTorchandTensorFlowinsidePython

Compiler

Intermediaterepresentation

Compilation

Optimization

Basicdatastructure:Tensor

Lexicalanalysis:Token

Usercontrolled:mini-batch

Basiccomputation:DAG

Parsing:AST

Dataparallelismandmodelparallelism

Advancefeatures:controlflow

Semanticanalysis:SymbolicAD

Loopnetsanalysis:pipelineparallelism,controlflow

GeneralIRs:MLIR

Codeoptimization

Dataflowanalysis:CSP,Arithmetic,Fusion

Codegeneration

Hardwaredependentoptimizations:matrixcomputation,layout

Resourceallocationandscheduler:memory,recomputation,

Runtimes

Singlenode:CuDNN

Multimode:Parameterservers,Allreducer

Computationclusterresourcemanagementandjobscheduler

Hardware

Hardwareaccelerators:CPU/GPU/ASIC/FPGA

Networkaccelerators:RDMA/IB/NVLink

Experience

Frameworks

Architecture

CompilerBackend

TVM

TensorFlowXLA

LanguageFrontend

SwiftforTensorFlow

MxNetTensorFlowCNTK

PyTorch

Deeplearningframeworks

SpecialAIaccelerators

TPU

GraphCore

OtherASICs

AIFrameworkDense

matmulengine

GPU

FPGA

import"tensorflow/core/framework/to";import"tensorflow/core/framework/op_to";import"tensorflow/core/framework/tensor_to

MachineLearningLanguageandCompiler

PowerfulCompilerInfrastructure:

Codeoptimization,sparsityoptimization,hardwaretargeting

AFull-FeaturedProgrammingLanguageforML:Expressiveandflexible

Controlflow,recursion,sparsity

SIMDMIMD

SparsitySupport

ControlFlowandDynamicityAssociatedMemory

//SyntacticallysimilartoLLVM:

func@testFunction(%arg0:i32){

%x=call@thingToCall(%arg0):(i32)->i32br^bb1

^bb1:

%y=addi%x,%x:i32

return%y:i32}

深度学习高度依赖数据规模和模型规模

8layers

1.4GFLOP

16%Error

2012

AlexNet

Image

152layers

22.6GFLOP

3.5%Error

2015

ResNet

Speech

提高训练速度可以加快深度学习模型的开发速度

大规模部署深度学习模型需要更快和更高效的推演速度

InferenceperformanceServinglatency

80GFLOP

7,000hrsofData

8%Error

2014

DeepSpeech1

465GFLOP

12,000hrsofData

5%Error

2015

DeepSpeech2

Differentarchitectures:CNN,

RNN,Transformer,…

Highcomputationresource

requirements:modelsize,…

Differentgoals:latency,

throughput,accuracy,…

Betransparenttovarioususerrequirements

Transparentlyapplyoverheterogeneoushardwareenvironment

Scale-out LocalEfficiency MemoryEffectiveness

系统、算法和硬件必须相互结合:

算法层面:模型的结构,是否可压缩、可稀疏化,batch的大小、学习算法

系统层面:各个层次的并行化,去重,Overlap,调度与资

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论