GPT模型推理加速实践-latest_第1页
GPT模型推理加速实践-latest_第2页
GPT模型推理加速实践-latest_第3页
GPT模型推理加速实践-latest_第4页
GPT模型推理加速实践-latest_第5页
已阅读5页,还剩35页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

GPT模型的推理加速方案Agenda➢

LLM推理挑战➢

LLM整体推理方案➢

GPT模型基本介绍➢

GPT

模型推理加速实践LLM推理挑战LLMs

推理挑战GPT3-175B

needs5*

A800-80G

for

inferenceLLMs

推理挑战Howtoreduce

memory

requirement?Howtoacceleration

computing?Howtooptimize

communication?LLM整体推理方案LLMs

整体推理方案Model

compression

inference•

Smallermodels->smaller

memoryfootprint•

Computeacceleration•

Reduced

precision

computing•

Reduced

complexity->fewerfloating-pointoperations(FLOPs)QuantizationDistillationPruningLLMs

整体推理方案MGMNinferenceTensor

ParallelWhen

LLMmodel

size

istoolargeto

deploy

on

aSingle-GPU,

andwecan’t

getacceptablemodelaccuracyaftermodelcompression.Theotheroption

is

Multi-GPUInference(MGMN)Pipeline

ParallelLLMs

整体推理方案GPT模型基本介绍GPT模型基本介绍GPT

basicGPT3GPT=Generative

Pre-trained

Transformer•••Embedding

layerDecoderlayer*NDecodingEncoderDecoderGPT模型基本介绍GPT

basicGPT3Modelconfiguration

ofGPT3175B•

Number

of

layers(l):96•

Sequencelength(S):

2048•

Hidden

layer

size(h):

12288•

Vocabulary

size(V):

51200•

Total

parameters:

175BEncoderDecoderGPT模型基本介绍GPT

basicGPT=Generative

Pre-trained

TransformerThisplaceis…•Embedding

layer•

Text

embedding•

PositionembeddingDecoderlayer[00000…0

10…00]

One-hot

vector

ofvocab_size••DecodingText_embeddingHidden_sizeEmbWVocab_size…0.1Hidden_sizeHidden_size

=12288forGPT3GPT模型基本介绍GPT

basicGPT=Generative

Pre-trained

TransformerThisplaceis…•Embedding

layer•

Text

embedding•

PositionembeddingDecoderlayer[00000…0

10…00]

One-hot

vector

ofvocab_size••DecodingTokin

position

id=

iposition_embeddingSin(x0)…Sin(x_N-1)Hidden_sizeN=hidden_size[0.1…0.2]GPT模型基本介绍GPT

basicGPT=Generative

Pre-trained

Transformer••Embedding

layerDecoderlayer*N•

Attention•

LayerNormalization•

FFN•

LayerNormalizationDecoding••

Willcomputeattention

forcurrenttoken

withevery

previoustokenGPT模型基本介绍GPT

basicGPT=Generative

Pre-trained

Transformer••Embedding

layerDecoderlayer*N•

Attention•

LayerNormalization•

FFN4hHidden_sizeHidden_sizeFFN•

LayerNormalizationDecoding•GPT模型基本介绍GPT

basicGPT=Generative

Pre-trained

Transformer••Embedding

layerDecoderlayer•

Attention•

LayerNormalization•

FFNDecodingHidden_sizeW-1EToken•

LayerNormalizationDecodingGreedy

search•SamplingDecodingBeam

searchGPT模型推理加速GPT模型推理加速FasterTransformer

overviewFasterTransformerHighly

optimized

for

Transformer

modelsa)

Highly

Optimized

kernelsb)

Shared

bufferc)

Flexible

toadd

optimizationd)

Supported

data

type:

FP32,

FP16

,BF16,INT8e)

Supported

MGMN

inference4flows

forFasterTransformer

FP16inferenceGPT模型推理加速OverviewGPToptimization

in

FT•

Decoderlayer•

Attention

optimization•

K/V

cache•

Normalizationoptimization•

Activationmemory

optimization•

INT8quantization•

Decoding•

Beamsearch•

StreamingdecodinginFT•

MGMNinference•

TP/PP•

Nccl

allreduceoptimizationGPT模型推理加速Decoder•

InGPT

model,

wewill

receive

contexts

as

input,andthen

generatereplystepby

step.•

We

split

theworkflowinto

twophases:contextphaseandgeneration

phase.N-1Input

sequenceContextphaseOutputtokengenerate

phaseOutput

sequenceOutputlength

=NGPT模型推理加速Decoder

attentionContextphase•

Likeencoder,

needtohandlemultiple

tokensatonce•

Using

CUDA

kernelto

compute

batchGEMMisin-efficientwhensequencelength

dimension

is

large.•

Use

unfusedmulti-head

attention

toleverage

thetensorcoreforGEMMcomputing.•

SavetheresultofKey

andValue

intocache.Avoid

recomputingGPT模型推理加速Decoder

attentionGeneration

phase:

generate

token

stepby

step.•

Use

“fuseQKV

masked

attention”.GPT模型推理加速Decoder

K/V

cacheOriginalIndecoder,

multi-headattention

computes•

the

relationshipof“currenttoken

(푠푡

)”•

all

tokens

generated

by

previousstepsGPT模型推理加速Decoder

K/V

cacheOptimizationUsing

K/VCachetopreventrecomputingandpreventconcatenation•

Prepare

largek/vcachebuffer•

Compute푞,푘,

of

currenttoken•

Putk/vintocachein-placeGPT模型推理加速Decoder

normalization

optimizationblockReduceblockReducewarpReducewarpReducesyncsyncsyncGPT模型推理加速Decoder

normalization

optimizationGPT模型推理加速Decoder

normalization

optimizationwarpReducewarpReduceGPT模型推理加速Decoder

normalization

optimizationFrom

mathblockReducesyncGPT模型推理加速Decoder

normalization

optimizationOriginalOptimizationGPT模型推理加速Decoder

activation

buffer

optimizationOptimizationOriginalAllocatebufferfor

everydecoderlayer'sactivationInF

T,

onlyallocatebufferfor

1layers'activation,

to

reuse

bufferfor

all

layers'activationDecoderlayers...Activationsbuffer

buffer

bufferbuffer

bufferGPT模型推理加速Decoder

quantizationQuantizationis

usedformodel

size

reductionand

inference

acceleration.FP16

inferenceThereare

twocommonways

toquantizethemodel•

Post

trainingquantization(PTQ)-less

cost,loweraccuracy•

Quantizationawaretraining(QAT)

-higher

cost,higher

accuracyGPT模型推理加速Decoder

quantizationWeightonly

int8

inFT•

INT8weight

onlysavetheweightsby

INT8,butactivations

are

savedbyFP16•

InGEMM,weloadINT8weight,casttoFP16,anduseFP16tensorcoreGPT模型推理加速Decoder

quantizationW8A8

forGPT

inFT•

Quantizingweight

andactivation,

GEMMinINT8tensorcore.GPT模型推理加速Decoding

beam

searchGPT模型推理加速Decoding

beam

searchGPT模型推理加速Decoding

-Streaming

decoding

inFTOriginalOptimizationinputoutputoutputGPTGPT•

When

bs

is

large

andoutput

sequence

lengths

vary

much,•

FT

support

streaming

decoding•

Better

userexperiencesome

outputs

have

to

wait

for

thelongest

oneGPT模型推理加速MGMN–

TP/PPPipeline

ParallelTensor

ParallelRecommend

touse

TP

for

intra-node,

PP

for

inter-node•

Communication

volume•

BandwidthGPT模型推理加速MGMN–

allreduce

optimizationOriginalnccl

allreduceOptimization•

Use

optimizedcudakernel

forallreduceGPU0GPU1GPU2GPU3•

The

ncclallreduceusually

takesup~20%ofend2endpipelineGPU7GPU6GPU5GPU4•

Whe

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论