2022智能边缘计算_第1页
2022智能边缘计算_第2页
2022智能边缘计算_第3页
2022智能边缘计算_第4页
2022智能边缘计算_第5页
已阅读5页,还剩66页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

智能边缘计算Personal

computingMainframeIntelligent

cloudIntelligentcloud+

edgeCentralizedCentralizedDistributedDistributedComputingparadigm

shiftsSmart

Home50GBper

daySmart

Devices20BIoT

devicesSmart

City250PBper

dayStadium200TBper

gameConnectedFactory1PBper

dayPeople1.5GB

perdaySmart

Office150GBper

dayAutonomous

Vehicle5TBper

day68Distributeddevicesand

dataDataexplosionfromfastgrowingedge

devicesE.g.,smartsurveillancecameras,self-driving

carsStrongneedsofon-device

intelligenceLow

latencyHighavailabilityandreliabilityStrongprivacy

protectionLow

costEdgedevicesbecomingincreasingly

powerfulEmerginghigh-perf,low-power,low-costAI

ASICIntelligent

CloudIntelligent

Edge68Thecallforintelligence(DL)onthe

edgeAffordableAImodelstailored

fordiversehardware68Highly-optimizedsoftwarestack&efficienthardwarefor

AISecurity&privacy,modelprotection,explainableAI,debuggingOn-device,continuous,collaborativelearningloopAI-empowereddiversedevicesandapplications

everywhereEmpowereveryapp&devicewith

AI/DLEdgeTPUVPUNPUKPUHPUAI

ChipsEfficientneuralnetwork(NN)

designEdge

NNFrameworksInnovationsofon-deviceDL

stackManualDesignNASPruningNN

DesignDesignSpace:#of

layers,op

structure,channel,

…constraints(e.g.,

FLOPs)Model

DeploymentModelFramework

opt.e.g.op

fusionConvBNReLuRe-quantizeRe-quantizeRe-quantize…QuantizationDequantizationCPUGPUDSPTPUNPU……ConvBNRe…LuCurrentNNdesigndoesnotconsiderplatform

featuresGapNNdesignand

deploymentEdgeTPU209MFLOPs990MFLOPsMobileNetV3Latency:4

msModelaccuracy:

74.7%MobileNetEdgeTPULatency:3.6

msModelaccuracy:

75.6%LessFLOPs≠>lesslatency,butcanharmmodel

accuracy.DoeslessFLOPsmeanless

latency?CortexA76

CPUVPUMobileNetV3MobileNetV225%

fasterMobileNetV3MobileNetV271%

fasterDoesafastmodelrunfastonevery

hardware?ToBridgeNeuralNetworkDesignandReal-WorldPerformance:ABehaviorStudyforNeural

NetworksPaperpublishedatMLSys

2021Measurementstudytoanswerthefollowing3

questions:WhatarethebehaviorcharacteristicsthatshowaninconsistentlatencyresponsetothechangeofOPsandmemoryaccessesofaconfigurationinthedesign

space?Whataretherootcausesfortheseunexpected

characteristics?Whataretheimplicationsofthesecharacteristicsforefficient-NN

design?GoalMeasurement

Tool:DSPNPURKNNKPUNNCASECortexCPUAdrenoGPUDSPEdgeTPUProfilingon7edgeAI

platforms:TFLite TFLite SNPE TFLiteVPUOpenVINOGeneratesingle

blockmodelin

TFConverttotargetgraphand

precisionProfileontargetdeviceCollecttimingresultsMethodologyThescalingofeachNNdesign

dimension:Operator/blocktype(𝑶):Normal

operator:Elementwise:Activations:Blocks:Kernelsize

(𝑲):Stride

(𝑺):Height(𝑯)/width

(𝑾):#ofConvchannels

(𝑪𝒊𝒏/𝑪𝒐𝒖𝒕):Precision

(𝑷):Conv,FC...Add,Pooling

...ReLU,Sigmoid,Swish...MobileNet/ShuffleNetblock,

...{1,3,5,

7}{1,

2}{3,...,

224}{3,...,

1000}INT8,FP16,...Covereddesign

dimensionsFinding1:ThelatencyofConvincreasesinasteppatternratherthanlinearwiththenumberofoutput

channelsXaxis:outputchannelnumber,Yaxis:

latencyInputfeature

map:

28x28; Inputchannel

number:

320; Kernel:3x3;Stride:

1DomoreConvchannelsincreaselatency?Cause:Theinputtensorsarepaddedtofullyutilizethehardwaredata-level

parallelismSIMDunitonCPU;VectorunitonDSP;SIMTonGPU

etc.K2x

CinPadto8·nCoutK2x

CinHx

WCout+

padHxW+

padInputfeature

mapConvolution

Kernel Outputfeature

map[8,1]x[1,8]basic

blockMatrixmultiplication

implementationPadPadto

8·nSIMDunitson

CPUDomoreConvchannelsincreaselatency?Implication:Forpotentialhigheraccuracy,itisencouragedtokeepthelargestnumberofchannelsineachlatencystepintheNNdesignspaceandskiptheother

ones...68101214161820......68101214161820...PreviousChannelNumber

Choices:ReducedChannelNumber

Choices:E.g.MetaPruningChannelsearchspace:from3014to

414(14layers,eachlayerhas30channel

candidates)DomoreConvchannelsincreaselatency?01020305040FLOPsDataCPUGPUVPUDSPTPUKPURelativeLatency

/MobileNetV1DenseBlockMobileNetV2Block+SE

MobileNetV2Block

ShufflenetV2Block318.95Finding

2:TherelativelatencyofabuildingblockvariesgreatlyondifferentplatformsDoesabuildingblockhavesimilarrelativelatencyondifferentNN

platforms?Cause:Themismatchofcomputationandmemorybandwidthis

severeThesupportfornon-ConvoperatorsisweakontheNN

platformsexcept

CPUSnapdragon855onMi

9Memorybandwidth23

GFloat/sCPU22.7GFLOP/sGPU508GFLOP/s0.81ShuffleNetBlock4.73MobilenetV2Block7.58MobilenetV2Block+SE44.51DenseBlockDatareuse

rateDoesabuildingblockhavesimilarrelativelatencyondifferentNN

platforms?Cause:Themismatchofcomputationandmemorybandwidthis

severeThesupportfornon-ConvoperatorsisweakontheNNplatformsexcept

CPUPoolingtakes<5%OPsbut>70%

timeSqueeze&Excitement

blockGlobalPoolingMultiplyFCReLUFCSigmoid3x3

DWConv,BN,ReLU6<

5%71.7%OPs

(%)Latency

(%)GlobalPoolingisinefficientinMobileNetV2+SEBlockon

GPUBlock

totalOPs/LatencyDoesabuildingblockhavesimilarrelativelatencyondifferentNN

platforms?Implication:ItisencouragedtocustomizethesetofcandidateblocksintheNNdesignspaceforeach

platformModuleModuleModuleModuleModuleModuleModuleModuleModuleCustomizedSearch

SpaceCustomizedSearch

SpaceCustomizedSearch

SpaceCPUGPUDSPDoesabuildingblockhavesimilarrelativelatencyondifferentNN

platforms?#ofChannels:ThelatencyofConvincreasesinasteppatternwiththe#ofout

channelsBlock:TherelativelatencyofaNNblockvariesgreatlyondifferent

platformsActivationFunction:Activationfunctionscanhavebigimpactonlatency,particularlyforSwishand

HardSwishKernelSize:TheConvlatencyincreasesmuchlesswithkernelsizeonAIaccelerators

thanonthe

CPUQuantization:TheuseofINT8ontheNPUachieves>11×speedup,whileCPUonlyachieves<

3.6×INT8candramaticallydecreaseinferenceaccuracyofvarious

modelsGeneral:Consideringthegeneralsupport,accuracy,andlatency,theCPUisstilla

goodchoicefor

inferenceSummaryofmajor

findingsHowtogetagood

model?EfficientNNdesignmustconsiderhardware

characteristics.EdgeTPUVPUHPUNPUKPUHW-specificpredictorsoflatencyand

energyProfiling

andmodelingManualDesignNASPruningNN

DesignDesign

Space:ModelsEdgeTPUVPUHPUNPUKPUModeldeployment#oflayers,opstructure,channel,…constraints(e.g.,

FLOPs)latency,

energyEfficientNNdesignfordiverseedge

hardwarenn-Meter:TowardsAccurateLatencyPredictionofDeep-LearningModelInferenceonDiverseEdge

DevicesCortexCPUAdrenoGPUVPUPaperpublishedatMobiSys2021(BestPaper

Award)FLOPs-based

predictionPros:verysimpleCons:notadirectmetricofinference

latencyOperator-level

predictionPros:stableprimitiveoperators(conv2d,pooling,

activations...)Cons:unawareofgraph-level

optimizationsModel-level

predictionPros:learngraph-leveloptimization

automaticallyCons:cannotgeneralizetounseenmodel

structuresnn-Meter:buildaccuratelatency

predictorTakegraph-leveloptimizationsinto

considerationGeneralization

abilityExistingworkonlatency

predictionBackend-independent

opt.Constant

foldingCommonexpression

elimination...Backend-dependent

opt.Operatorfusion...DesignedmodelBackendindependent

opt.Backenddependent

opt.CPUbackend1(egEigen

lib.)…CPUbackend2(egNNPack

lib.)GPUbackend1(eg

OpenCL)MovidiusbackendChallenge:framework

optimizationsOperatorfusionhasagreatimpactoninference

latencyConvActive_kernelconv_2d_1x1()

{for(i=0;i<out.row;i++)for(j=0;j<out.col;j++)for(cout=0;cout<out.chan;cout++)for(cin=0;cin<in.chan;cin++)out[i][j][cout]+=in[i][j][cin]*filter[cout][cin];

}Conv+Active_kernelconv_2d_1x1_active()

{for(i=0;i<out.row;i++)for(j=0;j<out.col;j++)for(cout=0;cout<out.chan;cout++){for(cin=0;cin<in.chan;cin++)out[i][j][cout]+=in[i][j][cin]*filter[cout][cin];out[i][j][cout]=active(out[i][j][cout]);}

}Model

graphBackend

implementationOperator

fusion_kernelactive(){for(i=0;i<out.row;i++)for(j=0;j<out.col;j++)for(c=0;c<out.chan;c++)out[i][j][c]=active(in[i][j][c]);}MobileNetv2Impactofoperator

fusionProblems:Howtodetectkernels?(Kernel

Detection)Howtopredictaccuratelyforeachkernel?(AdaptiveData

Sampling)modelKerneldetectorKernel

latencypredictorsumkernels

latenciesKernel:thebasicexecutionunitona

deviceCanbeasingleoperatororafusionofmultiple

operatorsDivideawholemodelintokernels,conductkernel-level

predictionModellatencyisthesumofall

kernelskernelsnn-Meter:kernel-levellatency

predictionFusionruledetectionforblack-box

devicesAsetoftest

casesForeverytwooperators,wegenerate3

graphsComparethelatency

differenceOp1andop2arefusible

if:𝑇𝑜𝑝1+𝑇𝑜𝑝2−𝑇𝑜𝑝1,

𝑜𝑝2>𝛼⋅min(𝑇𝑜𝑝1,𝑇𝑜𝑝2)Op1Op2Op2test

cases:

Op1measuredlatency:𝑇𝑜𝑝1𝑇𝑜𝑝2𝑇𝑜𝑝1,

𝑜𝑝2nn-Metertech#1:automatickernel

detectorFusionruledetectionforblack-box

devicesAsetoftest

cases:Foreverytwooperators,wegenerate3

graphsComparethelatency

differenceKernelsearchbythefusion

rulesApplythefusionrulestosearchmaximumfusedoperatorsintarget

modelAresnet18block

examplenn-Metertech#1:Automatickernel

detectorLargesamplespace,e.g.,

ConvCollectedfrom24widelyusedCNNmodelsfromPyTorch

modelzoo,Convhas𝟏×𝟏𝟎𝟗ofconfigurationsto

sample!Kernel-latencyprediction:

challengesNon-linearlatencyonedge

devicesRandomsamplingmissescrucialdata

pointsKernel-latencyprediction:

challengesSamplethemostbeneficialdata(kernelconfiguration)insteadofrandom

samplingSampleconfigurationsthatarelikelytobeconsideredinmodel

designPriorpossibilitydistribution:learnedfrommodel

zooFine-grainedsamplingaroundinaccurateprediction

dataPriorpossibilitydistributionRegressionmodelFine-graineddata

samplerdataandmeasuredlatency1consideredconfigsinmodel

design2datawithlarge

errorsnn-Metertech#2:adaptivedata

samplerPredictionaccuracy:99.0%(CPU),99.1%(Adreno640GPU),99.0%(Adreno630GPU)and83.4%(Intel

VPU)Generalizationperformanceonunseenmodel

graphsComparisonbaselines:FLOPs,FLOPs+MAC,BRP-NAS

(GCN),Onaverage:nn-Meterachieves89.2%,significantlybetterthanFLOPs(22.1%),FLOPs+MAC(17.1%),andBRP-NAS

(8.5%)nn-Meter

EvaluationEdgeTPUVPUHPUNPUKPUHW-specificpredictorsoflatencyand

energyProfiling

andmodelingManualDesignNASPruningNN

DesignDesign

Space:ModelsEdgeTPUVPUHPUNPUKPUModeldeployment#oflayers,opstructure,channel,…constraints(e.g.,

FLOPs)latency,

energyEfficientNNdesignfordiverseedge

hardwareWegotagood

model.Howdoesitrunonreal

devices?100%80%60%40%20%0%AverageCPU

usageARMCPUutilization%forCNNBig

coreLittle

core30%90%100%80%60%40%20%0%AdrenoGPUALUutilization%for

CNN84%Lowhardwareutilizationresultsinpoorinference

speed.Arecomputingresourcesfully

utilized?AsyMo:ScalableandEfficientDeep-LearningInferenceonAsymmetricMobile

CPUsPaperpublishedatMobiCom

2021100%80%60%40%20%0%AverageCPU

usageCPUutilization%for

CNNBig

coreLittle

core30%90%UnbalancedtaskdistributionbyOSinterandintracore

clustersB0 B1 B2 B3L0L1L2L3Bigcore

clusterLittlecore

clusterComputationtasksWhyisutilizationlowonthe

CPU?Executionflowofmatrix

multiplication1)Blockpartitionfor

parallelismKKmcMkckc3)Scheduletasks

tothread

queues2)Copyblocksinto

continuousmemory

space

task Thread

poolmcx

kcParamsFeature

mapnckcx

ncRedundant

datacopyQ0 Q1 Q#Ignorehardware

asymmetryIgnoredata

localityNIgnorehardwareasymmetryIgnoreresourceconstraintsIgnorethe

interference-proneenvironmentWhyisdistributionunbalancedonthe

CPU?AccelerateedgeDLinferencewithlowerenergy

costInferenceOne-run

initializationCNN/RNNmodelCost-modeldirectedblockpartitionData-reusebasedfrequency

settingPrearrangedmemory

layoutfor

paramsPartitionstrategyAsymmetry-awareschedulingPartitionstrategyMemoryhandleEfficientfrequencyIntra-opthread

poolTaskthread

IDAsyMo:optimizeDLinferenceonbig.Little

CPUCostforatask:computation+

memoryOthercost:unparallel+taskschedule+

frameworkTotal

cost:Computationand

memoryaccesscostCostforasequential

unit:Costforparallelcalculation:paralleltasknumberx

CostseqDegreeof

parallelismTaskschedulingandframework

costCost-model-basedblock

partitionMKKNbigtttttttttCore0 Core1 Core2 Core3tttttCore3tttCore2tttCore1tttCore0ttttInference

runBlock

partition Params

layoutCopy

featuresTasksschedulingand

runBigcore

clusterLittlecore

clusterPin

threadon

coreMKKNlittleNowork

stealingfrombigto

littleBetter

datalocalityOptimizedexecutionflowof

matrixmultiplicationOne-run

initialization1.851.331.01.21.41.61.82.0RelativetoTF(max

freq)PerformanceEnergy

efficiency9.87

131197531

1517 18.51

19Both@maxCPU

frequencyAsymovsTensorFlowonKirin970+Android9

Pie1.721.632.01.81.61.41.21.0RelativetoTF

(schedutil)1917

15

131197531Performance Energy

efficiencyTensorFlow@OSfrequencysettingAsymo@pickedefficientCPU

frequencyPre-copy

paramsenableparallelimplementationTotalperformanceandenergy

improvementSparseflow:unleashfullpotentialofsparsityindeep

learningJointworkwithChenZhanget

al.GPT-3175B

parameters$12Mtraining

costMT-NLG530B

parametersTrainedby560DGXA100

serversToday’sDNNmodelis

huge19602019CPUMoore’s

law108x19701980199020002010ENIAC5

Kops~500

GopsXeon

E5DedicatedHardware105xGPUTPUTPUv3 360

TopsV100 125

TopsTPUv1 90

Tops?Performance(Op/Sec)ComputationistheenginebehindAI’s

success&stillneed

more0.11101001000199520002005201020152020CPU

energy-efficiency

wallGPU

energy-efficiency

wallTPUenergy-efficiencywallGiga-operationsper

JouleYearMoore’s

lawDedicate?Pilinguphardwareisnotsustainable:energy-efficiency

wallSparsityisthekeyto

humanbrain’s

efficiencyWedonotlookateverythinginourvisualscopeSparsityisthekeyto

humanbrain’s

efficiencySimplegeometricshapesareenoughforustorecognizea

catHan,Song,etal.LearningbothWeightsandConnectionsforEfficientNeuralNetworks,

NIPS’15UnstructuredsparsematricesMxV→

SpMxVPruneawaysmall

weightsDifficult

toaccelerateWeight

PruningPros:Highmodel

accuracyHighcompression

ratioCons:Irregular

patternDifficultto

accelerateCons:Lowmodel

accuracyLowcompression

ratioPros:Regular

patternEasyto

accelerateFine-grained/IrregularCoarse-grained/ReAccuracyandSpeedupTrade

offModel

accuracyAddfewconstraintsonthesparsitypatternSpeedupMatrixpartitioningforparallel

computingEliminatingirregularcomputationandmemory

accessS.Caoetal.,“EfficientandEffectiveSparseLSTMonFPGAwithBank-BalancedSparsity”,

FPGA’19.HowtoAchieve

Both?DenseMatrixBank

Split0.81.51.0-1.42.00.9-1.32.10.8-0.10.21.51.00.3-0.4-1.40.72.00.9-0.51.2-1.32.10.2Traverseall

rows0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15DenseMatrix

RowFine-grainedpruninginsideeach

bankBBSMatrix

RowThresholdpercentagetoobtainidenticalsparsityratioamong

banksBank-Balanced

PruningBankpartitioning

forparallelcomputingFine-grained

pruninginsideeachbankformaintaining

accuracyBank-BalancedSparsity

(BBS)V0V1V2V3V4V5V6V7V8V9V10V11Dense

vectorBank

0Bank

1Bank

2Bank

3A0BCD00EFG0HIJ0K0LMN0OP0Row

0Row

1Bank

0Bank

1Bank

2Bank

3Bothinter-rowandinter-bank

parallelismLoadbalancingacrossrowsand

banksConflict-freevector

accessesSparseMVMultiplication

(SpMxV)0123456789101112131415ACEGBDFHIKMOJLNP00012232001312310123012301230123Datarearrangement

forinter-bank

parallelizationCSBVALUESBANKINTERNALINDICESPhysicalBRAM

addressesSpecificallydesignedforBBStoeliminatedecoding

overheadsOurCSB(CompressedSparse

Banks)FPGASpMxV

PE...****

++EWOPACT+ControllerInstruction

BufferDMA*PrivateVectorBufferOutput+

+DRAMCntlrPCIeCntlrOff-chipDRAMHostServerVector

MemoryMatrixMemoryIndicesValues

Accelerator

OverviewSpeech

RecognitiononTIMIT

datasetLanguage

modelPTB

datasetVery

closeModel

AccuracyHardware

Efficiency~34x~7xHardware

EfficiencySeerNet:PredictingCNNFeature-MapSparsitythroughLow-Bit

QuantizationS.Caoetal.,“SeerNet:PredictingConvolutionalNeuralNetworkFeature-Map

SparsitythroughLow-BitQuantization”,

CVPR’19.ConvolutionWFReLUor

Max-pooling ConvConvSoftmaxcatdogpigcowboyF’ReLUy=

max(0,x)Max-poolingy=max(xi|

i={1,2,…,n})1-1-52-32-3-65-42476-1-21002020050247600Acceleratemodelinferencebyfeature-mapspa

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论