版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
智能边缘计算Personal
computingMainframeIntelligent
cloudIntelligentcloud+
edgeCentralizedCentralizedDistributedDistributedComputingparadigm
shiftsSmart
Home50GBper
daySmart
Devices20BIoT
devicesSmart
City250PBper
dayStadium200TBper
gameConnectedFactory1PBper
dayPeople1.5GB
perdaySmart
Office150GBper
dayAutonomous
Vehicle5TBper
day68Distributeddevicesand
dataDataexplosionfromfastgrowingedge
devicesE.g.,smartsurveillancecameras,self-driving
carsStrongneedsofon-device
intelligenceLow
latencyHighavailabilityandreliabilityStrongprivacy
protectionLow
costEdgedevicesbecomingincreasingly
powerfulEmerginghigh-perf,low-power,low-costAI
ASICIntelligent
CloudIntelligent
Edge68Thecallforintelligence(DL)onthe
edgeAffordableAImodelstailored
fordiversehardware68Highly-optimizedsoftwarestack&efficienthardwarefor
AISecurity&privacy,modelprotection,explainableAI,debuggingOn-device,continuous,collaborativelearningloopAI-empowereddiversedevicesandapplications
everywhereEmpowereveryapp&devicewith
AI/DLEdgeTPUVPUNPUKPUHPUAI
ChipsEfficientneuralnetwork(NN)
designEdge
NNFrameworksInnovationsofon-deviceDL
stackManualDesignNASPruningNN
DesignDesignSpace:#of
layers,op
structure,channel,
…constraints(e.g.,
FLOPs)Model
DeploymentModelFramework
opt.e.g.op
fusionConvBNReLuRe-quantizeRe-quantizeRe-quantize…QuantizationDequantizationCPUGPUDSPTPUNPU……ConvBNRe…LuCurrentNNdesigndoesnotconsiderplatform
featuresGapNNdesignand
deploymentEdgeTPU209MFLOPs990MFLOPsMobileNetV3Latency:4
msModelaccuracy:
74.7%MobileNetEdgeTPULatency:3.6
msModelaccuracy:
75.6%LessFLOPs≠>lesslatency,butcanharmmodel
accuracy.DoeslessFLOPsmeanless
latency?CortexA76
CPUVPUMobileNetV3MobileNetV225%
fasterMobileNetV3MobileNetV271%
fasterDoesafastmodelrunfastonevery
hardware?ToBridgeNeuralNetworkDesignandReal-WorldPerformance:ABehaviorStudyforNeural
NetworksPaperpublishedatMLSys
2021Measurementstudytoanswerthefollowing3
questions:WhatarethebehaviorcharacteristicsthatshowaninconsistentlatencyresponsetothechangeofOPsandmemoryaccessesofaconfigurationinthedesign
space?Whataretherootcausesfortheseunexpected
characteristics?Whataretheimplicationsofthesecharacteristicsforefficient-NN
design?GoalMeasurement
Tool:DSPNPURKNNKPUNNCASECortexCPUAdrenoGPUDSPEdgeTPUProfilingon7edgeAI
platforms:TFLite TFLite SNPE TFLiteVPUOpenVINOGeneratesingle
blockmodelin
TFConverttotargetgraphand
precisionProfileontargetdeviceCollecttimingresultsMethodologyThescalingofeachNNdesign
dimension:Operator/blocktype(𝑶):Normal
operator:Elementwise:Activations:Blocks:Kernelsize
(𝑲):Stride
(𝑺):Height(𝑯)/width
(𝑾):#ofConvchannels
(𝑪𝒊𝒏/𝑪𝒐𝒖𝒕):Precision
(𝑷):Conv,FC...Add,Pooling
...ReLU,Sigmoid,Swish...MobileNet/ShuffleNetblock,
...{1,3,5,
7}{1,
2}{3,...,
224}{3,...,
1000}INT8,FP16,...Covereddesign
dimensionsFinding1:ThelatencyofConvincreasesinasteppatternratherthanlinearwiththenumberofoutput
channelsXaxis:outputchannelnumber,Yaxis:
latencyInputfeature
map:
28x28; Inputchannel
number:
320; Kernel:3x3;Stride:
1DomoreConvchannelsincreaselatency?Cause:Theinputtensorsarepaddedtofullyutilizethehardwaredata-level
parallelismSIMDunitonCPU;VectorunitonDSP;SIMTonGPU
etc.K2x
CinPadto8·nCoutK2x
CinHx
WCout+
padHxW+
padInputfeature
mapConvolution
Kernel Outputfeature
map[8,1]x[1,8]basic
blockMatrixmultiplication
implementationPadPadto
8·nSIMDunitson
CPUDomoreConvchannelsincreaselatency?Implication:Forpotentialhigheraccuracy,itisencouragedtokeepthelargestnumberofchannelsineachlatencystepintheNNdesignspaceandskiptheother
ones...68101214161820......68101214161820...PreviousChannelNumber
Choices:ReducedChannelNumber
Choices:E.g.MetaPruningChannelsearchspace:from3014to
414(14layers,eachlayerhas30channel
candidates)DomoreConvchannelsincreaselatency?01020305040FLOPsDataCPUGPUVPUDSPTPUKPURelativeLatency
/MobileNetV1DenseBlockMobileNetV2Block+SE
MobileNetV2Block
ShufflenetV2Block318.95Finding
2:TherelativelatencyofabuildingblockvariesgreatlyondifferentplatformsDoesabuildingblockhavesimilarrelativelatencyondifferentNN
platforms?Cause:Themismatchofcomputationandmemorybandwidthis
severeThesupportfornon-ConvoperatorsisweakontheNN
platformsexcept
CPUSnapdragon855onMi
9Memorybandwidth23
GFloat/sCPU22.7GFLOP/sGPU508GFLOP/s0.81ShuffleNetBlock4.73MobilenetV2Block7.58MobilenetV2Block+SE44.51DenseBlockDatareuse
rateDoesabuildingblockhavesimilarrelativelatencyondifferentNN
platforms?Cause:Themismatchofcomputationandmemorybandwidthis
severeThesupportfornon-ConvoperatorsisweakontheNNplatformsexcept
CPUPoolingtakes<5%OPsbut>70%
timeSqueeze&Excitement
blockGlobalPoolingMultiplyFCReLUFCSigmoid3x3
DWConv,BN,ReLU6<
5%71.7%OPs
(%)Latency
(%)GlobalPoolingisinefficientinMobileNetV2+SEBlockon
GPUBlock
totalOPs/LatencyDoesabuildingblockhavesimilarrelativelatencyondifferentNN
platforms?Implication:ItisencouragedtocustomizethesetofcandidateblocksintheNNdesignspaceforeach
platformModuleModuleModuleModuleModuleModuleModuleModuleModuleCustomizedSearch
SpaceCustomizedSearch
SpaceCustomizedSearch
SpaceCPUGPUDSPDoesabuildingblockhavesimilarrelativelatencyondifferentNN
platforms?#ofChannels:ThelatencyofConvincreasesinasteppatternwiththe#ofout
channelsBlock:TherelativelatencyofaNNblockvariesgreatlyondifferent
platformsActivationFunction:Activationfunctionscanhavebigimpactonlatency,particularlyforSwishand
HardSwishKernelSize:TheConvlatencyincreasesmuchlesswithkernelsizeonAIaccelerators
thanonthe
CPUQuantization:TheuseofINT8ontheNPUachieves>11×speedup,whileCPUonlyachieves<
3.6×INT8candramaticallydecreaseinferenceaccuracyofvarious
modelsGeneral:Consideringthegeneralsupport,accuracy,andlatency,theCPUisstilla
goodchoicefor
inferenceSummaryofmajor
findingsHowtogetagood
model?EfficientNNdesignmustconsiderhardware
characteristics.EdgeTPUVPUHPUNPUKPUHW-specificpredictorsoflatencyand
energyProfiling
andmodelingManualDesignNASPruningNN
DesignDesign
Space:ModelsEdgeTPUVPUHPUNPUKPUModeldeployment#oflayers,opstructure,channel,…constraints(e.g.,
FLOPs)latency,
energyEfficientNNdesignfordiverseedge
hardwarenn-Meter:TowardsAccurateLatencyPredictionofDeep-LearningModelInferenceonDiverseEdge
DevicesCortexCPUAdrenoGPUVPUPaperpublishedatMobiSys2021(BestPaper
Award)FLOPs-based
predictionPros:verysimpleCons:notadirectmetricofinference
latencyOperator-level
predictionPros:stableprimitiveoperators(conv2d,pooling,
activations...)Cons:unawareofgraph-level
optimizationsModel-level
predictionPros:learngraph-leveloptimization
automaticallyCons:cannotgeneralizetounseenmodel
structuresnn-Meter:buildaccuratelatency
predictorTakegraph-leveloptimizationsinto
considerationGeneralization
abilityExistingworkonlatency
predictionBackend-independent
opt.Constant
foldingCommonexpression
elimination...Backend-dependent
opt.Operatorfusion...DesignedmodelBackendindependent
opt.Backenddependent
opt.CPUbackend1(egEigen
lib.)…CPUbackend2(egNNPack
lib.)GPUbackend1(eg
OpenCL)MovidiusbackendChallenge:framework
optimizationsOperatorfusionhasagreatimpactoninference
latencyConvActive_kernelconv_2d_1x1()
{for(i=0;i<out.row;i++)for(j=0;j<out.col;j++)for(cout=0;cout<out.chan;cout++)for(cin=0;cin<in.chan;cin++)out[i][j][cout]+=in[i][j][cin]*filter[cout][cin];
}Conv+Active_kernelconv_2d_1x1_active()
{for(i=0;i<out.row;i++)for(j=0;j<out.col;j++)for(cout=0;cout<out.chan;cout++){for(cin=0;cin<in.chan;cin++)out[i][j][cout]+=in[i][j][cin]*filter[cout][cin];out[i][j][cout]=active(out[i][j][cout]);}
}Model
graphBackend
implementationOperator
fusion_kernelactive(){for(i=0;i<out.row;i++)for(j=0;j<out.col;j++)for(c=0;c<out.chan;c++)out[i][j][c]=active(in[i][j][c]);}MobileNetv2Impactofoperator
fusionProblems:Howtodetectkernels?(Kernel
Detection)Howtopredictaccuratelyforeachkernel?(AdaptiveData
Sampling)modelKerneldetectorKernel
latencypredictorsumkernels
latenciesKernel:thebasicexecutionunitona
deviceCanbeasingleoperatororafusionofmultiple
operatorsDivideawholemodelintokernels,conductkernel-level
predictionModellatencyisthesumofall
kernelskernelsnn-Meter:kernel-levellatency
predictionFusionruledetectionforblack-box
devicesAsetoftest
casesForeverytwooperators,wegenerate3
graphsComparethelatency
differenceOp1andop2arefusible
if:𝑇𝑜𝑝1+𝑇𝑜𝑝2−𝑇𝑜𝑝1,
𝑜𝑝2>𝛼⋅min(𝑇𝑜𝑝1,𝑇𝑜𝑝2)Op1Op2Op2test
cases:
Op1measuredlatency:𝑇𝑜𝑝1𝑇𝑜𝑝2𝑇𝑜𝑝1,
𝑜𝑝2nn-Metertech#1:automatickernel
detectorFusionruledetectionforblack-box
devicesAsetoftest
cases:Foreverytwooperators,wegenerate3
graphsComparethelatency
differenceKernelsearchbythefusion
rulesApplythefusionrulestosearchmaximumfusedoperatorsintarget
modelAresnet18block
examplenn-Metertech#1:Automatickernel
detectorLargesamplespace,e.g.,
ConvCollectedfrom24widelyusedCNNmodelsfromPyTorch
modelzoo,Convhas𝟏×𝟏𝟎𝟗ofconfigurationsto
sample!Kernel-latencyprediction:
challengesNon-linearlatencyonedge
devicesRandomsamplingmissescrucialdata
pointsKernel-latencyprediction:
challengesSamplethemostbeneficialdata(kernelconfiguration)insteadofrandom
samplingSampleconfigurationsthatarelikelytobeconsideredinmodel
designPriorpossibilitydistribution:learnedfrommodel
zooFine-grainedsamplingaroundinaccurateprediction
dataPriorpossibilitydistributionRegressionmodelFine-graineddata
samplerdataandmeasuredlatency1consideredconfigsinmodel
design2datawithlarge
errorsnn-Metertech#2:adaptivedata
samplerPredictionaccuracy:99.0%(CPU),99.1%(Adreno640GPU),99.0%(Adreno630GPU)and83.4%(Intel
VPU)Generalizationperformanceonunseenmodel
graphsComparisonbaselines:FLOPs,FLOPs+MAC,BRP-NAS
(GCN),Onaverage:nn-Meterachieves89.2%,significantlybetterthanFLOPs(22.1%),FLOPs+MAC(17.1%),andBRP-NAS
(8.5%)nn-Meter
EvaluationEdgeTPUVPUHPUNPUKPUHW-specificpredictorsoflatencyand
energyProfiling
andmodelingManualDesignNASPruningNN
DesignDesign
Space:ModelsEdgeTPUVPUHPUNPUKPUModeldeployment#oflayers,opstructure,channel,…constraints(e.g.,
FLOPs)latency,
energyEfficientNNdesignfordiverseedge
hardwareWegotagood
model.Howdoesitrunonreal
devices?100%80%60%40%20%0%AverageCPU
usageARMCPUutilization%forCNNBig
coreLittle
core30%90%100%80%60%40%20%0%AdrenoGPUALUutilization%for
CNN84%Lowhardwareutilizationresultsinpoorinference
speed.Arecomputingresourcesfully
utilized?AsyMo:ScalableandEfficientDeep-LearningInferenceonAsymmetricMobile
CPUsPaperpublishedatMobiCom
2021100%80%60%40%20%0%AverageCPU
usageCPUutilization%for
CNNBig
coreLittle
core30%90%UnbalancedtaskdistributionbyOSinterandintracore
clustersB0 B1 B2 B3L0L1L2L3Bigcore
clusterLittlecore
clusterComputationtasksWhyisutilizationlowonthe
CPU?Executionflowofmatrix
multiplication1)Blockpartitionfor
parallelismKKmcMkckc3)Scheduletasks
tothread
queues2)Copyblocksinto
continuousmemory
space
task Thread
poolmcx
kcParamsFeature
mapnckcx
ncRedundant
datacopyQ0 Q1 Q#Ignorehardware
asymmetryIgnoredata
localityNIgnorehardwareasymmetryIgnoreresourceconstraintsIgnorethe
interference-proneenvironmentWhyisdistributionunbalancedonthe
CPU?AccelerateedgeDLinferencewithlowerenergy
costInferenceOne-run
initializationCNN/RNNmodelCost-modeldirectedblockpartitionData-reusebasedfrequency
settingPrearrangedmemory
layoutfor
paramsPartitionstrategyAsymmetry-awareschedulingPartitionstrategyMemoryhandleEfficientfrequencyIntra-opthread
poolTaskthread
IDAsyMo:optimizeDLinferenceonbig.Little
CPUCostforatask:computation+
memoryOthercost:unparallel+taskschedule+
frameworkTotal
cost:Computationand
memoryaccesscostCostforasequential
unit:Costforparallelcalculation:paralleltasknumberx
CostseqDegreeof
parallelismTaskschedulingandframework
costCost-model-basedblock
partitionMKKNbigtttttttttCore0 Core1 Core2 Core3tttttCore3tttCore2tttCore1tttCore0ttttInference
runBlock
partition Params
layoutCopy
featuresTasksschedulingand
runBigcore
clusterLittlecore
clusterPin
threadon
coreMKKNlittleNowork
stealingfrombigto
littleBetter
datalocalityOptimizedexecutionflowof
matrixmultiplicationOne-run
initialization1.851.331.01.21.41.61.82.0RelativetoTF(max
freq)PerformanceEnergy
efficiency9.87
131197531
1517 18.51
19Both@maxCPU
frequencyAsymovsTensorFlowonKirin970+Android9
Pie1.721.632.01.81.61.41.21.0RelativetoTF
(schedutil)1917
15
131197531Performance Energy
efficiencyTensorFlow@OSfrequencysettingAsymo@pickedefficientCPU
frequencyPre-copy
paramsenableparallelimplementationTotalperformanceandenergy
improvementSparseflow:unleashfullpotentialofsparsityindeep
learningJointworkwithChenZhanget
al.GPT-3175B
parameters$12Mtraining
costMT-NLG530B
parametersTrainedby560DGXA100
serversToday’sDNNmodelis
huge19602019CPUMoore’s
law108x19701980199020002010ENIAC5
Kops~500
GopsXeon
E5DedicatedHardware105xGPUTPUTPUv3 360
TopsV100 125
TopsTPUv1 90
Tops?Performance(Op/Sec)ComputationistheenginebehindAI’s
success&stillneed
more0.11101001000199520002005201020152020CPU
energy-efficiency
wallGPU
energy-efficiency
wallTPUenergy-efficiencywallGiga-operationsper
JouleYearMoore’s
lawDedicate?Pilinguphardwareisnotsustainable:energy-efficiency
wallSparsityisthekeyto
humanbrain’s
efficiencyWedonotlookateverythinginourvisualscopeSparsityisthekeyto
humanbrain’s
efficiencySimplegeometricshapesareenoughforustorecognizea
catHan,Song,etal.LearningbothWeightsandConnectionsforEfficientNeuralNetworks,
NIPS’15UnstructuredsparsematricesMxV→
SpMxVPruneawaysmall
weightsDifficult
toaccelerateWeight
PruningPros:Highmodel
accuracyHighcompression
ratioCons:Irregular
patternDifficultto
accelerateCons:Lowmodel
accuracyLowcompression
ratioPros:Regular
patternEasyto
accelerateFine-grained/IrregularCoarse-grained/ReAccuracyandSpeedupTrade
offModel
accuracyAddfewconstraintsonthesparsitypatternSpeedupMatrixpartitioningforparallel
computingEliminatingirregularcomputationandmemory
accessS.Caoetal.,“EfficientandEffectiveSparseLSTMonFPGAwithBank-BalancedSparsity”,
FPGA’19.HowtoAchieve
Both?DenseMatrixBank
Split0.81.51.0-1.42.00.9-1.32.10.8-0.10.21.51.00.3-0.4-1.40.72.00.9-0.51.2-1.32.10.2Traverseall
rows0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15DenseMatrix
RowFine-grainedpruninginsideeach
bankBBSMatrix
RowThresholdpercentagetoobtainidenticalsparsityratioamong
banksBank-Balanced
PruningBankpartitioning
forparallelcomputingFine-grained
pruninginsideeachbankformaintaining
accuracyBank-BalancedSparsity
(BBS)V0V1V2V3V4V5V6V7V8V9V10V11Dense
vectorBank
0Bank
1Bank
2Bank
3A0BCD00EFG0HIJ0K0LMN0OP0Row
0Row
1Bank
0Bank
1Bank
2Bank
3Bothinter-rowandinter-bank
parallelismLoadbalancingacrossrowsand
banksConflict-freevector
accessesSparseMVMultiplication
(SpMxV)0123456789101112131415ACEGBDFHIKMOJLNP00012232001312310123012301230123Datarearrangement
forinter-bank
parallelizationCSBVALUESBANKINTERNALINDICESPhysicalBRAM
addressesSpecificallydesignedforBBStoeliminatedecoding
overheadsOurCSB(CompressedSparse
Banks)FPGASpMxV
PE...****
++EWOPACT+ControllerInstruction
BufferDMA*PrivateVectorBufferOutput+
+DRAMCntlrPCIeCntlrOff-chipDRAMHostServerVector
MemoryMatrixMemoryIndicesValues
Accelerator
OverviewSpeech
RecognitiononTIMIT
datasetLanguage
modelPTB
datasetVery
closeModel
AccuracyHardware
Efficiency~34x~7xHardware
EfficiencySeerNet:PredictingCNNFeature-MapSparsitythroughLow-Bit
QuantizationS.Caoetal.,“SeerNet:PredictingConvolutionalNeuralNetworkFeature-Map
SparsitythroughLow-BitQuantization”,
CVPR’19.ConvolutionWFReLUor
Max-pooling ConvConvSoftmaxcatdogpigcowboyF’ReLUy=
max(0,x)Max-poolingy=max(xi|
i={1,2,…,n})1-1-52-32-3-65-42476-1-21002020050247600Acceleratemodelinferencebyfeature-mapspa
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 沪科版九年级数学上册期末复习考点 第23章 解直角三角形知识归纳与题型突破(12类题型清单)
- 2024-2030年中国型钢产业趋势预测及投资产量分析报告
- 2024-2030年中国地铁建设行业前景规划及投资经营模式分析报告
- 2024年智能软件使用与数据保密协议2篇
- 2024年特许经营合同(加盟)
- 梅河口康美职业技术学院《运动伤害事故处理与急救》2023-2024学年第一学期期末试卷
- 2024年“通办”第二批事项指导目录实施合同范本3篇
- 2024年二手手机买卖与市场推广合作协议3篇
- 满洲里俄语职业学院《云计算原理及应用》2023-2024学年第一学期期末试卷
- 影视动画资源库相关专业介绍
- 咖啡因提取的综合性实验教学
- 机关食堂年终个人工作总结
- GONE理论视角下宜华生活财务舞弊案例分析
- 初中语文默写竞赛方案
- 2023电力建设工程监理月报范本
- 汽车空调检测与维修-说课课件
- 氨水浓度密度对照表
- 白雪歌送武判官归京公开课一等奖课件省课获奖课件
- 园林植物栽培与环境
- 小型双级液压举升器设计
- 9月支部委员会会议记录
评论
0/150
提交评论