版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
智能边缘计算Personal
computingMainframeIntelligent
cloudIntelligentcloud+
edgeCentralizedCentralizedDistributedDistributedComputingparadigm
shiftsSmart
Home50GBper
daySmart
Devices20BIoT
devicesSmart
City250PBper
dayStadium200TBper
gameConnectedFactory1PBper
dayPeople1.5GB
perdaySmart
Office150GBper
dayAutonomous
Vehicle5TBper
day68Distributeddevicesand
dataDataexplosionfromfastgrowingedge
devicesE.g.,smartsurveillancecameras,self-driving
carsStrongneedsofon-device
intelligenceLow
latencyHighavailabilityandreliabilityStrongprivacy
protectionLow
costEdgedevicesbecomingincreasingly
powerfulEmerginghigh-perf,low-power,low-costAI
ASICIntelligent
CloudIntelligent
Edge68Thecallforintelligence(DL)onthe
edgeAffordableAImodelstailored
fordiversehardware68Highly-optimizedsoftwarestack&efficienthardwarefor
AISecurity&privacy,modelprotection,explainableAI,debuggingOn-device,continuous,collaborativelearningloopAI-empowereddiversedevicesandapplications
everywhereEmpowereveryapp&devicewith
AI/DLEdgeTPUVPUNPUKPUHPUAI
ChipsEfficientneuralnetwork(NN)
designEdge
NNFrameworksInnovationsofon-deviceDL
stackManualDesignNASPruningNN
DesignDesignSpace:#of
layers,op
structure,channel,
…constraints(e.g.,
FLOPs)Model
DeploymentModelFramework
opt.e.g.op
fusionConvBNReLuRe-quantizeRe-quantizeRe-quantize…QuantizationDequantizationCPUGPUDSPTPUNPU……ConvBNRe…LuCurrentNNdesigndoesnotconsiderplatform
featuresGapNNdesignand
deploymentEdgeTPU209MFLOPs990MFLOPsMobileNetV3Latency:4
msModelaccuracy:
74.7%MobileNetEdgeTPULatency:3.6
msModelaccuracy:
75.6%LessFLOPs≠>lesslatency,butcanharmmodel
accuracy.DoeslessFLOPsmeanless
latency?CortexA76
CPUVPUMobileNetV3MobileNetV225%
fasterMobileNetV3MobileNetV271%
fasterDoesafastmodelrunfastonevery
hardware?ToBridgeNeuralNetworkDesignandReal-WorldPerformance:ABehaviorStudyforNeural
NetworksPaperpublishedatMLSys
2021Measurementstudytoanswerthefollowing3
questions:WhatarethebehaviorcharacteristicsthatshowaninconsistentlatencyresponsetothechangeofOPsandmemoryaccessesofaconfigurationinthedesign
space?Whataretherootcausesfortheseunexpected
characteristics?Whataretheimplicationsofthesecharacteristicsforefficient-NN
design?GoalMeasurement
Tool:DSPNPURKNNKPUNNCASECortexCPUAdrenoGPUDSPEdgeTPUProfilingon7edgeAI
platforms:TFLite TFLite SNPE TFLiteVPUOpenVINOGeneratesingle
blockmodelin
TFConverttotargetgraphand
precisionProfileontargetdeviceCollecttimingresultsMethodologyThescalingofeachNNdesign
dimension:Operator/blocktype(𝑶):Normal
operator:Elementwise:Activations:Blocks:Kernelsize
(𝑲):Stride
(𝑺):Height(𝑯)/width
(𝑾):#ofConvchannels
(𝑪𝒊𝒏/𝑪𝒐𝒖𝒕):Precision
(𝑷):Conv,FC...Add,Pooling
...ReLU,Sigmoid,Swish...MobileNet/ShuffleNetblock,
...{1,3,5,
7}{1,
2}{3,...,
224}{3,...,
1000}INT8,FP16,...Covereddesign
dimensionsFinding1:ThelatencyofConvincreasesinasteppatternratherthanlinearwiththenumberofoutput
channelsXaxis:outputchannelnumber,Yaxis:
latencyInputfeature
map:
28x28; Inputchannel
number:
320; Kernel:3x3;Stride:
1DomoreConvchannelsincreaselatency?Cause:Theinputtensorsarepaddedtofullyutilizethehardwaredata-level
parallelismSIMDunitonCPU;VectorunitonDSP;SIMTonGPU
etc.K2x
CinPadto8·nCoutK2x
CinHx
WCout+
padHxW+
padInputfeature
mapConvolution
Kernel Outputfeature
map[8,1]x[1,8]basic
blockMatrixmultiplication
implementationPadPadto
8·nSIMDunitson
CPUDomoreConvchannelsincreaselatency?Implication:Forpotentialhigheraccuracy,itisencouragedtokeepthelargestnumberofchannelsineachlatencystepintheNNdesignspaceandskiptheother
ones...68101214161820......68101214161820...PreviousChannelNumber
Choices:ReducedChannelNumber
Choices:E.g.MetaPruningChannelsearchspace:from3014to
414(14layers,eachlayerhas30channel
candidates)DomoreConvchannelsincreaselatency?01020305040FLOPsDataCPUGPUVPUDSPTPUKPURelativeLatency
/MobileNetV1DenseBlockMobileNetV2Block+SE
MobileNetV2Block
ShufflenetV2Block318.95Finding
2:TherelativelatencyofabuildingblockvariesgreatlyondifferentplatformsDoesabuildingblockhavesimilarrelativelatencyondifferentNN
platforms?Cause:Themismatchofcomputationandmemorybandwidthis
severeThesupportfornon-ConvoperatorsisweakontheNN
platformsexcept
CPUSnapdragon855onMi
9Memorybandwidth23
GFloat/sCPU22.7GFLOP/sGPU508GFLOP/s0.81ShuffleNetBlock4.73MobilenetV2Block7.58MobilenetV2Block+SE44.51DenseBlockDatareuse
rateDoesabuildingblockhavesimilarrelativelatencyondifferentNN
platforms?Cause:Themismatchofcomputationandmemorybandwidthis
severeThesupportfornon-ConvoperatorsisweakontheNNplatformsexcept
CPUPoolingtakes<5%OPsbut>70%
timeSqueeze&Excitement
blockGlobalPoolingMultiplyFCReLUFCSigmoid3x3
DWConv,BN,ReLU6<
5%71.7%OPs
(%)Latency
(%)GlobalPoolingisinefficientinMobileNetV2+SEBlockon
GPUBlock
totalOPs/LatencyDoesabuildingblockhavesimilarrelativelatencyondifferentNN
platforms?Implication:ItisencouragedtocustomizethesetofcandidateblocksintheNNdesignspaceforeach
platformModuleModuleModuleModuleModuleModuleModuleModuleModuleCustomizedSearch
SpaceCustomizedSearch
SpaceCustomizedSearch
SpaceCPUGPUDSPDoesabuildingblockhavesimilarrelativelatencyondifferentNN
platforms?#ofChannels:ThelatencyofConvincreasesinasteppatternwiththe#ofout
channelsBlock:TherelativelatencyofaNNblockvariesgreatlyondifferent
platformsActivationFunction:Activationfunctionscanhavebigimpactonlatency,particularlyforSwishand
HardSwishKernelSize:TheConvlatencyincreasesmuchlesswithkernelsizeonAIaccelerators
thanonthe
CPUQuantization:TheuseofINT8ontheNPUachieves>11×speedup,whileCPUonlyachieves<
3.6×INT8candramaticallydecreaseinferenceaccuracyofvarious
modelsGeneral:Consideringthegeneralsupport,accuracy,andlatency,theCPUisstilla
goodchoicefor
inferenceSummaryofmajor
findingsHowtogetagood
model?EfficientNNdesignmustconsiderhardware
characteristics.EdgeTPUVPUHPUNPUKPUHW-specificpredictorsoflatencyand
energyProfiling
andmodelingManualDesignNASPruningNN
DesignDesign
Space:ModelsEdgeTPUVPUHPUNPUKPUModeldeployment#oflayers,opstructure,channel,…constraints(e.g.,
FLOPs)latency,
energyEfficientNNdesignfordiverseedge
hardwarenn-Meter:TowardsAccurateLatencyPredictionofDeep-LearningModelInferenceonDiverseEdge
DevicesCortexCPUAdrenoGPUVPUPaperpublishedatMobiSys2021(BestPaper
Award)FLOPs-based
predictionPros:verysimpleCons:notadirectmetricofinference
latencyOperator-level
predictionPros:stableprimitiveoperators(conv2d,pooling,
activations...)Cons:unawareofgraph-level
optimizationsModel-level
predictionPros:learngraph-leveloptimization
automaticallyCons:cannotgeneralizetounseenmodel
structuresnn-Meter:buildaccuratelatency
predictorTakegraph-leveloptimizationsinto
considerationGeneralization
abilityExistingworkonlatency
predictionBackend-independent
opt.Constant
foldingCommonexpression
elimination...Backend-dependent
opt.Operatorfusion...DesignedmodelBackendindependent
opt.Backenddependent
opt.CPUbackend1(egEigen
lib.)…CPUbackend2(egNNPack
lib.)GPUbackend1(eg
OpenCL)MovidiusbackendChallenge:framework
optimizationsOperatorfusionhasagreatimpactoninference
latencyConvActive_kernelconv_2d_1x1()
{for(i=0;i<out.row;i++)for(j=0;j<out.col;j++)for(cout=0;cout<out.chan;cout++)for(cin=0;cin<in.chan;cin++)out[i][j][cout]+=in[i][j][cin]*filter[cout][cin];
}Conv+Active_kernelconv_2d_1x1_active()
{for(i=0;i<out.row;i++)for(j=0;j<out.col;j++)for(cout=0;cout<out.chan;cout++){for(cin=0;cin<in.chan;cin++)out[i][j][cout]+=in[i][j][cin]*filter[cout][cin];out[i][j][cout]=active(out[i][j][cout]);}
}Model
graphBackend
implementationOperator
fusion_kernelactive(){for(i=0;i<out.row;i++)for(j=0;j<out.col;j++)for(c=0;c<out.chan;c++)out[i][j][c]=active(in[i][j][c]);}MobileNetv2Impactofoperator
fusionProblems:Howtodetectkernels?(Kernel
Detection)Howtopredictaccuratelyforeachkernel?(AdaptiveData
Sampling)modelKerneldetectorKernel
latencypredictorsumkernels
latenciesKernel:thebasicexecutionunitona
deviceCanbeasingleoperatororafusionofmultiple
operatorsDivideawholemodelintokernels,conductkernel-level
predictionModellatencyisthesumofall
kernelskernelsnn-Meter:kernel-levellatency
predictionFusionruledetectionforblack-box
devicesAsetoftest
casesForeverytwooperators,wegenerate3
graphsComparethelatency
differenceOp1andop2arefusible
if:𝑇𝑜𝑝1+𝑇𝑜𝑝2−𝑇𝑜𝑝1,
𝑜𝑝2>𝛼⋅min(𝑇𝑜𝑝1,𝑇𝑜𝑝2)Op1Op2Op2test
cases:
Op1measuredlatency:𝑇𝑜𝑝1𝑇𝑜𝑝2𝑇𝑜𝑝1,
𝑜𝑝2nn-Metertech#1:automatickernel
detectorFusionruledetectionforblack-box
devicesAsetoftest
cases:Foreverytwooperators,wegenerate3
graphsComparethelatency
differenceKernelsearchbythefusion
rulesApplythefusionrulestosearchmaximumfusedoperatorsintarget
modelAresnet18block
examplenn-Metertech#1:Automatickernel
detectorLargesamplespace,e.g.,
ConvCollectedfrom24widelyusedCNNmodelsfromPyTorch
modelzoo,Convhas𝟏×𝟏𝟎𝟗ofconfigurationsto
sample!Kernel-latencyprediction:
challengesNon-linearlatencyonedge
devicesRandomsamplingmissescrucialdata
pointsKernel-latencyprediction:
challengesSamplethemostbeneficialdata(kernelconfiguration)insteadofrandom
samplingSampleconfigurationsthatarelikelytobeconsideredinmodel
designPriorpossibilitydistribution:learnedfrommodel
zooFine-grainedsamplingaroundinaccurateprediction
dataPriorpossibilitydistributionRegressionmodelFine-graineddata
samplerdataandmeasuredlatency1consideredconfigsinmodel
design2datawithlarge
errorsnn-Metertech#2:adaptivedata
samplerPredictionaccuracy:99.0%(CPU),99.1%(Adreno640GPU),99.0%(Adreno630GPU)and83.4%(Intel
VPU)Generalizationperformanceonunseenmodel
graphsComparisonbaselines:FLOPs,FLOPs+MAC,BRP-NAS
(GCN),Onaverage:nn-Meterachieves89.2%,significantlybetterthanFLOPs(22.1%),FLOPs+MAC(17.1%),andBRP-NAS
(8.5%)nn-Meter
EvaluationEdgeTPUVPUHPUNPUKPUHW-specificpredictorsoflatencyand
energyProfiling
andmodelingManualDesignNASPruningNN
DesignDesign
Space:ModelsEdgeTPUVPUHPUNPUKPUModeldeployment#oflayers,opstructure,channel,…constraints(e.g.,
FLOPs)latency,
energyEfficientNNdesignfordiverseedge
hardwareWegotagood
model.Howdoesitrunonreal
devices?100%80%60%40%20%0%AverageCPU
usageARMCPUutilization%forCNNBig
coreLittle
core30%90%100%80%60%40%20%0%AdrenoGPUALUutilization%for
CNN84%Lowhardwareutilizationresultsinpoorinference
speed.Arecomputingresourcesfully
utilized?AsyMo:ScalableandEfficientDeep-LearningInferenceonAsymmetricMobile
CPUsPaperpublishedatMobiCom
2021100%80%60%40%20%0%AverageCPU
usageCPUutilization%for
CNNBig
coreLittle
core30%90%UnbalancedtaskdistributionbyOSinterandintracore
clustersB0 B1 B2 B3L0L1L2L3Bigcore
clusterLittlecore
clusterComputationtasksWhyisutilizationlowonthe
CPU?Executionflowofmatrix
multiplication1)Blockpartitionfor
parallelismKKmcMkckc3)Scheduletasks
tothread
queues2)Copyblocksinto
continuousmemory
space
task Thread
poolmcx
kcParamsFeature
mapnckcx
ncRedundant
datacopyQ0 Q1 Q#Ignorehardware
asymmetryIgnoredata
localityNIgnorehardwareasymmetryIgnoreresourceconstraintsIgnorethe
interference-proneenvironmentWhyisdistributionunbalancedonthe
CPU?AccelerateedgeDLinferencewithlowerenergy
costInferenceOne-run
initializationCNN/RNNmodelCost-modeldirectedblockpartitionData-reusebasedfrequency
settingPrearrangedmemory
layoutfor
paramsPartitionstrategyAsymmetry-awareschedulingPartitionstrategyMemoryhandleEfficientfrequencyIntra-opthread
poolTaskthread
IDAsyMo:optimizeDLinferenceonbig.Little
CPUCostforatask:computation+
memoryOthercost:unparallel+taskschedule+
frameworkTotal
cost:Computationand
memoryaccesscostCostforasequential
unit:Costforparallelcalculation:paralleltasknumberx
CostseqDegreeof
parallelismTaskschedulingandframework
costCost-model-basedblock
partitionMKKNbigtttttttttCore0 Core1 Core2 Core3tttttCore3tttCore2tttCore1tttCore0ttttInference
runBlock
partition Params
layoutCopy
featuresTasksschedulingand
runBigcore
clusterLittlecore
clusterPin
threadon
coreMKKNlittleNowork
stealingfrombigto
littleBetter
datalocalityOptimizedexecutionflowof
matrixmultiplicationOne-run
initialization1.851.331.01.21.41.61.82.0RelativetoTF(max
freq)PerformanceEnergy
efficiency9.87
131197531
1517 18.51
19Both@maxCPU
frequencyAsymovsTensorFlowonKirin970+Android9
Pie1.721.632.01.81.61.41.21.0RelativetoTF
(schedutil)1917
15
131197531Performance Energy
efficiencyTensorFlow@OSfrequencysettingAsymo@pickedefficientCPU
frequencyPre-copy
paramsenableparallelimplementationTotalperformanceandenergy
improvementSparseflow:unleashfullpotentialofsparsityindeep
learningJointworkwithChenZhanget
al.GPT-3175B
parameters$12Mtraining
costMT-NLG530B
parametersTrainedby560DGXA100
serversToday’sDNNmodelis
huge19602019CPUMoore’s
law108x19701980199020002010ENIAC5
Kops~500
GopsXeon
E5DedicatedHardware105xGPUTPUTPUv3 360
TopsV100 125
TopsTPUv1 90
Tops?Performance(Op/Sec)ComputationistheenginebehindAI’s
success&stillneed
more0.11101001000199520002005201020152020CPU
energy-efficiency
wallGPU
energy-efficiency
wallTPUenergy-efficiencywallGiga-operationsper
JouleYearMoore’s
lawDedicate?Pilinguphardwareisnotsustainable:energy-efficiency
wallSparsityisthekeyto
humanbrain’s
efficiencyWedonotlookateverythinginourvisualscopeSparsityisthekeyto
humanbrain’s
efficiencySimplegeometricshapesareenoughforustorecognizea
catHan,Song,etal.LearningbothWeightsandConnectionsforEfficientNeuralNetworks,
NIPS’15UnstructuredsparsematricesMxV→
SpMxVPruneawaysmall
weightsDifficult
toaccelerateWeight
PruningPros:Highmodel
accuracyHighcompression
ratioCons:Irregular
patternDifficultto
accelerateCons:Lowmodel
accuracyLowcompression
ratioPros:Regular
patternEasyto
accelerateFine-grained/IrregularCoarse-grained/ReAccuracyandSpeedupTrade
offModel
accuracyAddfewconstraintsonthesparsitypatternSpeedupMatrixpartitioningforparallel
computingEliminatingirregularcomputationandmemory
accessS.Caoetal.,“EfficientandEffectiveSparseLSTMonFPGAwithBank-BalancedSparsity”,
FPGA’19.HowtoAchieve
Both?DenseMatrixBank
Split0.81.51.0-1.42.00.9-1.32.10.8-0.10.21.51.00.3-0.4-1.40.72.00.9-0.51.2-1.32.10.2Traverseall
rows0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15DenseMatrix
RowFine-grainedpruninginsideeach
bankBBSMatrix
RowThresholdpercentagetoobtainidenticalsparsityratioamong
banksBank-Balanced
PruningBankpartitioning
forparallelcomputingFine-grained
pruninginsideeachbankformaintaining
accuracyBank-BalancedSparsity
(BBS)V0V1V2V3V4V5V6V7V8V9V10V11Dense
vectorBank
0Bank
1Bank
2Bank
3A0BCD00EFG0HIJ0K0LMN0OP0Row
0Row
1Bank
0Bank
1Bank
2Bank
3Bothinter-rowandinter-bank
parallelismLoadbalancingacrossrowsand
banksConflict-freevector
accessesSparseMVMultiplication
(SpMxV)0123456789101112131415ACEGBDFHIKMOJLNP00012232001312310123012301230123Datarearrangement
forinter-bank
parallelizationCSBVALUESBANKINTERNALINDICESPhysicalBRAM
addressesSpecificallydesignedforBBStoeliminatedecoding
overheadsOurCSB(CompressedSparse
Banks)FPGASpMxV
PE...****
++EWOPACT+ControllerInstruction
BufferDMA*PrivateVectorBufferOutput+
+DRAMCntlrPCIeCntlrOff-chipDRAMHostServerVector
MemoryMatrixMemoryIndicesValues
Accelerator
OverviewSpeech
RecognitiononTIMIT
datasetLanguage
modelPTB
datasetVery
closeModel
AccuracyHardware
Efficiency~34x~7xHardware
EfficiencySeerNet:PredictingCNNFeature-MapSparsitythroughLow-Bit
QuantizationS.Caoetal.,“SeerNet:PredictingConvolutionalNeuralNetworkFeature-Map
SparsitythroughLow-BitQuantization”,
CVPR’19.ConvolutionWFReLUor
Max-pooling ConvConvSoftmaxcatdogpigcowboyF’ReLUy=
max(0,x)Max-poolingy=max(xi|
i={1,2,…,n})1-1-52-32-3-65-42476-1-21002020050247600Acceleratemodelinferencebyfeature-mapspa
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- GB/T 24632.1-2024产品几何技术规范(GPS) 圆度第1部分:词汇和参数
- 工程物流合同模板参考
- 2024年度劳动合同范本
- 代收货款服务协议格式
- 房屋租赁合同纠纷调解案例分享
- 房产建筑动漫设计范本合同
- 企业与高校联合人才培养协议样本
- 物资交换合同模板
- 独家招生代理权协议
- 女方自愿离婚协议书撰写作答
- 芜湖市大学生乡村医生专项计划招聘考试试卷及答案
- 12J201平屋面建筑构造图集(完整版)
- 2024-2030年中国航空噪声与振动主动控制系统行业市场发展趋势与前景展望战略研究报告
- 20起典型火灾事故案例合集-2024年消防月专题培训
- 外研版七年级英语上册教学课件Unit-1-Lesson-4-Reading-for-writing
- 大药房《质量管理体系文件》-管理制度
- 新人教版六年级语文上册期中考试卷(真题)
- 2024年个人信用报告(个人简版)样本(带水印-可编辑)
- 16J914-1 公用建筑卫生间
- 西方古代建筑史智慧树知到答案章节测试2023年天津大学
- 公司企业日常安全安全生产检查记录表
评论
0/150
提交评论