69an automatic rtl compiler for high-throughput fpga implementation of diverse deep convolutional neural networks_第1页
69an automatic rtl compiler for high-throughput fpga implementation of diverse deep convolutional neural networks_第2页
69an automatic rtl compiler for high-throughput fpga implementation of diverse deep convolutional neural networks_第3页
69an automatic rtl compiler for high-throughput fpga implementation of diverse deep convolutional neural networks_第4页
69an automatic rtl compiler for high-throughput fpga implementation of diverse deep convolutional neural networks_第5页
已阅读5页,还剩3页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、An Automatic RTL Compiler for High-Throughput FPGA Implementation of Diverse Deep Convolutional Neural NetworksYufei Ma, Yu Cao, Sarma Vrudhula, Jae-sun Seo School of Electrical, Computer and Energy EngineeringSchool of Computing, Informatics, Decision Systems Engineering Arizona State University, T

2、empe, USAyufeima, yu.cao, vrudhula, Abstract Convolutional neural networks (CNNs) are rapidly evolving and being applied to a broad range of applications. Given a specific application, an increasing challenge is to search the appropriate CNN algorithm and efficiently map it to the t

3、arget hardware. The FPGA-based accelerator has the advantage of reconfigurability and flexibility, and has achieved high- performance and low-power. Without a general compiler to automate the implementation, however, significant efforts and expertise are still required to customize the design for ea

4、ch CNN model. In this work, we present an RTL-level CNN compiler that automatically generates customized FPGA hardware for the inference tasks of various CNNs, in order to enable high-level fast prototyping of CNNs from software to FPGA and still keep the benefits of low-level hardware optimization.

5、 First, a general- purpose library of RTL modules is developed to model different operations at each layer. The implementation of each module is optimized at the RTL level. Given a CNN algorithm, its structure is abstracted to a directed acyclic graph (DAG) and then complied with RTL modules in the

6、library. The integration and dataflow of physical modules are predefined in the top-level system template and reconfigured during compilation. The runtime control of layer-by-layer sequential computation is managed by the proposed execution schedule so that even highly irregular and complex network

7、topology, e.g. ResNet, can be compiled. The proposed methodology is demonstrated with end-to-end FPGA implementations of various CNN algorithms (e.g. NiN, VGG-16, ResNet-50, and ResNet-152) on two standalone Intel FPGAs, Stratix V and Arria 10. The performance and overhead of the automated compilati

8、on are evaluated. The compiled FPGA accelerators exhibit superior performance compared to state-of- the-art automation-based works by 2 for various CNNs.To pursue higher classification accuracy and enable various intelligent applications, CNNs are rapidly evolving with new type of layers, increased

9、model depth and more complex structures. For example, the recently emerged deep residual networks (ResNets) 5-7 can achieve superior accuracy at the cost of extremely deep structures with up to 1,000 convolution layers (Conv) and many different types of layers, and their layer dimensions and kernel

10、sizes vary significantly. Furthermore, instead of connecting layers in sequence as in AlexNet 2 and VGG 4, the interconnection between layers in more recent CNN algorithms, e.g. ResNets and Google Inception 78, are in the form of a directed acyclic graph (DAG) as illustrated in Fig. 1, such that the

11、re are multiple parallel branches of layer stacks or skip connections between non-adjacentlayers.All these trends have made it more difficult to design a general-purpose CNN hardware accelerator to efficiently cater a diverse range of CNN algorithms. On the contrary, a customized hardware-level desi

12、gn, i.e. RTL, can obtain excellent throughput and energy efficiency for a specific CNN model, but this requires profound knowledge of both algorithms and FPGA system architecture. Based on our experience, the development period of an RTL FPGA-based accelerator for a specific CNN could take four or m

13、ore months from studying deep learning algorithm, simulation based functional design, optimizing synthesizable architecture, compiling integrated FPGA system to timing analysis and functionality verification. There could be several feedback loops for the developer to go back to the earlier design st

14、age to resolve issues encountered at the later stage. Therefore, great efforts and expertise is required to customize the design for various CNNs on different FPGAs, whichI.INTRODUCTIONConvolutional neural networks (CNNs) have become the dominant approach in many computer vision applications such as

15、 image classification 1-8 and object detection 9. The large number of operations and parameters as well as the highly varying layer dimensions have challenged the real-time implementation of CNNs on FPGA-based accelerators. Constrained by the limited computing resources and costly external memory (e

16、.g. DRAM) access, the FPGA-based CNN accelerator must fully reuse the hardware resources for different convolution layers and increase data locality to reduce the data movement and off-chip communication. Therefore, the CNN acceleration strategy, such as loop unrolling and tiling techniques, must be

17、 optimized to properly manage the parallel computation and data storage patterns in memory 121719. Conv, Pool and FC are defined as key layers. Layer combo = key layer+ affiliated layersConv212Conv33Bnorm2Bnorm3Eltwise4Layer combo and its computing order:1 , 2 , , 5ReLu5Fig. 1. Example of DAG form l

18、ayer connections in recent CNN algorithms.FCPool1Conv1exacerbates the gap between the development of CNN algorithms and the acceleration on embedded hardware.In this work, a library-based CNN RTL compiler is proposed as shown in Fig. 2, where the user only needs to input high-level CNN model informa

19、tion and design variables to characterize hardware usage without touching low-level hardware design. It enables fast and automatic mapping of various deep CNN algorithms from software deep learning frameworks, e.g. Caffe 10, onto FPGAs with high efficiency and performance. By this means, we can bene

20、fit from the reconfigurability of FPGA and finer optimization of RTL implementation. As CNNs are assembled by highly iterative computing primitives or layers, scalable RTL building block modules are designed for different types of layers and reused by different CNNs. The RTL compiler configures thes

21、e modules with CNN parameters, and it also scales the sizes of Processing Engines (PEs) and on-chip buffers based on the user-specified hardware design variables. The main contributions of this work are as follows.1) A user-friendly and high-level CNN RTL compiler is designed to automatically genera

22、te FPGA-based accelerators for various large-scale CNN algorithms with user-specified hardware resource constraints, such as computing parallelism and buffer usage, targeting FPGA platforms with different amount of hardware resouces.2) An RTL module library is developed for different types of layers

23、 with hand coded Verilog templates. These modules are designed based on the optimized acceleration strategy in 12, which defines the parallel computation, data movement and memory access. This library can be expanded to incorporate new layers or operations for emerging deep learning algorithms.3) Th

24、e sequential processing of different layers and the DAG form layer connections are managed by the proposed execution schedule. The integration and dataflow of these physical modules are predefined by the reconfigurable top-level accelerator template to synthesize only the required modules.The flexib

25、ility of the proposed compilation methodology is validated by implementing the inference phase of both conventional CNNs, e.g. NiN 3 and VGG-16 4, and complex DAG form networks, e.g. ResNets 5 with 50 and 152 convolution layers. The accelerator is demonstrated on two standalone Intel FPGAs, Stratix

26、V and Arria 10, achieving superior performance by 2 than state -of-the-art automation- based deep learning FPGA accelerators 111416.Fig. 2. The overall compilation flow of the proposed CNN RTL compiler.The RTL module library consists of multiple hand coded Verilog templates describing the computatio

27、ns and dataflow of different kinds of layers. The templates are built on the optimized CNN acceleration strategy 12 to minimize the memory access and data movements while maximizing the resource utilization. The Verilog parameters that determine the size of PEs and buffers are compiled based on the

28、design variables. The parameters for runtime control are initialized by compiler and stored in configuration registers. The intra-tile execution flow of layers, as shown in Fig. 4(c), is predefined in the templates and can be customized by the compiler to enable execution of certain layers during ru

29、n time. The top-level accelerator system template, as shown in Fig. 5, integrates these modules with reconfigurable dataflow, where the undesired computing modules are not compiled and the dataflow bypasses these modules.By this means, the layer-by-layer execution flow, hardware computing architectu

30、re and memory transactions can be customized by the compiler for different CNN algorithms. The specific RTL computing modules are compiled for different layers, the parameter configuration registers are initialized to control the inter and intra layers sequential computations, and the corresponding

31、read/write addresses are generated and sorted to describe the access of external memory.Nky Input feature maps TkyOutput feature mapsPkyNkyTkyNkx PkyNiy=Nof Tof PofTiy Noy Toy PoyII.OVERVIEW OF PROPOSED CNNRTLCOMPILERTkxPiyPkxThe dimensions and connections of CNN layers and pre- trained kernel weigh

32、ts are derived from Caffe 10 as the input to the CNN compiler. Given the CNN parameters, the accelerator design variables, e.g. loop unrolling and tiling sizes as shown in Fig. 3 and described in Section III, can be tuned by the user to balance the performance and required hardware resources. The to

33、pology of the CNN structure is transformed into the specified layer-by-layer execution schedule as shown in Fig. 4(a)(b), which is used by the global control logic. The execution schedule also determines the read and write orders of certain kernel weights or pixels from different layers stored in ex

34、ternal memory. The associated read and write addresses are generated and sorted to control the transactions between external and on-chip memories.Nif Tif PifNoxNixTixToxNifNkxKernel maps PoxPixTifN* : ConvdimensionsPifT* : Loop tiling variablesP* : Loop unrolling variablesTkx Pkx1 P* T* N*, e.g. 1 P

35、if Tif NifFig. 3. Convolution dimensions and accelerator design variables.III.ACCELERATION OF CONVOLUTION LOOPSA. Convolution Loop Optimization and Design VariablesConvolution involves three-dimensional multiply and accumulate operations (MAC) of input feature maps and kernelweights as illustrated i

36、n Fig. 3, where the parameters (N*) prefixed with capital N denote the algorithm-defined dimensions of feature and kernel maps of one Conv layer. Since convolution dominates the CNN operations, the acceleration strategy of convolution loops dramatically impacts the parallel computation efficiency an

37、d memory access requirements. Therefore, we employ the loop optimization techniques in 12 to customize the convolution computation and communication patterns. Loop unrolling design variables (P*) determine the parallelism degree of certain convolution loops, and thus the required size and architectu

38、re of PEs. Loop tiling increases the data locality by dividing the entire data of one layer into multiple tiles, which can be fit into the on-chip buffers. The loop tiling design variables (T*) determine the minimum demanded on-chip buffer sizes, and affect the required external memory accesses.B. C

39、onvolution Acceleration StrategyRead DRAM & Write BuffersRead from Weight & Input BuffersYesConvis Conv ?No is Pool ?Yes NoPoolRead DRAMRead DRAMwith Bnorm ?YesLayer combo 11Tile 1 of 11NoBnorm & ScaleWrite DRAMWrite DRAMNowith ReLU ?YesReLUis FC ?Read DRAMRead DRAMFCLayer combo 22Tile 2 of 11Write

40、to Output BuffersYesEltwisewith Eltwise ?Write DRAMWrite DRAMNoWrite to Output BuffersWrite to DRAM from Output Buffers(c)(a)(b)Fig. 4. Execution schedule (a) layer-by-layer (b) inter-tile (c) intra-tile.input from DRAM and write their output back to DRAM, and the next key layer can start computing

41、only after the current one has fully completed. On the other hand, the affiliated layers can directly use the key layer outputs as their input without accessing DRAM. By treating a layer as an affiliated layer, the DRAM access delay can be eliminated, however its computingpattern,e.g. unrolling and

42、tiling variables, must depend on the key layer configuration, which hampers the design flexibility. Therefore, Conv, Pool and FC are assigned as key layers so that the computation between these layers are relatively independent, and all other layers are affiliated layers to the key layers. By this m

43、eans, we can define one layer combo as one key layer affiliated with several optional layers. For example, as shown in Fig. 1, layer is composed of one key layer Conv3 and three affiliated layers. The layer-by-layer serial computation is essentially the serial execution of layer combos as illustrate

44、d in Fig. 1 and Fig. 4(a). The computing order of layer combos are set before compilation, and the only rule is to ensure that all the key layers predecessors have been processed prior to itself.With loop tiling technique applied, one key layer is divided into multiple tiles to fit into the on-chip

45、buffers. Accordingly, the execution of one layer combo is also divided into multiple sequential tiles as illustrated in Fig. 4(a)(b). Different layer combo may contain different kinds of layers, for example layer in Fig. 1 does not have Eltwise. Therefore, a general intra-tile execution schedule is

46、designed as shown in Fig. 4(c) to control whether a layer is executed or not for a specific layer combo during runtime. The select signals, e.g. “is Conv?” in Fig. 4(c), are stored in the configuration registers and initialized based on the input CNN topology during compilation. If a layer does not

47、exist in the given CNN, the select signal becomes constant to be “No”. This schedule is also flexible to introduce new type of layers by adding new select signals.Three levels of control logic, namely global, inter-tile, and local control logic, are required to govern the layer-by-layer, inter-tile

48、and intra-tile sequential execution (Fig. 4). The parameters of each layer, e.g. kernel sizes, feature map dimensions, unrolling and tiling variables, and iteration numbers, are stored in configuration registers. The global control logic keeps track of the number of executed layer combos, and loads

49、the current layers parameters from the configuration registers into the local control logic registers. EachThe RTL module template design adopts theCNNacceleration strategy in 12, which leads to uniform mapping of PEs that reduces the accelerator architecture complexity. In this work, the accelerati

50、on strategy is further generalized for different CNN models with varying dimensions and topology. Only the computation within one input feature map and across multiple output feature maps are unrolled or parallelized, i.e. Pkx= Pky = Pif = 1, Pix 1, Piy 1, Pof 1. By this means, bothpixels and weight

51、s are reused by multiple PEs and high level of parallelism can be supported with large Nix Niy Nof. The data required to compute one final output pixel are fully buffered to minimize the partial sum storage, i.e. Tkx = Nkx, Tky = Nky, Tif = Nif. We also set Tox = Nox so that an entire row is buffere

52、d to benefit DRAM transactions with data from continuous addresses. Furthermore, either all input pixels or all weights of each layer are fully buffered to minimize the DRAM accesses by tuning Toy and Tof.Following the above optimized settings, different P* and T* design variables can be adjusted by

53、 the user to explore the best trade-off between performance and hardware resource usage,e.g. DSP blocks and BRAMs, for the target FPGA platform.IV.END-TO-END CNN ACCELERATORA. Layer-by-layer Execution ScheduleIn conventional CNN algorithms, different layers are connected in sequence, which enables s

54、traightforward layer-by- layer serial computation. The DAG form layer interconnection in recent CNN algorithms brings new challenge for the serial processing flow. If layers in different branches, e.g. Conv2 and Conv3 in Fig. 1, are computed in parallel, the computing resources must be split for par

55、allel layers, which requires different mappings of PEs. A tougher problem is to keep the latency balance among different branches if their operation sizes are different. Therefore, we still sequentially compute the CNN layers that are connected as a DAG. A reconfigurable layer-by- layer execution sc

56、hedule is designed to handle the different combinations of stacked layers and the DAG form network topology, as shown in Fig. 4.There are many types of layers in a CNN algorithm, and the combination and number of these layers could be quite different. Based on the operation property of these layers,

57、 we categorize them as key layers and affiliated layers. The key layers read theirTimetype of layer module has its own local control logic to iterate their computation. By this means, we can just use one set of control logic for layers with varying dimensions by initializing configuration registers

58、for different layers during compilation.layer sizes and loop design variables are handled by initializing the configuration registers based on the layer property. If a component inside the module cannot be scaled to the layer changes, specific variants are generated during compilation. This RTL module library is open to be extended with new layers for more CNN algorithms and the existing module can also be further optimized for performance and efficiency. The detailed design of the computin

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论