




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Special Session: Deep LearningFPGA18, February 2527,Monterey, CA, USADeltaRNN: A Power-efficient Recurrent Neural Network AcceleratorDaniel NeilInstitute of Neuroinformatics, University of Zurich and ETH Zurich Zurich, Switzerland Chang GaoInstitute of Neuroinformatics, Univers
2、ity of Zurich and ETH Zurich Zurich, Switzerland changini.uzh.chEnea CeoliniInstitute of Neuroinformatics, University of Zurich and ETH Zurich Zurich, Switzerland eceoliini.uzh.chShih-Chii LiuInstitute of Neuroinformatics, University of Zurich and ETH Zurich Zurich, Switzerland shihini.ethz.chTobi D
3、elbruckInstitute of Neuroinformatics, University of Zurich and ETH Zurich Zurich, Switzerland tobiini.uzh.chRNNs is further improved by adding gating units to control the data flow in and between neurons. In deep learning, Long Short-Term Memory (LSTM) 16 and Gated Recurrent Unit (GRU) 6 are two maj
4、or neuron models used in gated RNNs. The gating units in the LSTM and GRU models help to mitigate the well-known vanishing gradient problem encountered during the training process.A challenge in deploying RNN applications in mobile or always- on applications is access to hardware that achieves high
5、power efficiency. For mobile applications, edge inference is preferable because of lower latency, reduced network bandwidth requirement, robustness to network failure, and better privacy. Recent neural network applications use GPUs ranging from high-end models such as the NVIDIA Pascal Titan X GPU t
6、o embedded system on chip (SoC) such as the Kepler GPU in Tegra K1 SoC or smartphone processor embedded GPU such as the Samsung Exynos Mali. The server- or desktop-targeted Titan X consumes about 200 W and achieves a power efficiency around 5 GOp/s/W in inference of an LSTM layer with 1024 neurons 1
7、2. Embedded CPUs and GPUs consume about 1 W, but their RNN power efficiency is actually much lower (around 8 MOp/s/W during inference with an LSTM network with 2 layers of 128 LSTM neurons each, as measured in 5). For mobile GPUs, the low efficiency is from poor matching of memory architecture to GP
8、U, especially for small RNNs.The energy figures from Horowitz 17 for a 45nm process show that Dynamic Random Access Memory (DRAM) access consumes several hundred times more energy than arithmetic operations. The power consumption of the LSTM RNN in a complete mobile speech recognition engine 22 can
9、be estimated. The RNN is a 5-layer network with 500 units per layer, that is updated at 100 Hz. The weight matrices are too large to be stored in economical SRAM and so must be fetched from off-chip DRAM. Using the Horowitz numbers results in a power consumption of about 0.2 W with most consumed by
10、DRAM access. Thus RNN inference energy is domi- nated by memory access cost; however, achieving high throughput requires high memory bandwidth for weight fetching in order to keep arithmetic units loaded as fully as possible. The key to im- proving power efficiency (number of arithmetic operations/W
11、) is to reduce the total memory access therefore decreasing the en- ergy consumption, while keeping arithmetic units fully utilized to maintain high throughput.ABSTRACTRecurrent Neural Networks (RNNs) are widely used in speech recog- nition and natural language processing applications because of the
12、ir capability to process temporal sequences. Because RNNs are fully connected, they require a large number of weight memory accesses, leading to high power consumption. Recent theory has shown that an RNN delta network update approach can reduce memory access and computes with negligible accuracy lo
13、ss. This paper describes the implementation of this theoretical approach in a hardware accel- erator called “DeltaRNN” (DRNN). The DRNN updates the output of a neuron only when the neurons activation changes by more than a delta threshold. It was implemented on a Xilinx Zynq-7100 FPGA. FPGA measurem
14、ent results from a single-layer RNN of 256 Gated Recurrent Unit (GRU) neurons show that the DRNN achieves1.2 TOp/s effective throughput and 164 GOp/s/W power efficiency. Thedeltaupdate leadstoa5.7xspeedupcompared toaconventional RNN update because of the sparsity created by the DN algorithm and the
15、zero-skipping ability of DRNN.ACM Reference Format:Chang Gao, Daniel Neil, Enea Ceolini, Shih-Chii Liu, and Tobi Delbruck. 2018. DeltaRNN: A Power-efficient Recurrent Neural Network Acceler- ator. In FPGA 18: 2018 ACM/SIGDA International Symposium on Field- Programmable Gate Arrays, February 2527, 2
16、018, Monterey, CA, USA. ACM, New York, NY, USA, 10 pages. /10.1145/3174243.31742611 INTRODUCTIONRecurrent Neural Networks (RNNs) are fully-connected single- or multi-layered networks with complex neurons that have multiple memory states and enabled state-of-art accuracies in tasks invo
17、lving temporalsequences27suchasautomaticspeechrecognition2,9 and natural language processing 23. The prediction accuracy of Currently at BenevolentAI.Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are no
18、t made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).FPGA 18, February 2527, 2018, Monterey, CA, USA 2018
19、Copyright held by the owner/author(s). ACM ISBN 978-1-4503-5614-5/18/02./10.1145/3174243.317426121Special Session: Deep LearningFPGA18, February 2527,Monterey, CA, USASparsity in network activations or input data is a property that can be used to achieve high power efficiency. In a Mat
20、rix-Vector Multiplication (MxV) between a neuron activation vector and a weight matrix, zero elements in the vector result in zero partial sums that do not contribute to the final result. The multiplica- tions between zero vector elements and their corresponding weight columns can be skipped to save
21、 memory access of weight columns. Moreover, it is also possible to skip any multiplication between a zero weight element and a non-zero vector element to further reduce operations and memory access, though this feature has not been adopted in this work yet. A common way to create sparsity in neural
22、network parameters is weight compression, which is shown in works 14, 19. The Delta Network algorithm 25 creates spar- sity in input and activation vectors by exploiting their temporal dependency.Although GPUs offer high peak throughput, they may suffer from irregular execution paths and memory acce
23、ss patterns in RNN inference, which is caused by recurrent connections and lim- ited data reuse of RNNs 3. In this case, it is useful to consider hardware architectures specialized for RNN inference that enhance parallelism by matching memory bandwidth to available arithmetic units. RNN inference ac
24、celerators based on FPGA have been ex- plored because of the potentially higher power efficiency compared with GPUs. Previous works include LSTM accelerators 4, 5, 11, 21, a GRU accelerator 26 and a Deep Neural Network (DNN) acceler- ator 10 that is able to run LSTM inference. These implementations
25、do not capitalize on the data sparsity of RNNs. The Efficient Speech Recognition Engine (ESE) proposed by Han et al. 13 employs a load-balance-aware pruning scheme, including both pruning 15 and quantization. This pruning scheme compresses the LSTM model size by 20x and a scheduler that parallelizes
26、 operations on sparse models. They achieved 6.2x speedup over the dense model by prun- ing the LSTM model to 10% non-zeros. However, none of these works use the temporal property of RNN data.In this paper, a GRU-RNN accelerator architecture called the DeltaRNN (DRNN) is proposed. This implementation
27、 is based on the Delta Network (DN) algorithm that skips dispensable com- putations during network inference by exploiting the temporal dependency in RNN inputs and activations 25 (Sec. 2.2). The sys- tem was implemented on an Xilinx Zynq-7100 FPGA controlled by a dual ARM Cortex-A9 CPU. To provide
28、sufficient bandwidth for arithmetic units and less external memory access, DRNN V1.0 stores the network weight matrix in BRAM blocks. It was tested on the TIDIGITS dataset using an RNN model with one GRU layer of 256 neurons.Figure 1: Data flow of GRUthat will be added to the candidate hidden state.
29、 The update gate decides to what extent the activation h should be updated by the candidate hidden state to enable a long-term memory. The GRU formulation used in this paper is shown below:r(t) = Wxrx(t ) + Whr h(t 1) + br u(t) = Wxux(t) + Whuh(t 1) + bu c(t) = tanhWxcx(t) + r(t) (Whch(t 1) + bc h(t
30、) = (1 u(t) h(t 1) + u(t) c(t)(1)(2)(3)(4)where x is the input vector, h the activation vector, W the weightmatrix, b the bias and r, u, c correspond to the reset gate, update gate and candidate activation respectively. and signify logistic sigmoid function and element-wise multiplication respective
31、ly.2.2 Delta Network AlgorithmThis section explains the principle of the DN algorithm 25 and show how a conventional GRU-RNN model with full updates can be converted to a delta network.The DN algorithm reduces both memory access and arithmetic operations by exploiting the temporal stability of RNN i
32、nputs and outputs. Computations associated with a neuron activation that has a small amount of change from its previous timestep can be skipped. As shown in Fig. 2, all gray circles represent neurons whose corresponding computations are skipped. The previous re- search on the DN algorithm 25 demonst
33、rated for the TIDIGITS audio digit recognition benchmark that the algorithm can achieve 8x speedup with 97.5% accuracy when the network was trained as a delta network without considering sparsity in the weight ma- trix. Pre-trained networks can also be greatly accelerated as delta networks. The larg
34、e Wall Street Journal (WSJ) speech recognition benchmark showed speedup of 5.7x with word error rate of 10.2%, which is the same with using a conventional RNN 25.Fig. 3 shows how skipping a single neuron saves multiplications of an entire column in all related weight matrices as well as fetches of t
35、he corresponding weight elements. The following equations describe the conversion between an MxV with full update and one with delta updates:2 BACKGROUND2.1 Gated Recurrent UnitA GRU RNN has similar prediction accuracy to an LSTM RNN but lower complexity of computation. We implemented the GRU which
36、requires a smaller number of network parameters (therefore less hardware resource) than the LSTM. Fig. 1 shows the data flow within a GRUneuron.The GRU neuron model has two gatesa reset gate r and an update gate uand a candidate hidden state c. The reset gate deter- mines the amount of information f
37、rom the previous hidden state(5)(6)y (t ) = W x (t )y (t) = W x (t) + y (t 1)22Special Session: Deep LearningFPGA18, February 2527,Monterey, CA, USACmem,sparse = oc n2 + 4n(10)(11)(12)Speedup = Ccomp, dense 1Ccomp, sparseocCmem, dense 1Memory Access Reduction =Cmem, sparseocTo skip the computations
38、related to any small x (t), the delta threshold is introduced to decide when a delta vector element can be ignored. The change of a neurons activation is only memorized when it is larger than . Furthermore, to prevent the accumulation of error with time, only the last activation value that has a cha
39、nge larger than the delta threshold is memorized. This is defined by the following equation sets:Figure 2: Comparison between a standard gated RNN net- work (left) and a sparse delta network (right)x(t 1), |x (t ) x(t 1)| where x (t)= x (t)x (t 1) andy (t 1) is the MxV result from the previous times
40、tep. The MxV in equation (6) becomes a sparse MxV if all computations with respect to small x(t) elements are ignored. As shown in Fig. 3, the sparser the x(t) vector, the more memory access and arithmetic operations are saved.x(t 1) =(13)x(t 2) , |x (t ) x(t 1)| , h(t 1) h(t 2)h(t 2) h(t 2) =(14)h(
41、t 3) , h(t 1) h(t 2) x (t ) x (t 1) , |x (t ) x (t 1)| (15)x (t) =0, |x (t) (t 1)| h (t 1) h (t 2) , h (t 1) h (t 2) h (t 1) =, h (t 1) h (t 2) 0(16)where memorized changes x(t) and h(t 1) are calculated byusingx(t1) andh(t2). Next, using (13), (14), (15) and (16), the conventional GRU equation set
42、can be transformed into its delta network version:Mr (t) = Wxr x(t) + Whr h(t 1) + Mr (t 1) Mu (t) = Wxu x(t) + Whu h(t 1) + Mu (t 1) Mcx (t) = Wxc x(t) + Mcx (t 1)Mch (t ) = Whc h(t 1) + Mch (t 1)r (t) = Mr (t)u(t) = Mu (t)c(t ) = tanhMcx (t ) + r (t ) Mch (t )h(t) = 1 u(t) h(t 1) + u(t) c(t) where
43、 Mr (0) = br , Mu(0) = bu, Mcx (0) = bc , Mch(0) = 0.3IMPLEMENTATION(17)(18)(19)(20)(21)(22)(23)(24)Figure 3: Skipping neuron updates save multiplications be- tween input vectors and columns that correspond to zerox(t ) (also the behavior of Matrix-Vector Multiplication Channel discussed in Section
44、3.2.2)Assumingthelength of all vectors is n and dimension ofthe weight matrix W is n n, the computation cost for calculating adense MxV is n2 while the computation cost for calculating a DNMxV is oc n2 + 2n, where oc is the occupancy1 of the delta vectorx(t). The term 2n exists because calculating t
45、he delta vector x(t)and adding y(t 1) to W x(t) respectively needs n operations.As for memory access, to calculate a dense MxV, oc n2 weightThe DN algorithm can theoretically reduce arithmetic operations and memory access by reducing weight fetches. The main target of the DRNN accelerator is to real
46、ize efficient zero-skipping on sparse and irregular data patterns of x(t) and h(t 1).Fig. 4 shows the overview of the whole system. The Programmable Logic (PL) part is implemented on a Xilinx Zynq-7100 FPGA chip running at 125 MHz. An AXI Direct Memory Access (DMA) mod- ule converts between AXI4-Str
47、eam (AXIS) and full AXI4 so that data can be transferred between the PL and the Processing System (PS). The PS is the Dual ARM-Cortex A9 CPU on the Zynq SoC. The data transfer between DRNN and AXI DMA is managed in packets. The Memory Mapped to Stream (MM2S) interrupt and the Stream to Memory Mapped
48、 (S2MM) interrupt are respectively used to indicate the end of corresponding data transfers. Read andelements and n vector elements has to be fetched. The DN MxVneeds to fetch oc n2 weight elements, 2n vector elements for x(t), n vector elements for y(t 1) and finally write n vector elements for y(t
49、). Thus, the theoretical computation speedup and memory access reduction approaches 1/oc when n . A summary of the computation cost and memory cost is shown by:2(7)(8)(9)Ccomp,dense = nCcomp,sparse = oc n2 + 2n Cmem,dense = n2 + n1Occupancy is defined as the ratio of non-zero elements to all element
50、s of a vector or a matrix23Special Session: Deep LearningFPGA18, February 2527,Monterey, CA, USAwrite operations on DDR3 memory are managed by the DRAM controller of the PS 28.Figure 5: Top-level block diagram of the DRNN AcceleratorThe structure of the IEU is shown in Fig. 6. The IEU has two identi
51、cal parts that are responsible for generating x(t) and h(t1) respectively. The width of the inputs of the two parts are 64 bits and 512 bits. Since both x(t) and h(t 1) elements are 16-bit Q8.8 integers, the two IEU parts respectively consume 4 elements of x(t) and 32 elements of h(t 1) per clock cy
52、cle, both of which should be set as large as possible to enhance performance but are respectively limited by the AXI-DMA bandwidth and timing requirements. The same number of delta vector elements are then calculated by the Delta Encoder by subtracting the current input from the input of last timest
53、ep stored in the Previous Time (PT) register file and comparing the result with the delta threshold to decide if the magnitude of change should be dropped or saved intoFigure 4: Acceleration system overviewThe input and output interface of the DRNN accelerator uses the AXIS protocol to stream data i
54、n from the MM2S Data FIFO and out to the S2MM Data FIFO. Both the input and output interfaces are 64-bit wide in order to transfer 4 16-bit values per clock cycle. The weight BRAM block consumes 400 36-Kbit BRAM blocks (1.76 MB) to store all weight matrices and biases on-chip to provide sufficient b
55、andwidth. Input vectors are stored in DRAM to be trans- ferred to DRNN during runtime and output vectors are written back to DRAM immediately after being produced by DRNN. Previous research shows that RNNs with quantized fixed-point parameters downto8andeven4bitscanstillworkwell18,20.Inthiswork,to d
56、emonstrate the benefit of the DRNN architecture on performance gain with minimum RNN accuracy loss on practical applications, we quantize all 32-bit floating-point parameters used by the DRNN including GRU input/output vectors, weights and biases into fixed- point 16-bit Q8.8 integers by Fixed16 = r
57、ound (256 Float32).3.1 DRNNArchitectureThe top-level block diagram of the DRNN Accelerator is shown in Fig. 5. It is composed of three main modules, the Input Encoding Unit (IEU), the MxV Unit which is controlled by the MxV controller, and the Activation Pipeline (AP).The function of the IEU is to e
58、xecute subtractions and com- parisons between the input vectors from the current and previ- ous timesteps to generate delta vectors. The MxV Unit executes Multiply-Accumulate (MAC) operations using the sparse IEU out- put. It contains 768 MAC units that performs 16-bit multiplications and 32-bit accumulations on s
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 长沙医学院《高等数学Ⅰ(下)》2023-2024学年第一学期期末试卷
- 2025年陕西省商洛市洛南县重点名校初三下学期第一次月考试题化学试题试卷含解析
- 遵义师范学院《经典译著赏析》2023-2024学年第二学期期末试卷
- 德阳农业科技职业学院《国际新闻作品案例解析》2023-2024学年第二学期期末试卷
- 二甲护理条款解读
- 广西贵港市覃塘区重点名校2025年高中毕业班第二次模拟(英语试题文)试卷含答案
- 2025届山东省泰安一中、宁阳一中高三第四次月考(物理试题理)试题含解析
- 海南省琼海市嘉积中心校2024-2025学年三年级数学第二学期期末联考试题含解析
- 2025年青海省玉树州高三4月“圆梦之旅”(九)生物试题含解析
- 2025年厦门市重点中学高三第二次教学质量检查考试物理试题试卷含解析
- 高中英语语法课件-状语从句(共40张)
- 粤教粤科版科学六年级下册全册单元检测卷 含答案
- 物种起源少儿彩绘版
- 人才培养方案企业调研
- 第6课《求助电话》课件
- 旅游业品牌塑造与形象传播策略
- 单片机恒压供水系统设计
- 《冠心病的中医防治》课件
- 数据中心建设项目可行性研究报告
- 【高新技术企业所得税税务筹划探析案例:以科大讯飞为例13000字(论文)】
- 口中有异味中医巧辨治
评论
0/150
提交评论