Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
AN IMPROVED HARDWARE PRIMITIVE FOR IMPLEMENTATIONS OF DEEP NEURAL NETWORKS
Document Type and Number:
WIPO Patent Application WO/2020/215124
Kind Code:
A1
Abstract:
Quantization is a key optimization strategy to improve the performance of floating-point deep neural network (DNN) accelerators. FPGA-based accelerators usually employ fine-grained resources such as lookup tables (LUTs), as the digital signal processing (DSP) blocks available on FPGAs are not efficiently utilized when applied to low-precision computations. This issue is addressed for the most important computations in embedded DNN accelerators, namely the standard, depth-wise, and point-wise convolutional layers through three modifications to Xilinx DSP48E2 DSP-blocks. First, a flexible precision, run-time decomposable multiplier architecture for CNN implementations is provided. Second, a significant upgrade to DSP-DSP interconnect is provided, providing a semi-2D low precision chaining capability which supports our low-precision multiplier. This enables a 1D DSP column to be operated in a Semi-2D mesh arrangement, reducing the data read access energy by avoiding off-DSP interconnections when data streaming. Data reuse via a register file which can also be configured as FIFO is also presented.

More Like This:
WO/2012/044738INSTANT REMOTE RENDERING
Inventors:
RASOULINEZHAD SEYEDRAMIN (AU)
LEONG PHILIP (AU)
Application Number:
PCT/AU2020/050395
Publication Date:
October 29, 2020
Filing Date:
April 24, 2020
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
UNIV SYDNEY (AU)
International Classes:
G06F9/22; G06N3/06; H03K19/17736
Domestic Patent References:
WO2017003887A12017-01-05
Foreign References:
US20130135008A12013-05-30
US8468335B22013-06-18
US8495122B22013-07-23
US8583569B22013-11-12
US0949539A1910-02-15
Other References:
LANGHAMMER, M. ET AL.: "High Density and Performance Multiplication for FPGA", IEEE 25TH SYMPOSIUM ON COMPUTER ARITHMETIC (ARITH), 25 June 2018 (2018-06-25), pages 5 - 12, XP033400124, DOI: 10.1109/ARITH.2018.8464695
Attorney, Agent or Firm:
SHELSTON IP PTY LTD (AU)
Download PDF:
Claims:
CLAIMS:

1. A flexible precision, run time decomposable multiplier system for FPGA or ASIC architectures, which includes run time precision control.

2. A system as claimed in claim 1 wherein the decomposition is provided by divide and conquer (partitioning) and recursive twin precision techniques.

3. A system as claimed in any previous claim further including a DSP to DSP interconnect, including providing a semi-2D low precision chaining capability which supports the low-precision multiplier.

4. A system as claimed in claim 3 wherein said chaining capability includes allowing a ID DSP column to be operated in a Semi-2D mesh arrangement, reducing the data read access energy by avoiding off-DSP interconnections when data streaming.

5. A system as claimed in any previous claim wherein data is forwarded from one DSP to at least two DSPs in data chaining.

6. A system as claimed in any previous claim wherein predetermined register fdes within the architectures are also configured as FIFO data structures.

7. A system as claimed in any previous claim wherein data said system implements a series of convolution layers of a Deep Neural Network architecture.

8. A DSP to DSP interconnect, including providing a semi -2D low precision chaining capability which supports the low-precision multiplier.

9. A system as claimed in claim 8 wherein said chaining capability includes allowing a ID DSP column to be operated in a Semi-2D mesh arrangement, reducing the data read access energy by avoiding off-DSP interconnections when data streaming.

10. In an FPGA architecture, configuring register files within the architecture as dual use FIFO data structures.

Description:
An improved hardware primitive for implementations of Deep Neural Networks

FIELD OF THE INVENTION

[0001] The present invention provides for systems and methods for Improvement in Deep Neural Network Architectures

BACKGROUND OF THE INVENTION

[0002] Any discussion of the background art throughout the specification should in no way be considered as an admission that such art is widely known or forms part of common general knowledge in the field.

[0003] Recent progress with deep neural networks (DNNs) has yielded significant improvement over conventional approaches in cognitive applications like image, speech and video recognition [1] Using massively parallel architectures, DNNs are much more memory and computationally expensive than previous approaches, and efficient implementations continue to pose a challenge. For instance, AlexNet, proposed in 2012, requires 724M floating-point operations (FLOPs) on 61M parameters in a 5-layer network to achieve 15.3% top-5 error rate on ImageNet [2] In contrast ResNetl52, a state of the art convolutional neural network (CNN), uses 11.3B FLOPs over 152 layers to enhance the Top-5 error to 3.6% [3]

[0004] Modem accelerators have strived to decrease memory footprint and computation requirements of CNNs with minimal compromise in accuracy by using low precision arithmetic operations, particularly for inference [4], [5], [6], [7] Reference [1] compared the implementation of multiply-accumulate (MAC) units with different wordlengths on Xilinx and Intel FPGAs. They reported that using fixed point 8 c 8-bit operations instead of single precision floating point, logic resources are reduced by a factor of 10 - 50 times. This idea has been taken to its conclusion with ternary and binary operations which achieve extremely high speed and low energy on FPGA platforms [8], [9]

[0005] Current FPGAs include hard digital signal processing (DSP) blocks to allow efficient implementation of MAC operations. Unfortunately, as for central processing unit (CPU), graphics processing unit (GPU) and application specific integrated circuit (ASIC) architectures, they are optimized for higher precision (8-18 bits) and do not efficiently support low precision MAC operations, leading to inefficiencies in resource usage and energy consumption. Inefficient DSP- block usage causes higher LUT utilization and wasted area. In addition, researchers have proposed strategies involving run-time selection of wordlengths, which can not efficiently implemented in current FPGA architectures [10]

[0006] Research on computer architectures for DNN accelerators have extensively utilized 2D systolic architectures which offer higher performance and energy efficiency via data reuse techniques [11], [12] Current FPGA DSP-block layouts are based on 1D-DSP columns. This is a mismatch to 2D systolic architectures leading to inefficiencies and requiring that general purpose rather than dedicated routing resources be used.

SUMMARY OF THE INVENTION

[0007] It is an object of the present invention to provide an improved form of DSP block particularly optimised for DNN type operations.

[0008] In accordance with a first aspect of the present invention, there is provided a flexible precision, run time decomposable multiplier system for FPGA architectures, which preferably can include run time precision control.

[0009] In some embodiments, the decomposition can be provided by divide and conquer (partitioning) and recursive twin precision techniques. The system can further include a DSP to DSP interconnect, including providing a semi-2D low precision chaining capability which supports the low-precision multiplier.

[0010] The chaining capability preferably can include allowing a ID DSP column to be operated in a Semi-2D mesh arrangement, reducing the data read access energy by avoiding off- DSP interconnections when data streaming. The data can be forwarded from one DSP to at least two DSPs in data chaining.

[001 1] Predetermined register files within the FPGA architectures are preferably also configured as FIFO data structures.

[0012] In one embodiment, the system implements a series of convolution layers of a Deep Neural Network architecture.

[0013] In accordance with a further aspect of the present invention, there is provided a DSP to DSP interconnect, including providing a semi-2D low precision chaining capability which supports the low-precision multiplier. [0014] The chaining capability preferably can include allowing a ID DSP column to be operated in a Semi-2D mesh arrangement, reducing the data read access energy by avoiding off- DSP interconnections when data streaming.

[0015] In accordance with a further aspect of the present invention, there is provided in an FPGA architecture, configuring register files within the FPGA architecture as dual use FIFO data structures.

BRIEF DESCRIPTION OF THE DRAWINGS

[0016] Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings in which:

[0017] Fig. 1 illustrates a Xilinx DSP48E2 schematic;

[0018] Fig. 2 is a schematic block diagram of a modified DSP48E2 device;

[0019] Fig. 3 is a schematic block diagram of the divide and conquer technique used in the embodiments;

[0020] Fig. 4 is a schematic block diagram of the recursive twin precision decomposition of a signed/unsigned 9x9 multiplier for depth factor = 0 (left), 1 (middle), and 2 (right), .;

[0021] Fig. 5 illustrates conventional and modified processing elements used in the embodiments showing a) 2D processing unit architecture in [11] b) 3x3 convolution layer implementation on 2D architecture c) Our Semi-2D DSP arrangement d) conventional FPGA column-based arrangement

[0022] Fig. 6 illustrates proposed implementation for standard and DW (left), and PW convolution layers (right); and

[0023] Fig. 7 illustrates an implementation approach for different DW convolution kernel sizes.

DETAILED DESCRIPTION

[0024] The embodiments provide a novel precision, interconnect and reuse optimised DSP block, (PIR-DSP), which is optimised for implementing area-efficient DNNs. [0025] In particular, we make the following contributions:

[0026] · Precision: A MAC (MAC-IP) with run-time precision control using a combination of

Divide-and-Conquer and Recursive Twin-Precision techniques.

[0027] · Interconnect: A DSP interconnection scheme which provides support for semi-2D connections and low-precision streaming.

[0028] · Reuse: Inclusion of register fries within the DSP to improve data-reuse and reduce energy.

[0029] · Evaluation of performance for implementing machine learning primitives including standard, depth-wise (DW) and point-wise (PW) convolution layers in recent embedded DNNs.

[0030] PIR-DSP is implemented as a parameterized module generator which can target FPGAs or ASICs.

[0031 ] RELATED WORKS

[0032] A through review of DNN is available in reference [17]

[0033] A. Deep Neural Networks

[0034] The turning point for deep learning (DL) is generally considered to have occurred in 2012 with AlexNet [2], which won the ImageNet Large Scale Visual Recognition Competition (ILSVRC) challenge in 2012. When applied to the large ImageNet dataset, with 10,000 categories and 10 million images, it achieved a Top-5 classification error rate of 15.3%, the next best result being 26.2% from a non-neural network model (all reported performance metrics are taken from reference [17]). This network had a storage requirement of 62M parameters and used 724M MAC operations per classification.

[0035] There has been considerable recent interest in memory and computationally efficient CNNs for mobile and embedded applications. Consider a standard convolutional layer which takes a D F x D F x M feature map F as input, and produces a D G x D G x N feature map G as output. The output is generated via a convolution with a D K X D K M x N kernel K as follows: [0036] MobileNet [18], [14] proposed depth-wise separable convolutions which first factorizes Equation 1 into M depth-wise convolutions:

[0037] where K is the D K xD K xM depth-wise kernel and the m’th filter of K is applied to the m’th channel of F to produce the m’th channel of G. Linear combinations of the M depth-wise layer outputs are then used to form the N outputs, these being called l x l point-wise convolutions. A speedup of N + D 2 K / ND 2 K is achieved and typical values are 8 - 9 times (for D K = 3), with a small reduction in accuracy. A study of the speed/accuracy tradeoffs of convolutional object detectors compared the use of the Inception, MobileNet, ResNet and VGG networks as the feature extractor for object detection, with MobileNet achieving excellent accuracy if low execution time on a GPU is desired [19]

[0038] In order to manage the massive computation and storage complexities of DNNs, efforts at reducing hardware resource usage at all design levels have been undertaken, e.g. efficient computational kernels [20], [21], [16], [22], [23], [24], data pruning [25], [26], memory compression [27], [28] and quantization [29], [30], [31], [32], [33] Table I provides a summary of the architectures employed in a number of recent state of the art embedded DNNs. From the last row, it can be seen that standard, DW, PW and fully connected (FC) layers account for almost all MACs.

[0039] Table 1: SUMMARY OF THE ARCHITECTURES EMPLOYED IN A NUMBER OF RECENT STATE OF THE ART EMBEDDED DNNS

[0040] Modem GPUs are presently the most popular solution for high-performance DNN implementation and Google’s Tensor Processing Unit (TPU) is an application specific integrated circuit (ASIC) for accelerating DNNs [34] Unfortunately, neither can efficiently support lower precisions than 8-bit. In contrast, FPGA architectures are more customizable and can support arbitrary precision MAC operations using fine-grained logic resources [8], [9], [35], [36], [37]

[0041] Interest in quantization has dramatically increased since it was shown that binarized and ternary weights with low- precision activations, suffer only a small decrease in accuracy compared with floating point [38], [39] Since FPGAs can implement arbitrary precision datapaths, they have some advantages over the byte addressable GPUs and CPUs for these applications. Moreover, the highest speed implementations on all platforms use reduced precision for efficiency reasons.

[0042] B. DSP blocks

[0043] CPU architectures working at high clock speeds and are efficient for highly sequential computations while GPU-based systems have a massive number of parallel processing elements and are favoured for parallel computations. In contrast to CPU and GPU architectures, FPGA systems are able to efficiently implement a range of parallel and sequential computations. They allow the data path to be better customized for an application, enabling designs to be more highly optimized, particularly in inference for processing single input feature maps (to minimize latency) and to support low precision. Datapaths are most efficient when operations can be implemented using hard DSP resources.

[0044] 1) Xilinx DSP48E2: The Xilinx DSP48E2 DSP [40] in UltraScale architecture can perform 27x 18 MAC operations and is illustrated in Fig. 1. It includes a 27-bit pre-adder, 48- bit accumulator, and 48-bit arithmetic logic unit (ALU). Dual SIMD 24-bit or quad 12-bit ADD/SUB operations can be computed in the ALU, and other DSP48E2 features include pattern matching and ID unidirectional chaining connections. The DSPs can be cascaded to form a higher precision multiplier, and optional pipeline registers are present. In the DSP48E2, the SIMD wordlength can be changed at run-time. [0045] 2) Intel DSPs: The Intel DSP [41] supports one 27x27 or two 18x 18 multiplications.

Precision is compile-time rather than run-time configurable and there is no pattern matching unit. A pre-adder is implemented as well as two read-only register files (RF)s which can be initialized at compile time and jointly operated as a higher precision RF.

[0046] 3) Previous Work in Multi-precision DSPs: Previous research has been conducted in supporting larger numbers of low precision operations using existing DSP blocks. Xilinx has proposed a method to use 8 DSP blocks to perform 7x2 8-bit multiply-add operations, improving performance 1.75 times [42] Colangelo et. al. [43] proposed to use an 18 x 18 multiplier as four different 2x2 multipliers. Multi-precision FPGA hard blocks have been proposed by Parandeh- Afshar and Ienne [44] This DSP variant, based on a radix-4 Booth architecture, supports 9/12/18/24/36 multiplier bit-widths and multi-input addition. Boutros et. al. [45] proposed a modification of the Arria-10 DSP that can support 4x 9-bit or 8x 4-bit MACs. For the AlexNet, VGG-16, and ResNet-50 DNNs, this architecture improved speed by up to 1.6 times while reducing utilized area by up to 30%. The proposed PIR-DSP differs from previous designs in that it provides a parameterised hard or soft DSP block generator with improved flexibility, considers buffering of within the DSP, and also considers inter-DSP interconnect. This serves to improve the speed and energy consumption of the standard, DW and PW convolutions of Table I, with FC layer computations unaffected by changes.

[0047] PIR-DSP

[0048] There will now be described three modifications to the Xilinx DSP48E2 block with reference to Fig. 2.

[0049] A Precision: Decomposable Multiplier

[0050] The multiplier decomposition strategy is based on two approaches: Divide-and-Conquer and Recursive Twin-Precision.

[0051] 1) Divide and Conquer: A signed 2’s complement number can be represented as the sum of one signed (the most significant part) and an unsigned term: [0052] where the k-th bit is the dividing point and the A¾ and A u are respectively signed and unsigned portions.

[0053] When applied to signed multiplication, this enables the separation of lower-precision product terms:

[0054] with each input being chopped at the k-th bit.

[0055] Consider Equation 4 applied to an N c M-bit multiplier with chopping size C, where N, M, and C are respectively 27, 18, and 9. As shown in Fig. 3(a), standard multiplication is done by adding six partial results with appropriate shifts. Fig. 3(b) shows that by controlling the shift steps for the first, fourth and fifth partial results, the summation can be arranged into two separate columns, where each column calculates a 3-CxC-bit- MAC operation with separated carry-in signals:

[0056] 2) Recursive Twin-Precision: In the Twin-Precision technique [46], we assumed a signed/unsigned N x N multiplier as our baseline. Inputs should be 1-bit extended according to the individual sign control signals and their MSBs. The extended inputs are then multiplied using a ( N + 1) x ( N + 1) signed multiplier based on the Baugh-Wooley structure [47] . Figure 4(a) shows the baseline multiplier where A and B are 9-bit numbers and each circle represents a logical function. By modifying the logic circuits of the PPs and preventing carry propagation using mode control signals, the multiplier can also work as two half-precision multipliers. The required modifications are depicted in Figure 4(b). By recursively applying the technique on the twin multipliers, the multiplier is able to compute four quarter-precision (like the circuit in 4(c)) in parallel without significant additional resources as only small changes to the PP logic and carry propagation paths are required.

[0057] The multiplier is parameterized by chopping factors (separately for each of the two inputs) and the depth. For an MxN multiplier, we use the notation MxNCijDk where i and j are chopping factors (the numbers of times we chop M and N), and k is the Recursive Twin-Precision depth factor.

[0058] We applied our idea to the Xilinx DSP48E2 27 x 18 multiplier which produces two partial results (the following ALU is responsible for adding these two outputs). To create a 27 x 18C32D2 configuration, we chop A and B into i = 3 and j = 2 9-bit parts. As each smaller multiplication is a signed/unsigned 9-bit multiplication, we then used Recursive Twin-Precision with depth k = 2 to change the 9 x 9 signed/unsigned multiplier to additionally support two 4 x 4 and four 2 x 2 multiplication (Figure 4(c)). Extra bits are included so that this is done without precision loss. Fig. 3 (c) and (d) show how the bit-level carry propagation from each column to the next is arranged. Combining the six 9 x 9 multipliers, we can compute the following multi- precision MAC operations without precision loss:

[0059] · One signed/unsigned 27 x 18

[0060] · Two sets of signed/unsigned ( 9 x 9 + 9 x 9 + 9 x 9)

[0061] · Four sets of signed/unsigned ( 4 x 4 + 4 x 4 + 4 x 4)

[0062] · Eight sets of signed/unsigned (2 x 2 + 2 x 2 + 2 x 2)

[0063] A generator can be developed which uses these techniques to convert any size multiplier to a MAC-IP. A sign-magnitude format is used so each operand can be signed or unsigned, this being controllable at run-time.

[0064] B. Interconnect: Low-precision, Semi-2D DSP-DSP Communication

[0065] In this section, we focus on data movement among processing elements (PEs), which are DSP-blocks in this content. It has been shown that 2-D systolic arrays solutions using data streaming and efficient buffering are required for DNN applications with high-performance and low energy consumption [11], [12]

[0066] Whereas in ASIC designs the PEs can be arranged in a 2D pattern, FPGA DSP-blocks must be arranged in columns. In each column, DSP inputs and outputs can be passed via dedicated chain connections. This single -direction chaining is highly efficient for their intended signal processing applications. Although general routing resources make it possible to configure a 2D mesh network of PEs, this approach introduces significant amounts of additional circuitry and latency compared with direct connections.

[0067] In 2D systolic architectures, PE interconnections must forward input and result data to two different destination PEs, usually in different dimensions. Figure 5(a) shows a 2D PE architecture, proposed in reference [11], which is a NxM mesh network of PEs with unidirectional communications occurring in horizontal, vertical and diagonal directions. In Figure 5(b) a 3x3 convolutional layer is assigned to three rows of the PEs. By rearranging this three-row architecture as shown in Figure 5(c), we organize them as a column. When implementing 2D systolic arrays solutions on conventional FPGA column-based chains, it is impossible to use both the input and output dedicated chain connections as they have same source and destination. Figure 5(d) shows a column-based connection which is capable of forwarding the data/result to the next DSP block. This addresses the difficulty of implementing a 2D interconnection on a ID array, by supporting data forwarding to two DSPs instead of a single one. This is particularly effective for the case where one dimension is small (e.g. 3 elements for 3 c 3 convolutional layers).

[0068] Current DSP columns are capable of streaming high precision data over the chains. We propose low precision streaming for the Xilinx DSP48E2 which is efficiently coupled within our multiplier module. This supports the proposed multipliers for applications such as low precision convolution layers, FIR filters, and matrix multiplication. To stream low precision inputs, we make some minor modifications to the input B register and chaining connections to support both high and low precision data streaming. Also, we modified DSP both input A and B chains to support run time configurable input data forwarding up to next two DSPs. This is done by bypassing the next DSP to enhance the implementation capabilities for improving data reuse via a small modification to current FPGAs. With these changes, the 18-bit input B can feed both B 27-bit shift registers and their 9-bit MSB portions via both A and B chains. Furthermore, the design supports run-time configuration (Figure 2). When used to implement convolutional layers, these modifications support one high- precision or two low-precision streams for the Stride = 1 and 2 cases. [0069] C. Reuse: Flexible FIFO and Register File

[0070] In DNN implementations each input/parameter takes part in many MAC operation, so it is important to cache fetched data. Since data movement contributes more to energy consumption than computation, this leads higher performance and energy reduction [11], [12] Unfortunately, Xilinx DSP-blocks do not support caching of data (this is done using the fine-grained resources or hard memory blocks). Intel DSPs do include a small embedded memory for each 18-bit multiplier, but they cannot be configured at run-time and hence can only be used efficiently for fixed coefficients, making them unsuitable for buffering of data for practical sized DNNs.

[0071] A small and flexible first-in-first-out register file (FIFO/RF) can be provided to enhance data reuse. This is a wide shift register can be loaded sequentially and can be read by two standard read ports. The two read port address signals can be provided from outside the DSP-block. The first is used inside the DSP and brings the requested and the next data for multiplier and multiplexer units (two 27-bit read ports are needed to feed the multiplier). The other port is used to select the data for DSP-DSP chaining connections. As RFs are mostly used to buffer a chunk of data inside the DSP, writes always occur as a burst. Using this approach, we arrange the RF as a flexible FIFO. By adjusting the FIFO length, systolic array implementations with different buffering patterns can be implemented. The schematic of our implemented FIFO/RF is given in Figure 2, and operates on input A.

[0072] IV. EXPERIMENTAU STUDY

[0073] A. Precision: As a baseline, we modeled the Xilinx DSP48E2 DSP-block using Verilog and synthesized it using SMIC 65-nm technology standard cell by Synopsis Design Compiler 2013.12. Post-synthesis reports show that our modeled DSP48E2 timing is consistent with reported speeds for DSP48E1 in Virtex 5 speed grade -1, especially the critical path which is 3.85 and 3.94 ns respectively for our modeled DSP48E2 and Virtex-5 DSP48E1. A comparison with DSP48E1 rather than DSP48E2 was made as the former has generally the same DSP architecture and 65 nm process technology [48] DSP48E2 is the most recent version including three major architectural upgrades; wider multiplier unit (27 c 18 instead of 25 c 18), pre-adder module, and wide XOR circuit [49]

[0074] The baseline DSP48E2 multiplier produces two temporary results, and these are added using the ALU to produce the final MAC output. As a longer critical path is created by the PIR- DSP partial product summation circuits, we applied parallel computing and carry-lookahead techniques for both multiplier and ALU, and also added a new pipeline-register layer to the multiplier unit to prevent performance drop of using our more complex circuit. Modifications to the ALU also required replacing the DSP48E2 12/24/48-bit SIMD add/sub operations with a 4/8/18/48-bit SIMD which leads to smaller and width-variant ALUs since they must be aligned with the carry propagation blocking points, as shown in Figure 2.

[0075] We applied the proposed multiplier structure using different configurations on Xilinx 27x 18 bit multiplier and ALU unit. By omitting the extra logical circuits in the ALU, the remaining circuits is our proposed MAC-IP. Table II shows post synthesis Area, Maximum frequency, and energy per MAC operation results for different configurations using performance optimization synthesis strategy.

[0076] Table III shows the area and performance results for different PIR-DSP variations. By upgrading the multiplier to a 27x l8C32D2 MAC-IP, improvements in MAC capabilities of x6, x l2, x24 times for 9, 4, 2-bit MAC operations respectively are gained, at the cost of a 14% increase in area. Configurations # 1 to #3 in Table III show the synthesis results obtained by simply replacing the multiplier and ALU units. Configuration #4 is achieved by modifying the multiplier (in the 27x l8C32D2 configuration) and including the interconnect optimization. Configuration #5 is the final implementation of PIR-DSP which includes all three modifications.

[0077] TABLE II: MAC-IP POST SYNTHESIS RESULTS (AREA RATIO 1 = 9224 um 2 )

[0078] TABLE III: PIR-DSP SYNTHESIS RESULTS.

[0079] B. Interconnect and Reuse

[0080] To evaluate the effectiveness of our proposed data movement modifications for low- precision computations, we focused on the total run-time energy required by implementing low- precision versions of some well-cited embedded CNNs.

[0081] We extracted the read and write energy using Xilinx Power Estimator (XPE) for BRAM and LUT blocks on the Virtex-5 FPGA. EBRAM, Read and EBRAM, Write per byte were measured for an 18-bit wide memory configuration (most efficient way to use BRAMs). To estimate the energy associated with moving data from an off-DSP register file (RF) and shift- register (SR), we configured the LUTs respectively as RAM with Fanout = 4 for broadcasting, and shift register with Fanout = 1 for streaming (Table IV). Using results for small register files in [50], [51], [52], we estimated our embedded 4x2 30-bit RF read & write energy to be 1.1 pJ/byte. RF width and size are selected respectively, to fully feed the multiplier/pre-adder in high/low-precision and to be similar to Intel DSP-block read-only RFs which are configured in two 8x 18-bit memories per DSP. To estimate input B energy which operates as a SR and a normal register we used results for high-performance [53] and low energy flip-flops [54] (FF) to obtain estimates of 180 fj and 90 fj respectively. Energy required to transfer data from DSP-DSP was obtained from reference [55], and scaled to 65nm technology, to obtain 2 pj per byte. Using the energy ratios from Table II, energy consumption for 9/4/2 -bit MAC operations are 89/44/22 c that of a 9-bit register. Table V summarizes the estimated energy ratios for data movement. We further assume that all elements (except the MAC) scale linearly with word length.

[0082] TABLE IV: ESTIMATION OF BRAM, OFF -DSP RF AND RS READAVRITE ACCESS ENERGY 9-BIT WORD ON A XILINX XC5VLX155T EXTRACTED FROM XPE TOOL (PJ)

[0083] TABLE V: DATA MOVEMENT ENERGY RATIOS IN 65 NM TECHNOLOGY (l x = 90FJ)

[0084] We now describe implementations of standard and DW convolutional layers, using a 3x3 DW convolution layer as a case study. According to Equation 2, output channels can be computed in parallel. We assumed input and weight parameters are located in BRAMs and results will be written back to BRAMs. In an implementation on conventional DSPs [56], weight stationary data flow was used, with each input feature map element fetched once from BRAMs and then streamed over off-DSP SRs. Weight parameters are fetched once from BRAMs and saved in DSP registers.

[0085] Each filter and input element are respectively used Fh c Fw and Kh c Kw times. The average energy for the described data flow where EM AC is the energy consumption of the MAC computation is

[0086] For a PIR-DSP implementation, inspired by the Eyeriss architecture [11], we mapped computation of multiple rows of output channels to a three-cascaded PIR-DSP (Figure 6). Each PIR-DSP can compute 2/4/8 sets of three-MAC operations for 9/4/2 -bit precision. Each three-MAC operation can be used for a row of a 3x3 DW kernel. Cascading three PIR-DSPs, we can sum the partial outputs to produce the final output feature map elements. As illustrated in Figure 6 for 9-bit precision, each PIR-DSP receives two streams of 9-bit data (as each PIR-DSP can compute two parallel three-MAC operations). Using our interconnect scheme, the three -cascaded PIR-DSPs can forward two of their streams to the next three-cascaded PIR-DSP over the DSP-DSP chains, and we can implement K rows of 2/4/8 channels of the output matrix for 9/4/2 -bit precision using a column of 3K PIR-DSPs. For this case, Ei nput becomes: [0087] where NoF is the number of forwarding over chains for each input stream (2 in our case as each row of the input stream is involved in three rows of output feature map). To implement other kernel sizes, we use a kernel tiling approach with tile size of 3x3, 2x3, and 1 x3 which are respectively the computation capabilities of a three-cascaded, two-cascaded, and a PIR- DSP. As depicted in Figure 7, implementing a 5x5 kernel can be done by 2 c three-cascaded DSPs and 2x two-cascaded DSP groups where NoF is 6.

[0088] For the case of standard convolution, by using our proposed RF to reuse the streamed input for many fdter parameters, the Ei np ut can be reduced by the factor of RFsize according to the last line of Table VI. The calculated access energy ratio in the last column indicates that PIR-DSP uses 31% of the data access energy for a middle bottleneck layer of MobileNetv2 [14] which applies 192 depth-wise 3x3 filters on an input feature map of shape 56 2 c 192.

[0089] For a PW convolution, each input channel can be streamed into a DSP to multiplied by corresponding weight parameter, producing a partial result which is cascaded and summed to produce an entry of the output feature map. In a PIR-DSP implementation, we assign three channels of input and three corresponding channels of 2/4/8 PW kernels to a PID-DSP, depending on operation precision. PIR-DSP using 2, 4, or 8 three-MAC operations, computes partial results of applying each filter on same input stream in parallel (the stream includes one element of three channels of input feature map in each cycle). By cascading we change it to 2, 4, or 8 six-MAC operations (computing six elements of the PW kernels). Also, as illustrated in the right hand part of Figure 6 for 9-bit precision, using our interconnect circuit, each two-cascaded PIR-DSP can forward their streams to next two-cascaded DSP which leads to energy reduction as summarized in Table VI. Thus, PIR-DSP uses saved weights and performs a MAC with the 2/4/8 3-channel weight parameters which are saved in two 27-bit registers. Furthermore, the RF improves input data reuse. By applying the equations to a middle bottleneck layer of MobileNet-v2 (which includes 192 PW 1 x 1 x 32 filters on 56 2 c 32 input feature map), our proposed optimizations can reduce the read access energy to 44% of the original value.

[0090] A similar analysis was applied to all layers of some common embedded DNN models, the results in Table VII are obtained. For example, when applying all our optimisations to MobileNet-v2 [14], energy is reduced to 31/19/13% of the original value for 9/4/2 -bit precision.

[0091] TABLE VII: ENERGY RATIO OF PIR-DSP OPTIMISATIONS FOR 9/4/2-BIT PRECISION (PERCENT)

[0092] TABLE VI: READ ACCESS ENERGY FOR STANDARD/DW/PW CONV. LAYER PER MAC (BASELINE IMPLEMENTATION USES OFF-DSP RESOURCES TO STREAM INPUT OVER SAVED WEIGHTS IN DSP REGISTERS)

[0093] TABLE VIII: COMPARISON WITH PREVIOUS WORK. ADR=AREA DELAY RATIO, MAIN ENTRIES ARE IN (# OF MAC PER CYCLE / MAC PER SECOND PER DSP (GOPS/SEC)) FORMAT.

[0095] C. Comparison with Previous Work [0096] BitFusion [57] is an ASIC DNN accelerator, supporting multi-precision MACs. The reported area is for a computation unit including 16 Bit-bricks, and supporting 8x8 multipliers, in 45-nm technology. This unit is similar to our 27xl8C32D2 MAC-IP (Table II), although BitFusion is more flexible as it supports more variations including 2x4, 2x8 and 4x8. Table VIII compares Performance per Area (PPA). We used the maximum frequency reported for a same implementation, DSP48E1, in three FPGAs, Virtex5/6/7, normalized to feature size [58] (area is scaled by 1/0.66/0.3 and maximum frequency by 1/1.1/1.35 respectively for 65/45/28 nm). BitFusion only applies the Divide-and-Conquer technique, leading to high area overhead. The introduction of Recursive Twin-Precision better supports low and high-precision MAC operations.

[0097] Boutros et. al. proposed improvements to the Intel DSP- block [45], and is capable of 27x27 and reduced precision MACs down to 4-bit. In comparison, PIR-DSP is a flexible module generator, can support precisions down to 2 bits, has better performance at 8 c 8 bits and lower, but is worse at 16 c 16 and higher PPA. It is not possible to compare energy but we would expect Boutros to be similar to the Baseline case in Table VI with PIR-DSP having significant advantages due to the interconnect and reuse optimizations.

[0098] CONCLUSION

[0099] We have proposed a PIR-DSP architecture which incorporates precision, interconnect and reuse optimizations to better support 2- dimensional low-precision DNN applications. When applied to the implementation of embedded DNNs for which most of the computation are in the standard, PW and DW convolutions, it was shown that our DSP block architecture can significantly reduce the energy consumption of low-precision implementations, albeit with a 28% area overhead.

[00100] REFERENCES

[00101] [1] K. Guo, S. Zeng, J. Yu, Y. Wang, and H. Yang,“A survey of FPGA based neural network accelerator,” CoRR, vol. abs/1712.08934, 2017. [Online]. Available: http://arxiv.org/abs/1712.08934

[00102] [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton,“hnagenet classification with deep convolutional neural networks,” in Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, ser. NIPS’12. USA: Curran Associates Inc., 2012, pp. 1097-1105. [Online]. Available: http://dl.acm. org/citation.cfm?id=2999134.2999257 [00103] [3] K. He, X. Zhang, S. Ren, and J. Sun,“Deep residual learning for image recognition,”

2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016. [Online]. Available: http://dx.doi.Org/10.l 109/CVPR.2016.90

[00104] [4] Y. Guan, Z. Yuan, G. Sun, and J. Cong,“FPGA-based accelerator for long short term memory recurrent neural networks,” 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 629-634, 2017.

[00105] [5] H. Li, X. Fan, L. Jiao, W. Cao, X. Zhou, and L. Wang,“A high performance FPGA- based accelerator for large-scale convolutional neural networks,” in Field Programmable Logic and Applications (FPL), 2016 26th International Conference on. IEEE, 2016, pp. 1-9.

[00106] [6] Q. Xiao, Y. Liang, L. Lu, S. Yan, and Y. Tai,“Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on fpgas,” in Proceedings of the 54th Annual Design Automation Conference, DAC 2017, Austin, TX, USA, June 18-22, 2017, 2017, pp. 62: 1— 62:6. [Online] Available: https://doi.org/10.1145/3061639.3062244

[00107] [7] C. Zhang, Z. Fang, P. Zhou, P. Pan, and J. Cong,“Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks,” 2016 IEEE/ACM International Conference on ComputerAided Design (ICCAD), pp. 1-8, 2016.

[00108] [8] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. H. W. Leong, M. Jahre, and

K. A. Vissers,“FINN: A framework for fast, scalable binarized neural network inference,” CoRR, vol. abs/1612.07119, 2016. [Online]. Available: http://arxiv.org/abs/1612.07119

[00109] [9] A. Prost-Boucle, A. BOURGE, F. Pe'trot, H. Alemdar, N. Caldwell, and V. Leroy,

“Scalable High-Performance Architecture for Convolutional Ternary Neural Networks on FPGA,” in Field Programmable Logic and Applications (FPL), 2017 27th International Conference on, Gent, Belgium, Sep. 2017. [Online]. Available: https://hal.archivesouvertes.fr/hal-01563763

[001 10] [10] L. Shan, M. Zhang, L. Deng, and G. Gong,“A dynamic multiprecision fixed-point data quantization strategy for convolutional neural network,” in Computer Engineering and Technology 20th CCF Conference, NCCET 2016, Xi’an, China, August 10-12, 2016, Revised Selected Papers, 2016, pp. 102-111. [Online]. Available: https://doi.org/10.1007/978-981-10-3159- 5 10 [001 11] [11] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze,“Eyeriss: An energy- efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127- 138, 2017.

[001 12] [12] Y.-H. Chen, J. Emer, and V. Sze,“Eyeriss v2: A flexible and high- performance accelerator for emerging deep neural networks,” arXiv preprint arXiv: 1807.07928, 2018.

[001 13] [13] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures for scalable image recognition,” CoRR, vol. abs/1707.07012, 2017.

[001 14] [14] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen,“Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation,” CoRR, vol. abs/1801.04381, 2018.

[001 15] [15] N. Ma, X. Zhang, H. Zheng, and J. Sun,“Shufflenet V2: practical guidelines for efficient CNN architecture design,” CoRR, vol. abs/1807.11164, 2018.

[001 16] [16] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer,

“Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <0.5mb model size,” 2016.

[001 17] [17] V. Sze, Y. Chen, T. Yang, and J. S. Emer,“Efficient processing of deep neural networks: A tutorial and survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295-2329, 2017. [Online] Available: https://doi.org/10.1109/JPROC.2017.2761740

[001 18] [18] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M.

Andreetto, and H. Adam,“Mobilenets: Efficient convolutional neural networks for mobile vision applications,” CoRR, vol. abs/1704.04861, 2017. [Online]. Available: http : //arxiv .org/ab s/1704.04861

[001 19] [19] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z.

Wojna, Y. Song, S. Guadarrama, and K. Murphy, “Speed/accuracy trade-offs for modem convolutional object detectors,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2017, pp. 3296-3297. [Online] Available: https://doi.Org/10.l 109/CVPR.2017.351 [00120] [20] B. Wu, A. Wan, X. Yue, P. H. Jin, S. Zhao, N. Golmant, A. Gholaminejad, -J. -

Gonzalez, and K. Keutzer,“Shift: A zero flop, zero parameter alternative to spatial convolutions,” in 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, 2018, pp. 9127-9135. [Online] Available: http://openaccess.thecvf.com/content cvpr 2018/html/Wu Shift A Zero CVPR 2018 paper.html

[00121] [21] X. Zhang, X. Zhou, M. Lin, and J. Sun,“Shufflenet: An extremely efficient convolutional neural network for mobile devices,” in 2018 JELL Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18- 22, 2018. IEEE Computer Society, 2018, pp. 6848-6856. [Online]. Available: http://openaccess.thecvf.com/content cvpr 2018/html/ Zhang ShuffleNet An Extremely CVPR 2018 paper.html

[00122] [22] L. Lu, Y. Liang, Q. Xiao, and S. Yan,“Evaluating fast algorithms for convolutional neural networks on FPGAs,” 2017 IEEE 25th Annual International Symposium on Field- Programmable Custom Computing Machines (FCCM), pp. 101-108, 2017.

[00123] [23] J. Faraone, G. Gambardella, N. J. Fraser, M. Blott, P. H. W. Leong, and D. Boland,

“Customizing low-precision deep neural networks for FPGAs,” in 28th International Conference on Field Programmable Logic and Applications, FPL 2018, Dublin, Ireland, August 27-31, 2018, 2018, pp. 97-100. [Online] Available: https://doi.org/10.1109/FPL.2018.00025

[00124] [24] C. Zhang and V. K. Prasanna,“Frequency domain acceleration of convolutional neural networks on CPU-FPGA shared memory system,” in Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2017, Monterey, CA, USA, February 22-24, 2017, J. W. Greene and J. H. Anderson, Eds. ACM, 2017, pp. 35-44. [Online]. Available: http://dl.acm.org/citation.cftn?id=3021727

[00125] [25] S. Han, H. Mao, and W. J. Dally,“Deep compression: Compressing deep neural network with pruning, trained quantization and hufftnan coding,” CoRR, vol. abs/1510.00149, 2015. [Online]. Available: http://arxiv.org/abs/1510.00149

[00126] [26] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky,“Sparse convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 806-814. [00127] [27] M. Samragh, M. Ghasemzadeh, and F. Koushanfar,“Customizing neural networks for efficient FPGA implementation,” 2017 IEEE 25th Annual International Symposium on Field- Programmable Custom Computing Machines (FCCM), pp. 85-92, 2017.

[00128] [28] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song et al., “Going deeper with embedded FPGA platform for convolutional neural network,” in Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 2016, p. 2635.

[00129] [29] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification using binary convolutional neural networks,” Lecture Notes in Computer Science, p. 525542, 2016. [Online] Available: http://dx.doi.org/10.1007/978-3-319-46493-0 32

[00130] [30] S. Han, J. Kang, H. Mao, Y. Hu, X. Li, Y. Li, D. Xie, H. Luo, S. Yao, Y. Wang, H.

Yang, and W. B. J. Dally,“Ese: Efficient speech recognition engine with sparse lstm on fpga,” in Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ser. FPGA ’17. New York, NY, USA: ACM, 2017, pp. 75-84. [Online]. Available: http://doi.acm.org/10.1145/3020078.3021745

[00131] [31] E. Nurvitadhi, D. Sheffield, J. Sim, A. K. Mishra, G. Venkatesh, and D. Marr,

“Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU, and ASIC,” 2016 International Conference on Field- Programmable Technology (FPT), pp. 77-84, 2016.

[00132] [32] S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou,“Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients,” CoRR, vol. abs/1606.06160, 2016.

[00133] [33] C. Zhu, S. Han, H. Mao, and W. J. Dally,“Trained ternary quantization,” CoRR, vol. abs/1612.01064, 2016.

[00134] [34] N. P. Jouppi, A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark, J. Coriell, M.

Daley, M. Dau, J. Dean, and et al.,“In-datacenter performance analysis of a tensor processing unit,” Proceedings of the 44th Annual International Symposium on Computer Architecture - ISCA 17, 2017. [Online] Available: http://dx.doi.org/10.1145/3079856.3080246 [00135] [35] L. Jiao, C. Luo, W. Cao, X. Zhou, and L. Wang,“Accelerating low bit-width convolutional neural networks with embedded FPGA,” 2017 27th International Conference on Field Programmable Logic and Applications (FPL), pp. 1-4, 2017.

[00136] [36] D. J. M. Moss, E. Nurvitadhi, J. Sim, A. K. Mishra, D. Marr, S. Subhaschandra, and P. H. W. Leong,“High performance binary neural networks on the Xeon+FPGATM platform,” in 27th International Conference on Field Programmable Logic and Applications, FPL 2017, Ghent, Belgium, September 4-8, 2017, 2017, pp. 1-4. [Online]. Available: https://doi.org/10.23919/FPL.2017.8056823

[00137] [37] H. Nakahara, T. Fujii, and S. Sato,“A fully connected layer elimination for a binarizec convolutional neural network on an FPGA,” in 27th International Conference on Field Programmable Logic and Applications, FPL 2017, Ghent, Belgium, September 4-8, 2017, 2017, pp. 1-4. [Online] Available: https://doi.org/10.23919/FPL.2017.8056771

[00138] [38] J. Faraone, N. Fraser, M. Blott, and P. H. Leong,“Syq: Learning symmetric quantization for efficient deep neural networks,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.

[00139] [39] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio,“Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or - 1,” arXiv preprint arXiv: 1602.02830, 2016.

[00140] [40] Xilinx Inc,“UG579: UltraScale Architecture DSP Slice,” Tech. Rep., 2018.

[00141] [41] Intel Corp,“UG-S10-DSP Intel Stratix 10 Variable Precision DSP Blocks User

Guide”,” Tech. Rep., 2018.

[00142] [42] WP486: Deep Learning with INT8 Optimization on Xilinx Devices, Xilinx Inc,

2017.

[00143] [43] P. Colangelo, N. Nasiri, E. Nurvitadhi, A. Mishra, M. Margala, and K. Nealis,

“Exploration of low numeric precision deep learning inference using Intel FPGAs,” Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays - FPGA 18,

2018. [Online] Available: http://dx.doi.org/10.1145/3174243.3174999 [00144] [44] H. Parandeh-Afshar and P. Ienne,“Highly versatile DSP blocks for improved

FPGA arithmetic performance,” in 18th IEEE Annual International Symposium on Field- Programmable Custom Computing Machines, FCCM 2010, Charlotte, North Carolina, USA, 2-4 May 2010, 2010, pp. 229-236. [Online] Available: https://doi.org/10.1109/FCCM.2010.42

[00145] [45] A. Boutros, S. Yazdanshenas, and V. Betz,“Embracing diversity: Enhanced DSP blocks for low-precision deep learning on FPGAs,” in 28th International Conference on Field Programmable Logic and Applications, FPL 2018, Dublin, Ireland, August 27-31, 2018, 2018, pp. 35-42. [Online] Available: https://doi.org/10.1109/FPL.2018.00014

[00146] [46] M. Sjalander and P. Larsson-Edefors,“Multiplication acceleration through twin precision,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 17, no. 9, pp. 1233-1246, Sept 2009.

[00147] [47] C. R. Baugh and B. A. Wooley,“A two’s complement parallel array multiplication algorithm,” IEEE Trans. Comput, vol. 22, no. 12, pp. 1045-1047, Dec. 1973. [Online]. Available: https://doi.org/10.1109/TC.1973.223648

[00148] [48] Virtex-5 FPGA Data Sheet: DC and Switching Characteristics, Xilinx, 6 2016, v5.5.

[00149] [49] H. Wong, V. Betz, and J. Rose,“Quantifying the gap between FPGA and custom

CMOS to aid microarchitectural design,” IEEE Trans. VLSI Syst., vol. 22, no. 10, pp. 2067-2080, 2014. [Online] Available: https://doi.org/10.1109/TVLSI.2013.2284281

[00150] [50] S. Hsu, A. Agarwal, M. Anders, S. Mathew, R. Krishnamurthy, and S. Borkar,“An

8.8GHz 198mw 16x64b lr/lw variationtolerant register fde in 65nm CMOS,” in 2006 IEEE International Solid State Circuits Conference Digest of Technical Papers, Feb 2006, pp. 1785— 1797.

[00151] [51] K. Sarfraz and M. Chan,“A 65nm 3.2 GHz 44.2 mw low-v t register file with robust low-capacitance dynamic local bitlines,” in European Solid- State Circuits Conference (ESSCIRC), ESSCIRC 2015-41st. IEEE, 2015, pp. 331-334. [00152] [52] M. Horowitz,“1.1 computing’s energy problem (and what we can do about it),” in

2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), Feb 2014, pp. 10-14.

[00153] [53] P. Bhattachaqee and A. Majumder,“A variation-aware robust gated flip-flop for power-constrained FSM application,” Journal of Circuits, Systems and Computers, vol. 0, no. 0, p. 1950108, 0. [Online] Available: https://doi.org/10.1142/S0218126619501081

[00154] [54] J. Shen, L. Geng, G. Xiang, and J. Liang,“Low- power level converting flip-flop with a conditional clock technique in dual supply systems,” Microelectronics Journal, vol. 45, no. 7, pp. 857 - 863, 2014. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0026269214 001372

[00155] [55] S. Das, T. M. Aamodt, and W. J. Dally,“SLIP: reducing wire energy in the memory hierarchy,” in Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, June 13-17,

[00156] 2015, D. T. Marr and D. H. Albonesi, Eds. ACM, 2015, pp. 349-361. [Online]

Available: https://doi.Org/10.l 145/2749469.2750398

[00157] [56] U. Aydonat, S. O’Connell, D. Capalija, A. C. Ling, and G. R. Chiu,“An opencl(tm) deep learning accelerator on Arria 10,” CoRR, vol. abs/1701.03534, 2017.

[00158] [57] H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, J. K. Kim, V. Chandra, and H.

Esmaeilzadeh,“Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural networks,” CoRR, vol. abs/1712.01507, 2017.

[00159] [58] A. Stillmaker and B. M. Baas,“Scaling equations for the accurate prediction of

CMOS device performance from 180 nm to 7 nm,” Integration, vol. 58, pp. 74-81, 2017. [Online]. Available: https://doi.Org/10.1016/j.vlsi.2017.02.002

Interpretation

[00160] Reference throughout this specification to“one embodiment”,“some embodiments” or “an embodiment” means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, appearances of the phrases“in one embodiment”,“in some embodiments” or“in an embodiment” in various places throughout this specification are not necessarily all referring to the same embodiment, but may. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner, as would be apparent to one of ordinary skill in the art from this disclosure, in one or more embodiments.

[00161] As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

[00162] In the claims below and the description herein, any one of the terms comprising, comprised of or which comprises is an open term that means including at least the elements/features that follow, but not excluding others. Thus, the term comprising, when used in the claims, should not be interpreted as being limitative to the means or elements or steps listed thereafter. For example, the scope of the expression a device comprising A and B should not be limited to devices consisting only of elements A and B. Any one of the terms including or which includes or that includes as used herein is also an open term that also means including at least the elements/features that follow the term, but not excluding others. Thus, including is synonymous with and means comprising.

[00163] As used herein, the term“exemplary” is used in the sense of providing examples, as opposed to indicating quality. That is, an“exemplary embodiment” is an embodiment provided as an example, as opposed to necessarily being an embodiment of exemplary quality.

[00164] It should be appreciated that in the above description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this invention.

[00165] Furthermore, while some embodiments described herein include some but not other features included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the invention, and form different embodiments, as would be understood by those skilled in the art. For example, in the following claims, any of the claimed embodiments can be used in any combination.

[00166] Furthermore, some of the embodiments are described herein as a method or combination of elements of a method that can be implemented by a processor of a computer system or by other means of carrying out the function. Thus, a processor with the necessary instructions for carrying out such a method or element of a method forms a means for carrying out the method or element of a method. Furthermore, an element described herein of an apparatus embodiment is an example of a means for carrying out the function performed by the element for the purpose of carrying out the invention.

[00167] In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In other instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

[00168] Similarly, it is to be noticed that the term coupled, when used in the claims, should not be interpreted as being limited to direct connections only. The terms "coupled" and "connected," along with their derivatives, may be used. It should be understood that these terms are not intended as synonyms for each other. Thus, the scope of the expression a device A coupled to a device B should not be limited to devices or systems wherein an output of device A is directly connected to an input of device B. It means that there exists a path between an output of A and an input of B which may be a path including other devices or means. "Coupled" may mean that two or more elements are either in direct physical or electrical contact, or that two or more elements are not in direct contact with each other but yet still co-operate or interact with each other.

[00169] Thus, while there has been described what are believed to be the preferred embodiments of the invention, those skilled in the art will recognize that other and further modifications may be made thereto without departing from the spirit of the invention, and it is intended to claim all such changes and modifications as falling within the scope of the invention. For example, any formulas given above are merely representative of procedures that may be used. Functionality may be added or deleted from the block diagrams and operations may be interchanged among functional blocks. Steps may be added or deleted to methods described within the scope of the present invention.