NP-CGRA: Extending CGRAs for Efficient Processing of Light-weight Deep Neural Networks

In this paper, we propose a set of architectural extensions and a mapping scheme to greatly improve the performance of CGRA for DSC cores. Hard DPUs such as TPU [2] may have higher performance and energy efficiency, but lack flexibility and may struggle to support future application changes. Soft DPUs such as BrainWave [1] are easily upgradeable, but typically have much lower performance in terms of cost and energy efficiency compared to hard DPUs.

Second, we propose a small set of generic architecture extensions and a mapping scheme for DWC and PWC kernels.

Baseline CGRA

CGRA Architecture Exploration

DPU Optimization

CGRA Performance Bottleneck Analysis

Our Proposed Architecture Extension

PE Array IFM (Input Feature Map) Data. for example, compiling matrix multiplication for a CGRA would produce a very suboptimal schedule. The most critical architectural change is crossbar type memory buses as opposed to parallel buses. This mapping achieves 100% PE utilization, as each PE performs MUL (multiplication) and ADD (addition) operations every cycle, given dual-mode MAC devices, explained next.

North PE out south PE out east PE out west PE out alone PE out R0-R3 Constant. A detailed dual-mode MAC diagram is omitted due to page limitations, but it is easy to design. To facilitate the realization of spatial data reuse on CGRAs, we propose an operand reuse network that allows input-to-input routing as opposed to output-to-input routing.

One way to map this loop to a CGRA is to place output variables yi to different PEs (i.e., yi . to PEi), called output stationary, and route inputs and coefficients to PEs. Operand reuse network thus allows one of the source operands of a PE (i.e., the output of an input MUX) to be passed to neighboring PEs without affecting other computations that PEs are doing, as illustrated in Fig. While a weight stationary scheme can realize spatial data reuse without operand reuse network, it cannot easily utilize more PEs than the number of weight parameters.

Figure 1: Mapping PWC (or matrix mult.) to a 2×2 CGRA.

Instruction Format and Global Configuration

We now present our application mapping for DWC kernels (PWC mapping is already detailed in section 3.2). We first present a general method that works for any step, then an optimized version for S = 1, which is the most common. In the following, a tile refers to the amount of work (or corresponding data) being done simultaneously by a CGRA, with its size determined by the CGRA size.

Although in this paper we mainly describe PE scheduling and data routing, which is crucial for maximizing PE usage and minimizing memory access, our implementation and evaluation results include full mapping, including data layout and AGU algorithms.

Depthwise Convolution with Arbitrary Stride

Nr,Nc Number of rows/columns of the matrix CGRA PE K,S Kernel size and rotation step. Nh,Nw Height and width of OFM (Output Feature Map) Br×Bc Number of tiles in current block (row×column) AIDr,AIDc Number of zero-based rows (or columns) H(V)-AGU Na Number of bank address bits (H -MEM or V-MEM) tidr, tidc Tile row/column coordinate within block tcycle Number of zero-based cycles. The input tile, which is the set of IFM data needed to produce the output tile, is the gray-filled rectangle in Fig.

For example, to calculate the top row of the output tile, the top three rows of the input tile are needed, which are given sequentially via an H-bus. You can see that our scheme achieves maximum data reuse within each row because the data needed for each row is presented only once. Weight parameters can be provided via V-buses because each column of PEs uses the same weight parameters every cycle.

Depthwise Convolution with S = 1

For the next K cycles, the PE array processes the first row of the weight matrix using IFM data partially reused from the previous cycle (from the east-side PEs) and partially loaded from local memory (for the easternmost column), which is called EE phase (Expand East). In the next cycle, the PE array processes W1,2, which requires reuse of IFM data from the south-side and southernmost PEs to load new IFM data, the so-called SS (Shift South) phase. In the following K−1 cycles, the PE array processes the remaining elements of the second weight row, which is similar to the EE phase, except that we expand west, thus called EW (Expand West).

In this plan, all PEs use the same weight element, which is provided by the GRF, indexed by the CGRA controller. We put the complete IFM data in H-MEM and the part needed for the SS stages in V-MEM. Only one channel is considered in this mapping, which is repeated for all channels to complete the DWC.

7b shows how to achieve this, with details such as which elements of the IFM (indicated by red boxes) and which weight element are used by the CGRA in each cycle. Moreover, only the gray elements are loaded from memory and the white IFM elements in red boxes are received from neighboring PEs and thus reused, which is essential to achieve optimal mapping with limited memory bandwidth. Most of the memory accesses can be fulfilled by H-buses, with a few exceptions; T = 6 (or T = 9) can be done in a single cycle using V-buses, and step 1 takes two cycles, but can be done as part of initialization and possibly merged with other operations.

Figure 5: Mapping DWC on a 2×2 CGRA (K = 3, S = 2).

Pointwise Convolution

We now present our mapping methods for PWC and DWC kernels in more detail, focusing on data movement within a PE array and between PEs and memories. Weight Register File data PE internal connection. a) Logical view of IFM data, and H-MEM bank assignment.

Figure 8: Mapping DWC with stride 1 on a 2×2 CGRA.

Depthwise Convolution with Arbitrary Stride

Depthwise Convolution with S = 1

Performance Analysis

8: //generate cargo bank index 9: bank number←(twrap+AIDr)%Nr. 13: addr←tidc·Nc+tidr·Nc·Bc+Nc+cycle−tile_latency+ 1 +addrOFM. The number of blocks is dNw/(BrNr)e × dNo/(BcNc)e, each of which takes BrBcT cycles, producing the layer latency. When processing one tile, ((Nc−1)·S+K) IFM data is used, along with 1×K weight data, which is repeated K times.

DWC General shares the layer delay formula with DWC Optimized (ie S=1), although the tile delay is different. Even if pwc mappings are handled quickly, im2col overhead can slow processing speed. We can reduce the cost of im2col by processing the data in the direction of the channel first, instead of running im2col to process the data within the channel first.

Moreover, our dwc algorithm has the problem that when processing data for multiple channels. So we will add a method for continuous processing of channel data when processing multi-channel data with small height input. In the future, we will solve these two problems to increase efficiency by reducing the network inference time despite the AGU overhead.

Figure 11: IFM data in external memory and in V-MEM for DWC (shown in red is a tile).

Experimental Setup

Depthwise Separable Convolution Results

For this experiment only, the CGRA size was set to 4×4 due to CCF assembly flow (for all three cases). For CCF, we apply loop piping to the loop level with the largest trip count, which is image height (Nh). The architectural factor is about 2×, as our NP-CGRA has 2× faster arithmetic and memory operation rate than the baseline CGRA.

A closer look at the generated code revealed that CCF generates an additional 1 MUL and 3 ADD ops for each MAC operation (1 MUL, 1 ADD) in the program, which is due to address generation as it addresses load storage used. Overall, the difference in mapping efficiency is about 10× in the case of PWC for the relatively small CGRA size. Overall, our NP-CGRA generates a speedup of more than 20× and a reduction of almost 18×ADP for PWC compared to the baseline (our architecture has an 18% larger total area including SRAM memory; the synthesis result is discussed in Section 6.3).

For DWC, our NP-CGRA continues to provide better performance and ADP than the baseline. Although the Matmul DWC case utilization is approximately 16% (and cannot exceed 25% with just one CGRA column), our DWC mapping generates approximately 1.75 ~ 3x higher performance and efficiency than the on matmul based mapping. Note that DWC (S=2) layers are the rarest in MobileNets, while PWC accounts for the majority of MAC operations, which can justify relatively little optimization effort in the former case.

Hardware Overhead Evaluation

Not reported in the paper, and assumed to be the area of the 4×4 baseline CGRA. The ARM processor's runtime is included in latency, but its area is not included in ADP. The common logic and variables used by AGUs such as iterators are implemented in the controller, which is shown in the graph.

The increase in the PE array is modest (the baseline architecture has a homogeneous operation set, meaning that all PEs support MUL and ADD operations). Not surprisingly, the total area is dominated by SRAM memory, which puts the total area overhead of NP-CGRA at 22.2%. While we use the same clock frequency for both CGRAs in our ADP evaluation, our dual-mode MAC does increase the critical path delay.

Table 6: Comparison with previous CGRA and DPU implementations

Comparison with Previous Work Using MobileNet

AlexNet Convolution Layer Results

Ghandi et al., “A cloud-scale configurable dnn processor for real-time AI,” in 2018 ACM/IEEE 45th ISCA. Borchers et al., "Data-centric performance analysis of a tensor processing unit," in Proceedings of the 44th ISCA, 2017, pp. TCAD, vol.

Chen, "Mobilenetv2: Inverted residuals and linear bottlenecks," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. Sze, “Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal of Solid-State Circuits , vol. Sze, “Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices,” IEEE Journal on Emerging and Selected Topics in CAS, vol.

Chaves Filho, “Morphosys: An Integrated Reconfigurable System for Data-Parallel and Computation-Intensive Applications,” IEEE Transactions on Computing, vol. Lauwereins, “Address: a tightly coupled vliw processor,coarse-grained reconfigurable matrix architecture,” in,International Conference on FPL,. Kim, “Space exploration design and implementation of a high-performance, low-area coarse-grained reconfigurable processor,” at the 2012 International Conference on FPT.