Application-Level Performance Improvement for Stream Program on CGRA-based systems

There is very little evidence in the literature on application-level scalability of CGRA performance, and most recent work has focused only on kernel acceleration (see Section 6). In Section 5, we present our experimental results comparing CGRA with GP-GPE (and other results), which suggest that CGRAs for flow graphs can often be more energy efficient (7 times worse to 50 times better) than the state-of-the-art. GP-GPE.

CGRA

Naive Mapping

In the following, we use two parameters: the core part before acceleration, p, and the core invocation count, K. We use all the applications in the benchmark suites except for a few that have problems with our infrastructure. In the following, we use two parameters: the kernel part before acceleration, p, and the kernel call count, K.

When defining the core, we consider three different scenarios: only the innermost loops (C1), the outer loops included (C2), and even those containing function calls (C3). Performance can be further improved by mapping entire loop nests to CGRA, which can reduce the overhead of kernel calls such as control overhead and DMA cycles, as shown in the graph. Most native applications (except for tde_pp) have memory requirements small enough to fit the SPM of our target architecture.

Most native applications (except tde pp) have memory requirements that are small enough to fit into the SPM of our target architecture.

Figure 4.1: Stream graphs have not only higher kernel portion but also higher kernel count, requiring new strategies.

SCOPE AND PROBLEMS OF APPLICATION ACCELERATION 7

Input and Variables

Data flow graph: G= (V, E, W, B), where V ={ni} is the set of nodes and E = (i, j) is the set of directed edges representing the dependence of the flow on ni tuna. Runtime difference: pi is the advantage of mapping to the CGRA instead of the host processor in terms of host processor cycles. You can find out by mapping a node (which can be a nested loop) to both the CGRA and the host processor, and measuring the difference in execution time using simulation.

In the case of CGRA mapping, we include in the runtime all control cost cycles spent on CGRA calls. Without loss of generality, we assume that order i is not in a static schedule. Constant arrays: It is also important to consider the size of constant arrays, which are often global variables in StreamIt programs and are shared by multiple actors.

Such an array only needs SPM space if there is at least one node that uses it that is mapped to the CGRA.

Variables

Schedule order of nodes: We require the schedule order of nodes to account for the live interval of buffers. Therefore, we require their usage information in an array whose entry aik is defined to be 1 if ni uses array k (otherwise it is zero), along with their sizes {ck}. DMA request: dij is binary variable which is 1 only if the edge connects ni and nj.

ILP Formulation (S-A case)

Putting it all together, the SPM memory size requirement at time (mt) is determined as follows. This formulation can be linearized (see Section 4.1.5), but to linearize R2 we need to constrain yij to Ymax, which is the largest degree difference across any edge in the graph.

ILP Formulation (A-A case)

Linearization

Increasing Task Granularity

Although this optimization involves hardware expansion at the PE level, it is expected to increase both the core fraction (p) and the core granularity (thereby reducing K). If there is a feedback loop at the global level, it only sets an upper bound on the value ofω. A critical drawback of the work counter optimization is that it requires ω times larger buffers between filters.

Due to the limited local memory in CGRA, its effect is in direct conflict with the objective of kernel selection optimization. Therefore, an interesting question here is whether it is possible that a combination of work counter optimization and core selection can generate better results for a certain SPM size. Our preliminary results indicate that they do not need to be used simultaneously; that is, core selection is only necessary when the SPM size is small, and work counter optimization only when the SPM size is large.

This is because mapping a kernel to CGRA generates a much larger performance improvement than increasing the number of operations.

Transparent Configuration Management

To reduce the cost of newly added operations due to aggregation, they define a small set of special operations, such as an iterator generator and an accumulator, that can perform an additional operation, such as a regular reset of an iterator whose period is adjustable. In steady state, the flow graph is implemented on the CGRA by executing filters one by one, sequenced by the host processor. Since flow graphs run indefinitely, each filter can be repeated ω times before executing the next filter without changing the correctness of the output.

The former therefore has a higher priority in terms of memory allocation, and the latter should only be considered if all kernels are mapped to CGRA.

Mapping Flow

The CGRA effect includes its SPM and configuration cache effect as well as the PE array and is assumed to increase by 13% due to special PEs [9]. The CGRA effect includes its SPM and configuration cache strength as well as the PE array and is assumed to increase by 13% due to special PEs [9]. This is because the kernel time of a CGRA also includes prolog/epilog overhead due to software pipelining.

First, from example A, we notice that the kernel share is about 80.6%, although it varies widely between applications.2 The application vocoder has the smallest share, which is due to the mathematical functions used in the ﬁlters. In general, due to the large part of the core, the mapping of the innermost loops already provides a reduction in execution time of more than 50% (case C compared to case A), or about a 2x speedup compared to case A. The application vocoder has the smallest, which is due to the math functions used in filters.

Overall, due to the large core part, mapping the innermost loops only already gives the runtime reduction of over 50% (Case C over Case A), or about 2x speedup over Case A.

Experiments 16

Comparison with GP-GPUs

For the GPU power numbers (see Table 5.2) we assume the average power dissipation is 60% of Nvidia's published TDP value and divide it by 4 for multi-GPU platforms (S1070 and S2050), since [12] it only uses one of the four GPUs present on each platform. Note that GPU power does not include Intel CPU power while CGRA power includes that of the main processor (ARM Cortex A8). For the GPU power numbers (see Table 2) we assume the average power dissipation is 60% of the TDP value published by Nvidia and divide it by 4 for multi-GPU platforms (S1070 and S2050), since [12 ] uses only one of the four GPUs present on each platform.

Note that GPE energy does not include Intel CPU energy, while CGRA energy includes main processor (ARM Cortex A8) energy. Some applications run much faster on GP-GPE (e.g., ﬁlterbank, FM), while others can run on the MP+CGRA platform with less than a 10-fold slowdown. The impact of prefetching configuration as opposed to loading configuration on demand is not large for most applications due to the large size of the configuration cache, but for DCT and vocoder it can reduce application runtime by more than 20% and 10%, respectively.

Some applications run much faster on GP-GPE (eg filterbank, FM), while others can run on the MP+CGRA platform with less than a 10x slowdown.

Application Speedup

For the CGRA results, all optimizations are enabled except work counter optimization, which is enabled only in the last column. Considering the difference in the technology used (see Table 2), our results show that CGRAs can be significantly more energy efficient than GP-GPUs in terms of power graphs. However, additional code inserted when there are sibling inner loops can effectively increase the number of non-core cycles, which is the case with the FFT.

Considering the difference in technology used (see Table 5.2), our results show that CGRAs can be significantly more energy efficient than GP-GPUs, in terms of streaming graphics. The impact of prefetching the configuration compared to loading the configuration on demand is not high for most applications due to the large size of the configuration cache, but for DCT and vocoder, it can reduce the application time by over 20% and 10%, respectively. Performance can be further improved by mapping all loop slots to CGRA, which can reduce kernel call overheads such as control cycles and DMA cycles, as seen in the graph.

However, the extra code introduced when there are inner sibling loops can effectively increase the number of non-kernel cycles, which is the case with FFT.

Effect of Limited Memory Size

The code executable on CGRA (with respect to simulator) generated from flattened code by C code regenerator which makes the code executable on CGRA by taking parameters like start interval, steps etc. from the streaming application that generated in the first step and the flattened and executable C code, we profile two codes for ILP input described in Section 4.1.1. We generate scheduled stream application by taking each filter's code from the two codes mentioned above.

In group A, see Figure 5.4(a), the DMA overhead is completely eliminated by pipeline filters, and the A-A platform can hide the computational overhead of the processor (host processor or CGRA), which requires fewer cycles during execution. Since the pair graph size is so small (8 – 10 filters), the communication overhead improvement fraction of the S-A and A-A cases is much smaller than the software overhead; The program pipeline load is an additional double-buffering instruction where we increase the size of the buffers. As a result, application-level A-A performance improves by 17.1% over an average of 8 applications compared to the S-S system.

For many important target applications of CGRAs, flow charts can provide very natural and easy-to-use representations. Although efficient mapping of stream graphs to CGRA-accelerated platforms can be extremely useful and potentially broaden the scope of CGRA applications, it also comes with multidimensional challenges. In this paper, we present a set of architecture-compiler optimizations tailored for flow graphs that consider all three aspects of computation.

Figure 5.4 shows our simulation results for the cases, S-S, S-A and A-A (the three systems is represented A, B and C respectively in Figure 5.2)

Conclusion 23

Application runtimes, normalized to that of Case A

Performance decrease due to limited local memory

Framework of Section 5.5 experiment

Application runtimes, normalized to S-S(A) case