Evaluator-Executor Transformation for Efficient Conditional Statements on CGRA

Our experiments with MiBench and computer vision benchmarks on a CGRA show that our techniques can improve loop performance over predetermined execution by up to 65%, or 38.0% on average when the hardware extension is enabled. However, predicate execution is wasteful because it allocates resources to both branches of a conditional execution even if, most certainly, only one of them needs to be executed. Our software-only version, which requires no hardware modification, can improve performance over predicted execution by up to 64%, or 33.2% on average.

Software pipelining-based mappings typically use predicated execution [23], which converts control flow to data flow, by issuing instructions from both branches of a conditional (hence no control dependency), but only executing them if the predicate is true (hence data dependency).

Figure 1.1: Frequency and size of then-blocks in loops from embedded and vision benchmarks.

Basic Idea

Illustrative Example

However, the main weakness of predicated execution is the waste of resources allocated to the predicated block (the second block in our example), which really only needs to be executed when the condition is true. But since the state value changes from iteration to iteration, and the loop schedule must be statically generated, resources for the second block must be allocated regardless of the state value, resulting in a waste of resources. Essentially, the evaluator loop contains the code up to the condition evaluation, and the executor loop contains the code below the condition expression.

To keep the functionality, some operations need to be added, such as saving and recalling the iterator values for true iterations.

Figure 3.1: Splitting a loop with an if-then statement into evaluator and executor loops

Challenges

Overall Flow

Then the blocks contained in each set are converted to data flow graphs, which are modified by adding the operations necessary for our transformation. The next step is data dependency analysis within the iteration, which is identifying the variables that need to be passed from the evaluator to the executor or trace loop. Such (temporary) variables can be passed through local memory or simply recalculated - we decide by comparing their estimated cost.

The next step, DFG optimization, cleans the DFG of any possible dead code introduced during the previous steps (eg recalculating a temporary variable in the executor loop may render the original definition in the evaluator loop unnecessary) . Finally, the DFGs are planned and mapped onto the target architecture, and the expected runtime is compared to that of mapping the original loop using only predicted execution, with the best one selected.

CFG Partitioning

Performance Consideration

Depending on which block is included in the execution loop, the expected execution time after splitting the loop can be represented as follows, assuming that the execution time of two blocks executed together is the sum of the execution time of each separate execution: 1. Z in other words, the best candidate for an execution loop is a large and infrequently executed loop. We generalize this intuition to our partitioning algorithm, which first identifies all candidate blocks or sets of blocks and selects the one that maximizes the size metric with respect to execution frequency.

1This assumption applies to a RISC processor and is also valid when the running time is approximated as the number of operations.

Figure 3.3: Various control flow graphs.

Trail Loop

General Solution

We then note that if B includes a block whose execution condition is x, adding to B all other blocks whose execution condition contains x can only increase the executor's utility as determined by our performance model. This is because adding them does not increase the execution frequency of B, but only increases its size and execution time. This effectively means that in a nested conditional sentence, all inner conditionals must be included in their entirety if the outer one is included.

The set of blocks included in the trace loop, called C, includes all the blocks that (i) appear after the blocks of B in the original code, and (ii) whose execution conditions are not exclusive with B, i.e. y·x6 = 0, where x is the union of all the execution conditions of the blocks iB. In a later step (ie, intra-iteration data dependency analysis), if the operations in C are found to have no data dependency at all on B, C can be merged into A.

Inter-iteration Data Dependence

This example illustrates that by moving the operation (in the gray box) outside of the conditional, we can break the dependency of the evaluator loop on the executor. If all data dependency loops are contained in individual arrays, we can proceed to the next step.

DFG Modification

While the evaluator and trail loops inherit the loop control code (i.e. iterator initialization and increment) from the original loop, the executor loop has trivial loop control code and within the loop body the real values of the original iterator are retrieved. from memory.

Intra-iteration Data Dependence

CGRA's controller is aware of the true counter and its final value, which is the total number of true iterations, is used as the trip counter for the Executor loop. Note: fex andrex represents the frequency and relative size (compared to the entire loop body) of the execution loop, respectively, and V is the number of iterator/transfer variables. 3 includes only those included in the DFG of the original loop, but not the loop control related ones handled by the CGRA controller.

The 4th column shows how often the executor loop is executed on average per iteration (i.e. the execution frequency), and the 5th column is calculated as the ratio of the number of operations in the executor loop to the number of operations in the entire loop body. before applying the evaluator-executor transformation. This is because the mapping result of our scheduler is not exactly proportional to the number of operations of the DFG (Data Flow Graph). We compare the total energy consumption of the CGRA and its associated memories in two cases: using only predicated execution (P) vs.

Finally, the dynamic energy of the local memory, which constitutes 22.4% in the baseline, is increased by 18.2% (to 26.4% of the total baseline energy) due to the additional memory operations generated by our software transformation . Overall, our software technique can reduce the energy consumption of the entire CGRA, including configuration and data memory, by 22.0% on average. In Proceedings of the conference on Design, automation and testing in Europe, DATE '01, pages 642–.

Fission-Specialized CGRA 15

Hardware for Save and Restore

Push and Pop
Optimizing for Performance
Queue

Thus, we simplify this by replacing the entire address calculation with a very simple address generator without multi-threading, which can be implemented with a counter, and by nesting the store operation in other instructions as a sub-operation controlled by one bit of the configuration.1 This sub-operation is called push and can therefore be performed by any PE. From the last equation, it is clear that to maximize the difference, we need to minimize II1, or the evaluator's initiation interval. Another peculiarity of the push operation is that it may1 only need to be executed in true iterations or when the condition expression is true.

There is often a time interval between the definition of a variable and the availability of the value of the condition, the latter usually being in the last stage. The decision to write or remove data is given by the global version of the XE signal (see Fig.4.1). From the worst case analysis, the queue size should be as large as the number of stages.

Now, this fixed queue size does not mean that certain loops cannot be mapped if they happen to have more stages than the queue size, but it just means that the mapping needs to be changed, so that either the number of stages is reduced possibly at the expense of the increased initiation -interval, or some variables must wait in PEs during the redundant stages. Per quadrant) add pop operation to one of the PEs; a FIFO; and a counter for generating address. 1 Some variables, such as one used in the routing loop, may need to be printed every iteration, regardless of the condition value (see Print Always in Fig.4.1).

Experimental Setup and Applications

Performance of Our Transformation with Hardware Extension
Performance of Software Transformation Only
Energy Comparison

While the number of operations is reduced by only 43% (from 21 to 12), the number of required PEs including those used only for routing is reduced by 66% (from 38 to 13). This is a direct consequence of the estimator, in this particular case, taking almost as many cycles as the original cycle - they happen to be scheduled with the same initialization interval as the estimator having fewer phases. Thus, although a reduced number of runs will most likely lead to a reduced run time, this may not always be the case, due to the randomness of the scheduling process.

Between the S1 and S2 cases, although the overall difference is small (they are only 3.8% points different in terms of the geometric mean), some applications experience huge runtime reductions. The PE array power has two components, the power loss of individual PEs and that of the rest of the PE array. The leakage power is assumed to be 50% of the dynamic power when PEs are inactive, whereas the leakage power of the local memory is taken from Cacti.

The dynamic energy of the PE array is only slightly larger than the configuration memory and is generally reduced by our scheme, which is largely a byproduct of the reduced runtime. Interestingly, larger branches are often executed less frequently, allowing for optimization unknown to predicated execution, which blindly allocates resources regardless of execution frequency. Without hardware changes, our software-only version can improve performance by up to 64%, or 33.2% on average, while reducing the energy consumption of the entire CGRA by 22.0% on average.

In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture, HPCA '11, pages 25–36, Washington, DC, USA, 2011. InProceedings of conference on Design, Automation and Test in Europe - Volume 1, DATE ' 03, Washington, DC, USA, 2003.

Splitting a loop with an if-then statement into evaluator and executor loops. (a)

The overall flow of our evaluator-executor transformation

Various control flow graphs

Determining the set of blocks for the trail loop

Recurrent loops can be fissioned if the recurrence can be isolated into one fissioned

Extending each quadrant (2x2 PE array) of the baseline CGRA with: a FIFO

Comparing runtimes of three cases: P (predicated execution), H (our evaluator-

Dijkstra loop example, before and after the transformation