Runtime and Install-Time Binary Translation for Reconfigurable Accelerators

In this work, we propose a Runtime Binary Translation (RBT) technique that can dynamically identify and translate kernels IR to Coarse-Grained Reconfigurable Array (CGRA) accelerator configuration, and offload the execution of the kernels onto the accelerator. To speed up the execution of the kernels in an application, people these days often use some kind of accelerator (eg hardware accelerators, reconfigurable array, GPGPU, etc.), if any, in the target machine.

Thesis Statement

Traditionally, they have the source code of the application, information about the cores in the applications, and knowledge about the accelerator on the target machine. Then, they must manually split and rewrite the program's source code into kernel code and sequential code before statically compiling them into the executable binary.

Challenging Issues

Contributions of this Thesis

Document Organization

LLVM JIT Compiler

After the compilation of the main function is completed, the context is switched to the Host Execution context to execute the main function. If a stub of a function F is reached during the host execution of the main function, it means that the function F should definitely be executed. The context is switched from Host Execution to Base JIT, and the BaseJIT.A subcomponent compiles the F function.

After the F function is compiled, the BaseJIT.B subcomponent replaces the F stub in the main function binary with the call to the native F function that was just produced. The system will then return to the Host Execution context (event (2) in the figure) and jump to execute the call to the native function F.

Figure 2.1: Base Just-In-Time compiler design in the LLVM framework.

Related Work

In the Runtime Binary Translation (RBT) approach, the RBT Virtual Machine (RBTVM) identifies kernels at runtime that are suitable for running on the accelerator, translates those kernels into the accelerator configuration, and compiles and updates the binary code of the function running that kernels. In this chapter we describe the design and implementation of the RBTVM, which is essentially a JIT, but with some important modifications to enable the runtime features mentioned above.

High-Level Design of RBTVM

They are used to compile the IR of the function F into the original binary code of the main processor (i.e. the SP of SPIRA), and to replace the stub call at the call location in the original binary code of the caller of F, with the address of the actual original binary code of the function F. It is removed from the IR by L2JIT.D before L2JIT.E recompiles the IR of the function containing the kernel, and L2JIT.F updates it memory address of that function's original binary code, so that subsequent calls to the function use the new function binary code and transfer execution from the kernel to the accelerator. At some point during execution (Ta), F1 has already been compiled to the original binary and is executing, while F2 and F3 are still in IR and have not yet been compiled to the original binary.

Then, at Td, F3's original binary code is executed and the RBTMonitor callback is called, causing Monitor.Ato to run, incrementing the counter for loop L1. These subcomponents of L2JIT will translate L1's body into accelerator configuration, recompile the F3 IR into native binary code, and update the F3 binary's address so that the next time F3 is called, the execution of L1 is transferred to the accelerator.

Figure 3.1: Runtime Binary Translation Virtual Machine design.

Implementation of the RBTVM

When the counter for L1 reaches the Kernel Threshold value, the monitor calls the L2JIT callback. The implementation of the subcomponent L2JIT.C actually inserts a function call in the preheader of the loop IR, which will later be compiled into a callback to a special function in RBTVM that performs the accelerator control before the kernel execution. Since it is not easy to update the native binary code address of a function at all its call points, we instead reuse a method that exists in the LLVM framework to send a forward instruction (ie, a branch instruction) to the new function binary code, overwrites the first binary instructions in the old function binary code.

This means that all calls to the old function's binary, along with the arguments, will be forwarded to the new function's binary.

Optimizations for the RBTVM

The basic idea is that RBTVM will keep track of the kernel discovered during the execution of a function and call L2JIT to do the translation and recompilation only at the end of that function. Since the function must make a call to L2JIT, we support this by adding the newL1JIT.C component to the design in Fig.3.3, so that a call instruction to L2JIT is inserted at the end of each function's IR. containing suitable loops. The L2JIT.D component has been modified so that, in addition to removing the detected kernels monitor callback, L2JIT.D will also remove the L2JIT callback at the end of the IR function if all running kernels have already been detected and translated .

However, we should be aware of some techniques related to nested function calls that can cause the translated LoopList to keep track of the kernel from different functions. In this chapter, we describe the Install-Time Binary Translation (IBT) approach, which is a simpler version of the RBT approach.

Figure 3.3: Runtime Translation Virtual Machine optimized design.

High-Level Design of IBT

IBTVM

4.1, at runtime Bitcode Reader parses the bitcode file produced at install time by the Install Time Interpreter into two parts, the IR configuration and the accelerator. The accelerator configuration of all suitable cycles is stored in a memory storage. This is because IBTVM should not support runtime features such as loop execution profiling, translating loops into accelerator configuration, or recompiling functions.

For each function F, the IBTVM only needs to instrument the appropriate loops in the possible calls of F, already translated at install time, with corresponding accelerator control callbacks. This flow guarantees that just before executing a suitable loop, the accelerator control callback will be activated and the execution of the loop will be offloaded to the accelerator.

Implementation of IBT

Based on the wrapper structure of the LLVM bitcode, we propose an accelerator configuration structure as shown in Fig. The bitcode now contains a header indicating the offset to and size of the original IR. The implementation of the IBTVM is almost the same as that of the base JIT (see Section 2.1).

The only thing to note is that the insertion of the per-loop accelerator control callbacks must be done at runtime. We also evaluate the effectiveness of the optimizations we apply for the RBT approach, and finally we compare the two approaches.

Figure 4.3: Accelerator configuration and IR wrapper format. The header is 24 bytes long consisting of Magic number, Version, IR Offset, IR Size and CPUType, each of which is 4 bytes long

Experimental Setup

In this chapter, we try to quantify different aspects of the RBT and IBT approaches, in order to provide an idea of the efficiency of the two approaches in the role of an infrastructure that enables the use of a CGRA accelerator to accelerate kernels at runtime. Application IR. Then we evaluate the RBT and IBT approaches against several criteria, such as improved uptime, overhead, etc. We set the SP and RA accelerator clock speeds to 400 MHz and 600 MHz, respectively.

First, the actual control and execution of the RA accelerator is not modeled, so we estimate the number of kernel execution cycles per RA based on the loop mapping results obtained from the EMS algorithm [4]. Benchmarks are compiled into LLVM IR bitcode using Clang, the front-end of the LLVM framework.

Experimental Results

BaseJITRBT-l2immRBT-l1imm RBT-l1lazy BaseJITRBT-l2immRBT-l1imm RBT-l1lazy BaseJITRBT-l2immRBT-l1imm RBT-l1lazy BaseJIT RBT-l2immRBT-l1immRBT-l1lazy BaseJIT RBT-l2immRBT-l1immRBT-l1lazy BaseJ تكنولوجيا المعلومات RBT-l 2immRBT-l1immRBT-l1lazy BaseJITRBT - l2imm RBT-l1immRBT-l1lazy BaseJITRBT-l2imm RBT-l1immRBT-l1lazy BaseJITRBT-l2imm RBT-l1immRBT-l1lazy BaseJIT RBT-l2immRBT-l1imm RBT-l1lazy BaseJIT RBT-l2immRBT-l1imm RBT-l1lazy BaseJ تكنولوجيا المعلومات RBT-l 2 mmR BT- l1immRBT- ITR BT-l 2immRBT-l1imm RBT -l1lui BaseJITRBT-l2imm RBT-l1immRBT-l1lazy BaseJIT RBT-l2immRBT-l1immRBT-l1lazy BaseJIT RBT-l2immRBT-l1immRBT-l1lazy. أساسJITRBT-l2imm RBT-l1immRBT-l1أساس كسولJITRBT-l2immRBT-l1imm RBT-l1أساس كسولJIT RBT-l2immRBT-l1immRBT-l1أساس كسولJITRBT-l2imm RBT-l1immRBT-l1أساس كسولJITRBT-l2immRBT-l1immRBT-l 1lazy BaseJ ITRBT-l 2imm RBT-l1immRBT-l1lazy BaseJITRBT - l2imm BaseJIT -l1imm RBT-l1lazy BaseJITRBT-l2immRBT-l1imm BaseJIT -l1lazy BaseJIT RBT-l2immRBT-l1immRBT-l1lazy BaseJITRBT-l2imm RBT-l1immRBT-l1lazy BaseJIT RBT-l2immRBT-l1immRBT-l1lazy. BasisJIT IBT BasisJIT IBT BasisJIT IBT BasisJIT IBT BasisJIT IBT BasisJIT IBT BasisJIT IBT BasisJIT IBT BasisJIT IBT BasisJIT IBT BasisJIT IBT BasisJIT IBT أساس جيت آي بي تي

Terdapat tiga kes dalam penilaian: BaseJIT (asas, tanpa sebarang sokongan pemecut), RBT (RBT kes terbaik, iaitu, RBT-l1lazy), dan IBT. Secara purata, peningkatan masa jalan ke atas BaseJIT bagi kes IBT ialah 1.61 kali, iaitu 11.33% lebih baik daripada RBT. BaseJIT RBT IBT BaseJIT RBT IBT BaseJIT RBT IBT BaseJIT RBT IBT BaseJIT RBT IBT BaseJIT RBT IBT BaseJIT RBT IBT BaseJIT RBT IBT BaseJIT RBT IBT BaseJIT RBT IBT JIT RBT IBT BaseJIT RBT IBT BaseJIT RBT IBT BaseJIT Base RBT IBT BaseJIT Base RBT IBT BaseJIT Pangkalan RBT IBT IBT.

Therefore, in the average of all performance measures and nrun values, the implementation overhead in the case of RBT is 61.91% higher than that of IBT.

Figure 5.1: Runtime improvement of applications on the RBT system.

Conclusions

Exploiting Loop-Level Parallelism on Coarse-Grained Reconfigurable Architectures Using Modulo Scheduling', in: Proceedings of the European Design, Automation and Test Conference - Volume 1, DATE 03. Iterative Modulo Scheduling: An Algorithm for Software Pipeline Loops', in: In Proceedings of the 27th Annual International Symposium on Microarchitecture. A virtual machine for merit-based runtime reconfiguration, in: 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines 2005.

FloRA: A coarse-grained reconfigurable architecture with floating-point capability', in: International Conference on Field Programmable Technology, 2009. MediaBench: a tool for evaluation and synthesis of multimedia and communication systems', in: Proceedings of the 30th An - annual ACM/IEEEInternational Symposium on Microarchitecture, MICRO.