The average compiler consists of three basic components. The front end is responsible for deciphering the original program text and for ensuring that its syntax is correct and in accordance with the language’s specifications. The optimizer improves the program in one way or another, while preserving its original meaning. Finally, the back end is responsible for generating the plat- form-specific binary from the optimized code emitted by the optimizer. The following sections discuss each of these components in depth.
Front End
The compilation process begins at the compiler’s front end and includes several steps that analyze the high-level language source code. Compilation usually starts with a process called lexical analysis or scanning, in which the compiler goes over the source file and scans the text for individual tokens within it.
Tokens are the textual symbols that make up the code, so that in a line such as:
if (Remainder != 0)
The symbols if, (, Remainder, and !=are all tokens. While scanning for tokens, the lexical analyzer confirms that the tokens produce legal “sentences”
in accordance with the rules of the language. For example, the lexical analyzer might check that the token ifis followed by a (, which is a requirement in some languages. Along with each word, the analyzer stores the word’s mean- ing within the specific context. This can be thought of as a very simple version of how humans break sentences down in natural languages. A sentence is divided into several logical parts, and words can only take on actual meaning when placed into context. Similarly, lexical analysis involves confirming the legality of each token within the current context, and marking that context. If a token is found that isn’t expected within the current context, the compiler reports an error.
A compiler’s front end is probably the one component that is least relevant to reversers, because it is primarily a conversion step that rarely modifies the program’s meaning in any way—it merely verifies that it is valid and converts it to the compiler’s intermediate representation.
Intermediate Representations
When you think about it, compilers are all about representations. A compiler’s main role is to transform code from one representation to another. In the process, a compiler must generate its own representation for the code. This intermediate representation (or internal representation, as it’s sometimes called), is useful for detecting any code errors, improving upon the code, and ultimately for generating the resulting machine code.
Low-Level Software 55
06_574817 ch02.qxd 3/16/05 8:35 PM Page 55
Properly choosing the intermediate representation of code in a compiler is one of the compiler designer’s most important design decisions. The layout heavily depends on what kind of source (high-level language) the compiler takes as input, and what kind of object code the compiler spews out. Some intermediate representations can be very close to a high-level language and retain much of the program’s original structure. Such information can be use- ful if advanced improvements and optimizations are to be performed on the code. Other compilers use intermediate representations that are closer to a low-level assembly language code. Such representations frequently strip much of the high-level structures embedded in the original code, and are suit- able for compiler designs that are more focused on the low-level details of the code. Finally, it is not uncommon for compilers to have two or more interme- diate representations, one for each stage in the compilation process.
Optimizer
Being able to perform optimizations is one of the primary reasons that reversers should understand compilers (the other reason being to understand code-level optimizations performed in the back end). Compiler optimizers employ a wide variety of techniques for improving the efficiency of the code.
The two primary goals for optimizers are usually either generating the most high-performance code possible or generating the smallest possible program binaries. Most compilers can attempt to combine the two goals as much as pos- sible.
Optimizations that take place in the optimizer are not processor-specific and are generic improvements made to the original program’s code without any relation to the specific platform to which the program is targeted. Regardless of the specific optimizations that take place, optimizers must always preserve the exact meaning of the original program and not change its behavior in any way.
The following sections briefly discuss different areas where optimizers can improve a program. It is important to keep in mind that some of the opti- mizations that strongly affect a program’s readability might come from the processor-specific work that takes place in the back end, and not only from the optimizer.
Code Structure
Optimizers frequently modify the structure of the code in order to make it more efficient while preserving its meaning. For example, loops can often be partially or fully unrolled. Unrolling a loop means that instead of repeating the same chunk of code using a jump instruction, the code is simply duplicated so that the processor executes it more than once. This makes the resulting binary larger, but has the advantage of completely avoiding having to manage a counter and invoke conditional branches (which are fairly inefficient—see the 56 Chapter 2
section on CPU pipelines later in this chapter). It is also possible to partially unroll a loop so that the number of iterations is reduced by performing more than one iteration in each cycle of the loop.
When going over switchblocks, compilers can determine what would be the most efficient approach for searching for the correct case in runtime. This can be either a direct table where the individual blocks are accessed using the operand, or using different kinds of tree-based search approaches.
Another good example of a code structuring optimization is the way that loops are rearranged to make them more efficient. The most common high- level loop construct is the pretested loop, where the loop’s condition is tested before the loop’s body is executed. The problem with this construct is that it requires an extra unconditional jump at the end of the loop’s body in order to jump back to the beginning of the loop (for comparison, posttested loops only have a single conditional branch instruction at the end of the loop, which makes them more efficient). Because of this, it is common for optimizers to convert pretested loops to posttested loops. In some cases, this requires the insertion of an ifstatement before the beginning of the loop, so as to make sure the loop is not entered when its condition isn’t satisfied.
Code structure optimizations are discussed in more detail in Appendix A.
Redundancy Elimination
Redundancy elimination is a significant element in the field of code optimization that is of little interest to reversers. Programmers frequently produce code that includes redundancies such as repeating the same calculation more than once, assigning values to variables without ever using them, and so on. Optimizers have algorithms that search for such redundancies and eliminate them.
For example, programmers routinely leave static expressions inside loops, which is wasteful because there is no need to repeatedly compute them—they are unaffected by the loop’s progress. A good optimizer identifies such state- ments and relocates them to an area outside of the loop in order to improve on the code’s efficiency.
Optimizers can also streamline pointer arithmetic by efficiently calculating the address of an item within an array or data structure and making sure that the result is cached so that the calculation isn’t repeated if that item needs to be accessed again later on in the code.
Back End
A compiler’s back end, also sometimes called the code generator, is responsi- ble for generating target-specific code from the intermediate code generated and processed in the earlier phases of the compilation process. This is where the intermediate representation “meets” the target-specific language, which is usually some kind of a low-level assembly language.
Low-Level Software 57
06_574817 ch02.qxd 3/16/05 8:35 PM Page 57
Because the code generator is responsible for the actual selection of specific assembly language instructions, it is usually the only component that has enough information to apply any significant platform-specific optimizations.
This is important because many of the transformations that make compiler- generated assembly language code difficult to read take place at this stage.
The following are the three of the most important stages (at least from our perspective) that take place during the code generation process:
■■ Instruction selection: This is where the code from the intermediate rep- resentation is translated into platform-specific instructions. The selec- tion of each individual instruction is very important to overall program performance and requires that the compiler be aware of the various properties of each instruction.
■■ Register allocation: In many intermediate representations there is an unlimited number of registers available, so that every local variable can be placed in a register. The fact that the target processor has a limited number of registers comes into play during code generation, when the compiler must decide which variable gets placed in which register, and which variable must be placed on the stack.
■■ Instruction scheduling: Because most modern processors can handle multiple instructions at once, data dependencies between individual instructions become an issue. This means that if an instruction performs an operation and stores the result in a register, immediately reading from that register in the following instruction would cause a delay, because the result of the first operation might not be available yet. For this reason the code generator employs platform-specific instruction scheduling algorithms that reorder instructions to try to achieve the highest possible level of parallelism. The end result is interleaved code, where two instruction sequences dealing with two separate things are interleaved to create one sequence of instructions. We will be seeing such sequences in many of the reversing sessions in this book.
Listing Files
A listing file is a compiler-generated text file that contains the assembly lan- guage code produced by the compiler. It is true that this information can be obtained by disassembling the binaries produced by the compiler, but a listing file also conveniently shows how each assembly language line maps to the original source code. Listing files are not strictly a reversing tool but more of a research tool used when trying to study the behavior of a specific compiler by feeding it different code and observing the output through the listing file.
58 Chapter 2
Most compilers support the generation of listing files during the compila- tion process. For some compilers, such as GCC, this is a standard part of the compilation process because the compiler doesn’t directly generate an object file, but instead generates an assembly language file which is then processed by an assembler. In such compilers, requesting a listing file simply means that the compiler must not delete it after the assembler is done with it. In other compilers (such as the Microsoft or Intel compilers), a listing file is an optional feature that must be enabled through the command line.