Hardware Execution Environments in Modern Processors

Having this kind of information completely changes the reversing experi- ence because it brings us much closer to the original high-level representation of the program. In fact, this information allows for the creation of highly effec- tive decompilers that can reconstruct remarkably readable high-level language representations from bytecode executables. This situation is true for both Java and .NET programs, and it presents a problem to software vendors working on those platforms, who have a hard time protecting their executables from being easily reverse engineered. The solution in most cases is to use obfuscators—programs that try to eliminate as much sensitive information from the executable as possible (while keeping it functional).

Depending on the specific platform and on how aggressively an executable is obfuscated, reversers have two options: they can either use a decompiler to reconstruct a high-level representation of the target program or they can learn the native low-level language in which the program is presented and simply read that code and attempt to determine the program’s design and purpose.

Luckily, these bytecode languages are typically fairly easy to deal with because they are not as low-level as the average native processor assembly language.

Chapter 12 provides an introduction to Microsoft’s .NET platform and to its native language, the Microsoft Intermediate Language (MSIL), and demonstrates how to reverse programs written for the .NET platform.

prevent parallelism. The general strategy employed by modern IA-32 processors for achieving parallelism is to simply execute two or more instructions at the same time. The problems start when one instruction depends on information produced by the other. In such cases the instructions must be executed in their original order, in order to preserve the code’s functionality.

Because of these restrictions, modern compilers employ a multitude of techniques for generating code that could be made to run as efficiently as possible on modern processors. This naturally has a strong impact on the readability of disassembled code while reversing. Understanding the rationale behind such optimization techniques might help you decipher such optimized code.

The following sections discuss the general architecture of modern IA-32 processors and how they achieve parallelism and high instruction throughput.

This subject is optional and is discussed here because it is always best to know why things are as they are. In this case, having a general understanding of why optimized IA-32 code is arranged the way it is can be helpful when trying to decipher its meaning.

64 Chapter 2

IA-32 COMPATIBLE PROCESSORS

Over the years, many companies have attempted to penetrate the lucrative IA-32 processor market (which has been completely dominated by Intel Corporation) by creating IA-32 compatible processors. The strategy has usually been to offer better-priced processors that are 100 percent compatible with Intel’s IA-32 processors and offer equivalent or improved performance. AMD (Advanced Micro Devices) has been the most successful company in this market, with an average market share of over 15 percent in the IA-32 processor market.

While getting to know IA-32 assembly language there isn’t usually a need to worry about other brands because of their excellent compatibility with the Intel implementations. Even code that’s specifically optimized for Intel’s NetBurst architecture usually runs extremely well on other implementations such as the AMD processors, so that compilers rarely have to worry about specific

optimizations for non-Intel processors.

One substantial AMD-specific feature is the 3DNow! instruction set. 3DNow!

defines a set of SIMD (single instruction multiple data) instructions that can perform multiple floating-point operations per clock cycle. 3DNow! stands in direct competition to Intel’s SSE, SSE2, and SSE3 (Streaming SIMD Extensions).

In addition to supporting their own 3DNow! instruction set, AMD processors also support Intel’s SSE extensions in order to maintain compatibility. Needless to say, Intel processors don’t support 3DNow!.

Intel NetBurst

The Intel NetBurst microarchitecture is the current execution environment for many of Intel’s modern IA-32 processors. Understanding the basic architecture of NetBurst is important because it explains the rationale behind the optimization guidelines used by almost every IA-32 code generator out there.

µops (Micro-Ops)

IA-32 processors use microcode for implementing each instruction supported by the processor. Microcode is essentially another layer of programming that lies within the processor. This means that the processor itself contains a much more primitive core, only capable of performing fairly simple operations (though at extremely high speeds). In order to implement the relatively com- plex IA-32 instructions, the processor has a microcode ROM, which contains the microcode sequences for every instruction in the instruction set.

The process of constantly fetching instruction microcode from ROM can cre- ate significant performance bottlenecks, so IA-32 processors employ an execu- tion trace cache that is responsible for caching the microcodes of frequently executed instructions.

Pipelines

Basically, a CPU pipeline is like a factory assembly line for decoding and exe- cuting program instructions. An instruction enters the pipeline and is broken down into several low-level tasks that must be taken care of by the processor.

In NetBurst processors, the pipeline uses three primary stages:

1. Front end: Responsible for decoding each instruction and producing sequences of µops that represent each instruction. These µops are then fed into the Out of Order Core.

2. Out of Order Core: This component receives sequences of µοps from the front end and reorders them based on the availability of the various resources of the processor. The idea is to use the available resources as aggressively as possible to achieve parallelism. The ability to do this depends heavily on the original code fed to the front end. Given the right conditions, the core will actually emit multiple µops per clock cycle.

3. Retirement section: The retirement section is primarily responsible for ensuring that the original order of instructions in the program is pre- served when applying the results of the out-of-order execution.

Low-Level Software 65

06_574817 ch02.qxd 3/16/05 8:35 PM Page 65

In terms of the actual execution of operations, the architecture provides four execution ports (each with its own pipeline) that are responsible for the actual execution of instructions. Each unit has different capabilities, as shown in Figure 2.4.

Figure 2.4 Issue ports and individual execution units in Intel NetBurst processors.

Port 0

Double Speed ALU

ADD/SUB Logic Operations

Branches Store Data Operations

Floating Point Move

Floating Point Moves Floating Point Stores Floating Point Exchange (FXCH)

Port 1

Double Speed ALU

ADD/SUB

Floating Point Execute

Floating Point Addition Floating Point Multiplication

Floating Point Division Other Floating Point

Operations MMX Operations Integer Unit

Shift and Rotate Operations

Port 2

Memory Loads

All Memory Reads

Port 3

Memory Writes

Address Store Operations (this component writes the address to be written into the bus, and does not send

the actual data).

66 Chapter 2

Notice how port 0 and port 1 both have double-speed ALUs (arithmetic log- ical units). This is a significant aspect of IA-32 optimizations because it means that each ALU can actually perform two operations in a single clock cycle. For example, it is possible to perform up to four additions or subtractions during a single clock cycle (two in each double-speed ALU). On the other hand, non- SIMD floating-point operations are pretty much guaranteed to take at least one cycle because there is only one unit that actually performs floating-point operations (and another unit that moves data between memory and the FPU stack).

Figure 2.4 can help shed light on instruction ordering and algorithms used by NetBurst-aware compilers, because it provides a rationale for certain otherwise- obscure phenomenon that we’ll be seeing later on in compiler-generated code sequences.

Most modern IA-32 compiler back ends can be thought of as NetBurst- aware, in the sense that they take the NetBurst architecture into consideration during the code generation process. This is going to be evident in many of the code samples presented throughout this book.

Branch Prediction

One significant problem with the pipelined approach described earlier has to do with the execution of branches. The problem is that processors that have a deep pipeline must always know which instruction is going to be executed next. Normally, the processor simply fetches the next instruction from memory whenever there is room for it, but what happens when there is a conditional branch in the code?

Conditional branches are a problem because often their outcome is not known at the time the next instruction must be fetched. One option would be to simply wait before processing instructions currently in the pipeline until the information on whether the branch is taken or not becomes available. This would have a detrimental impact on performance because the processor only performs at full capacity when the pipeline is full. Refilling the pipeline takes a significant number of clock cycles, depending on the length of the pipeline and on other factors.

The solution to these problems is to try and predict the result of each conditional branch. Based on this prediction the processor fills the pipeline with instructions that are located either right after the branch instruction (when the branch is not expected to be taken) or from the branch’s target address (when the branch is expected to be taken). A missed prediction is usually expensive and requires that the entire pipeline be emptied.

The general prediction strategy is that backward branches that jump to an earlier instruction are always expected to be taken because those are typically used in loops, where for every iteration there will be a jump, and the only time

Low-Level Software 67

06_574817 ch02.qxd 3/16/05 8:35 PM Page 67

such branch is not be taken is in the very last iteration. Forward branches (typically used in ifstatements) are assumed to not be taken.

In order to improve the processor’s prediction abilities, IA-32 processors employ a branch trace buffer (BTB) which records the results of the most recent branch instructions processed. This way when a branch is encountered, it is searched in the BTB. If an entry is found, the processor uses that information for predicting the branch.

Conclusion

In this chapter, we have introduced the concept of low-level software and gone over some basic materials required for successfully reverse engineering programs. We have covered basic high-level software concepts and how they translate into the low-level world, and introduced assembly language, which is the native language of the reversing world. Additionally, we have covered some more hard core low-level topics that often affect the reverse-engineering process, such as compilers and execution environments. The next chapter provides an introduction to some additional background materials and focuses on operating system fundamentals.

68 Chapter 2

69 Operating systems play a key role in reversing. That’s because programs are tightly integrated with operating systems, and plenty of information can be gathered by probing this interface. Moreover, the eventual bottom line of every program is in its communication with the outside world (the program receives user input and outputs data on the screen, writes to a file, and so on), which means that identifying and understanding the bridging points between application programs and the operating system is critical.

This chapter introduces the architecture of the latest generations of the Microsoft Windows operating system, which is the operating system used throughout this book. Some of this material is quite basic. If you feel perfectly comfortable with operating systems in general and with the Windows architecture in particular, feel free to skip this chapter.

It is important to realize that this discussion is really a brief overview of information that could fill several thick books. I’ve tried to make it as complete as possible and yet as focused on reversing as possible. If you feel as if you need additional information on certain subjects discussed in this chapter I’ve listed a couple of additional sources at the end of this chapter.

Dalam dokumen The Hacker's Guide to Reverse Engineering (Halaman 93-99)