High-level languages were made to allow programmers to create software without having to worry about the specific hardware platform on which their program would run and without having to worry about all kinds of annoying low-level details that just aren’t relevant for most programmers. Assembly lan- guage has its advantages, but it is virtually impossible to create large and com- plex software on assembly language alone. High-level languages were made to isolate programmers from the machine and its tiny details as much as possible.
The problem with high-level languages is that there are different demands from different people and different fields in the industry. The primary tradeoff is between simplicity and flexibility. Simplicity means that you can write a rel- atively short program that does exactly what you need it to, without having to deal with a variety of unrelated machine-level details. Flexibility means that there isn’t anything that you can’t do with the language. High-level languages are usually aimed at finding the right balance that suits most of their users. On one hand, there are certain things that happen at the machine-level that pro- grammers just don’t need to know about. On the other, hiding certain aspects of the system means that you lose the ability to do certain things.
When you reverse a program, you usually have no choice but to get your hands dirty and become aware of many details that happen at the machine level. In most cases, you will be exposed to such obscure aspects of the inner workings of a program that even the programmers that wrote them were unaware of. The challenge is to sift through this information with enough understanding of the high-level language used and to try to reach a close
Low-Level Software 33
06_574817 ch02.qxd 3/16/05 8:35 PM Page 33
approximation of what was in the original source code. How this is done depends heavily on the specific programming language used for developing the program.
From a reversing standpoint, the most important thing about a high-level programming language is how strongly it hides or abstracts the underlying machine. Some languages such as C provide a fairly low-level perspective on the machine and produce code that directly runs on the target processor. Other languages such as Java provide a substantial level of separation between the programmer and the underlying processor.
The following sections briefly discuss today’s most popular programming languages:
C
The C programming language is a relatively low-level language as high-level languages go. C provides direct support for memory pointers and lets you manipulate them as you please. Arrays can be defined in C, but there is no bounds checking whatsoever, so you can access any address in memory that you please. On the other hand, C provides support for the common high-level features found in other, higher-level languages. This includes support for arrays and data structures, the ability to easily implement control flow code such as conditional code and loops, and others.
C is a compiled language, meaning that to run the program you must run the source code through a compiler that generates platform-specific program binaries. These binaries contain machine code in the target processor’s own native language. C also provides limited cross-platform support. To run a pro- gram on more than one platform you must recompile it with a compiler that supports the specific target platform.
Many factors have contributed to C’s success, but perhaps most important is the fact that the language was specifically developed for the purpose of writ- ing the Unix operating system. Modern versions of Unix such as the Linux operating system are still written in C. Also, significant portions of the Microsoft Windows operating system were also written in C (with the rest of the components written in C++).
Another feature of C that greatly affected its commercial success has been its high performance. Because C brings you so close to the machine, the code written by programmers is almost directly translated into machine code by compilers, with very little added overhead. This means that programs written in C tend to have very high runtime performance.
C code is relatively easy to reverse because it is fairly similar to the machine code. When reversing one tries to read the machine code and reconstruct the 34 Chapter 2
original source code as closely as possible (though sometimes simply under- standing the machine code might be enough). Because the C compiler alters so little about the program, relatively speaking, it is fairly easy to reconstruct a good approximation of the C source code from a program’s binaries. Except where noted, the high-level language code samples in this book were all writ- ten in C.
C++
The C++ programming language is an extension of C, and shares C’s basic syn- tax. C++ takes C to the next level in terms of flexibility and sophistication by introducing support for object-oriented programming. The important thing is that C++ doesn’t impose any new limits on programmers. With a few minor exceptions, any program that can be compiled under a C compiler will com- pile under a C++ compiler.
The core feature introduced in C++ is the class. A class is essentially a data structure that can have code members, just like the object constructs described earlier in the section on code constructs. These code members usually manage the data stored within the class. This allows for a greater degree of encapsula- tion, whereby data structures are unified with the code that manages them. C++
also supports inheritance, which is the ability to define a hierarchy of classes that enhance each other’s functionality. Inheritance allows for the creation of base classes that unify a group of functionally related classes. It is then possible to define multiple derived classes that extend the base class’s functionality.
The real beauty of C++ (and other object-oriented languages) is polymor- phism (briefly discussed earlier, in the “Common Code Constructs” section).
Polymorphism allows for derived classes to override members declared in the base class. This means that the program can use an object without knowing its exact data type—it must only be familiar with the base class. This way, when a member function is invoked, the specific derived object’s implementation is called, even though the caller is only aware of the base class.
Reversing code written in C++ is very similar to working with C code, except that emphasis must be placed on deciphering the program’s class hier- archy and on properly identifying class method calls, constructor calls, etc.
Specific techniques for identifying C++ constructs in assembly language code are presented in Appendix C.
In case you’re not familiar with the syntax of C, C++ draws its name from the C syntax, where specifying a variable name followed by ++ incdicates that the variable is to be incremented by 1. C++is the equivalent of C = C + 1.
Low-Level Software 35
06_574817 ch02.qxd 3/16/05 8:35 PM Page 35
Java
Java is an object-oriented, high-level language that is different from other lan- guages such as C and C++ because it is not compiled into any native proces- sor’s assembly language, but into the Java bytecode. Briefly, the Java instruction set and bytecode are like a Java assembly language of sorts, with the difference that this language is not usually interpreted directly by the hardware, but is instead interpreted by software (the Java Virtual Machine).
Java’s primary strength is the ability to allow a program’s binary to run on any platform for which the Java Virtual Machine (JVM) is available.
Because Java programs run on a virtual machine (VM), the process of reversing a Java program is completely different from reversing programs written in compiler-based languages such as C and C++. Java executables don’t use the operating system’s standard executable format (because they are not executed directly on the system’s CPU). Instead they use .class files, which are loaded directly by the virtual machine.
The Java bytecode is far more detailed compared to a native processor machine code such as IA-32, which makes decompilation a far more viable option. Java classes can often be decompiled with a very high level of accuracy, so that the process of reversing Java classes is usually much simpler than with native code because it boils down to reading a source-code-level representa- tion of the program. Sure, it is still challenging to comprehend a program’s undocumented source code, but it is far easier compared to starting with a low-level assembly language representation.
C#
C# was developed by Microsoft as a Java-like object-oriented language that aims to overcome many of the problems inherent in C++. C# was introduced as part of Microsoft’s .NET development platform, and (like Java and quite a few other languages) is based on the concept of using a virtual machine for executing programs.
C# programs are compiled into an intermediate bytecode format (similar to the Java bytecode) called the Microsoft Intermediate Language (MSIL). MSIL programs run on top of the common language runtime (CLR), which is essen- tially the .NET virtual machine. The CLR can be ported into any platform, which means that .NET programs are not bound to Windows—they could be executed on other platforms.
C# has quite a few advanced features such as garbage collection and type safety that are implemented by the CLR. C# also has a special unmanaged mode that enables direct pointer manipulation.
As with Java, reversing C# programs sometimes requires that you learn the native language of the CLR—MSIL. On the other hand, in many cases manu- ally reading MSIL code will be unnecessary because MSIL code contains 36 Chapter 2
highly detailed information regarding the program and the data types it deals with, which makes it possible to produce a reasonably accurate high-level lan- guage representation of the program through decompilation. Because of this level of transparency, developers often obfuscate their code to make it more difficult to comprehend. The process of reversing .NET programs and the effects of the various obfuscation tools are discussed in Chapter 12.