The next logical leap that supersedes procedures is to divide a program into objects. Designing a program using objects is an entirely different process than the process of designing a regular procedure-based program. This process is called object-oriented design (OOD), and is considered by many to be the most popular and effective approach to software design currently available.
OOD methodology defines an object as a program component that has both data and code associated with it. The code can be a set of procedures that is related to the object and can manipulate its data. The data is part of the object and is usually private, meaning that it can only be accessed by object code, but not from the outside world. This simplifies the design processes, because developers are forced to treat objects as completely isolated entities that can only be accessed through their well-defined interfaces. Those interfaces usu- ally consist of a set of procedures that are associated with the object. Those pro- cedures can be defined as publicly accessible procedures, and are invoked primarily by clients of the object. Clients are other components in the program that require the services of the object but are not interested in any of its imple- mentation details. In most programs, clients are themselves objects that simply require the other objects’ services.
Beyond the mere division of a program into objects, most object-oriented programming languages provide an additional feature called inheritance.
Inheritance allows designers to establish a generic object type and implement many specific implementations of that type that offer somewhat different functionality. The idea is that the interface stays the same, so the client using the object doesn’t have to know anything about the specific object type it is dealing with—it only has to know the base type from which that object is derived.
This concept is implemented by declaring a base object, which includes a dec- laration of a generic interface to be used by every object that inherits from that base object. Base objects are usually empty declarations that offer little or no actual functionality. In order to add an actual implementation of the object type, another object is declared, which inherits from the base object and contains the actual implementations of the interface procedures, along with any support code or data structures. The beauty of this system is that for a single base object there can be multiple descendant objects that can implement entirely different functionalities, but export the same interface. Clients can use these objects with- out knowing the specific object type they are dealing with—they are only aware of the base object’s type. This concept is called polymorphism.
managed in the program. This requires two perspectives: the high-level per- spective as viewed by software developers and the low-level perspective that is viewed by reversers.
High-level languages tend to isolate software developers from the details surrounding data management at the system level. Developers are usually only made aware of the simplified data flow described by the high-level language.
Naturally, most reversers are interested in obtaining a view of the program that matches that simplified high-level view as closely as possible. That’s because the high-level perspective is usually far more human-friendly than the machine’s perspective. Unfortunately, most programming languages and soft- ware development platforms strip (or mangle) much of that human-readable information from binaries shipped to end users.
In order to be able to recover some or all of that high-level data flow infor- mation from a program binary, you must understand how programs view and treat data from both the programmer’s high-level perspective and the low- level machine-generated code. The following sections take us through a brief overview of high-level data constructs such as variables and the most common types of data structures.
Variables
For a software developer, the key to managing and storing data is usually named variables. All high-level languages provide developers with the means to declare variables at various scopes and use them to store information.
Programming languages provide several abstractions for these variables.
The level at which variables are defined determines which parts of the pro- gram will be able to access it, and also where it will be physically stored. The names of named variables are usually relevant only during compilation. Many compilers completely strip the names of variables from a program’s binaries and identify them using their address in memory. Whether or not this is done depends on the target platform for which the program is being built.
User-Defined Data Structures
User-defined data structures are simple constructs that represent a group of data fields, each with its own type. The idea is that these fields are all somehow related, which is why the program stores and handles them as a single unit. The data types of the specific fields inside a data structure can either be simple data types such as integers or pointers or they can be other data structures.
While reversing, you’ll be encountering a variety of user-defined data struc- tures. Properly identifying such data structures and deciphering their contents is critical for achieving program comprehension. The key to doing this suc- cessfully is to gradually record every tiny detail discovered about them until 30 Chapter 2
you have a sufficient understanding of the individual fields. This process will be demonstrated in the reversing chapters in the second part of this book.
Lists
Other than user-defined data structures, programs routinely use a variety of generic data structures for organizing their data. Most of these generic data structures represent lists of items (where each item can be of any type, from a simple integer to a complex user-defined data structure). A list is simply a group of data items that share the same data type and that the program views as belonging to the same group. In most cases, individual list entries contain unique information while sharing a common data layout. Examples include lists such as a list of contacts in an organizer program or list of e-mail messages in an e-mail program. Those are the user-visible lists, but most programs will also maintain a variety of user-invisible lists that manage such things as areas in memory currently active, files currently open for access, and the like.
The way in which lists are laid out in memory is a significant design deci- sion for software engineers and usually depends on the contents of the items and what kinds of operations are performed on the list. The expected number of items is also a deciding factor in choosing the list’s format. For example, lists that are expected to have thousands or millions of items might be laid out dif- ferently than lists that can only grow to a couple of dozens of items. Also, in some lists the order of the items is critical, and new items are constantly added and removed from specific locations in the middle of the list. Other lists aren’t sensitive to the specific position of each item. Another criterion is the ability to efficiently search for items and quickly access them. The following is a brief discussion of the common lists found in the average program:
■■ Arrays: Arrays are the most basic and intuitive list layout—items are placed sequentially in memory one after the other. Items are referenced by the code using their index number, which is just the number of items from the beginning of the list to the item in question. There are also multidimensional arrays, which can be visualized as multilevel arrays.
For example, a two-dimensional array can be visualized as a simple table with rows and columns, where each reference to the table requires the use of two position indicators: row and column. The most signifi- cant downside of arrays is the difficulty of adding and removing items in the middle of the list. Doing that requires that the second half of the array (any items that come after the item we’re adding or removing) be copied to make room for the new item or eliminate the empty slot pre- viously occupied by an item. With very large lists, this can be an extremely inefficient operation.
Low-Level Software 31
06_574817 ch02.qxd 3/16/05 8:35 PM Page 31
■■ Linked lists: In a linked list, each item is given its own memory space and can be placed anywhere in memory. Each item stores the memory address of the next item (a link), and sometimes also a link to the previ- ous item. This arrangement has the added flexibility of supporting the quick addition or removal of an item because no memory needs to be copied. To add or remove items in a linked list, the links in the items that surround the item being added or removed must be changed to reflect the new order of items. Linked lists address the weakness of arrays with regard to inefficiencies when adding and removing items by not placing items sequentially in memory. Of course, linked lists also have their weaknesses. Because items are randomly scattered through- out memory, there can be no quick access to individual items based on their index. Also, linked lists are less efficient than arrays with regard to memory utilization, because each list item must have one or two link pointers, which use up precious memory.
■■ Trees: A tree is similar to a linked list in that memory is allocated sepa- rately for each item in the list. The difference is in the logical arrange- ment of the items: In a tree structure, items are arranged hierarchically, which greatly simplifies the process of searching for an item. The root item represents a median point in the list, and contains links to the two halves of the tree (these are essentially branches): one branch links to lower-valued items, while the other branch links to higher-valued items. Like the root item, each item in the lower levels of the hierarchy also has two links to lower nodes (unless it is the lowest item in the hierarchy). This layout greatly simplifies the process of binary searching, where with each iteration one eliminates one-half of the list in which it is known that the item is not present. With a binary search, the number of iterations required is very low because with each iteration the list becomes about 50 percent shorter.