Extracting Architectural Features from Source Code*
2. Architecture Recovery - Framework and Process
Our recovery framework (see Figure 1) spans three levels of software representation:
• a program parsing capability (implemented using Software Refinery
(Reasoning Systems, 1990)) with accompanying code level organization views, i.e., abstract syntax trees and a "bird's eye" file overview
• an architectural representation that supports both idealized and as-built architectural representations with a supporting library of architectural styles and constituent archi-tectural features
• a source code recognition engine and a supporting library of recognizers
Figure 1 shows how these three levels interact. The idealized architecture contains the initial intentions of the system designers. Developers encode these intentions in the source code. Within our framework, the legacy source code is parsed into an internal abstract syntax tree representation. We run recognizers over this representation to discover archi-tectural features - the components/connectors associated with archiarchi-tectural styles (selecting a particular style selects a set of constituent features to search for). The set of architectural features discovered in a program form its as-built architecture containing views with respect to many architectural styles. Finally, note that the as-built architecture we have recovered is both less than and more than the original idealized architecture. The as-built is less than the idealized because it may miss some of the designer's original intentions and because it may not be complete. The as-built is also more than the idealized because it is up-to-date and because we now have on-line linkage between architecture features and their imple-mentation in the code. We do not have a definition of a complete architecture for a system.
The notions of code coverage described later in the paper provides a simple metric to use in determining when a full understanding of the system has been obtained.
The framework supports architectural recovery in both a bottom-up and top-down fashion.
In bottom-up recovery, analysts use the bird's eye view to display the overall file structure and file components of the system. The features we display (see Figure 2) include file type (diamond shapes for source files with entry point functions; rectangles for other source files), name, pathname of directory, number of top level forms, and file size (indicated by the size of the diamond or rectangle). Since file structure is a very weak form of architectural organization, only shallow analysis is possible; however, the bird's eye view is a place where our implementation can register results of progress toward recognition of various styles.
In top-down recovery, analysts use architectural styles to guide a mixed-initiative recovery process. From our point of view, an architectural style places an expectation on what
112 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH
Idealized Architecture Views of the As-Built Architecture
implemented by
I
combine using architectural styles to form
Architectural Features
provides clues for recognizing
1
Program parses into
1
Abstract Syntax Tree
Figure 1. Architectural recovery framework
recovery tools will find in the softw^are system. That is, the style establishes a set of architectural feature types which define components/connectors types to be found in the software. Recognizers are used to find the component/connector features. Once the features are discovered, the set of mappings from feature types to their realization in the source code forms the as-built architecture of the system.
2. /. Architectural Styles
The research community has provided detailed examples (Garlan and Shaw, 1993, Shaw, 1989, Shaw, 1991, Perry and Wolf, 1992, Hofmeister, Nord, Soni, 1995) of architec-tural styles, and we have codified many of these in an architecture modeling language.
Our architecture modeling language uses entity/relation taxonomies to capture the com-ponent/connector style aspects that are prevalent in the literature (Abowd, Allen, Garlan,
1993, Perry and Wolf, 1992, Tracz, 1994). Entities include clusters, layers, processing el-ements, repositories, objects, and tasks. Some recognizers discover source code instances of entities where developers have implemented major components - "large" segments of source code (e.g., a layer may be implemented as a set of procedures). Relations such as contains, initiates, spawns, and is-connected-to each describe how entities are linked.
Component participation in a relation follows from the existence of a connector - a specific code fragment (e.g., special operating system invocation) or the infrastructure that pro-cesses these fragments. This infrastructure may or may not be part of the body of software under analysis. For example, it may be found in a shared library or it may be part of the implementation language itself.
As an illustration. Figure 3 details the task entity and the spawns relation associated with a task spawning style. In a task spawning architectural style, tasks (i.e., executable processing
EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 113
Figure 2. Bird's Eye Overview
114 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH
elements) are linked when one task initiates a second task. Task spawning is a style that is recognized by the presence of its connectors (i.e., the task invocations). Its components are tasks, repositories, and task-functions. Its connectors are spawns (invocations from tasks to tasks), spawned-by (the inverse of spawns), uses (relating tasks to any tasks with direct in-terprocess communications and to any repositories used for inin-terprocess communications), and conducts (relating tasks to functional descriptions of the work performed).
Tasks are a kind of processing element that programmers might implement by files (more generally, by call trees). A default recognizer named executables will extract a collection of tasks. Spawns relates tasks to tasks (i.e., parent and child tasks respectively). Spawns might be implemented by objects of type system-call (e.g., in Unix/C, programmers can use a system, execl, execv, execlp, or execvp call to start a new process via a shell command). Analysts can use the default recognizer, find-executable-links, to retrieve instances of task spawning.
defentity TASK
:specialization-of processing-element :possible-implementation file
:recognized-by executables
defrel SPAWNS
:specialization-of initiates
:possible-implementation system-call :recognized-by find-executable-links :domain task
:range task
Figure 3. Elements in an architecture modeling language
Many of the styles we work with have been elaborated by others (e.g., pipe and filter, object-oriented, abstract data type, implicit invocation, layered, repository). In addition we have worked with a few styles that have special descriptive power for the type of programs we have studied. These include application programming interface (API) use, the task spawning associated with real time systems, and a service invocation style. Space limitations do not permit a full description of all styles here. However, we offer two more examples to help the reader understand the scope of our activities.
Layered: In a layered architecture the components (layers) form a partitioning of a subset, possibly the entire system, of the program's procedures and data structures. As mentioned in (Garlan and Shaw, 1993), layering is a hierarchical style: the connectors are the specific references that occur in components in an upper layer and reference components that are defined in a lower layer. One way to think of a layering is that each layer provides a service to the layer(s) above it. A layering can either be opaque: components in one layer cannot reference components more than one layer away, or transparent: components in one layer can reference components more than one layer away.
EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 1 1 5
Data Abstractions and Objects: Two related ways to partially organize a system are to identify its abstract data types and its groups of interacting objects (Abelson and Sussman, 1984, Garlan and Shaw, 1993). A data abstraction is one or more related data representations whose internal structure is hidden to all but a small group of procedures, i.e., the procedures that implement that data abstraction. An object is an entity which has some persistent state (only directly accessible to that entity) and a behavior that is governed by that state and by the inputs the object receives. These two organization methods are often used together.
Often, the instances of an abstract data type are objects, or conversely, objects are instances of classes that are described as types of abstract data.