Extracting Architectural Features from Source Code*
4. Analysis Tools for Supporting Recognition
118 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH
The recognizers are designed to recognize typical and possible patterns of architectural feature implementations. The recognizers are not fool-proof. A programmer can always find an obscure way to implement an architectural feature which the recognizers will not detect and a programmer may write code that accidently aligns with an architectural feature.
However, the recognizers written so far capture the more common patterns and have worked well on the examples we have seen. As we encounter more examples, we will modify and expand the recognizers as needed.
The more advanced recognizers from the set of recognizers (listing in the appendix) capture task spawnings and service invocations via slice evaluation and searching for special progranuning patterns. Section 4 highlights this analysis. In addition, (Holtzblatt, Piazza, Reubenstein, Roberts, 1994) describes our related work on CMS2 code. In most of the other cases, the features are not difficult to recognize. Among other things, the recognizers cover a wide spectrum of components and connectors that C/Unix programmers typically use for implementing architectural features.
EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 119
4,2. Special Patterns
Slicing provides only part of the story for the examples in Figure 6. Programmers use stereotypical code patterns to implement frequently occurring computations. Some of these patterns can be easily recognized in abstract syntax trees. For example, the code in Figure 6 shows two standard ways of invoking an executable (and potentially invoking a task). To uncover this architectural feature, we need to exploit knowledge of two patterns.
The first pattern identifies the position - first argument for system calls, last but for the null string for execlp - of the key command string that contains the name of the executable.
The second pattern describes potential ways programmers can encode pathnames in the command strings. In the first example, the function sprintf binds the variable cmd to the string "%s/snoopy" where the %s is replaced by the name of the directory stored in the variable b i n _ d i r . In the second, the movement to the appropriate directory ("cd
%s / b i n ; " ) is separated from the actual spawning of "snoopy". We designed our approach to catch such dominate patterns and to ferret out the names of files and executable images (possibly tasks) within string arguments.
1. sprintf(cmd, "%s/snoopy", bin_dir);
if ( debug == 0)
status = system (cmd);
2. sprintf(cmd,"cd %s/bin; ./snoopy", top_dir);
if (forkO == 0) {
e x e c l p ( " / b i n / s h " , " s h " , " - c " , cmd, ( c h a r * ) 0 ) ; }
Figure 6. Two approaches for invoking an executable image
Other examples of patterns for C/Unix systems include the use of socket calls with con-nect or bind calls for creating client-server architectures, and the declaration of read/write modes in fopen calls. While our approach has been somewhat catch-as-catch-can, we have found that identifying only a few of these patterns goes a long way toward recovering architectural features across many architectural styles.
43, Clustering
Clusters are groupings of features of the program - a set of files, a set of procedures, or other informal structures of a program. Some recognizers need to bundle up collections of objects that may be de-localized in the code. Clustering facilities follow some algorithm for gathering elements from the abstract syntax tree. They create clusters (or match new
120 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH
collections to an old cluster), and, in some cases, conduct an analysis that assigns properties to pairs of clusters based on relationships among constituent parts of the clusters.
For example, our OBject and Abstract Data type (OBAD) recovery sub-tool (Harris, Reubenstein, Yeh: Recovery, 1995) builds clusters whose constituents are collections of procedures, data structures, or global variables. OBAD is an interactive approach to the recovery of implicit abstract data types (ADTs) and object instances from C source code.
This approach includes automatic recognition and semi-automatic techniques that handle potential recognition pitfalls.
OBAD assumes that an ADT is implemented as one or a few data structure types whose internal fields are only referenced by the procedures that are part of the ADT. The basic version of OBAD finds candidate ADTs by examining a graph where the procedures and structure types are the nodes of the graph, and the references by the procedures to the internal fields of the structures are the edges. The set of connected components in this graph form the set of candidate ADTs. OBAD has automatic and semi-automatic enhancements to handle pitfalls by modifying what is put into the above graphs. Currently, OBAD constructs the graph from the abstract syntax tree. In the future, OBAD will use graphs made from the results returned by more primitive recognizers.
Also, recognizers can use clusters as input and proceed to detect relationships among clusters. For example, a computation of pairwise cluster level dominance looks at the procedures within two clusters. If cluster A contains a reference to an entry point defined in cluster B, while cluster B does not reference cluster A, we say that A is dominant over B. This notion of generalizing properties held by individual elements of groups occurs in several of our recognizers.
4,4, Language/Operating-System Models
A design goal has been to write recognizers that are LOL-independent - independent of specific patterns due to the source code Language, the Operating system, and any Legacy system features. Our hope is that we will be able to reuse most recognizers across ASTs associated with different LOL combinations. While we have not explored this goal exten-sively, we have had some success with recognizers that work for both FORTRAN (under the MPX operating system) and C (under Unix). Our approach to this is two-fold. First, we write recognizers using special accessors and analysis functions that have distinct im-plementations for each LOL. That is, the special access functions need to be re-written for each LOL, but the recognizer's logic is reusable across languages. Second, we isolate LOL-specific function (e.g., operating system calls) names in separately loadable libraries of call specifications. Each call specification describes the language, operating system, and sometimes even target system approach for coding a LOL-neutral behaviors such as system calls, time and date calls, communication channel creators, data accessing, data transmis-sion, input/output calls, API's for commercial products, and network calls. For examples.
Figure 7 is the C/Unix model for system-calls (i.e., calls that run operating system line commands or spawn a task) while Figure 8 shows an analogous FORTRAN/MPX model.
EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 1 2 1
These specifications are also a convenient place for describing attributes of special pat-terns. In these examples, the key-positions field indicates the argument position of the variable that holds the name of the executable invoked.
defcalls SYSTEM-CALLS :call-desc "System Calls"
:call-type system-call
:call-ref-names "system", "execve",
"exec1", "execV",
"execlp", "execvp", "execle"
:key-positions first, next-last, next-last, next-last, next-last, next-last, next-last
Figure 7. A C/Unix Call Specification
defcalls SYSTEM-CALLS :call-desc "System Calls"
:call-type system-call
:call-ref-names "m::rsum", "m::sspnd"
:key-positions first, first
Figure 8. A FORTRAN/MPX Call Specification
4,5. An Example - Putting it all together
We return to the find-executable-links recognizer described in Section 3.1. When faced with either code fragment of Figure 6, this recognizer will collect the appropriate triple. We explain this activity in terms of the above analysis capabilities. The functions g o t o t o p -from-root and i n v o c a t i o n s - o f - t y p e perform their job by traversing the program AST. i n v o c a t i o n s - o f - t y p e accesses the call-specification to tell it which functions in the examined program can implement some architectural style feature. For example, in the Unix operating system, the system-call specification names the functions that can spawn a task (i.e., system or members of the execlp family of functions). The function p r o c e s s -invoked uses slice evaluation to find the value(s) of the arguments to the fiinction calls returned by i n v o c a t i o n s - o f - t y p e . p r o c e s s - i n v o k e d then uses special patterns to determine the name of the executable image within the command string. In addition, p r o c e s s - i n v o k e d consults a map to tell it which source code file has the root for which task. The map is currently hand generated from examining system makefvX^^. In the file with the root, p r o c e s s - i n v o k e d finds the task's root function (in the C language, this is
122 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH
the function named main) and then traverses the program AST to collect the call tree into a cluster starting at that root function. Figure 5 shows how these various actions are put together for the sample recognition described in Section 3.1.
The database of language and operating system specific functions, the program slicing (and slice evaluation), and the special patterns described in this section are all areas where our architecture recovery tool adds value beyond that of commercially available software reverse engineering tools.