Extracting Architectural Features from Source Code*
5. Recognizers in Practice
122 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH
the function named main) and then traverses the program AST to collect the call tree into a cluster starting at that root function. Figure 5 shows how these various actions are put together for the sample recognition described in Section 3.1.
The database of language and operating system specific functions, the program slicing (and slice evaluation), and the special patterns described in this section are all areas where our architecture recovery tool adds value beyond that of commercially available software reverse engineering tools.
EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 123
1. let (function-names =
FUNCTIONS-OF-INTEREST) (for-every item in program such-that
function-call(item) and
name(item) in function-names do
2. (for-every item in invocations-of-type(reference) do
3. (for-every item in invocations-of-type('system-calls) do
Figure 9. A family of recognizer fragments
pre-defined list of special calls). In addition, more of the processing is explicitly stated, perhaps making the fragment more difficult to understand (i.e., lacking abstractions). In contrast, the third special purpose recognizer does not require any external parameter set-tings, but would co-exist in a library with many close cousins. The second fragment is a compromise. In general, our set of parameters allows recognizer authors to modulate abstraction versus understandability issues to produce a collection that best suits the needs of their specific user community.
5,2, Operation and Control
Analysts use recognizers in two ways. First, recognizers can be stand-alone analysis meth-ods for answering a specific question about the source code. For example, an analyst might ask for the locations where the source code invokes the sendmail service. Second, within our architecture recovery implementation, recognizers are semi-automatically bundled to-gether to produce a composite view. For example, Section 6 below shows a system's as-built architecture with respect to the task-spawning style. This view was constructed using the set of default recognizers associated with the entities and relations of the task-spawning style.
Three recognizers were employed. The find-executable-links recognizer found instances of the spawns relation (encoded in the system or execlp calls of the program), a second recognizer found instances of file/shared-memory interprocess communication (through fopen and open calls), and a third looked for separate executables (identified by "main"
procedures) that may not have been found by the other recognizers. Within our recovery framework, analysts can override the defaults by making selections from the recognition library. Thus, either in stand-alone or as-built architecture recovery modes, recovery is an interactive process and we need facilities that will help analyst make informed selections from the library.
124 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH
5.2.1. Recognizer Retrieval
Since the library is large (60 or more entries), we have provided two indexing schemes that help the analyst find an appropriate recognizer. The first scheme simply uses the text strings in a description attribute associated with each recognizer. The analyst enters a text string and the implementation returns a list of all recognizers whose description contains text that matches the string. The analyst can review the list of returned descriptions and select the recognizer that looks most promising.
The second scheme allows an analyst to see and select from all the recognizers that would return some type of information. While, analysts may not remember the name of a recognizer, they will probably know the type of information (e.g., file, function-call, procedure) that they are looking for. To support this retrieval, we have attached effect descriptions to each recognizer. Since, the result of running a recognizer may be that some part of the source code is annotated with markers, we think of the "effects" of running a recognizer on the AST. For example, the task-spawning recognizer in Figure 4 finds function calls and files (associated with tasks). The format for these effect descriptions is
"[<category> <type>]" where <category> is either "know" or "check" and <type> is some entry in the type hierarchy. Such tuples indicate that the recognizer will "know" about fragments of the stated type or "check" to each if fragments are of the stated type.
Figure 10 is the type taxonomy our implementation uses. Uppercase entries are top entries of taxonomies based on the language model (e.g., C, FORTRAN) along with our specializations (e.g., specializations of function call) and clustering extensions. The depth of indentation indicates the depth in a subtree.
When analysts select a type from this list, the system shows them a list of all the recognizers that find items of that type. Figure 11 is an example that shows the restricted menu of recognizers that achieve [know function-call]. In the event that the analyst does not find a relevant recognizer in the list, the system helps by offering to expand the search to find recognizers that know generalizations of the current type. For example, a request [know special-call] would be extended to the request [know function-call] to the request [know expression] climbing into the upper domain model for the legacy system's language.
Once a recognizer is selected, the system prompts the analyst for parameters that the recognizer requires. Analysts can set the reference parameter to the result of a previous recognition thus providing a mechanism for cascading several recognizers together to re-trieve a complex pattern. In addition, there is an explicit backtracking scheme encoded for the recognizers. If a recognizer requires other recognizers to have been run (i.e., to populate some information on the AST) its representation indicates that the second recog-nizer is a pre-condition. The analyst can review the result and select some subset of the returned results for subsequent analysis. Reasons for only selecting a subset could range from abstracting away details (for understanding or analysis) to removing irrelevant details that cannot be detected syntactically (e.g., a module is only used for testing).
EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 125
CLUSTER
Ne twork-exchange RPC-exchange Port-exchange Pipe-exchange
Unix-pipe Code-fragment
Connector-fragment Module
Service Non-source-file
Shell-script Input-file Output-file Source-file Executable-object FUNCTION-CALL
Special-call Network-call System-call I/O-call
Non-POSIX-compliant-call FUNCTION-DEF
STRUCT-TYPE
Figure 10. k Taxonomy of recognition types
126 D.R. HARRIS. H.B. REUBENSTEIN. A.S. YEH
FUHCTIOH-CALL-ARTIFACT
NETWORK-CALL : implementations of client process NETWORK-CALL : implementations of server process
SERVICE : LINKS between the program and any network services or remote procedun NETWORK-CALL : LINKS between procedures and some service
NETWORK-CALL : LINKS between procedures and network services PROCESS-INVOCATION : LINKS between procedures and shell commands
SPECIAL-CALL : Connection family used in a network exchange SPECIAL-CALL : Connection type used in a network exchange PROCESS-INVOCATION : Spawning LINKS between executable modules
PROCESS-INVOCATION : Invocations that activate executables FUNCTION-DEF : LINKS between local and remote procedures
SPECIAL-CALL : Function calls identified directly or by dereferenced function name SPECIAL-CALL : Invocations of members of a family of functions
Abort
Figure 11. Recognizers with effect [know function-call]
5.2.2. Recognizer Results
From among several possible representations our recognition results are either sets of objects from the AST or sets of tuples of objects from the AST, This choice has been motivated by the multiple purposes we envision for recognizer use. As we have mentioned, recognition results may stand by themselves in answering a question, they may be joined with other results to form a composite picture (i.e., this is how style recognition is accomplished), or they may be used as inputs to other recognizers in a more detailed analysis of the code.
Standard output results are needed to support interoperability among recognizers and to provide a uniform API to applications. This notion needs to be balanced with the need to allow analyst to flexibly compose solutions to a wide variety of questions involving multiple aggregation modes. For example, many architectural features (e.g., tasks, functional units) requires an analysis of a calling hierarchy. Given a set of procedures - perhaps a functionally cohesive unit - several aggregations are possible. We might be interested in identifying a set of common callers of these procedures, the entire calling hierarchy, a calling hierarchy that is mutually exclusive with some other set of procedures (i.e., a distinct functional unit), or a set of root nodes (i.e., candidates for task entry points). All of these are meaningful for identifying architectural components. Thus our library contains recognizers that return various aggregations within the calling hierarchy. The danger is that if we have too many different output forms, we will drastically limit our ability to compose recognition results.
Our solution deals with this problem in two ways. First, we output results in a manner that reduces the need for repeating computationally expensive analyses in subsequent
rec-EXTRACTING ARCHITECTURAL FEATURES FROM SOURCE CODE 1 2 7
ognizers of a cascaded chain. Second, we standardize output levels so that results can be compared and bundled together easily.
Avoiding Redundant Computations: One approach to recognition would be to assume that each recognizer always returns a single object and that adjoining architectural structures can be found piecemeal by following the AST (or using some of the analysis tools described above). We have found this approach to be unsatisfactory because many of the recognizers collect objects in the context of some useful larger structure. Rather, it is useful to return a structure (i.e. the ordered triples described above) that contains contextual information.
For example, a slice evaluation coupled with the use of program patterns (e.g., the slice associated with the code in Figure 6) can be a relatively expensive computation. Once the recognizer completes this examination it caches the result as the third element of a triple (as in Table 1) to avoid re-computations. This format has enabled us to support extensive architecture recovery without excessive duplication of computations.
Standard Contexts: Each recognizer has only a local view; it cannot know how some other recognizer will use its results. The critical concern is to identify some standard contexts so that other parts of an analysis process can rely on a uniform type of response. If we do not have some standardization, the enclosing structure part of a recognition could be a procedure, a file, a directory, a task, or something else. This would require each recognizer to carry out a normalization step prior to using the results of another recognizer.
For the current framework, we selected the procedure level as a standard context. That is to say, unless there is reason to report some other structures, triples will be of the form <
object, procedure, procedure >. Our justification for this is that, if necessary, courser grained structure (e.g., file, directory) can be easily re-derived from the AST, while procedures offer an architecture level result that embodies the results of expensive lower-level analyses such as slice evaluation.
5.3. Recognizer Representation
We can summarize the above issues by displaying the internal representation we use for each recognizer. In our implementation, each recognizer is an object with a set of attributes that the implementation uses for composition and retrieval. The attributes are as follows:
• Name: a unique identifier
• Description: a textual description of what the recognizer finds (used in indexing)
• Effects: effects indicate the types of source code fragments that are found (also used in indexing)
• Pre-condition - other recognizers that are run before this recognizer will run
• Environment: the set of parameters that analysts must set before invoking the recognizer
• Recognition method: the action part of the recognizer; written in RRL (as illustrated in Section 3.1 above)
128 D.R. HARRIS, H.B. REUBENSTEIN, A.S. YEH
In summary, recognizer authors build the RRL descriptions using the RRL language con-structs and special analysis functions. They set pre-conditions and environment attributes to link the recognizer into the library. At this time they may add the new recognizer's name to default recognizer lists for the style-level entities/relations.
Subsequently, during an investigation, an analyst retrieves the recognizer either by se-lecting an entity/relation with a default, by recognizer name, by indicating a text fragment of the description, or by indicating the effect desired. The implementation recursively runs recognizers in the pre-condition attribute, asks the analyst to set any of the required parameters, and interprets the RRL code in the recognizer's method.
If the analyst employed the recognizer in architecture recovery, the results are added to the as-built architecture with respect to some style. We provide additional support via specialization hierarchies among the architectural entities and relation. Upon finding that few examples of an architectural feature are recognized, the analyst has the option of expanding a search by following generalization and specialization links and searching for architecturally related information. This capability complements the recognizer indexing scheme based on code level relationships.