Directory UMM :Data Elmu:jurnal:P:Photogrametry & Remotesensing:Vol55.Issue1.Feb2000:

(1)

www.elsevier.nlrlocaterisprsjprs

Review Paper

Modelling and representation issues in automated feature

extraction from aerial and satellite images

Arcot Sowmya

a,)

_{, John Trinder}

b

a

School of Computer Science and Engineering, The UniÕersity of New South Wales, Sydney NSW 2052, Australia

b

School of Geomatic Engineering, The UniÕersity of New South Wales, Sydney NSW 2052, Australia

Received 6 October 1998; accepted 21 December 1999

Abstract

New digital systems for the processing of photogrammetric and remote sensing images have led to new approaches to

Ž .

information extraction for mapping and Geographic Information System GIS applications, with the expectation that data can become more readily available at a lower cost and with greater currency. Demands for mapping and GIS data are increasing as well for environmental assessment and monitoring. Hence, researchers from the fields of photogrammetry and remote sensing, as well as computer vision and artificial intelligence, are bringing together their particular skills for automating these tasks of information extraction. The paper will review some of the approaches used in knowledge representation and modelling for machine vision, and give examples of their applications in research for image understand-ing of aerial and satellite imagery.q_{2000 Elsevier Science B.V. All rights reserved.}

Keywords: artificial intelligence; computer vision; knowledge; models; photogrammetry; remote sensing; representation; feature extraction; automation

1. Introduction

Data acquisition by photogrammetry and remote sensing for mapping and GIS has traditionally been characterised by the efficient manual extraction of high precision 3-D data from images ranging in scale from those derived from space systems, to close-range images of stationary objects. These procedures are generally time-consuming and expensive. This has limited the amount and resolution of terrain information that can be extracted on current mapping and GIS budgets. In addition, in many cases, the data

)_{Corresponding author. Tel.:} _q_{61-2-9385-3936; fax:} _q 61-2-9385-1814.

Ž .

E-mail address: [email protected] A. Sowmya .

has tended to become out-of-date because of poor revision cycles adopted by data acquisition authori-ties. Data acquisition systems have always been tech-nology-driven, and have been adapted to the latest methods and equipment. This will clearly continue in the future as digital systems become available. Digi-tal systems provide opportunities for new approaches to information extraction for mapping and GIS based on increasing levels of automation. They also accom-modate cutting-edge techniques in computer vision that draw upon areas such as artificial intelligence and machine learning. Hence, researchers from the fields of photogrammetry and remote sensing, as well as computer vision and artificial intelligence, are blending their particular skills to attack the spe-cific problems in this application area.

Ž .

(2)

Automatic information extraction of the terrain surface in the fields of photogrammetry and remote sensing requires the formulation of procedures and knowledge that encapsulate the content of the im-ages. This is a non-trivial task, because of the com-plexity of the information stored in the images. Images of the terrain surface used in photogramme-try may have scales varying from 1:3000 to a larger scale of 1:90,000, while in remote sensing the pixel footprints usually vary from 1 to 30 m. The structure of the features in images of the terrain is complex, being a combination of many different intensities that can represent natural features such as vegetation, geomorphological and hydrological features, objects constructed by humans such as buildings and roads, and artifacts caused by variations in illumination of the terrain by the sun, such as shadows and other changes in brightness. In addition, the context in which features occur is considerably more complex than may occur in ‘normal’ photographic images. These characteristics mean that extraction of infor-mation in aerial and satellite images presents major challenges. The research on information extraction must consider primarily the semantic aspects of the data. However, the geometric quality of the extracted information must also be considered, so that it satis-fies the relevant specifications of spatial data.

Computer vision is the enterprise of automating and integrating a wide range of processes and

repre-Ž

sentations used for visual perception Ballard and

.

Brown, 1982 by constructing explicit, meaningful descriptions of objects from images, using a variety of approaches and techniques including digital image processing, pattern recognition, geometric modelling and cognitive processing. Computer vision concerns itself chiefly with the problem of image interpreta-tion and understanding, and attempts to achieve it via object and scene recognition. In this task, it employs the techniques of attribute and relation extraction from an image, shape representation and description, and finally, model-based recognition using the at-tributes extracted and described.

The application of computer vision approaches to the task of extraction of information from digital remotely sensed images is complicated by the method of acquisition of these images, and the consequent data characteristics need careful consideration when applying or adapting methods developed for

‘nor-Ž .

mal’ photographic images De Gunst, 1996 . Many difficult-to-handle features present in computer vi-sion applications, also tend to persist in aerial and satellite images. The common problems include noise in the acquisition process, the effects of shading, variations in illumination and geometry due to cam-era angle and position, and occlusion and partial objects problems.

There are differences in approaches taken in the interpretation of aerialrsatellite and other photo-graphic images. Three-dimensional object geometry is an essential element of the processing of aerial images, not only for the purposes of extracting eleva-tions, but also because the third dimension will provide additional information for the interpretation of the image. The issue then arises as to whether 2-D or 3-D information will result in a better feature description. 2-D image data has generally been con-sidered sufficient for most research in the computer science community, though 3-D processing in the form of stereoscopic vision and depth extraction are being addressed by some research groups. Object models in computer vision are dominated by shape-and appearance-oriented descriptions, though there are a few exceptions. For low- and medium-resolu-tion aerial images, the utility of shape-based recogni-tion is quesrecogni-tionable. However, context plays a larger role in remotely sensed image interpretation; for example, a bridge may be more easily recognised as part of a road which crosses a river. It has been observed that objects in an aerial image are dense and a composition of many parts; this agrees with recent trends in computer vision, which recognise objects in images by first isolating component parts of the objects and the relationships between them

Ž_{Grimson, 1990 .}.

This paper will review some of the approaches used in knowledge representation and modelling for machine vision and give examples of their applica-tions in research for image understanding of re-motely sensed images. While a number of methods will be covered, it is not possible to discuss all approaches currently being used by researchers in this field. Other surveys may be consulted for cover-age of other areas, for example, Crevier and Lepcover-age

Ž1997 for a survey of knowledge-based image un-.

Ž .

derstanding, Hancock and Kittler 1990 for one on

Ž .

(3)

survey of artificial intelligence techniques in remote sensing. Knowledge is defined in Section 2, as well as approaches to knowledge representation, control issues and approaches to the modelling of features in machine vision. Feature representation and the fea-ture recognition process are covered in Section 3, while examples of the application of the methods of knowledge representation in both photogrammetry and remote sensing are presented in Section 4.

2. Knowledge, representation and models

2.1. Definitions of knowledge

Ž .

Merriam Webster Dictionary 1999 defines

knowledge as the fact or condition of knowing

some-thing with familiarity gained through experience or association, the range of one’s information or under-standing, the sum of what is known. Representation is the act or action of representing; the state of being represented, the act of delineating; to represent is to serve as a sign or symbol of, serve as the counterpart or image of, to describe as having a specified charac-ter or quality, to correspond to in essence. A model is a miniature representation of something, an exam-ple for imitation or emulation.

Within computer vision and artificial intelligence, these terms are used loosely in conformance with their dictionary meanings, and technical definitions are hard to come by. Computer vision must produce a ‘‘useful’’ description of a scene depicted in an image, whose initial representation is an array of image intensity values. At the low-leÕelÕision stage,

the early processing of the image takes place. Do-main-independent image processing algorithms ex-tract, characterise and label components at the

mid-dle-leÕelÕision stage. This stage then delivers more

generalised image representations to the higher-leÕel

Ž .

Õision stage Ballard and Brown, 1982; Marr, 1982 ,

which attempts to emulate cognition. To cope with the changes in lighting and viewpoint, the effects of shape and shading, variations in the imaging process such as in camera angle and position, and noise at the lower-level image processing stage, we need knowledge of the world in which the images are

recorded and the specific application, via rich repre-sentations at the higher level, which in computer vision are usually called models. These models ex-plain, describe, or abstract the image information. The gap between the image and the models is bridged via a range of representations which connect the input image to the output interpretation. The

repre-Ž

sentations are categorised as Ballard and Brown,

.

1982 :

1. Generalised images, which are iconic and ana-logue representations of the input data; binary imagesrsilhouettes are examples.

2. Segmented images, consisting of sets of pixels likely to correspond to real objects; for example, the outputs of segmentation algorithms.

3. Geometric representations, which deal with shape information; many object models in computer vision, for example, are shape-based.

4. Relational models, which encode knowledge that is used in high-level reasoning and interpretation; Artificial Intelligence tools are often used for representation and modelling. Many semantic net-work models in the literature, for example, fall into this category, some of which are mentioned later in the paper.

Each method of representation has limited applica-tions on its own. Hence, all four types of representa-tions are vital in the image interpretation task. We shall concentrate on the fourth category, namely relational models, which bring together knowledge representation and models for the purpose of image understanding.

2.2. Knowledge representation

The objective of knowledge representation is to

Ž

express knowledge in computer-tractable form

Rus-.

sell and Norvig, 1995 . A good knowledge represen-tation language should be expressive, concise, unam-biguous, and context-independent. First Order Logic

ŽFOL is the basis of many representation schemes in.

(4)

procedures for FOL permit one to derive new sen-tences from old ones. Such a formal inference proce-dure may be used to automatically derive valid con-clusions from known facts. Both logic programming

languages and production systems are based on

FOL. Logic programming languages such as Prolog permit the representation of knowledge in a re-stricted form of FOL; they also implement inference procedures and allow the derivation of new informa-tion from current knowledge. They usually use back-ward chaining for control, which applies the logical inference rule backwards: to prove something, they find logical implications in the database that would allow the conclusion of the desired statement. Thus, when a goal is to be established, backward chaining is the preferred mode of inference.

Production systems consist of a knowledge base of facts and a set of rules or productions, repre-sented using logical implication. The following is an example of a production: IF a region is an elongated and homogeneous object THEN it belongs to a road object.

The production system applies the rules to the knowledge base and obtains new assertions, in an endless cycle called the match–select–act cycle. In the match phase, the system finds all rules whose antecedent is satisfied by current data values. In the

select phase, the system decides on one rule to

execute, out of the matched rules in the first phase. The selected rule is then executed in the act phase, where the execution of a rule might involve inser-tions into and deleinser-tions from the knowledge base as well as input and output of data values.

Frames and semantic networks are popular

knowledge representation schemes in artificial intel-ligence and recently, in photogrammetry and remote sensing. They use the metaphor that objects are nodes in a graph, that these nodes are organised in a taxonomic structure, and that links between nodes represent binary relations. In frame systems, the binary relations are thought of as slots in one frame that are filled by another frame; in semantic net-works, they are thought of as arrows between nodes. The meaning and implementations of these two types of systems can be identical.

Description logic systems evolved from semantic

networks; the basic idea is to express and reason with complex definitions of, and relations among,

objects and classes. Description logics provide three

Ž .

kinds of reasoning services Nebel, 1990 :

1. Classification of concept descriptions, by auto-matic arrangement of concepts in a specialisation hierarchy.

2. Classification of individual objects given a de-scription of their properties.

3. Maintenance of the overall consistency of the knowledge base.

The languages provided by these logics are rather inexpressive and it is difficult to specify complex constraints. Their advantages are that they have for-mal semantics on which the reasoning services are based, as well as simple logical operations.

In summary, a logic programming language such as Prolog has an execution model that is simple enough for a programmer to deal with. Recently, the introduction of Prolog compilers has served to boost the desirability of Prolog for prototyping small-scale artificial intelligence projects, in preference to C. Production systems are popular in modelling human reasoning; unlike Prolog, production systems are not query-based, and are good for implementing open-ended non-terminating systems, which are in opera-tion continuously. Semantic networks provide a graphical interface which is easier to comprehend than text-based formalisms. They can be as expres-sive as FOL, though most are not, since they impose severe constraints on what may be expressed. Their advantages include the ability to express hierarchical connections in a modular fashion, and their relative simplicity. Description logics combine clear seman-tics with simple logical operations. Therefore, while all these schemes are based on FOL, there are trade-offs to using one or the other.

2.3. Control issues

Whatever the image representation chosen, the

processing of the image data and its representations

may be either image data-driÕen, called bottom-up

control, or internal model-driÕen, also called

top-Ž

down control Ballard and Brown, 1982; Sonka et .

(5)

next. Bottom-up control is useful if domain-indepen-dent image processing is cheap, and input data is

Ž . Ž .

accurate and reliable. Marr 1982 and Ullman 1984 advocated the bottom-up approach on the basis that bottom-up processing of data occurs invariably in human vision. Marr saw this leading to an intermedi-ate representation called the 2 1r2-D sketch, con-taining surface orientations and distances in a viewer-centred frame of reference, as well as discon-tinuities in surface distances and orientations. In addition, Ullman hypothesised higher-level processes called visual routines, which detect features of inter-est in the intermediate representation.

Top-down model-driven control is driven by ex-pectations and predictions generated in the knowl-edge base. Thus, model-driven control attempts to perform internal model verification, in a goal-di-rected manner. A common top-down technique is

hypothesise-and-Õerify, which can normally control

low-level processing. There appears to be support for the view that some aspects of human vision are not bottom-up, and the model-driven approach is moti-vated by this observation, as well as the desire to minimise low-level processing.

In practice, computer vision systems tend to favour mixed top-down and bottom-up control that focuses attention efficiently and makes the computation prac-tical. Either parallel or serial computation may be performed within any of these schemes.

Both top-down and bottom-up controls imply a hierarchy of processes. In heterarchical control, pro-cesses are viewed as a collection of cooperating and competing experts, and at any time, the ‘expert’ which can ‘help the most’ is chosen. Blackboard

architectures are an example of this approach, in

which modular knowledge sources communicate via

Ž .

a common blackboard memory to which they can write and from which they can read.

2.4. Modelling issues

In the model-based approach to computer vision, a priori models of possible objects in a class of images are defined and utilised for object recogni-tion. The models encode external knowledge of the world and the application. Object models may be appearance models, shape models, physical models, etc. Each of these should capture the range of

varia-tion in the presentavaria-tion of the object due to changes in viewpoint, in lighting, and even changes in shape

Ž .

in the case of flexible objects Pope, 1995 . In addition, variations due to the image acquisition process itself, as well as variations among individual members of the object class, should be accounted for.

Objects of interest may be 2-D or 3-D; they may also be rigid, articulated, or flexible. The images themselves may be range images or intensity images. Recognition is accomplished by determining a corre-spondence between some attributes of the image and comparable attributes of the model in a matching

Ž .

phase. Relevant attributes of a model image are represented using one of the representation schemes discussed earlier. Recognising a 3-D object in an intensity image of an unrestricted scene is the most difficult form of the problem, and aerial and space images fall into this category. Loss of depth due to projection, occlusion and cluttering of details are some of the problems occurring; further, image in-tensity is only indirectly related to object shape.

3. Automated feature extraction

The goal of most image interpretation systems is the extractionrrecognition of objects in the scene. In the model-based approach, this is achieved by first extracting object properties and then matching them to a model.

3.1. Feature attribute representation

(6)

recognis-able objects in images. To refer to properties of objects, we shall use the term ‘attributes’.

Global attributes of objects summarise informa-tion about the entire visible porinforma-tion of the object, such as area, perimeter, length, etc. Ideally, such global attributes should be scale and translation in-variant in order to cope with multiple resolution and shifts in images; features should be non-overlapping, so that clutter and occlusion may be avoided; further, a separate model is necessary for each view of the

Ž

object, so as to handle multiple view multiple-look

.

angle images. Local attributes in photogrammetry may be, for example, junctions and edge segments, which may be treated as independent attributes of features. However, within computer vision, it is more usual to treat such attributes in relation to each other, or in context. Relational attributes are usually structured into graphs.

A representation scheme for feature attributes is judged on the criteria of scope and sensitivity,

stabil-Ž

ity, efficiency and uniqueness Marr and Nishihara, 1978; Binford, 1982; Brady, 1983; Haralick et al.,

.

1988; Mokhtarian and Mackworth, 1992 . On these criteria, researchers conclude that a good representa-tion for the model-based approach includes a combi-nation of local attributes, each pertaining to a

spe-Ž

cific region of the image or object Grimson, 1990;

.

Pope, 1995 . This is because local attributes may be computed efficiently based on a limited part of the input data; they are stable, since small changes in appearance affect only some of the features, and partial occlusion of objects will only partly affect local features. Edge junctions are an example of such a local attribute, based on edge analysis. Also, a multiple-scale representation is preferable as two largely similar objects will then have similar descrip-tions, even if small-scale details are different. Such multi-scale representations are more readily obtain-able for aerial and satellite images, either from image databases or by sub-sampling of high-resolution im-ages. This option is not available for many computer vision applications. The uniqueness criterion for models is not of great importance in feature recogni-tion, since the recognition algorithm could allow for some mismatch due to noise and occlusion.

To specify locations of local attributes is easier for aerial and satellite images, in comparison to images usually considered in computer vision,

be-cause in the former, exterior orientation and camera parameters are either known or derivable. For most computer vision studies of aerial imagery, 2-D

repre-Ž

sentations have been found adequate e.g., Shufelt

.

and McKeown, 1993 , but 3-D models and matching are often employed in photogrammetry in

applica-Ž

tions such as building shape extraction Henricsson

.

and Baltsavias, 1997 . Finally, what attributes are useful for feature extractionrrecognition? Attributes should capture all distinctions needed to differentiate features from each other and from other parts of the scene; secondly, they should reflect regularities and structures in the external world. Thus, the choice of attributes is application-dependent. In remote sensing and photogrammetry, the characteristics of spectral images are fairly well-known, through radiometric calibration and the spectral characteristics of the objects, as well as ground truthing. Some of the attribute regularities will arise from knowledge of these characteristics; for example, the spectral char-acteristics of various types of ground cover, such as various types of vegetation, soil, minerals, water, and some man-made structures have been determined by extensive tests and ground truthing over a number of years. Other attributes will be shape- and appear-ance-based, just as in computer vision; such as ‘roads are long and narrow strips’, ‘buildings are closed regions’ and so on. Yet others will be context-based, such as ‘buildings are normally situated beside roads’, and ‘bridges span rivers’.

Features may be organised into some kind of structure. One way is to arrange them hierarchically into partrwhole relations, as in

semantic-network-Ž .

based systems details later . A second is to arrange them according to adjacency relations. The latter

Ž

corresponds to spatial nearness, or context e.g., Strat

.

and Fischler, 1991 . Both may be represented as graphs.

3.2. Recognition of features

Object recognition in computer vision corre-sponds to the term feature extraction in photogram-metry. To recognise a single object in an image, bottom-up data-driven control is usually sufficient, in which attributes are first detected and represented as symbols. New attributes are then identified by

(7)

are then used to select a likely model in a library of object models, also called indexing. The best match between image attributes and model attributes is then found. Finally, the match is verified using some decision procedure. The grouping, indexing, and matching steps essentially involve search procedures. Bottom-up control fails, however, in more com-plex images containing multiple objects with occlu-sion and overlap, as well as in the case of poor quality images, in which noise creates spurious at-tributes. This is a very likely scenario for remotely sensed images. In this situation, top-down or hybrid control strategies are more useful. In the top-down approach, the hypothesis phase requires the organisa-tion of models indexed by attributes so that based on observed attributes, a small set of likely objects can be selected. The selected models are then used to

Ž

recognise objects in the verification phase Jain et

.

al., 1995 . A disadvantage of this approach is that the model control necessary in some parts of the image is too strong for other parts; for example, symmetry requirements imposed by the model could corrupt borders. In the hybrid approach, the two strategies are combined to improve processing effi-ciency.

Attributes are grouped whenever the resulting at-tribute is more informative than individual atat-tributes. This process is also called perceptual organisation.

Ž .

Lowe 1985, 1990 addressed this grouping question in object recognition and came up with some objec-tive criteria for grouping attributes; he looks for configurations of edge segments that are unlikely to happen by chance and are preserved under projec-tion. Collinear and parallel edges are an example.

Ž .

Zerroug and Nevatia 1993 utilise regularities in the projections of homogeneous generalised cylinders into 2-D. Most other researchers have developed ad

Ž .

hoc criteria for grouping, e.g., Steger et al. 1997 for road extraction, and Henricsson and Baltsavias

Ž1997 for building extraction. It seems obvious that.

local context will play a large part in attribute group-ing, since one would expect a particular arrangement of local attributes in relation to each other to define a local context.

General knowledge about occlusion, perspective, geometry and physical support are also necessary for

Ž .

the recognition task. Brooks 1981 built a geometric reasoning system called ACRONYM for object

recognition. The system SIGMA by Matsuyama and

Ž .

Hwang 1985 includes a geometric reasoning

ex-Ž .

pert. McGlone and Shufelt 1994 have incorporated projective geometry into their system for building

Ž .

extraction, while Lang and Forstner

¨

1996 have developed polymorphic features for the development of procedures for building extraction.

Context plays a significant role in image under-standing. In particular, relaxation labelling methods use local and global context to perform semantic labelling of regions and objects in an image. After the segmentation phase, scene labelling should corre-spond with available scene knowledge and the la-belling should be consistent. This problem is usually solved using constraint propagation: local con-straints result in local consistencies, and by applying an iterative scheme, the local consistencies adjust to global consistencies in the whole image. A full survey of relaxation labelling is available in Hancock

Ž .

and Kittler 1990 . Discrete relaxation methods are oversimplified and cannot cope with incomplete or inaccurate segmentation. Probabilistic relaxation works on the basis that a locally inconsistent but very probable global interpretation may be more valuable than a consistent but unlikely explanation;

Ž .

see Rosenfeld et al. 1976 for an early example of this approach.

To handle uncertainty at the matching stage, vari-ous evidence-based techniques have been used. Ex-amples include systems which utilise Dempster–

Ž

Shafer theory Wesley, 1986; Provan, 1990;

Clark-. Ž .

son, 1992 , reliability values Haar, 1982 , fuzzy

Ž .

logic Levine and Nazif, 1985 , the principle of least

Ž .

commitment Jain and Haynes, 1982 , confidence

Ž .

values McKeown and Harvey, 1987 , random closed

Ž .

sets Quinio and Matsuyama, 1991 and Bayesian

Ž

networks Rimmey, 1993; von Kaenel et al., 1993;

.

Sarkar and Boyer, 1994 .

4. Some examples of applications of modelling and representation

(8)

researchers in machine vision. In the fields of pho-togrammetry and remote sensing, the approaches adopted have followed those in the field of computer vision, and have been adapted for the types of infor-mation being extracted. These applications demon-strate that there is a growing level of expertise in techniques of artificial intelligence amongst the re-searchers in photogrammetry and remote sensing. The evolution of these methods has been from rule-based systems to semantic networks and frames to descriptive logic. A review of some applications in machine vision, photogrammetry and remote sensing in this section will demonstrate these trends.

4.1. Logic

The first researchers to advocate the use of logic as a representation in computer vision systems are

Ž .

Reiter and Mackworth 1989 . In their paper, they proposed a logical framework for depiction and in-terpretation of image and scene knowledge, as well as a formal mapping between the two. They propose image axioms, scene axioms and depiction axioms, whose logical model forms an interpretation of an image. They illustrate their approach using a simple munderstanding system called Mapsee. The ap-plication is relatively limited, however, and newer systems have not been reported. One reason could be the computational complexity. While logic provides a consistent formalism to specify constraints, ad hoc search using logic is not efficient. Further, FOL by itself is not good for representing uncertainty or incompleteness in data, which is in the nature of image properties. The correspondence between im-age elements and scene objects is not usually one-to-one, and additional logical relations are necessary to

Ž .

model these. Matsuyama and Hwang 1990 adopt a logical framework in which new logical constants and axioms are generated dynamically.

4.2. Rule-based and production systems

Ž .

Brooks 1981 developed ACRONYM, a model-based image understanding system for detecting 3-D objects, and tested it to extract aircraft in aerial images. 3-D models of aircrafts are stored using a frame-based representation. Given an image to be analysed, ACRONYM extracts line segments and

obtains 2-D generalised cylinders. Rules encoding geometric knowledge as well as knowledge of imag-ing conditions are used to generate expected 3-D models of the scene, which are then matched against the frames to identify aircraft.

Ž .

SIGMA Matsuyama and Hwang, 1985 is an aerial image understanding system that uses frames to represent knowledge, and both top-down and bot-tom-up control schemes to extract features. It con-sists of three subsystems: the Geometric Reasoning

Ž . Ž .

Expert GRE , Model Selection Expert MSE , and

Ž .

Low Level Vision Expert LLVE . Information passes from the GRE to the MSE, which then com-municates with the LLVE. The frames in SIGMA use slots storing attributes of an object and its rela-tionships to other objects. Based on the spatial knowledge in the frames, hypotheses are generated for objects and matched against image features. This is done by the MSE reasoning about the most likely appearance of an object and conveying this in image terms to the LLVE. This top-down selection of image attributes helps detect small attributes. The system was tested to extract houses and road seg-ments from aerial images.

Ž .

McKeown et al. 1985 present a rule-based sys-tem for the interpretation of airports in aerial images. It was based on about 450 rules, divided into six classes for: initialisation, region-to-interpretation for interpreting the original image fragments, local eval-uation, consistency checks, functional area rules for grouping of image fragments into functional areas, and goal-generation rules for building the airport model.

Ž .

McKeown and Harvey 1987 present a system for aerial image interpretation, with rules compiled from standard knowledge sets, called schemata. They generated rules automatically from higher level mod-ules, which made for better error-handling and more efficient execution. Their system contained about 100 schemata, each of which generated about five rules.

Ž .

(9)

candi-date evaluation, and consistency determination. Can-didate comparisons are based on the evaluation of likely candidates in the evaluation process, which scores the relative likelihood that a candidate is an instance of that class. The authors state that this division of the knowledge assigns it to manageable sizes.

Ž .

Stilla et al. 1996 present a model-based system for automatic extraction of buildings from aerial images, in which objects to be recognised are mod-elled by production rules and depicted by a produc-tion set. The object model is both specific and generic. The specific model describes objects using a fixed topological structure, while the generic models are more general.

These systems illustrate that rule-based systems do not guarantee additivity of knowledge and consis-tency of reasoning. Breaking up a rule base into multiple rules of varying granularity makes the pro-gram less modular and more difficult to modify.

Ž .

Draper et al. 1989 suggest blackboard and schema-based architectures to handle this.

4.3. Blackboard systems

Ž .

Nagao and Matsuyama 1980 first addressed the problem of scene understanding using the blackboard model, and applied it to aerial images of suburban areas, involving identification of cars, houses and

Ž

roads. Their system consists of a global database the

.

blackboard and a set of knowledge sources. The blackboard records data in a hierarchy consisting of elementary regions, characteristic regions and ob-jects. The blackboard also stores a label picture, which links pixels in the original image to the re-gions in the database. Elementary rere-gions are the result of an image segmentation process, and are characterised by grey-level, size and location in the image. Characteristic features of the regions are then extracted, resulting in the identification of elemen-tary regions with the following attributes:

1. Large, homogeneous regions, based on region size.

2. Elongated regions, based on shape.

3. Regions in shadow, based on region brightness. 4. Regions capable of causing shadows, based on

location of adjoining regions and the position of the sun.

5. Vegetation and water regions, from multispectral information.

6. High contrast texture regions, from textural infor-mation.

These properties are stored on the blackboard by separate modules. The knowledge sources then iden-tify a particular object, given the presence or absence of the characteristic features of various regions. Each knowledge source is a single rule, with a condition and a complex action part that performs various picture processing operations to detect the object. For example, the knowledge source to detect a crop field would look like:

if large homogeneous region and

vegetation region and not water region and not shadow making region

then perform crop field identification.

Each knowledge source identifies an object inde-pendently, and this might lead to conflicting

identifi-Ž

cations for the same region for example, crop field

.

and grassland . To solve this, the system automati-cally calculates a reliability value for each identifica-tion, and uses it to discard all but the most reliable.

Ž .

Fuger et al. 1994 present a blackboard-based

¨

data-driven system for analysis of man-made objects in aerial images. Generic object models are repre-sented symbolically in the blackboard, an individual object being described by several attributes. The models are controlled by numerous parameters, which are determined by a closed-loop system using ‘evolu-tion strategies’.

Ž .

(10)

Blackboard systems in general tend to have a centralised control structure so that efficiency be-comes an issue. Also, blackboards assume that knowledge sources will be available when needed and then vanish, whereas in vision applications, they tend to persist as long as the image is being anal-ysed.

4.4. Frames

Ž .

Hanson and Riseman 1978 used frames as hy-pothesis generation mechanisms for vision systems. Knowledge about classes of objects was represented as frames, and slots represented binary geometric relations between classes of objects. Slots also con-tained production rules for instantiating other object frames. Thus, frames are used both for control and

Ž .

representation. Ikeuchi and Kanade 1988 used frames to represent aspects of 3-D objects. When exact object models are available, processing is top-down, but given weaker models and more exact data, processing is bottom-up. However, using frames for both control and representation hides the procedural behaviour of the system and destroys its temporal

Ž .

coordination Draper et al., 1989 . Other systems which use frames include ACRONYM, SIGMA and Nagao and Matsuyama’s system, already described above.

4.5. Semantic network

Ž .

Nicolin and Gabler 1987 describe a system to analyse aerial images, using semantic nets to repre-sent and interpret the image. The system consists of

Ž .

a Short Term Memory STM , a Methodology Base

Ž_{MB , and a Long Term Memory LTM . The STM}. Ž .

is conceptually equivalent to a blackboard and stores the partial interpretation of the image. The LTM stores the a priori knowledge of the scene and the

Ž

domain-specific knowledge i.e., the knowledge

.

base . The system matches the contents of the STM against those of the LTM to produce an interpreta-tion. This is accomplished using an inference mecha-nism that calls modules in the MB. The initial con-tents of the STM are established in a bottom-up way, and a model-driven phase generates and verifies the presence or absence of object attributes stored in the LTM.

Ž .

Mayer 1994 has developed a semantic-network-based system for knowledge-semantic-network-based extraction of ob-jects from digitised maps. The system is based on a combined semantic network and frames representa-tion, as well as a combination of model-driven and data-driven control. The model is composed of three levels which generally correspond to the respective layers of bottom-up image processing:

1. The image layer, e.g., the digitised map, 2. Image-graph and graphics and text layers, 3. Semantic objects.

The semantic network is built up from the concept of ‘partrpart of’ elements in the graphs layer to the semantic objects, which comprise the ‘specialisa-tionrgeneralisation’ relations between the graphics objects and the terrain objects. For example, an elongated area in the graphics objects layer is spe-cialised into ‘road-sides’, ‘pavements’, ‘road net-work’, etc. Descriptions of other objects are not given, but the tests demonstrated the extraction of parcels and road networks. The frames are designed to analyse the various concepts and their properties. The object extraction is based on both model-driven and data-driven instantiation, with the initial search being based on a goal specified by the user. While the method is based on the extraction of well-defined information on maps, Mayer believes that the pro-cess should be useful for the extraction of informa-tion from images.

Ž .

Tonjes 1996 has used semantic networks for

¨

modelling landscapes from overlapping aerial im-ages. The output is a 3-D view of the terrain with appropriate representations of the vegetation. Tonjes

¨

states that semantic networks are suited to represent-ing knowledge of structural objects. His semantic network is described by frames that include the relationships, attributes, and methods. The semantic net has three layers:

1. Sensor layer, which represents the segmentation layer, based on texture and stripes, as well as the image details;

(11)

The semantic network is established between compo-nents in the three layers.The relationships ‘con–of’ are concrete realisations of objects in the image data; ‘part–of’ describes the decomposition of the objects into parts; while ‘is–a’ is the specialisation of the object. The object descriptions are tracked through each layer for reconstruction, which is based on both data-driven as well as model-driven processes.

Ž .

Lang and Forstner 1996 have based their method

¨

of extraction of buildings on polymorphic mid-level features. The approach involves semantic modelling using a ’part–of’ hierarchical representation. Rela-tions between the parts have not yet been included. The hypothesis generation of the building is based on a combination of a data-driven model for the original generation of the vertices, and subsequent model-driven approaches for hypothesis generation of object interpretation and verification, using four building types as the models: flat roof, non-orthogo-nal flat roof, gable roof, and hip roof. The approach successfully extracts buildings.

Ž .

Schilling and Vogtle

¨

1996 have developed a procedure for updating digital map bases using exist-ing map bases to aid the interpretation. The image is compared with the map to detect changes since the compilation of the map. New features are then anal-ysed by semantic networks. Two networks are cre-ated, one for the scene and the other for the image, with the typical relationships established at different levels in the networks.

Ž .

De Gunst 1996 has developed a combined data-driven and model-data-driven approach to recognising objects required for updating digital map data. The process is based on object oriented models for road descriptions and a semantic network for the feature recognition, based on frames. The frames define such details as object relations, object definition, alternative object definitions and preprocessing rela-tions. Road details include complex road junctions which are described by the knowledge base. This is a very detailed study involving several different types of road features. The success of the investigations varied significantly, demonstrating the difficulty in understanding such details.

Ž . Ž .

Quint and Sties 1996 and Quint 1997 present a model-based system called MOSES to analyse aerial images, which uses semantic networks as a mod-elling tool. Models are automatically refined by

us-ing knowledge gained from topographical maps or GIS data. The generatiÕe model is the most general

model containing common sense knowledge about the environment. Concepts in the generic models in the map and image domain are specialisations of the corresponding concepts in the generative model. A

specific model is automatically generated by the

system and is specific to the current scene; it is generated by combining the scene description ob-tained after map analysis with the generic model in the image domain. Initially, digitally available line segments are used for the structural analysis of the map, resulting in a structural description of the map scene. The scene description so obtained is then combined with the generic model in the image do-main to yield the specific model, which will be used for image analysis. For structural analysis, image

Ž .

primitives currently line segments and regions serve as input. The analysis is model-driven, resulting in

Ž .

recognition of objects parking places in the project . A merit function is used to guide search in the image analysis process.

To sum up, semantic networks have found wide acceptance and use in the interpretation of aerial images and digital maps.

4.6. Description logics

There are very few photogrammetric applications based on description logics. One such is Lange and

Ž .

Schroder’s 1994 description-logic-based approach

¨

to the interpretation of changes in aerial images with respect to reference information extracted from a map. Knowledge about types of objects and types of possible changes are represented using a

KL-ONE-Ž

like description logic Brachman and Schmolze,

.

(12)

for instantiation is goal-directed, and uses a number of heuristics. The examples in the paper, however, seem to be based on artificial images.

5. Conclusions

This paper has given a review of some of the developments in knowledge representation and mod-elling by researchers in the fields of machine vision, and photogrammetry and remote sensing. The ap-proaches to knowledge representation have evolved from production systems to semantic networks and frames, to description logics. While there are advan-tages to each approach, in terms of their complexity, modularity andror clear definition of the operations, examples have been given to demonstrate the appli-cation of these systems for the particular tasks, with varying degrees of success. These examples demon-strate however, that methods of knowledge represen-tation and modelling are still under development, in terms of their ability to match the performance of humans in extracting and recognising features in images. Apart from comparisons of the different methods given in the text, it is not possible to make firm conclusions on the most appropriate approach that should be used for the automation of informa-tion extracinforma-tion from aerial and satellite images.

As methods are being developed for the extraction of information from digital images for data acquisi-tion for digital mapping and GIS, researchers in the fields of photogrammetry and remote sensing are collaborating with machine vision experts to gain skills in knowledge representation and modelling, and are adapting the methods to the particular char-acteristics of their image data. While this work is still under development, significant advances are be-ing made. It is indeed important that researchers in the fields of machine vision, and photogrammetry and remote sensing, continue to collaborate in this work, so that advantages are gained from the combi-nation of the skills of both groups.

Acknowledgements

This research was partially supported by a grant from the Australian Research Council.

References

Ballard, D.H., Brown, C.M., 1982. Computer Vision. Prentice-Hall, Englewood Cliffs, NJ.

Binford, T.O., 1982. Survey of model-based image analysis

sys-Ž .

tems. Int. J. of Robotics Res. 1 1 , 18–64.

Brachman, R.J., Schmolze, J.G., 1985. An overview of the

KL-Ž .

ONE, knowledge representation system. Cogn. Sci. 9 2 , 171–216.

Brady, M., 1983. Criteria for representations of shape. In: Jacob

Ž .

Beck, B.H., Rosenfeld, A. Eds. , Human and Machine Vision. Academic Press, New York, pp. 39–84.

Brooks, R.A., 1981. Symbolic reasoning among 3-D models and 2-D, images. Artif. Intell. 17, 285–348.

Clarkson, M.E., 1992. Intelligent user interface for the detection of arbitrary shapes by mathematical morphology. Proc. SPIE 1769, 82–93.

Crevier, D., Lepage, R., 1997. Knowledge-based image under-standing systems: a survey. Computer Vision and Image

Un-Ž .

derstanding 67 2 , 161–185.

De Gunst, M., 1996. Knowledge-based interpretation of aerial images for updating road maps. Netherlands Geodetic Com-mission, Publications in Geodesy, No. 44, 184 pp.

Draper, B.A., Collins, R., Brolio, J., Hansen, A., Riseman, E.,

Ž .

1989. The schema system. Int. J. Comput. Vis. 2 3 , 209–250. Fuger, H., Stein, G., Stilla, U., 1994. Multi-population evolution¨

strategies for structural image analysis. Proc. First IEEE Conf. Evolutionary Computation I, 229–234.

Grimson, W.E.L., 1990. Object Recognition by Computer: The Role of Geometric Constraints. MIT Press, Cambridge, MA. Haar, R.L., 1982. Sketching, estimating object positions from

relational descriptions. Computer Graphics and Image Process-ing 19, 227–247.

Hancock, E.R., Kittler, J., 1990. Discrete relaxation. Pattern

Ž .

Recognit. 23 7 , 711–733.

Hanson, A.R., Riseman, E.M., 1978. VISIONS: a computer vision system for interpreting scenes. In: Hanson, A.R., Riseman,

Ž .

E.M. Eds. , Computer Vision Systems. Academic Press, New York.

Haralick, R.M., Mackworth, A.K., Tanimoto, S.L., 1988. Com-puter vision update. In: Avron Barr, P.R.C., Feigenbaum, E.A.

ŽEds. , Handbook of Artificial Intelligence Vol. 4 Addison.

Wesley, Reading, MA.

Henricsson, O., Baltsavias, E., 1997. 3-D building reconstruction with ARUBA: a quantitative evaluation. In: Gruen, A.,

Balt-Ž .

savias, E.P., Henricson, O. Eds. , Automatic Extraction of

Ž .

Man-Made Objects from Aerial and Space Images II . Birkhauser, Basel.¨

Ikeuchi, K., Kanade, T., 1988. Automatic generation of object

Ž .

recognition programs. Proc. IEEE 76 8 , 1016–1035. Jain, R., Haynes, S., 1982. Imprecision in computer vision.

Com-Ž .

puter 15 8 , 39–48.

Jain, R., Kasturi, R., Schunck, B.G., 1995. Machine Vision. McGraw-Hill, New York.

Lang, F., Forstner, W., 1996. Surface reconstruction of man-made¨

(13)

knowledge. Int. Arch. Photogrammetry Remote Sensing 31

ŽB3 , 752–757..

Lange, H., Schroder, C., 1994. Analysis and interpretation of¨

changes in aerial images: knowledge representation and spatial

Ž .

reasoning. Int. Arch. Photogrammetry Remote Sensing 30 3 , 475–482.

Levine, M.D., Nazif, A., 1985. An experimental rule-based sys-tem for testing low level segmentation strategies. In: Preston,

Ž .

K., Uhr, L. Eds. , Multicomputers and Image Processing: Algorithms and Programs. Academic Press, New York, pp. 149–160.

Lowe, D.G., 1985. Perceptual organisation and visual recognition. Kluwer Academic Publishing, Hingham, MA.

Lowe, D.G., 1990. Visual recognition as probabilistic inference

Ž .

from spatial relations. In: Blake, A., Troscianko, T. Eds. , AI and the Eye. Wiley, Chichester, NY, pp. 261–279.

Marr, D., 1982. Vision. Freeman, New York.

Marr, D., Nishihara, H.K., 1978. Representation and recognition of the spatial organisation of three-dimensional shapes. Proc. R. Soc. London, Ser. B 200, 269–294.

Matsuyama, T., Hwang, V., 1985. SIGMA: a framework for image understanding-integration of bottom-up and top-down analyses. Proc. 9th Int. Joint Conf. Artif. Intell., 908–915. Matsuyama, T., Hwang, V., 1990. SIGMA: A Knowledge-Based

Aerial Image Understanding System. Plenum, New York. Mayer, H., 1994. Automatic knowledge based extraction of

ob-jects of the real world from scanned maps. Int. Arch. of

Ž .

Photogrammetry Remote Sensing 30 3 , 547–554.

McGlone, J.C., Shufelt, J.A., 1994. Projective and object space geometry for monocular building extraction. Proc. IEEE Com-puter Society Conf. ComCom-puter Vision and Pattern Recognition, IEEE Computer Society, Los Alamitos, CA, 54–61. McKeown, D.M., Harvey, W.A., 1987. Automatic knowledge

acquisition for aerial image interpretation. In: Proc. SPIE, Image Understanding and the Man–Machine Interface, pp. 144–164.

McKeown, D.M., Harvey, W.A., McDermott, J., 1985. Rule-based interpretation of aerial imagery. IEEE Trans. Pattern Anal.

Ž .

Mach. Intell. 7 5 , 570–585.

Merriam Webster Dictionary, 1999. http:rrwww.m-w.comr netdict.htm, Merriam Webster, accessed Dec. 9, 1999. Mokhtarian, F., Mackworth, A.K., 1992. A theory of multi-scale,

curvature-based shape representation for planar curves. IEEE

Ž .

Trans. Pattern Anal. Mach. Intell. 14 8 , 789–805.

Nagao, M., Matsuyama, T., 1980. A Structural Analysis of Com-plex Aerial Photographs. Plenum, New York.

Nebel, B., 1990. Reasoning and revision in hybrid representation systems. In: Lecture Notes Artif. Intell. 422 Springer, Berlin. Nicolin, B., Gabler, R., 1987. A knowledge based system for the analysis of aerial images. IEEE Trans. Geosci. Rem. Sens.

Ž .

GE-25 3 , 317–329.

Pope, A.R., 1995. Learning to recognise objects in images: acquir-ing and usacquir-ing probabilistic models of appearance. PhD disser-tation, Dept of Comp. Sci., Univ. of British Columbia, Canada. Provan, G.M., 1990. The application of Dempster–Shafer theory

to a logic-based visual recognition system. In: Henrion, M.,

Ž .

Shachter, R.D., Kanal, L.N., Lemmer, J.F. Eds. , Uncertainty in Artificial Intelligence 5 North-Holland, Amsterdam, pp. 389–405.

Quinio, P., Matsuyama, T., 1991. Random closed sets: a unified approach to the representation of imprecision and uncertainty.

Ž .

In: Kruse, R., Siegel, P. Eds. , Symbolic and Quantitative Approaches to Uncertainty. Springer, Berlin, pp. 282–286. Quint, F., 1997. Recognition of structured objects in monocular

aerial images using contextual information. In: Leberl, F.,

Ž .

Kalliany, R., Gruber, M. Eds. , Methods for Extracting and Mapping Buildings, Roads and Other Man-Made Structures from Images. R. Oldenburg, Wien, pp. 213–228.

Quint, F., Sties, M., 1996. An evidential merit function to guide search in a semantic network based image analysis system. In:

Ž .

Perner, P., Wang, P., Rosenfeld, A. Eds. , Advances in Structural and Syntactic Pattern Recognition. Springer, Berlin, pp. 140–149.

Reiter, R., Mackworth, A.K., 1989. A logical framework for

Ž .

depiction and image interpretation. Artif. Intell. 41 2 , 125– 155.

Rimmey, R.D., 1993. Control of selective perception using Bayes nets and decision theor. PhD dissertation, Comp. Sci. Dept, Univ. of Rochester, USA.

Rosenfeld, A., Hummel, R.A., Zucker, S.W., 1976. Scene la-belling by relaxation operations. IEEE Trans. Sys., Man

Cy-Ž .

ber. 6 6 , 421–433.

Russell, S., Norvig, P., 1995. Artificial Intelligence — A Modern Approach. Prentice-Hall, Englewood Cliffs, NJ.

Sarkar, S., Boyer, K.L., 1994. Computing Perceptual Organisation in Computer Vision. World Scientific, Singapore.

Schilling, K., Vogtle, T., 1996. Satellite image analysis using¨

integrated knowledge processing. Int. Arch. Photogrammetry

Ž .

Remote Sensing 31 B3 , 415–420.

Shufelt, J.A., McKeown, D.M., 1993. Fusion of monocular cues to detect man-made structures in aerial imagery computer

Ž .

vision and image processing. Image Understanding 57 3 , 307–330.

Sonka, M., Hlavac, V., Boyle, R., 1993. Image Processing, Analy-sis and Machine Vision. Chapman & Hall, London.

Srinivasan, A., 1990. An artificial intelligence approach to the analysis of multiple information sources in remote sensing. PhD dissertation, Dept. of Comp. Sci., Univ. New South Wales, Sydney, Australia.

Steger, C., Mayer, H., Radig, B., 1997. The role of grouping for road extraction. In: Gruen, A., Baltsavias, E.P., Henricsson, O.

ŽEds. , Automatic Extraction of Man-Made Objects from Aerial. Ž .

and Space Images II . Birkhauser, Basel.¨

Stilla, U., 1995. Map-aided structural analysis of aerial images.

Ž .

ISPRS J. Photogram. Rem. Sens. 50 4 , 3–10.

Stilla, U., Michaelsen, E., Lutjen, K., 1996. Automatic extraction¨

of buildings from aerial images. In: Leberl, F., Kalliany, R.,

Ž .

(14)

Strat, T.M., Fischler, M.A., 1991. Context-based vision: recognis-ing objects usrecognis-ing information from both 2-D and 3-D imagery.

Ž .

IEEE Tran. Pattern Anal. Mach. Intell. 13 10 , 1050–1065. Tonjes, R., 1996. Knowledge based modelling of landscapes. Int.¨

Ž .

Arch. Photogrammetry Remote Sensing 31 B3 , 868–873.

Ž .

Ullman, S., 1984. Visual routines. Cognition 18 1–3 , 97–159. von Kaenel, P.A., Brown, C.M., Rimmey, R.D., 1993

Goal-ori-ented dynamic vision. Report No. TR-466, Comp. Sci Dept., Univ. of Rochester, USA.

Wesley, L.P., 1986. Evidential knowledge-based computer vision.

Ž .

Optical Engineering 25 3 , 363–379.