Directory UMM :Data Elmu:jurnal:T:Transportation Research Part B Methodological:Vol34.Issue1.Jan2000:

(1)

Trip distribution forecasting with multilayer perceptron neural

networks: A critical evaluation

M. Mozolin

a

, J.-C. Thill

b,*

, E. Lynn Usery

c a

ESRI, Inc. Redlands, CA, USA b

Department of Geography and National Center for Geographic Information and Analysis, State University of New York at Bualo, Amherst, NY, USA

c_{Department of Geography, University of Georgia, Athens, GA, USA}

Received 17 February 1998; received in revised form 15 April 1999; accepted 19 April 1999

Abstract

This study compares the performance of multilayer perceptron neural networks and maximum-likeli-hood doubly-constrained models for commuter trip distribution. Our experiments produce overwhelming evidence at variance with the existing literature that the predictive accuracy of neural network spatial in-teraction models is inferior to that of maximum-likelihood doubly-constrained models with an exponential function of distance decay. The study points to several likely causes of neural network underperformance, including model non-transferability, insucient ability to generalize, and reliance on sigmoid activation functions, and their inductive nature. It is concluded that current perceptron neural networks do not provide an appropriate modeling approach to forecasting trip distribution over a planning horizon for which distribution predictors (number of workers, number of residents, commuting distance) are beyond

1. Introduction

A number of modeling approaches have been put forward over the years to distribute trips, freight or information among origins and destinations. One of the more successful ones is the spatial interaction (or gravity) model (Ortuzar and Willumsen, 1994). This model relates the matrix of ¯ows to a matrix of interzonal impedance. Traditionally, the spatial interaction model is calibrated by one of several well known techniques, including regression, maximum likelihood, or by numerical heuristics. Several recent studies (Black, 1995; Fischer and Gopal, 1994; Gopal and Fischer, 1996; Openshaw, 1993) have proposed the neural network architecture as a means to

*

Corresponding author. Tel.: +716 645 2722; fax: +716 645 2329; e-mail: [email protected]alo.edu

(2)

model the distributed complexity of spatial interaction. 1 This line of research has shown that neural networks generally outperform classical calibration and estimation approaches.

At ®rst, this conclusion should not come as much of a surprise to many modelers given the wide success experienced by neural networks in pattern recognition and classi®cation (Bishop, 1995; Smetanin, 1995; Ripley, 1996), as well as in various application ®elds of transportation engi-neering and planning (Dougherty, 1995; Himanen et al., 1998; Hua and Faghri, 1994). After all, neural networks impose less constraints on the form of the functional relationship between inputs and outputs than conventional ®tting techniques. This paper revisits this conclusion by paying attention to several aspects of spatial interaction modeling that have not been addressed so far. Our aim is to compare the performance of a perceptron neural network (NN) spatial inter-action model to that of a baseline, conventionally estimated spatial interinter-action model beyond the comparative work done previously. The comparison is conducted empirically on journey-to-work patterns in the Atlanta metropolitan area. Our approach diers drastically from others in several respects.

Firstly, we evaluate the models in a predictive mode. In other words, calibration is done on observed, base-year data, while testing is conducted on data for the projection year. To the best of our knowledge, all other NN studies of trip distribution have used the same origin-destination matrix for training and testing, thus allowing the network to learn the noise in the training data (Black, 1995). Incidently, NN applications to trac data and other transportation problems also use hold out samples for testing. Secondly, our baseline model is a doubly-constrained model estimated by maximum likelihood. This is a departure from Fischer and Gopal (1994), and Gopal and Fischer (1996) who chose the less accurate unconstrained spatial interaction model as a benchmark, and estimated model parameters by ordinary least squares regression, a method considered less precise than maximum likelihood (Fotheringham and O'Kelly, 1989).

Thirdly, we evaluate the models on origin-destination matrices of dierent sizes (from hundreds of origin/destination zones down to a dozen) to test the sensitivity of our conclusions to the size of the interaction system being modeled. Finally, we apply an adjustment factor to ¯ows predicted by the NN output to satisfy production and attraction constraints, and thus make it possible to unambiguously interpret any discrepancy with ¯ows predicted by the baseline doubly-constrained model in terms of relative performance of the models.

The paper presents a case where a conventional spatial interaction model outperforms a multilayer perceptron NN model of spatial interaction. The predictive mode of the analysis replicates the process by which trip distribution is realized in transportation planning, and thus helps to compare the merits of the conventional and NN approaches for practical applications of spatial interaction modeling.

The remainder of the paper is organized as follows. The next two sections present an overview of the conventional spatial interaction model of journey-to-work, and of the multilayer per-ceptron NN model. In the following section, we describe the setup of the empirical test of the latter model against the former, as well as the data used in the test. Next, results under dierent

1

(3)

modeling con®gurations are detailed. We conclude with a discussion of possible explanations for the better performance of the conventional spatial interaction model.

2. Journey to work problem and its conventional solution

Spatial interaction may be de®ned in general terms as any ¯ow of commodity, people, capital, or information over space resulting from some explicit or implicit decision process (Fotheringham and O'Kelly, 1989). Journey to work is one kind of spatial interaction. Other kinds of spatial interaction include journey to school, shopping trips, non-home based intraurban trips, intercity population migration, choice of college or university by students, intercity freight movement, telephone calls, Internet access, and many others.

Spatial interaction models are often classi®ed on the basis of the number and character of constraints imposed on the predicted trip matrix. Constraints represent a priori knowledge about the total interaction ¯ows entering and/or exiting a particular zone. For example, if the total number of employed residents in each zonei(Oi) is known exogenously, then the sum of predicted ¯ows leaving each zone is equal to Oi

X

j

Tij Oi; 8i 1

whereTij is the ¯ow of commuters from zoneito zonej predicted by the model. Similarly if one

knows total employment in each zone (Dj), one can impose that the sum of predicted commuter

¯ows ending in each zone is equal toDj: X

i

Tij Dj; 8j: 2

If Eq. (1) holds for each origin zone then the model is said to be production constrained; if Eq. (2) holds for each destination zone then the model is referred to as attraction constrained. If both Eqs. (1) and (2) hold, the model is doubly constrained, while if neither of the two holds, the model is unconstrained.

Trip distribution may be modeled with any number of constraints. Implementing the additional constraints requires more a priori information. In turn, the reduction of degrees of freedom leads to a more accurately predicted ¯ow matrix. It has been shown empirically that estimating spatial interaction with a doubly-constrained model yields the most accurate results. See, for example, Fotheringham and O'Kelly (1989) on interregional migration among the nine major census di-visions in the United States, or Mozolin (1997) on commuting trips within metropolitan Atlanta. Because of its higher accuracy in modeling trip distribution, the doubly-constrained model is a proper baseline against which to evaluate neural networks.

The doubly-constrained spatial interaction model of journey to work can be formulated mathematically as:

TijAiOiBjDjfcij; 3

where cij is the travel impedance (distance) from zone i to zone j, f(cij) is a distance decay

(4)

in zone j (attraction), andAi and Bj are balancing coecients ensuring that Eqs. (1) and (2) are

Two alternative speci®cations of the distance decay functionf(cij) will be used here: the negative

power functioncÿijbbP0, and the negative exponential function expÿbcijbP0.

Of all the methods suggested to calibrate spatial interaction models (see, for instance, Ba-charach (1970), Batty (1976), Evans (1971), Fotheringham and O'Kelly (1989), Ortuzar and Willumsen (1994), Wilson (1970)), we choose to use a maximum-likelihood estimation (MLE) approach. Batty and Mackie (1972) have shown that likelihood maximization boils down to solving a non-linear equation. With a power distance function, this equation is given by

X

where Tij is the actual number of commuters from zone i to zone j, and Tij is estimated by

Eqs. (3)±(5). Mutatis mutandis with an exponential distance function. The SIMODEL computer code (Williams and Fotheringham, 1984) is used to derive parameter estimates.

3. Multilayer perceptron neural networks applied to the journey to work problem

Background of multilayer perceptron neural networks is presented below before we proceed with their application to the journey to work problem. The multilayer perceptron neural network is one of a variety of parallel computing techniques that conceptually mimic structures and functions of human central neural systems. The model used in this study is a three-layer fully-connected feedforward NN which consists of input nodes representing independent variables (the productions, the attractions, and the travel impedances), hidden nodes, and one output node for the dependent variable, namely the ¯owTijfrom zoneito zonej. See Fig. 1 for the architecture of

a NN with four hidden nodes. Each input node corresponds to an independent variable in the ¯ow model while the dependent variable Tij is the output node. The network output (activation) z is

obtained by a double logistic transform of the weighted sum of inputs. The reader is referred to Haykin (1998), Rojas (1996), or Smith (1993) for an in-depth coverage of the NN methodology. The most valuable property of multilayer feedforward NNs is their ability to approximate a desired function from training examples. In fact, a three-layer fully connected feedforward NN withn input nodes, a sucient number of hidden nodes, and one output node can be trained to approximate annto 1 mapping function of arbitrary complexity (Kreinovich and Sirisaengtaskin, 1993). Learning of network weights often proceeds by backpropagation of errors (Rumelhart et al., 1986) so as to minimize the total error for all examples in a training set.

(5)

to reduce total error. In this research, we use an o-line, or epoch-based learning: network weights are adjusted only after all examples in the training set have been processed. Several non-linear optimization methods are available to ®nd a set of weights that minimizes the error on all ex-amples in the training set. This study uses the Quickprop algorithm developed by Fahlman (1989). Though it does not guarantee a global optimum, its quick convergence dramatically increases the speed of NN training. The gradient descent method (Rumelhart et al., 1986) is applied in those rare instances where Quickprop cannot be used.

Backpropagation neural networks easily ®t into the framework of the doubly-constrained spatial interaction model. The network learns the mapping function that best ®ts the relationship between dependent variables (production, attraction, and travel impedance) and the independent variable (¯ows). Interestingly, the mapping function is no longer restricted to either power or exponential functional form as in the conventional models. Nor is it explicitly speci®ed as a linear or non-linear regression model. The major advantage of the NN approach is that it is ¯exible enough to model non-linear relationships of arbitrary complexity in an automated fashion.

As noted by Black (1995), Fischer and Gopal (1994), and Gopal and Fischer (1996), a NN may perform well enough to estimate actual spatial interaction ¯ows, but small deviations are bound to remain. Furthermore, the network itself does not contain any mechanism to enforce the origin and destination constraints. Consequently, the origin and destination totals derived by summing the ¯ows predicted by the model are usually not equal to the actual origin and destination totals.

(6)

We use a standard iterative proportional ®tting procedure (Slater, 1976) to enforce these con-straints. After this post-processing, NN ¯ows give predictions comparable to those of a doubly-constrained model.

Network training is realized with NevProp 1.16 (Goodman, 1996). The Quickprop algorithm is embedded in NevProp, but pre- and post-processing (including scaling and enforcing production and attraction constraints) are part of separate applications written by the authors.

4. Empirical analysis

4.1. Study Area and Data Sources

We use 1980 and 1990 journey-to-work commuter ¯ows in the Atlanta Metropolitan Area. Commuter ¯ows among the 15 counties of Atlanta SMSA for 1980 are available from the 1980 U.S. Census (Bureau of the Census, 1983). Commuter ¯ows among the 20 counties of the Atlanta MSA (Fig. 2) for 1990 are available in the Census Transportation Planning Package (CTPP) (Bureau of Transportation Statistics, 1993). Data sets on commuting ¯ows between census tracts in 1980 and 1990 were kindly made available to the authors by the Atlanta Regional Commission. There were 345 census tracts in 1980, and 507 census tracts in 1990 in the study area.

(7)

The logistics of spatial interaction modeling requires a clearly de®ned region with no, or small, ¯ows across its border. In the case of Atlanta, this assumption is not grossly violated. In 1980, slightly more than 90% of working residents of the Atlanta SMSA also worked inside the SMSA. In 1990, 95.3% of working residents of the Atlanta MSA worked inside the MSA. Also, 92.9% employed in the Atlanta MSA also lived within the region in 1990.

Spatial separation between commuting zones (counties or census tracts) is measured by the straight-line (Euclidean) distance between zone centroids in the metropolitan area. Setting in-trazone distance to zero is known to generate systematic measurement errors. It is common practice to correct for this error by de®ning the distance from a zone to itself as a quarter of the distance from the zone centroid to the centroid of its nearest neighbor (Thomas and Hugget, 1980). The Euclidean distance is only an approximation of the perceived impedance between home and work locations. We recognize that road distance, travel time, or a generalized travel cost function may dierently aect the predictive accuracy of the MLE and NN models. However, since the major thrust of this study is to compare and evaluate two forecasting techniques, the

relative accuracy of their estimates matters more than their absolute accuracy and the

approxi-mation given by Euclidean distance is acceptable.2

4.2. Implementation issues

The journey-to-work analysis is structured as follows. First, two doubly-constrained spatial interaction models are calibrated by MLE on 1980 travel data. One model uses county-level data, the second uses census tract-level data. Each model is then employed to forecast interzonal commuter ¯ows (intercounty or intertract) for year 1990 with the calibrated distance decay pa-rameter band 1990 working population and employment data at the corresponding geographic resolution (Oi and Dj marginal totals) for production and attraction, respectively. Forecasted

¯ows are compared to actual 1990 trip data using four goodness-of-®t measures (the absolute error (AE), the standardized root mean square error (SRMSE), Kulback's w statistic, and the

R-square. See Fotheringham and Knudsen (1987), Weiss (1995), and others for a description of the statistics. In most cases, these measures are highly consistent. Hence, only the AE and SRMSE measures are reported for the tested models hereunder.

In parallel, two sets of NN spatial interaction models are trained and validated on the same 1980 travel data-one on county-level ¯ows, the other on census tract ¯ows. With the network weights for which the validation error is minimum and 1990 population and employment data, the NNs predict 1990 interzonal commuter ¯ows at the county and census tract levels respectively (test sets). Goodness-of-®t of these forecasts to actual ¯ows is once again measured. Finally, the relative performance of MLE and NN spatial interaction models in predictive mode is assessed by comparing goodness-of-®t measures.

(8)

Selecting a NN con®guration and parameters suitable for a certain problem is often a chal-lenging task. The ®rst stage of backpropagation feedforward NN model design entails setting the topology of the model. A natural topology for a doubly-constrained spatial interaction problem involves three inputs (the number of resident workers in the zone of origin, employment in the destination zone, and the spatial impedance between the zones) and one output (the number of commuters). It is common practice to proceed by trial and error to select the number of hidden nodes, and to test networks with hidden layers of varying size. Networks with 5, 20, and 50 hidden nodes are tested in this study. Networks of larger sizes are impractical due to the excessive computational requirement of their training. Each network con®guration is processed ®ve times, each run starting with a random set of initial weights and a training set drawn randomly from the full data set. We report results of experiments with dierent partitions of the full data set into training and validation sets.

The NN model is further speci®ed as follows. Since, in most instances, weights are changed according to the Quickprop rule, no momentum term is needed, and the learning rate must be speci®ed only for use with the gradient descent method. 3A 0.1 learning rate is used throughout the analysis. Experiments with dierent rates lead to remarkably similar weight estimates and learning speeds. Initial weights are randomly drawn from a uniform distribution within the range [ÿ0.01, +0.01].

All three network inputs are scaled by dividing the value observed for each example by the input's maximum value in the set. Whereas input scaling is optional, scaling of the output is required for successful learning. Scaling to ®t the output within the [0.1, 0.9] range is usually used. However, because the networks are tested on data other that those used for training and vali-dation, and that total ¯ows have increased between base and prediction years, the interval is scaled to 0.75. Therefore, the output (the number of commuters) is scaled using a linear trans-formation to have 1980 ®t the [0.25, 0.75] range. This transtrans-formation is given by

T_ijnetwork0:250:5

Tij Tmax1980

; 7

whereT_ijnetworkis the output as seen by the network, andTmax1980is the maximum commuter ¯ow in 1980. At the testing stage, Eq. (7) is used in reverse.

All networks are trained for a maximum of 100 000 iterations. Many neural network practi-tioners allow for an early stopping of the feedforward backpropagation algorithm (Sarle, 1996) in order to prevent over®tting. It is critical to realize that the error on the validation set is not a good estimate of network generalization. A stopped network is tested on an independent test set that has never been used for training to give an unbiased estimate on the network performance. We train and validate all networks on the 1980 data, while the testing is accomplished on the 1990 data.

At the county level, a total of 225 data vectors are available. For each network processed, the training set is formed by randomly selecting 112 vectors without replacement, while the remaining 113 vectors are used for validation. In one experiment, the full set of vectors is used both for

3

(9)

training and validation. The network weights that minimize validation error serve to test the model on the 400 interactions from the 1990 trip matrix.

At the census-tract level, the training set is selected by simple random sample without re-placement of 200 examples from the 121 104 origin-destination pairs in the 1980 tract-to-tract trip matrix.4Similarly for the validation set. The optimal set of network weights is then tested on all 257 049 vectors from the 1990 tract-level trip matrix.

5. Results of performance comparison

5.1. Baseline spatial interaction models

The results of calibrating and testing maximum likelihood doubly-constrained models of journey to work at the county and census-tract levels are presented in Table 1. The overall per-formance of these models with an exponential function of spatial deterrence is sucient to provide a benchmark against which to evaluate the performance of NN models.5

Of the two distance decay functions, better performance is exhibited by the negative expo-nential function in terms of all four goodness-of-®t measures, and at both aggregation levels (county and census tract). This is consistent with the widespread consensus that the exponential function is more appropriate for analyzing short distance interactions, such as those that take place within an urban area, while the power function is more appropriate for analyzing longer distance interactions such as interstate migration ¯ows (Fotheringham and O'Kelly, 1989).

Table 1 clearly shows that county-level models are more accurate than models applied at the census-tract level. Lower model performance or ®t at a more detailed geographic scale is not unusual. This phenomenon is known in the spatial-analytic literature as the modi®able areal unit problem (Openshaw, 1984). Treatment of the issue in the context of spatial interaction modeling can be found in Amrhein and Flowerdew (1989), Batty and Sikdar (1982), and others. In sub-stance, if a simple functional relationship with a single parameter, like the doubly-constrained spatial interaction model, presents some diculties in accounting for all subtleties of 400 inter-actions in the 20-country trip matrix, it is more exacerbated with the 507-tract trip matrix. Consequently, if NNs truly outperform MLE spatial interaction models, one would anticipate the advantage to be much larger with census tracts that with county data. Along the same line, in case NNs were not to perform as well as MLE models with the latter data, the reverse statement could reasonably be expected for commuter ¯ows between census tracts.

5.2. Neural network models

The results of the NN training and testing on county-level data are presented in Table 2. All ®ve sets of networks exhibit good to very good ability to predict 1990 commuter ¯ows in Atlanta.

4

Tests with training sets of 1000 cases do not show better predictive accuracy than with 200 cases, while computational time becomes prohibitively high.

5

(10)

Comparison of the ®rst four sets of results reported in Table 2 indicates that, except for a slight tendency for performance to drop as more hidden nodes are used, there is no signi®cant impact of the number of hidden nodes in a network on goodness-of-®t. It is also noteworthy that all better performing networks were stopped after less than 10 000 epochs.

Table 2

Neural network models trained using 1980 county-to-county commuter ¯ows, and tested on 1990 commuter ¯ows

Instance Absolute error (%) SRMSE Epoch network was stopped

(a) Five-node networks

(d) Twenty-node networks trained on full 1980 data

1 30.9 0.920 100 000

Maximum likelihood doubly-constrained models calibrated using 1980 ¯ows, and tested on 1990 commuter ¯ows

Distance decay parameter

Exponential function of distance decay ÿ8.43´₁₀ÿ5 _24.7 _0.866

Census-tract level

(11)

The bottom part of Table 2 displays the results of testing ®ve 20-node NNs trained on the entire 1980 data set (225 training cases). As expected, they perform rather consistently and somewhat better than networks trained on half the available interaction pairs. Training a neural network on the full set of cases is usually not a recommended practice because it promotes over®tting, a point evidenced here by the failure of all ®ve networks to converge before 100 000 training epochs. On the other hand, training data is less sparse than with a sample of cases, and the network's power to generalize input±output relationships is enhanced.

Neural networks trained and tested on tract-level ¯ows perform signi®cantly worse that those trained on county-level data (Table 3). The best NN model produces errors that are 83.6% of 1990 commuter ¯ows, while the worst model hits a whopping 119.1% absolute error. The faint inverse relationship between model performance and number of hidden nodes detected above is now clearly marked. Average goodness-of-®t measures drop with the increase of hidden nodes from 5 to 50.

5.3. Neural Network versus MLE Models

None of the NN models tested on county commuter ¯ows outperforms the corresponding MLE doubly-constrained model. The only partial exception to this rule is the case of one 50-node NN model (#3 in Table 2(c)) which shows a better performance as measured by SRMSE and

Table 3

Neural network models trained using 1980 tract-to-tract commuter ¯ows, and tested on 1990 commuter ¯ows with training and validation sets selected randomly

(12)

R-square, but not according to the other two statistics. Even more remarkable, the ®ve NNs trained on the entire 1980 data set (Table 2(d)) still fail to surpass the MLE model in any run, though they come closer to challenging its superiority.

At the census-tract level, the comparison is even more favorable for the MLE model. All runs of the NN models (Table 3) trail far behind the MLE model (Table 1). The best of twelve NN models misallocates 83.6% of all commuter ¯ows, while the conventional doubly-constrained model with the negative exponential function of distance misallocates ``only'' 68.7% of ¯ows.

It is appropriate to stress here again that model performance is evaluated in a predictive mode, that is by the capacity of a model to predict interaction ¯ows for a horizon other than the base year used in training and validation. In fact, performance measured on base-year data would lead to opposite conclusions, thus supporting the existing literature in the matter. For information purposes, performances of MLE and NN models trained and tested on the 1990 county-to-county ¯ows are reproduced in appendicesA andB, respectively.

By all accounts, the evidence reviewed above that neural networks show inferior predictive performance over conventional statistical models is quite puzzling and unexpected. Neural net-works are indeed regarded as good approximators (Kreinovich and Sirisaengtaskin, 1993). The data analysis calls for further research to pinpoint the causes of their underperformance. In order to trace potential patterns of consistent underprediction or overprediction by NN models, we use three-dimensional plots of observed and predicted data. Each plot displays ¯ows originating from a given county against distance and number of workers at destinations. Such plots for a sample of four counties, namely Clayton, Cobb, DeKalb, and Fulton counties, are depicted in Fig. 3 for 1990. Corresponding ¯ow surfaces generated by the ®ve tested instances of 20-node neural net-works (see Table 2-b) are given in Fig. 4.

At examination, the predicted surfaces in Fig. 4 reveal unsuspected structures dominated by a wavy pattern of troughs and ridges. These structures are particularly pronounced in instance three (Fig. 4(c)), which also happens to be among the instances that predicts 1990 ¯ows with the least overall accuracy. This pattern is often symptomatic of over®tting due to excessive training of the network. That this network was trained longer than any other 20-node network suggests that it learned the noise in the training set in addition to the underlying function we want it to ®nd. As a result, its ability to generalize is rather poor and its prediction accuracy is low particularly where training data are sparse (interpolation problem). Another feature common to several underper-forming network instances in Fig. 4 is the consistent underestimation of the largest 1990 expected ¯ows (Figs. 3 and 4).6Networks fail to extrapolate around and beyond the limits of the training sample. Possible explanations for interpolation and extrapolation errors are now pursued.

The spurious ridges and troughs that show throughout predicted surfaces suggest that over-®tting may be occurring, in spite of the early stopping mechanism put in place to prevent it. In our commuter trip distribution problem, we can postulate that the occurrence of over®tting is tied to an excessive number of hidden nodes in the network. Our problem may in fact be simple enough to require less than ®ve hidden nodes. After all, conventional spatial interaction models perform

6

(13)

well with a single parameter. Such a neural network could be devoid of spurious ridges and troughs, and generalize just right.

To test the proposition that the predictive performance of NN models is improved by reducing the number of hidden nodes, various neural networks with one to three hidden nodes are tested on 1980 county-level commuter ¯ows and validated on 1990 ¯ows. Results are summarized in Table 4. Networks with fewer hidden nodes suer less from spurious troughs and ridges on their prediction plots, and therefore, are less prone to over®tting. In fact surfaces generated by one-hidden-node networks do not exhibit any spurious feature (Fig. 5). Such networks no longer model the noise in the training data because they are unable to produce complex surfaces. This does not translate however into unequivocally better goodness-of-®t with validation data for sparser networks. Furthermore, none of the networks tested with a reduced number of hidden

(14)

(15)

nodes (Table 4) succeeds in outperforming the MLE doubly-constrained model with exponential function of distance (Table 1). A straightforward consequence is that lower performance of neural networks cannot be imputed to over®tting and cannot be remediated easily by modifying the topology of the networks.

The fact remains that neural networks have a limited ability to interpolate spatial interaction data in a predictive mode. Paradoxically, the cause of this weakness may also be the essence of its strength in validation on contemporary data, namely the inherent ¯exibility to approximate complex data structures with great accuracy. In short, the poor ®t of neural networks on pre-diction-year data (1990) can be blamed on their unrivaled ®t to base-year data (1980). According to this view, neural networks are such good approximators that they model not only interaction data structures, but also the context of the transportation systems within which commuter pat-terns take place. By design, spatial interaction neural networks are context-dependent models whose parameters do not transfer well to other contexts. The extent of NN context sensitivity remains a subject for future study. A solution to this problem may come from the explicit in-corporation of context dependencies in the network representation. Evidence in Table 4 suggests that model transferability is problematic even for sparse model topologies.

It is our contention that the sigmoid form of network output limits the ability of neural net-works to extrapolate interaction data in a meaningful way. Sigmoid output nodes tend indeed to generate S-shaped predicted surfaces that are ill-suited to model spatial interaction behavior. For

Table 4

Neural network models with few hidden nodes. Trained using 1980 county-to-county commuter ¯ows, and tested using 1990 commuter ¯ows

(16)

illustration purposes, let us compare how ¯ows predicted by the NN and conventional gravity models respond to distance as the other two input variables are held constant. Most NN ¯ow surfaces (Figs. 4 and 5) have in common anS-shaped pro®le of dependence between ¯ow volume and distance. This pro®le implies that, all other things being equal, the marginal ¯ow increase with respect to distance is small and declining, sometimes even negative. On the contrary, observed patterns (Fig. 3) show no tapering in the relationship between ¯ow volume and distance. Con-sequently, ¯ow extrapolation on theS-shaped pro®le is highly inaccurate. Because enough of the 1990 ¯ow data fall outside of the range of the 1980 training data, the overall performance of the network is generally poor. A signi®cant implication is that conventional feedforward backprop-agation NNs may not exhibit the right properties for use in the application domain of trip dis-tribution. Other NN models that do not assume a sigmoidal activation function ± such as the Gaussian Radial Basis Function model (Verleysen and Hlavackova, 1994) ± may prove better suited for spatial interaction problems.

In contrast to neural networks, the smooth surface generated by the MLE model with a neg-ative exponential function of distance decay provides a better ®t to the empirical data. A good ®t is achieved not only for the data on which the model was calibrated, but also for the unseen data beyond the training range. This indicates that the maximum likelihood model is a better extrapolator than the neural network, and a better tool for urban and regional planning. A fundamental reason for better performance of the maximum-likelihood model is that, being a one-parameter model, it generalizes more than neural networks and, consequently, is more context independent. Also contributing towards better performance is its derivation from the ®rst prin-ciples, whereas the NN approach is purely data-driven. Wilson (1970) showed in his seminal work how the exponential distance decay function is derived from the entropy principle, by ®nding the

(17)

most likely trip matrix given the origin and destination totals and the total distance traveled in the system. The principle of maximum likelihood applies to all trip matrices, regardless of their use for model calibration or model testing; hence the better extrapolation capability of the maximum-likelihood model.

5.4. Geographic scale problem

The dramatically lower performance of neural networks on tract-level data suggests that ad-ditional factors are at work at this scale. The vast majority of commuter ¯ows in tract-level trip matrices are zero (82.9% of all ¯ows in the 1980 matrix, and 82.5% in the 1990 matrix), while most non-zero ¯ows are fairly small. With only a small fraction of ¯ows signi®cantly larger than the rest, small random samples of training examples have little chance to include large ¯ows. As a result, networks trained on random samples of examples primarily learn how to predict zero and very small ¯ows. Since we established earlier that neural networks are rather poor extrapolators of spatial interaction ¯ows, their predictions of larger ¯ows is highly inaccurate. Hence the low overall performance of neural networks on small analysis zones.

Resorting to larger samples (say, more than 1000 cases), or even to the entire population of samples, is not a practical solution because it leads to unacceptably long training. An appealing alternative consists in using strati®ed random sampling instead of uniform random sampling in order to represent ¯ows of all sizes in the training set. The eectiveness of this strategy is now assessed with two distinct strati®ed sampling schemes.

In strategy I, 20 examples of zero ¯ows and 180 examples of non-zero ¯ows are selected ran-domly without replacement from 121,104 interactions in 1980. In strategy II, we select 10 ex-amples of zero ¯ows, and 10 randomly-selected exex-amples from each bin of origin-destination pairs

Table 5

Neural network models trained using 1980 tract-to-tract commuter ¯ows, and tested using 1990 commuter ¯ows with training and validation sets selected using strati®ed random sampling

Instance Absolute error (%) SRMSE Epoch network

(18)

de®ned by 10-unit increments on the ¯ows. In both strategies, validation sets are selected simi-larly. The testing results for a 5-node network are presented in Table 5.

Comparison of these goodness-of-®t results to those of the ®ve-node network trained on a simple random sample Table 3 reveals no signi®cant improvement. The ®ve-node networks with training and validation sets selected using strati®ed random sampling have an average absolute error of 92.3% for sampling strategy I, and 91.8% for sampling strategy II, against 93.6% with uniform random sampling. This piece of evidence suggests that using strati®ed random sampling instead of uniform random sampling to select the training set does not improve the accuracy of NN spatial interaction models. More complex strati®cation strategies may produce better results, but we leave this investigation for the future.

6. Conclusions

This study compared the performance of multilayer perceptron neural networks and maximum-likelihood doubly-constrained models for commuter trip distribution. Our experiments produced overwhelming evidence that NN models may ®t data better but their predictive accuracy is poor in comparison to that of maximum-likelihood doubly-constrained models. What our thorough study failed to identify are perceptron model con®gurations that consistently exhibit a predictive per-formance that surpasses that of maximum-likelihood doubly-constrained models. It points to several likely causes of neural network underperformance, including model non-transferability, insucient ability to generalize, reliance on sigmoid activation functions, and their essence as data-driven techniques. An agenda for future research is also proposed to explore the potential for other perceptron formulations (i.e., spatial structure as NN input) and other neural networks (RBF, for instance) to predict spatial interaction ¯ows with greater accuracy.

This conclusion is at variance with the existing literature which has been overly optimistic about the advantages of modeling trip distribution by spatial interaction with backpropagation neural networks. While neural networks may perform better than conventional models in mod-eling spatial interaction for the base year, they fail to outperform the MLE doubly-constrained model for forecasting purpose, which is the motivation behind these modeling eorts in the ®rst place. Therefore, current perceptron neural networks do not provide an appropriate modeling approach to forecasting trip distribution over a planning horizon for which distribution predictors (number of workers, number of residents, commuting distance) are well beyond their base-year domain of de®nition.

Acknowledgements

The authors are grateful to Dr. Frank Koppelman. His insightful comments on an earlier version of the manuscript were instrumental in enhancing its quality.

Appendix A

(19)

Appendix B

Neural network models trained and tested using 1990 county-to-county commuter ¯ows in Atlanta

References

Amrhein, C.G., Flowerdew, R. (1989). The eect of data aggregation on a Poisson regression model of Canadian migration. The Accuracy of Spatial Databases Goodchild, M., Gopal, S. pp. 229±238. Taylor and Francis, London. Bacharach, M., 1970. Biproportional Matrices and Input-Output Change. Cambridge University Press, Cambridge. Batty, M., 1976. Urban Modeling: Algorithms, Calibrations, Predictions. Cambridge University Press, Cambridge.

Distance decay parameter (b)

Absolute error (AE) (%)

SRMSE

Exponential function of distance decay ÿ7.64´₁₀ÿ5 _24.0 _0.728

Instance Absolute error (%) SRMSE Epoch network

was stopped

Five-node networks

1 27.1 0.723 100 000

2 18.2 0.470 100 000

3 23.3 0.634 100 000

4 18.3 0.463 100 000

5 21.1 0.520 100 000

Average 21.6 0.562

Twenty-node networks

1 24.3 0.585 100 000

2 15.2 0.379 100 000

3 21.4 0.554 100 000

4 27.3 0.637 100 000

5 24.1 0.636 100 000

Average 22.5 0.558

Fifty-node networks

1 8.6 0.169 100 000

2 8.4 0.168 100 000

3 8.6 0.166 100 000

4 8.7 0.168 100 000

5 10.7 0.212 100 000

(20)

Batty, M., Mackie, S., 1972. The calibration of gravity, entropy, and related models of spatial interaction. Environment and Planning A 4, 205±233.

Batty, M., Sikdar, P.K., 1982. Spatial aggregation in gravity models: 1. An information-theoretic framework. Environment and Planning A 14, 377±405.

Bishop, C.M., 1995. Neural Networks for Pattern Recognition. Oxford University Press, Oxford.

Black, W.R., 1995. Spatial interaction modeling using arti®cial neural networks. J. Transport Geography 3 (3), 159± 166.

Bureau of the Census, 1983. 1980 Census of Population and Housing, Census Tracts, Atlanta, GA. PHC80-2-77. US Department of Commerce, Bureau of the Census, Washington.

Bureau of Transportation Statistics (1993) 1990 Census Transportation Planning Package. US Department of Transportation, Bureau of Transportation Statistics. CD-Rom, Washington.

Dougherty, M., 1995. A review of neural networks applied to transport. Transportation Research C 3, 247±260. Evans, A.W., 1971. The calibration of trip distribution models with exponential or similar cost functions.

Transportation Research 5, 15±38.

Fahlman, S.E., 1989. Faster-learning variations on back-propagation: An empirical study. Proceedings of the 1988 Connectionist Models Summer School Touretzky, D., Hinton, G., Sejnowski, T. (Eds). pp. 38±51. Morgan Kaufmann, San Mateo.

Fischer, M.M., Gopal, S., 1994. Arti®cial neural networks: A new approach to modeling interregional telecommu-nication ¯ows. J. Regional Science 34, 503±527.

Fotheringham, A.S., Knudsen, D.C., 1987. Goodness-of-®t Statistics. CATMOG series. Geo Abstracts, Norwich. Fotheringham, A.S., O'Kelly, M.E., 1989. Spatial Interaction Models: Formulations and Applications. Kluwer,

London.

Goodman, P.H., 1996. NevProp software, ver. 3. Reno, NV: University of Nevada, URL: http://www.scs.unr.edu/ nevprop/.

Gopal, S., Fischer, M.M., 1996. Learning in single hidden-layer feedforward network: Backpropagation in a spatial interaction modeling context. Geographical Analysis 28, 38±55.

Haykin, S.S., 1998. Neural Networks: A Comprehensive Foundation. Prentice Hall, Upper Saddle River. Himanen, V., Nijkamp, P., Reggiani, A., 1998. Neural Networks in Transport Applications. Ashgate, Brook®eld. Hua, J., Faghri, A., 1994. Applications of arti®cial neural networks to intelligent vehicle-highway systems.

Transportation Research Record 1453, 83±90.

Kikushi, S., Nanda, R., Perincherry, V., 1993. A method to estimate trip-O-D patterns using a neural network approach. Transportation Planning and Technology 17, 51±65.

Kreinovich, V., Sirisaengtaskin, O., 1993. Universal approximators for functions and for control strategies. Neural, Parallel, and Scienti®c Computations 1, 325±346.

Mozolin, M.V., 1997. Spatial interaction modeling with an arti®cial neural network Discussion Paper. Series 97-1, Department of Geography, University of Georgia, Athens, GA.

Openshaw, S., 1984. The Modi®able Areal Unit Problem. CATMOG 38. Geo Abstracts, Norwich.

Openshaw, S., 1993. Modeling spatial interaction using a neural net. Geographic Information Systems, Spatial Modeling and Policy Evaluation. Fischer, M.M., Nijkamp, P. (Eds.), Springer, Berlin, pp. 147±164 .

Ortuzar, J. de Dios, Willumsen, L.G., 1994. Modelling Transport. Wiley, Chichester.

Ripley, B.D., 1996. Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge. Rojas, P., 1996. Neural Networks: A Systematic Introduction. Springer, New York.

Rumelhart, D.E., Hinton, G.E., Williams, R.J., 1986. Learning representations by back-propagating errors. Nature 323 (9 October) 533±536.

Sarle, W.S., (ed.) (1996) A list of frequently asked questions (FAQ), USENET: comp.ai.neural-nets. Available via anonymous FTP from ftp.sas.com/pub/neural/FAQ.html.

Slater, P.B., 1976. Hierarchical internal migration regions of France. IEEE Transactions on Systems, Man, and Cybernetics 6 (4), 321±324.

Smetanin, Y.G., 1995. Neural networks as systems for pattern recognition: A review. Pattern Recognition and Image Analysis 5, 254±293.

(21)

Thomas, R.W., Hugget, R.J., 1980. Modeling in Geography: A Mathematical Approach. Barnes and Noble, Totowa. Verleysen, M., Hlavackova, K., 1994. An optimized RBF network for approximation of functions. Proceedings of the

European Symposium on Arti®cial Neural Networks, ESANN'94.

Weiss, N.A., 1995. Introductory Statistics. Fourth Edition. Addison-Wesley, Reading.

Williams, P.A., Fotheringham, A.S., 1984. The Calibration of Spatial Interaction Models by Maximum Likelihood Estimation with Program SIMODEL. Geographic Monograph Series, vol. 7, Department of Geography, Indiana University, Bloomington, IN.