Directory UMM :Data Elmu:jurnal:J-a:Journal of Econometrics:Vol98.Issue2.Oct2000:

(1)

*Corresponding author. Tel.:#612-9351-2787; fax:#612-9351-6409. E-mail address:[email protected] (M. Smith).

Nonparametric seemingly unrelated regression

Michael Smith

!,

*, Robert Kohn

"

!Econometrics and Business Statistics, University of Sydney, Sydney, Australia

"Australian Graduate School of Management, University of New South Wales, Kensington, New South Wales, Australia

Received 19 March 1998; received in revised form 9 February 2000; accepted 13 March 2000

Abstract

A method is presented for simultaneously estimating a system of nonparametric regressions which may seem unrelated, but where the errors are potentially correlated between equations. We show that the advantage of estimating such a&seemingly unre-lated'system of nonparametric regressions is that less observations can be required to obtain reliable function estimates than if each of the regression equations is estimated separately and the correlation ignored. This increase in e$ciency is investigated empiric-ally using both simulated and real data. The method uses a Bayesian hierarchical framework where each regression function is represented as a linear combination of a large number of basis terms. All the regression coe$cients, and the variance matrix of the errors, are estimated simultaneously by their posterior means. The computation is carried out using a Markov chain Monte Carlo sampling scheme that employs a&focused sampling'step to combat the high-dimensional representation of the unknown regression functions. The methodology extends easily to other nonparametric multivariate regres-sion models. ( 2000 Elsevier Science S.A. All rights reserved.

JEL classixcation: C11; C14; C15; C31

Keywords: Nonparametric multivariate regression; Bayesian hierarchical SUR model; Multivariate subset selection; Markov Chain Monte Carlo

(2)

1. Introduction

The aim of nonparametric regression is to estimate regression functions without assuming a priori knowledge of their functional forms. The price for this

#exibility is that appreciably larger sample sizes are required to obtain reliable nonparametric estimators than for parametric estimators. In this paper, we consider a system of regression equations that can seem unrelated, but actually are because their errors are correlated. Such a system of equations is called a set of&seemingly unrelated'regressions, or a SUR model (Zellner, 1962). This paper provides a Bayesian framework for reliably estimating the regression functions in a nonparametric manner, even for moderate sample sizes, by taking advant-age of the correlation structure in the errors. The most important consequence of this work is to show that if the errors are correlated, better nonparametric estimators are obtained by taking advantage of this correlation structure com-pared to ignoring the correlation and estimating the equations one at a time.

Speci"cally, we consider the system ofmregression equations

yi"fi(xi)#ei, for i"1, 2,2,m. (1.1)

Here, the superscript denotes that this is theith ofmpossible regressions,yiis the dependent variable,xiis a vector of independent variables andf1,2,fmare

functions that require estimating in a nonparametric manner. As in the linear Gaussian SUR model, the regressions are related through the correlation structure of the Gaussian errorsei. That is,

e&N(0,R?I

n), (1.2)

wheree@"(e1@, e2@,2,em@),eiis the vector of errors for thenobservations of the ith regression and R is a positive-de"nite (m]m) matrix that also requires estimation. This paper provides a data-driven procedure for estimating the unknown functionsfi(fori"1,2,m) and covariance matrixRin this model.

(3)

along with a temperature e!ect. In both examples, signi"cant nonlinear relation-ships are identi"ed that are di$cult to discern using a parametric SUR ap-proach. In addition, substantial correlation is estimated between the regressions and the function estimates di!er substantially from those obtained by estimating each of the nonparametric regressions separately and ignoring the correlation between the equations.

Our approach for estimating the system of equations de"ned at (1.1) and (1.2) models each of the functions fi as a linear combination of basis terms. We develop a Bayesian hierarchical model to explicitly parameterize the possibility that these terms may be super#uous and have corresponding coe$cient values that are exactly zero. A wide variety of bases can be used, including many with a desired structure, such as periodicity or additivity, a point which is demon-strated in the empirical examples. The unknown regression functions are esti-mated by their posterior means which attach the proper posterior probability to each subset of the basis elements, providing a nonparametric estimate that is both#exible and smooth. We develop a Markov chain Monte Carlo (MCMC) sampling scheme to calculate the posterior means because direct evaluation is intractable. This sampling scheme is a correction of the &focused sampler'

discussed in Wong et al. (1997) and our empirical work shows it to be reliable and much more e$cient than the Gibbs sampling alternative. We prove that the iterates of the focused sampler converge to the correct posterior distribution. The performance of the new estimator is investigated empirically with a set of simulation experiments that cover a range of potential regression curves. These demonstrate the improvement that can be obtained by exploiting the correla-tion structure in a system of regressions. We note that the solucorrela-tion to the nonparametric SUR model discussed in this paper is easily extended to other nonparametric multivariate (or vector) regression models.

(4)

independent variables, but not the dependent variable. Such an approach is an unsatisfactory way of estimating the smoothing parameters because it does not take into account the curvature exhibited by the dependent variable. Nor is it fully automatic because the e!ective degrees of freedom is required as an input from the user.

The paper is organized as follows. Section 2.1 discusses how to model the unknown functions and why they are estimated using a hierarchical model. The rest of Section 2 introduces the Bayesian hierarchical SUR model and develops an e$cient MCMC sampling scheme to enable its estimation. Section 3 uses the methodology to"t the print advertising and electricity load datasets. Section 4 contains simulation examples which investigate the improvements that can be made using this estimation procedure over a series of separate nonparametric regressions. Appendix A provides the conditional posterior distributions em-ployed in the sampling scheme, while Appendix B proves that the focused sampling step provides an iterate from the correct invariant distribution.

2. Methodology

2.1. Basis representation of functions

Each regression function is modeled as a linear combination of basis func-tions, so that for a functionf,

f(x)"₊p

A large number of authors use such linear decompositions with a variety of univariate and higher dimensional bases in the single equation case. For example, Friedman (1991), Smith and Kohn (1996) and Denison et al. (1998) use regression splines, Luo and Wahba (1997) use several reproducing kernel bases and Wahba (1990) uses natural splines. In particular, orthonormal bases, such as wavelet (Donoho and Johnstone, 1994) or Fourier bases have been used. However, the computational advantage provided by such orthonormal bases does not easily extend to the case where the errors are correlated, such as in the SUR model. In the case of multiple regressors in an equation, additive models of univariate bases or radial bases (Powell, 1987; Holmes and Mallick, 1998) can be used.

Given a choice of a particular basis for the approximation at (2.1), the ith regression at (1.1) can be written as the linear model

(5)

Here,yiis the vector of thenobservations of the dependent variable, the design matrix Xi"[b

1Db2D2Dbpi],b_j is a vector of the values of the basis function

b

j evaluated at the nobservations and b are the regression coe$cients. The

errorseiare correlated with those from the other regressions, as speci"ed in (1.2), and we denote the number of basis terms in theith equation as pi. Note that many basis expansions employ pi*n basis terms and it is inappropriate to estimate the regression coe$cients using existing SUR methodology because the function estimatesfKi, i"1,2,m, would interpolate the data (rather than

pro-duce smooth estimates that account for the existence of noise in the regression). Therefore, we estimate the regression parameters using a Bayesian hierarchical SUR model described below that explicitly accounts for the possibility that many of these terms may be redundant.

2.2. A Bayesian hierarchical SUR model

Consider theith regression of a linear SUR model given at Eq. (2.2), where the design matrixXiis (n]pi) and the coe$cient vectorbiis of lengthpi. To account explicitly for the notion that variables in this regression can be redundant, we introduce a vector of binary indicator variables ci"(ci₁, ci₂,2, cipi)@. Here,

c_ikcorresponds to thekth element of the coe$cient vector of theith regression, sayb_ik, withc_ik"0 ifb_ik"0 andc_ik"1 ifb_ikO0. By dropping the redundant terms with zero coe$cients, theith regression can be rewritten, conditional onci, as

y_i"Xi_cibi_ci#ei. (2.3)

Ifq_i

c"+pj/1i cij, then the design matrixXiciis of size (n]qi_c) andbi_ci is a vector of

q_i

c_{By stacking the linear models for the}elements. _m_{regressions, the SUR model can be}

written, conditional onc@"(c1@, c2@,2, cm@), as

complete this Bayesian hierarchical model, we introduce the following priors on the parameters.

(i) Following O'Hagan (1995) we construct a conditional prior forb_cby setting

p(b_cDR,c)Jp(yDb_c,c,R)1@n

so that b_cDR, c&N(l(

c,n(X@cAXc)~1), where A"R~1?In and

l(

c"(X@cAXc)~1X@cAy. This data-based fractional prior contains much less

(6)

does not depend on the data. However, we prefer the data-based prior because the conditional posterior mean ofb_c is unbiased.

(ii) The prior forR~1is taken as independent ofcand is the commonly used non-informative prior discussed in Zellner (1971), where

p(R~1Dc)JDR~1D~(m`1)@2.

(iii) The indicator variables c_ik, k"1,2,pi,i"1,2,m, are taken a priori

independently distributed, withp(c_ik"1Dui)"ui.

(iv) The hyperparametersui, i"1,2,m, are taken as independent and given

a non-informative uniform prior on (0, 1).

We integrate the hyperparametersx"(u1,2, um) out of our analysis, so that suggested in Smith and Kohn (1996). Note that the model here is a hierarchical SUR model as, conditional onc, it is simply a linear SUR model.

2.3. Markovchain Monte Carlo sampling

To estimate this model we use the following Markov chain Monte Carlo sampling scheme with the hyperparameterxintegrated out.

(1) Generate fromb_cDR~1,c, y

(2) Generate fromR~1Db,c,y"R~1_Db_c, c, y

(3) Fori"1, 2,2,m

Forj"1, 2,2,pi

Generate fromci_jDR~1, cCci_j,yusing the sampling step described below. In this sampling schemeb_cis generated from a multivariate normal. Generation of the matrixR~1directly from the posterior at step (2) is di$cult because the fractional prior b

cDR~1, cis centered at l(c, which is a function ofR.

Conse-quently, we use a Metropolis}Hastings step where the proposal Wishart density is the posterior under a #at conditional prior for b_c. This works well with between 60% and 90% of those iterates that are generated being accepted. Details of how to generate from the distributions at steps (1) and (2) are given in Appendix A. It is important to note that care has been taken to generate

c_ijwithout conditioning onb_ij at step (3), otherwise the sampling scheme would be reducible becausec_ij is known exactly givenb_ij.

Step (3) generates an iterate ofcone element at a time. As discussed in Wong et al. (1997), using a Gibbs sampler is computationally demanding becausechas

(7)

algorithm. Letn(c_ij)"p(c_ijDR~1,cCc_ij, y) be the conditional posterior density of

c_ijands(c_ij)"p(c_ijDcCc_ij) be its conditional prior density, withxintegrated out in both cases. Then, if c0-$ is the previous value ofc_ij, a new value c/%8 can be generated as follows.

Sampling Step

(a) Proposal

Ifc0-$"1 then generatec/%8 from the proposal density

Q(c0-$"1P_c/%8"0)"s(c_ij"0)min

A

1, n(cij"0)

s(c_ij"0)

B

Ifc0-$"0 then generatec/%8 from the proposal density

Q(c0-$"0Pc/%8"1)"s(c_ij"1)min

A

1, n(cij"1)

s(c_ij"1)

B

(b) Metropolis}Hastings acceptance probabilities

If c0-$"0 and c/%8"1, then accept c/%8 with probability

a₀₁"minM1,s(ci_j"0)/n(ci_j"0)N; otherwise setc/%8"0.

If c0-$"1 and c/%8"0, then accept c/%8 with probability

a₁₀"minM1,s(ci_j"1)/n(ci_j"1)N; otherwise setc/%8"1.

Appendix A calculates the posterior n and prior s, with the latter being a trivial calculation. Generating from the proposal in part (a) can be undertaken e$ciently as follows.

First generateufrom a uniform distribution on (0, 1), then

(i) ifc0-$"1 andu(s(c_ij"1), setc/%8"1.

(ii) if c0-$"1 and u's(c_ij"1), generate c/%8 from the density

p(c/%8"0)"min(1,n(c_ij"0)/s(c_ij"0)). (iii) ifc0-$"0 andu(s(c_ij"0), then setc/%8"0.

(iv) if c0-$"0 and u's(c_ij"0), generate c/%8 from the density

p(c/%8"1)"min(1,n(c_ij"1)/s(c_ij"1)).

(8)

Appendix B proves that the sampling method described above is a correct application of the Metropolis}Hastings method and therefore that

n(c_ij)"p(c_ijDR~1, cCc_ij, y) is the invariant distribution of each step. The appendix also includes a lemma demonstrating that the Metropolis}Hastings ratios at part (b) of the sampling step will usually be one, or close to one. Because steps (1)}(3) of the sampling scheme either generate directly from the conditional posterior distributions, or use a Metropolis}Hastings step, the scheme con-verges to its invariant distribution, which is the posterior distribution

R~1, c,bDy(Tierney, 1994).

The sampling scheme is much faster because step (3) involves a much smaller number of complex calculations than the full Gibbs sampler. Moreover, focus-ing the generations is especially important in nonparametric SUR models compared to a single equation regression, because there are more basis terms. We have found this sampler to have strong empirical convergence properties, a point that is demonstrated in the examples in Sections 3 and 4. These sections also compare the sampler to the Gibbs alternative and demonstrate that it is more computationally e$cient.

A sampler that generates solely from the parameter space of c is not con-sidered as it is di$cult to calculate the posterior distributionc_ijDcCc_ij, y. Similarly, samplers that generate from either the parameter space of (c, R~1) or (c,b) are not considered because it is di$cult to recognize the conditional posterior distributionR~1Dc, y, or calculatec_ijDcCc_ij,bc, y.

2.4. Parameter estimation

Given an initial state for the Markov chain and a &warmup period', after which the sampler is assumed to have converged to the joint posterior distribu-tion, we can collect iterates (R~1*1+, c*1+,b*1+),2, (R~1*J+, c*J+, b*J+) which form

a Monte Carlo sample from the joint posterior distribution. It is this sample that is used for inference.

The posterior mean E[RDy] is estimated by the histogram estimate RK"

((1/J)+_Jj_/1R~1*j+)~1. We do not use a mixture estimate because the distribution of R~1Dbc, c, yis di$cult to identify (which is also the reason a Metropolis}

Hastings step is used at step (2) of the sampler).

The posterior mean of the regression parameters, E[bD y], is estimated using the mixture estimate Each of the conditional expectations in the sum is simple to calculate because E[bcDc, R~1, y]"l(c, while elements ofbthat are not common tobcare set to

(9)

The posterior means E[fi(z)Dy] of the functions at Eq. (1.1) at any pointzin the domain ofxiis estimated using the mixture estimate

fKi(z)"1 estimate of a curve, while for higher dimensions it is a surface. For additive nonparametric models the component function estimates can easily be cal-culated separately by identifying the basis terms and regression coe$cient estimates that correspond to each function and forming the inner product of these.

2.5. Standardized residuals

A common diagnostic in parameteric models are estimates of the standard-ized errorsr"(R?I

n)e, withR@R"R~1, which can be shown to be distributed

N(0,I

nm). We estimate the posterior mean E[rDy] of these using the histogram

estimate

e*j+"y!Xb*j+. After calculation, these standardized residuals can be separated into m sets, one for each regression, so that r("(r(1@,2,r(m@). Such residuals

should be approximately N(0,I

nm).

3. Real data examples

In this section we apply the approach to two systems where the errors are likely to be correlated across equations and the functional forms of the relation-shipsfiare unknown a priori and require estimation. In such cases, we show that the function estimates di!er substantially from those obtained using the equation-by-equation equivalent estimator. We also show that the sampling scheme results in a major decrease in the computational burden of generating

(10)

3.1. Print advertising data

We demonstrate our procedure usingn"457 observations of data from six issues of an Australian monthly women's magazine collected by Starch INRA Hooper. Each observation corresponds to an advertisement placed in the magazine and the following three advertisement exposure scores, which are recorded from an experimental audience, are used as measures of the various levels of e!ectiveness of the print advertisement.

y1(noted score): Proportion of respondents who claim to recognize the ad as

having been seen by them in that issue.

y2(associated score): Proportion of the respondents who claim to have

noticed the advertiser's brand or company name or logo.

y3(read-most score): Proportion of respondents who claim to have read

half or more of the copy.

These scores are considered to measure advertisement exposure at increasing levels of depth. It has long been thought that the positioning of an advertisement within an issue a!ects its exposure to an audience (Hanssens and Weitz, 1980). To quantify this we constructed the variable P"(page number)/(number of pages in issue) to represent the position in the issue in which each advertisement appeared.

To estimate how the exposure of a print advertisement is a!ected by its position in the magazine, we considered the three nonparametric regressions

yi"fi(P)#ei for i"1, 2, 3. (3.1)

To model each unknown function we use a thin plate spline basis (Powell, 1987), with basis terms b

j(x)"Dx!kjD2log(Dx!kjD), j"1,2,n, wherek1,2,kn are

the so-called&knots'which we set equal to thenobservations of the independent variable. Following previous authors (Wahba, 1990) we also augment these basis terms with an intercept and linear term, so that p1"p2"p3"459. Expected features in the functionsfiinclude high casual attention to advertise-ments placed in the front (and to a lesser extent back) of the magazine, while the pre-editorial slots (whereP is about 0.7) are thought to attract more indepth attention. The three scoresy1,y2andy3are highly positively correlated and it is likely that the errors are correlated, so a SUR model appears appropriate.

(11)

Fig. 1. (a)}(c) Function estimates off1,f2andf3, respectively. Bold lines are the nonparametric SUR estimates, while the dashed lines are the single equation estimates. Panels (d)}(f) contain scatter plots ofPversus the standardized uncorrelated residuals resulting from the nonparametric SUR"t.

The equations at (3.1) were estimated both as a SUR system and individually with the analogous single equation estimator that ignores any correlation between the equations. The resulting function estimates are plotted in Figs. 1(a)}(c) and it can be seen that the SUR and equation-by-equation estimates di!er substantially. The estimate RK from the SUR model is given below (correlations in italics)

RK"

C

2.166]10~2 2.103]10~2 1.205]10~2

0.956 2.237]10~2 1.275]10~2

0.822 0.856 0.992]10~2

D

(12)

Table 1

Ratio of variances for the proposed sampler over the Gibbs sampler for the function estimates off1,

f2andf3calculated at four points on the domain ofP

Eq. (1) Eq. (2) Eq. (3)

P"0.2 2.028 3.027 2.195

P"0.4 6.564 4.684 6.484

P"0.6 2.546 3.407 7.102

P"0.8 4.174 7.634 6.660

(NSUR) estimates correctly capture the nonlinear relationships between

y1,y2,y3 and P, we calculated the standardized uncorrelated residuals

r(i, i"1, 2, 3. Figs. 1(d)}(f) plot these residuals againstP, and these con"rm that there is no signi"cant nonlinear relationship between the residuals andP.

We also estimated the NSUR model using the sampling scheme outlined in Section 2.3, but using a Gibbs sampler where the elementsc_ij were generated at step (3) directly from their posteriors. With both sampling schemes, we used a warmup period of 2000 iterations and a further 1000 iterates for inference, resulting in 459]3]3000 generations from the conditional posterior at (A.1). In comparison, the proposed sampler took only 91,511 computationally equivalent generations, 2.2% of the number required by the Gibbs sampler. Because the Gibbs sampler involves more generations, it would be expected to result in more random iterates than the proposed sampler. To measure this, we calculated the variances of the function estimates. Table 1 provides the ratio of the variance for the proposed sampler over that of the Gibbs sampler. It does so for function estimates calculated atP"0.2, 0.4, 0.6, 0.8 for all three regression equations. It reveals that the proposed sampler needs to be run for two to seven times as many iterations as the Gibbs sampler to obtain function estimates with the same variances. However, this increased number of iterations still takes only 4}14% of the time required by the Gibbs sampler and is therefore substantially more e$cient overall. We note that function estimates are averages of terms in a stationary sequence whose variances are computed as in Box et al. (1994, p. 34).

3.2. Intra-day electricity load model

(13)

a model for three weeks of half-hourly electricity load data from the adjacent Australian states of New South Wales (NSW) and Victoria (VIC). Let¸1and

¸2be the half-hourly average electricity load of NSW and VIC, respectively,

¹ODand ¹O=_{be the time of day and week of the half-hourly observations}

normalized to [0, 1) and ¹ be the temperature in the NSW state capital of Sydney. We estimate the additive nonparametric SUR:

¸1"f₁1(¹OD)#f₂1(¹O=₎#f₃1(¹)#e1,

¸₂"f₁2(¹OD)#f₂2(¹O=₎#e2, (3.2)

where the periodic pro"le of electricity load has been decomposed into periodic daily (f₁i) and weekly (f₂i) e!ects for both states. An additive temperature e!ect is considered for NSW, but we do not include one for VIC because half-hourly temperature data were not available to us for this state.

We model bothf₁i andf₂i with a periodic quadratic reproducing basis (Luo and Wahba, 1997) whereb

i(x)"((Dx!kiD!12)2!1/12)/2. We place the knots

k

i at all the possible observations, so that for the time of day e!ect

k

i"i/48,i"1,2, 48, and for the weekly e!ectki"i/336,i"1,2, 336. This

basis was chosen because it ensures that the functionsf₁1, f₂1,f₁2andf₂2are all periodic on [0, 1). The temperature e!ect was modeled using a thin plate spline with knots at all observations.

The estimated covariance is (correlation in italics)

RK"

C

29088.13 26093.55

0.5208 86299.88

D

,

where var(e1)(var(e2) because we control for the variation in temperature in the NSW equation. The positive correlation between the states is likely to be due, in part, to common weather e!ects and television programming. Figs. 2(a) and (b) plot the sum of the estimated daily and weekly e!ects for both states against¹O=_{, while the estimated additive Sydney temperature e}_!_{ect is given in}

(14)

Fig. 2. Sum of the estimated daily and weekly periodic e!ects for VIC (panel (a)) and NSW (panel (b)). The dashed curves are those resulting from estimation using the proposed sampler and the bold curves are those resulting from estimation using a full Gibbs sampler. Both estimates are plotted together for comparison, although they are almost identical and are hard to visually distinguish.

of the di!erence is about 250 and 400 megawatts for VIC and NSW, respectively, which is substantial compared to the estimates of Jvar(e1) and Jvar(e2). Because the load pro"les devolve slowly during the year and are e!ectively only static over periods of around two to three weeks, larger sample sizes cannot be used and it is useful to exploit the correlation between equations.

(15)

Fig. 3. Same as in Fig. 2, but for the estimated Sydney additive temperature e!ectfK₃1.

full Gibbs sampler was run using the same number of iterations for the warmup and sampling periods as the proposed sampler. Figs. 2 and 3 compare the resulting curve estimates for the Gibbs sampler (bold lines) and proposed sampler (dashed lines). The curve estimates are virtually identical and sometimes the two curve estimates are so similar that the dashed lines cannot be made out. It is di$cult to see how this could occur if the proposed sampler was not converging to the same distribution as the Gibbs sampler and the sampling scheme was not mixing well with respect to the indicator variablesc. The results are identical regardless of initial starting state for either sampler.

4. Simulation experiments

4.1. Positively correlated univariate regressions

(16)

Fig. 4. The di!erence between the SUR and single equation estimates of the sum of the daily and weekly periodic e!ects. Panel (a) is for the Victorian equation and panel (b) is for the New South Wales equation.

the estimates that can be obtained when such correlation is modeled rather than ignored. There are m"4 univariate regressions, with the covariance matrix speci"ed below at (4.1). Note that the standard deviation of the errors (var(ei))1@2"1 is high compared to the range of the functions.

Four true functions were carefully chosen to represent a wide variety of possible relationships. These are f1(x)"sin(8px) (which is highly oscillatory),

f2(x)"(/(x, 0.2, 0.25)#/(x, 0.6, 0.2))/4, with/(x, a, b) being a normal density of meanaand standard deviationb, (which requires a locally adaptive estimator as there are di!erent degrees of smoothness on the left and right of the function),

f3(x)"1.5x (which was chosen as many relationships are often thought to be linear) and f4(x)"cos(2px) (which is a smooth nonlinear function). The independent variables for the four univariate regressions were x1&U(0, 1),

x2&U(0, 1) and

A

x3

x4

B

&N

AA

0.5 0.5

B

, 0.3

C

(17)

We generatedn"100 data points from this true SUR model and applied the nonparametric SUR estimator to this data. We use a quartic reproducing kernel basis (Luo and Wahba, 1997) with basis terms

b

augmented with an intercept and linear term.

To assess the quality of the resulting estimates of the four functions, we calculated the mean squared di!erence between the function estimates and the true functions. This measure of distance between the two is de"ned as

MSD

the domain ofxi. For the same data we also"t an analogous single equation univariate nonparametric estimator to the four regressions. The mean squared di!erence was also calculated for each of these four function estimates.

The entire process was repeated one hundred times. Figs. 5(a)}(d) give box-plots of the one hundred resulting values of log(MSD

i) for each of the four

functions (i"1, 2, 3, 4) and for both the nonparametric SUR estimator (NSUR) and individual nonparametric estimators (NR). Fig. 5 shows that taking into account the correlation between the errors has substantially and consistently improved the resulting estimates of all the regression functions.

To examine the qualitative improvement that occurs, we focus on the single data set corresponding to the 50th sorted value of +4

i/1MSDi for the

non-parametric SUR estimator. This data set can be regarded as providing a&typical'

example of the procedure and is plotted as four scatter plots in Figs. 5(e)}(h) and again in Figs. 5(i)}(l). The nonparametric SUR estimates of the four functions for this data appear in Figs. 5(e)}(h) and the estimates for the separate nonparamet-ric regressions appear in Figs. 5(i)}(l). These"gures show that the nonparametric SUR estimator signi"cantly outperforms the separate nonparametric estimators which ignore the correlation between the separate regressions. The varianceRof the errors and its estimateRK for this data set are given below.

(18)

Fig. 5. (a)}(d) Boxplots of the log(MSD

i) fori"1, 2, 3 and 4, respectively. The left-hand boxplot is

(19)

Table 2

Time taken by the NSUR procedure to complete a"t to data generated form the model in Section 4.1 for four di!erent sample sizes. In the sampling procedure 2000 iterates were used for the warmup and the Monte Carlo sample consisted of a subsequent 1000 iterates. The machine used was a low end workstation

Sample size n"100 n"200 n"400 n"800

Time (s) 43 58 213 3850

The estimate compares favorably to the&best possible'estimateRK _"%45that arises from the sample variance of the true errors themselves, which are known because this is a simulated example.

To demonstrate that the NSUR estimator is practical to implement, we report the time required to"t this four equation system for a variety of sample sizes in Table 2. The computer used was a low-end modern workstation while the code was written e$ciently in FORTRAN. Although these timings are implementa-tion dependent, they do indicate that this computaimplementa-tionally intensive Markov chain Monte Carlo procedure runs in a reasonable time.

4.2. Unrelated regressions

The previous simulation considers a related set of regressions and demon-strates the improvements in the regression function estimates that can occur when correlation between the regressions is modeled and estimated, rather than ignored. However, is there a risk of degrading the function estimates by modeling correlation that does not exist?

To investigate this case, we repeat the simulation undertaken above, except that the regressions are unrelated, withR"0.5I

4. Fig. 6 provides the equivalent

output for this case as was produced in Fig. 5. It can be seen from the boxplots in Figs. 6(a)}(d) that, in general, there is a slight deterioration in the log(MSD

i) for

(20)

(21)

Nevertheless, in examples whenmis large, reliably estimating them(m#1)/2 free parameters ofRin addition to the unknown functions may prove di$cult. In such cases, a more parsimonious representaton for R can be used and estimated in combination with the unknown functions using MCMC. For example, Shephard and Pitt (1998) discuss the MCMC estimation of factor decompositions for R, while Smith and Kohn (1999) examine parsimonious representations forRbased on the Cholesky decomposition ofR~1.

Acknowledgements

The authors would like to thank Denzil Fiebig, Paul Kofman, Steve Marron, Gael Martin and Tom Smith for useful comments. They would also like to thank two anonymous referees for improving the clarity of the presentation and for pointing out an error in a previous implementation of the sampling scheme. Both Michael Smith and Robert Kohn are grateful for the support of Australian Research Council grants.

Appendix A

A.1. Generating fromb_cDR~1,c,y

This conditional distribution can be calculated exactly, as

p(b_cDR~1, c, y)Jp(yDb_c, R~1, c)p(b_cDR~1, c)

Jexp

G

!1

2

A

n#1

n

B

(bc!l(c)@X@cAXc(bc!l(c)

H

,

so that

b

cDR~1, c, y&N

A

l(c, n

n#1(X@cAXc)~1

B

.

Here,l(

c andAare de"ned in Section 2.2.

A.2. Generating fromR~1Db_c,c,y

(22)

to this method. The proposal density from which we generate a candidate iterate is given by

q(R~1)Jp(yDb_c, R~1,c)p(R~1Dc)

J_DR~1_D(n~m~1)@2exp

G

!1

2tr(XR~1)

H

which is a Wishart(X~1,n,m) density. Here, X is an (m]m) matrix with ijth elementu_ij"(yi!Xi_cibi_ci)@(yj!Xj_cjbj_cj). A newly generated iterateR~1_/%8 is

ac-cepted over the old valueR~1_0-$ with probability

a"min

A

p(R~1/%8Dbc, c, y)q(R~10-$)

p(R~1_0-$Db_c, c,y)q(R~1_/%8), 1

B

"min

A

p(b_cDR~1_/%8,c)

p(b_cDR~1_0-$, c), 1

B

.

High acceptance rates of 60}90% are obtained because the proposal densityq())

is equal to the correct conditional density except for the factorp(b_cDR~1, c).

A.3. Calculatingc_ijDR~1, cCc_ij,y

This density requires calculation to enable generation ofc_ij in the sampling scheme.

p(ci_jDR~1, cCci_j, y)J

P

p(yDc,R~1, b_c)p(b_cDc, R~1) db_cp(c)

J(n#1)~qc_@2exp

G

!1

2S(c,R~1)

H

p(cijDcCcij) (A.1) where S(c,R~1)"y@Ay!y@AX

c(X@cAXc)~1X@cAyand in Eq. (A.1) the

regres-sion coe$cient is integrated out using

b_c&N

A

l(_c, n

n#1(X@cAXc)~1

B

.

The conditional posterior can then be calculated by evaluating (A.1) forc_ij"1 andc_ij"0 and normalizing.

A.4. Calculatingp(c_ijDcCc_ij)

The conditional prior can be calculated by integrating out the hyperpriorui, with p(c_ijDcCc_ij)J_:1

(23)

p(c_ij"1DcCc_ij)"1/(1#h), where h"(pi!ai)/(ai#1) and ai"₊

kEjcik is the

number of elements ofciCc_ij that are one.

Appendix B

This appendix provides a proof that the invariant distribution of the proposed sampling step in Section 2.3 isn(c_ij)"p(c_ijDR~1, cCc_ij,y).

Proof is as follows. Asc/%8is a binary variable, the proof is undertaken by calculating the Metropolis}Hastings ratio for the two transitions (c0-$"0Pc/%8"1) and (c0-$"1Pc/%8"0) and showing that it is applied correctly in step (b) of the sampling step. Let Q(c0-$Pc/%8) be the proposal density, then

Q(1P0)"s(0)minM1,n(0)/s(0)N and Q(0P1)"s(1)minM1,n(1)/s(1)N

(B.1)

The Metropolis}Hastings ratios are

a₀₁"min

G

1,n(1)Q(1P0)

n(0)Q(0P1)

H

, a10"min

G

1,

n(0)Q(0P1)

n(1)Q(1P0)

H

.

Because n(1)"1!_n(0) and s(1)"1!s(0), n(1)/s(1)'1 i! n(0)/s(0)(1 and

n(1)/s(1)(1 i!n(0)/s(0)'1. Using these inequalities and substituting the pro-posal densities at (B.1) into the ratios, it can be seen thata₀₁"min(1,s(0)/n(0)) anda₁₀"min(1,s(1)/n(1)). These correspond to the ratios applied in part (b) of the sampling step. Hence the sampling step is reversible and its invariant distribution isn(ci_j).

Lemma.Let¸(0)"p(yDci

j"0,cCcij,R)and¸(1)"p(yDcij"1,cCcij,R)be the

likeli-hoods forci_j"0andci_j"1.Then,

a₀₁"s(0)#s(1)min(1,¸(1)/¸(0)) and a₁₀"s(1)#s(0)min(1,¸(0)/¸(1)) (B.2)

ProofNote that

n(1)" s(1)¸(1)

s(1)¸(1)#s(0)¸(0) and n(0)"

s(0)¸(0)

s(1)¸(1)#s(0)¸(0).

(24)

The importance of this lemma is it reveals that the Metropolis}Hastings ratios will usually be one, or close to one. Because s(1)#s(0)"1, a₀₁"1 if

¸(1)/¸(0)*1, whilea₁₀"1 if¸(0)/¸(1)*1. Also, even if¸(1)/¸(0)(1, because in the nonparametric regression problem s(0) is usually very close to 1,

a₀₁'s(0) is close to 1.

References

Bartels, R., Fiebig, D., Plumb, M., 1996. Gas or Electricity, which is cheaper?: an econometric approach with an application to Australian expendicture data. Energy Journal 17, 33}58. Box, G., Jenkins, G., Reinsel, G., 1994. Time Series Analysis, 3rd Edition. Prentice-Hall, New Jersey. Chib, S., Greenberg, E., 1995a. Hierarchical analysis of SUR models with extensions to correlated

serial errors and time varying parameter models. Journal of Econometrics 68, 339}360. Chib, S., Greenberg, E., 1995b. Understanding the Metropolis}Hastings algorithm. The American

Statisitician 49, 327}335.

Denison, D., Mallick, B., Smith, A., 1998. Automatic bayesian curve"tting. Journal of the Royal Statistical Society, Series B 60, 333}350.

Donoho, D., Johnstone, I., 1994. Ideal spatial adaptation by wavelet shrinkage. Biometrika 81, 425}455.

Friedman, J., 1991. Multivariate adaptive regression splines. The Annals of Statistics (with dis-cussion) 19, 1}141.

Hanssens, D., Weitz, B., 1980. The e!ectiveness of industrial print advertisements across product categories. Journal of Marketing Research 17, 294}306.

Harvey, A., Koopman, S., 1993. Forecasting hourly electricity demand using time-varying splines. Journal of the American Statistical Association 88 (424), 1228}1253.

O'Hagan, A., 1995. Fractional Bayes factors for model comparison (with discussion). Journal of Royal Statistical Society Series B 57, 99}138.

Holmes, C., Mallick, B., 1998. Bayesian radial basis functions of variable dimension. Neural Computation 10, 1217}1233.

Luo, Z., Wahba, G., 1997. Hybrid adaptive splines. Journal of the American Statistical Association 92, 107}116.

Mandy, D., Martins-Filho, C., 1993. Seemingly unrelated regressions under additive heteroskedas-ticity: theory and share equation applications. Journal of Econometrics 58, 315}346. Min, C., Zellner, A., 1993. Bayesian and non-Bayesian methods for combining models and forecasts

with applications to forecasting international growth rates. Journal of Econometrics 56, 89}118. Powell, M., 1987. Radial basis functions for multivariate interpolation: a review. In: Mason, J., Cox,

M. (Eds.), Algorithms for Approximation.

Shephard, N., Pitt, M., 1998. Analysis of time varying covariances: a factor stochastic volatility approach. Preprint.

Smith, M., 2000. Modeling and short term forecasting of new south Wales electricity system load. Journal of Business and Economic Statistics, to appear.

Smith, M., Kohn, R., 1996. Nonparametric regression via Bayesian variable selection. Journal of Econometrics 75 (2), 317}344.

Smith, M., Kohn, R., 1999. Bayesian parsimonious covariance matrix estimation. Preprint. Srivastava, V., Giles, D., 1987. Seemingly Unrelated Regression Equations Models. Marcel Dekker,

New York.

(25)

Wahba, G., 1990. Spline Models for Observational Data. SIAM, Philadelphia.

Wong, F., Hansen, M., Kohn, R., Smith, M., 1997. Focused sampling and its application to nonparametric and robust regression. Preprint.

Yee, T., Wild, C., 1996. Vector generalised additive models. Journal of the Royal Statistical Society, Series B 58, 481}493.

Zellner, A., 1962. An e$cient method of estimating seemingly unrelated regression equations and tests for aggregation bias. Journal of the American Statistical Association 57, 500}509. Zellner, A., 1963. Estimators for seemingly unrelated regression equations: some exact"nite sample

results. Journal of the American Statistical Association 58, 977}992.