• Tidak ada hasil yang ditemukan

Manajemen | Fakultas Ekonomi Universitas Maritim Raja Ali Haji 073500102288618766

N/A
N/A
Protected

Academic year: 2017

Membagikan "Manajemen | Fakultas Ekonomi Universitas Maritim Raja Ali Haji 073500102288618766"

Copied!
10
0
0

Teks penuh

(1)

Full Terms & Conditions of access and use can be found at

http://www.tandfonline.com/action/journalInformation?journalCode=ubes20

Download by: [Universitas Maritim Raja Ali Haji] Date: 13 January 2016, At: 00:38

Journal of Business & Economic Statistics

ISSN: 0735-0015 (Print) 1537-2707 (Online) Journal homepage: http://www.tandfonline.com/loi/ubes20

A Note on Rubin's Statistical Matching Using

File Concatenation With Adjusted Weights and

Multiple Imputations

Chris Moriarity & Fritz Scheuren

To cite this article: Chris Moriarity & Fritz Scheuren (2003) A Note on Rubin's Statistical

Matching Using File Concatenation With Adjusted Weights and Multiple Imputations, Journal of Business & Economic Statistics, 21:1, 65-73, DOI: 10.1198/073500102288618766

To link to this article: http://dx.doi.org/10.1198/073500102288618766

Published online: 01 Jan 2012.

Submit your article to this journal

Article views: 74

View related articles

(2)

A Note on Rubin’s Statistical Matching Using

File Concatenation With Adjusted Weights

and Multiple Imputations

Chris

Moriarity

U.S. General Accounting Of’ce (chrismor@cpcug.org)

Fritz

Scheuren

NORC, University of Chicago (scheuren@aol.com)

Statistical matching has been used for more than 30 years to combine information contained in two sample survey Žles. Rubin (1986) outlined an imputation procedure for statistical matching that is different from almost all other work on this topic. Here we evaluate and extend Rubin’s procedure.

KEY WORDS: Complex survey design; Multivariate normal; Predictive mean matching; Resampling; Robustness; Variance-covariance structures.

1. INTRODUCTION

Statistical matching began to be widely practiced with the availability of public use Žles in the 1960s (Citro and Hanushek 1991). Arguably, the desire to use this technique was even an impetus for the release of several early public use Žles, including those involving U.S. tax and census data (e.g., Okner 1972).

Statistical matching continues to be widely used by economists for policy microsimulation modeling in govern-ment, its original home (as the references herein attest). It also has begun to play a role in many business settings as well, especially as a way—in an era of data warehousing and data mining—to bridge across information silos in large orga-nizations. These applications have not yet reached refereed journals, however—partly, we believe, because they are being treated as proprietary.

Despite their widespread and, in some areas, growing use, statistical matching techniques seem to have received insufŽ-cient attention regarding their theoretical properties. Statistical matching always has had an ad hoc avor (Scheuren 1989), although parts of the subject have been examined with care (e.g., Cohen 1991; Rodgers 1984; Sims 1972). In this article we return to one of the important attempts to underpin prac-tice with theory, important work of Don Rubin (1986), now more than 15 years old.

We begin by describing Rubin’s contribution. Needed reŽnements are then made that go hand-in-hand with advances in computing in the time since Rubin’s article was written. To frame the results presented, after this introduction (Sec. 1) we include a section (Sec. 2) titled “What is Statistical Match-ing?” This is followed by a restatement of Rubin’s original results, along with a detailed examination of the method’s properties (Sec. 3). Then in Section 4 we present improve-ments of Rubin’s procedure and develop their properties. In Section 5 we provide some simulation results that illustrate the new approaches, and in Section 6 discuss applications and generalizations of the approaches. Finally, in Section 7 we give a summary and draw some conclusions.

2. WHAT IS STATISTICAL MATCHING?

Perhaps the best description to date of statistical match-ing has been given by Rodgers (1984); other good descrip-tions have been given by Cohen (1991) and Radner, Allen, Gonzalez, Jabine, and Muller (1980). A brief summary of the method is provided here.

Suppose that there are two sample Žles, A and B, taken from two different surveys. Suppose further that Žle A con-tains potentially vector-valued variables4X1 Y 5, whereas Žle B contains potentially vector-valued variables4X1 Z5. The objec-tive of statistical matching is to combine these two Žles to obtain at least one Žle containing 4X1 Y 1 Z5. In contrast to

record linkage, or exact matching (e.g., Fellegi and Sunter 1969; Scheuren and Winkler 1993, 1997), the two Žles to be combined are notassumed to have records for the same enti-ties. In statistical matching, the Žles are assumed to have little or no overlap, and hence records forsimilarentities are com-bined, rather than records for thesame entities.

All statistical matches described in the literature have used the X variables in the two Žles as a bridge to create synthetic records containing4X1 Y 1 Z5. To illustrate, suppose that Žle A consisted in part of records

X11 Y11 X21 Y21

and

X31 Y31 whereas Žle B had records of the form

X11 Z11

X31 Z31

©2003 American Statistical Association Journal of Business & Economic Statistics January 2003, Vol. 21, No. 1 DOI 10.1198/073500102288618766

65

(3)

66 Journal of Business & Economic Statistics, January 2003

and

X41 Z40

The matching methodologies used almost always have made the assumption that 4Y 1 Z5 are conditionally independent given X, as pointed out initially by Sims (1972). From this assumption, it would be immediate that one could create

X11 Y11 Z1 and

X31 Y31 Z30

Notice that matching on X1 and X3 in no way implies that [4X11 Y15and4X11 Z15], or [4X31 Y35and 4X31 Z35], are taken from the same entities.

What to do with the remaining records is less clear, and techniques vary. Broadly, the various strategies used for sta-tistical matching can be grouped into two general cate-gories, “constrained” and “unconstrained.” Constrained statis-tical matching requires the use of all records in the two Žles and basically preserves the marginal Y and Z distributions (e.g., Barr, Stewart, and Turner 1982). In the foregoing (sim-plistic) example, for a constrained match one would have to end up with a combined Žle that also had records

X21 Y21 Z?? and

X41 Y??1 Z40

Unconstrained matching does not have this requirement, and one might stop after creating X21 Y21 Z??. How the statistical matching procedure deŽned records to besimilarwould deter-mine the values of the variables without speciŽc subscripts.

A number of practical issues, not part of our present scope, need to be addressed during a statistical matching process. Among these issues are alignment of universes (i.e., agree-ment of the weighted sums of the data Žles) and alignagree-ment of units of analysis (i.e., individual records representing the same units). Usually, too, the bridgingX variables can have different measurement or nonsampling properties in the two Žles (See Cohen 1991; Ingram, O’Hare, Scheuren, and Turek 2000 for further details).

Statistical matching is by no means the only way to com-bine information from two Žles. Sims (1978), for instance, described alternative methodologies to statistical matching that could be used under conditional independence. Other authors (e.g., Singh, Mantel, Kinack, and Rowe 1993; Paass 1986, 1989) have described methodologies for statistical matching if auxiliary information about the4Y 1 Z5relationship is avail-able. Although an important special case, this option is seldom available (Ingram et al. 2000). (See also National Research Council 1992, where the subject of combining information has been taken up quite generally.)

Rodgers (1984) included a more detailed example of com-bining two Žles using both constrained and unconstrained matching than the example that we provide here. We encour-age the interested reader to consult that reference for an illustration of how sample weights are used in the matching process.

3. RUBIN’S PROCEDURE FOR STATISTICAL MATCHING

In the framework described earlier, Rubin (1986) outlined a methodology for what he termed the “concatenation” of two sample Žles. Assuming a trivariate outcome Žle 4X1 Y 1 Z5, whereXcould be vector-valued, Rubin suggested a methodol-ogy for multiple imputation (Rubin 1987, 1996) ofZin Žle A andY in Žle B, based on several speciŽed values of the par-tial correlation ofY andZ, givenX. We Žrst describe Rubin’s procedure, and then examine aspects of the procedure in more detail.

3.1 Description of the Procedure

Broadly, Rubin proposes starting his procedure by using regression. This in turn creates the predictions of the variables that he wants to use for the statistical matching. Finally, he advocates concatenating the resulting Žles.

Each of these steps is described briey in this section, fol-lowed by a discussion in Section 3.2 of the procedure’s the-oretical properties. As will be seen, many of these thethe-oretical results are new and in places corrective. In fact, we would not advocate using Rubin’s approach without major modiŽcations, as we specify in Sections 4 and 6.

3.1.1. Regression Step. Rubin’s procedure begins by postulating a value for the partial correlation of4Y 1 Z5 given

X, and calculating the regressions ofY on X in Žle A andZ

on X in Žle B. The regression coefŽcients, the variances of the residuals from the two regressions, and the assumed value of the partial correlation of4Y 1 Z5, givenX, then are used to construct the matrix

0 B @

0 RYonX RZonX

ƒ4RYonX50 pvarYX pcovY 1 ZX

ƒ4RZonX50 pcovY 1 ZX pvarZX

1 C

A0 (1)

Here RYonX andRZonX are the column vectors of the

regres-sion coefŽcients of Y on X and Z on X, respectively, and ƒ4RYonX5

0 and ƒ4R

ZonX5

0 are the respective negative

trans-poses.RYonXandRZonX are of dimension4mC15by 1, where

the dimension of X is m. pcovY 1 ZX is the partial covariance

of4Y 1 Z5 givenX; pvarYX is the partial variance of Y given

X, and pvarZX is the partial variance of Z given X. pvarYX

and pvarZX are estimated using the variances of the residuals

of the corresponding regressions, and pcovY 1 ZX is estimated

using the assumed value of the partial correlation of 4Y 1 Z5, givenX, multiplied by the square root of (pvarYX¢pvarZX).

The “sweep” matrix operator (Goodnight 1979; Seber 1977) is then applied to (1) to obtain estimates of RYonX1 Z and

RZonX1 Y. Beginning with (1) and then “sweeping onY” gives

the matrix

0 B @

# # 4RZonX 1 Y512nƒ1 # # 4RZonX 1 Y5n

ƒ4RZonX1 Y5012nƒ1 ƒ4RZonX1 Y5n pvarZX1 Y

1 C A1

(4)

and beginning with (1) and “sweeping onZ” gives the matrix

denotes the full set of regression coefŽcients of A on B and C.

RYonX1 Z is used to obtain what we call the “primary”

esti-mates of Y in Žle B, andRZonX1 Y is used to obtain primary

estimates ofZin ŽleA. These primary predicted values then are used to produce what we call the “secondary” predicted values.RYonX1 Z is used, along with observedX and predicted

Z, to obtain secondary estimates ofY in Žle A, andRZonX 1 Y

is used, along with observedX and predictedY, to obtain sec-ondary estimates ofZin Žle B.

At the completion of the estimation step, Žle A consists of observedX, observedY, primary predictedZ, and secondary predictedY. File B consists of observedX, observed Z, pri-mary predictedY, and secondary predictedZ.

Note that if the partial correlation of Y andZ, givenX, is assumed to be 0, then the foregoing procedure simpliŽes. In this case,RYonX1 Z equalsRYonX andRZonX1 Y equalsRZonX.

We note in passing that the regression coefŽcients obtained by use of the sweep operator as described by Rubin also can be obtained by the regression method described by Kadane (1978). (See Goodnight 1979 or Seber 1977, chap. 12 for more details.) The methods are identical if applied to just one dataset, and differences in regression coefŽcients obtained from the two methods in the framework discussed here can be expected to be small if the datasets are large.

To replicate the regression coefŽcients shown in Rubin’s table 5, we need to use the sweep operator, because the datasets are very small (eight records in Žle A and six records in Žle B). However, even in this case, Kadane’s method pro-duces regression coefŽcients ofY onX andZin good ment with Rubin’s method. (Note, however, that the agree-ment is not too good for the regression coefŽcients ofZonX

andY.)

3.1.2. Matching Step. The matching step in Rubin’s approach involves using unconstrained matches. The Žnal value ofZassigned to thejth record in Žle A is obtained by doing an unconstrained match between the (primary) predicted value of Z for the jth record and the (secondary) predicted values ofZ in Žle B. (The matching criterion is a minimum distance or nearest-neighbor approach.) The observedZvalue in the matched record in Žle B is then assigned to the jth record in Žle A. Similarly, the Žnal value of Y assigned to theith record in Žle B is obtained by doing an unconstrained match between the (primary) predicted value ofY for theith record and the (secondary) predicted values of Y in Žle A. The matching steps are done separately forY andZ.

Note that a careful reading of Rubin’s article is necessary to discern his methodology for the matching step. The sta-tistical matching literature contains references that incorrectly state that Rubin’s procedure is to do an unconstrained match between the (primary) predicted value ofZfor thejth record in Žle A and theobservedvalues ofZin Žle B, and an uncon-strained match between the (primary) predicted value ofY for theith record in Žle B and theobservedvalues ofY in Žle A.

3.1.3. Concatenation Step. Rubin then suggests concate-nating the resulting statistically matched Žles and assigning the weight 4wƒ1 weight corresponding to the Žle A portion of the record and wB is the weight corresponding to the Žle B portion of the record.

Rubin’s method is one of only two procedures described in the statistical matching literature for assessing the effect of alternative assumptions of the inestimable value cov4Y 1 Z5. (See Kadane 1978, supplemented by Moriarity 2001, for the other procedure.)

Rubin suggests that his basic procedure be repeated for sev-eral assumed values of the partial correlation of4Y 1 Z5 given

X, as an operation akin to multiple imputation (Rubin 1987, 1996). Note that Kadane (1978) had already emphasized the necessity of repeating the matching procedure for a range of corr4Y 1 Z5 values, thus anticipating Rubin’s multiple imputa-tion concept as applied in this arena.

3.2 Further Aspects of Rubin’s Method

Here we discuss some aspects of Rubin’s method separately for the regression step, matching step, and concatenation step.

3.2.1. Regression Step. In this section we discuss Rubin’s method within the framework of 4X1 Y 1 Z5 hav-ing a nonshav-ingular multivariate normal distribution with mean

X1 ŒY1 ŒZ5and covariance matrix

All elements ofècan be estimated from Žle A or Žle B except forèY Z and its transpose,èZY. SpeciŽcation of the partial

cor-relation of4Y 1 Z5givenX, as Rubin does, can be considered equivalent to speciŽcation ofèY Z in this framework.

It can be shown (Moriarity 2001) that for Žle A, after the “primary” prediction ofZ, the joint distribution of4Xj1 Yj1bZj5

We use the “A1” subscript to emphasize that this is the covari-ance matrix for Žle A, after the primary prediction ofZ.èZbZb

can be shown to equal

¡

Similarly, it can be shown that for Žle B, after the primary prediction ofY, the joint distribution of4Xi1bYi1 Zi5 in Žle B

(5)

68 Journal of Business & Economic Statistics, January 2003

Hence, although the variances of the variables created by pre-diction are smaller than the variances of the variables being predicted (see Sec. 4 for more discussion of this assertion), other relationships are preserved. In particular, the postulated value ofèY Z is preserved.

In Žle A, after the “secondary” prediction of Y, it can be shown (Moriarity 2001) that the joint distribution of

4Xj1eYj1bZj5 is normal with mean 4ŒX1 ŒY1 ŒZ5 and singular

Similarly, in Žle B, after the secondary prediction ofZ, it can be shown that the joint distribution of4Xi1bYi1eZi5 is

nor-3.2.2. Matching Step. Rubin’s procedure uses uncon-strained matching. This allows for the possibility that some records in each Žle act as a “donor” more than once, whereas other records are not used. Unlike constrained matching, which forces the use of all records, unconstrained matching can lead to distortions in means, variances, and other param-eters (see, e.g., Rodgers 1984).

3.2.3. Concatenation Step. As mentioned earlier, Rubin suggests concatenating the Žles and assigning the weight

4wƒ1 A Cw

ƒ1 B 5

ƒ1to each record. If a given record consists mostly of data from Žle A, then wAis the weight corresponding to the Žle A portion of the record, and wB is the weight correspond-ing to the Žle B portion of the record under the assumption that it was sampled according to the protocol used to sample Žle A. If the given record consists mostly of data from Žle B, then wB is the weight corresponding to the Žle B portion of the record and wA is the weight corresponding to the Žle A portion of the record under the assumption that it was sampled according to the protocol used to sample Žle B.

This suggested weight is intuitively reasonable and feasible to compute in simple cases such as iid sampling. For iid sam-pling, it can be shown (as Rubin does) that estimates are unbi-ased. However, for more complex sampling designs, it may not be feasible to compute the suggested weight. The needed sam-ple design information may not be available, and/or it could be difŽcult or impossible to compute selection probabilities for records in one Žle under the assumption of sampling accord-ing to the protocol used to sample the other Žle.

A serious problem with concatenation after the use of unconstrained matching is that estimates may not be unbiased, given the use of unconstrained matching, for non-iid sampling. Furthermore, concatenation can give the illusion of the cre-ation of additional informcre-ation. If Žle A and Žle B have 200 records each, it seems apparent that it is not possible to form more than 200 matchedY-Zpairs; however, concatenation can give the illusion of up to 400Y-Z pairs. This problem of the illusion of having more information is a criticism that can, of course, in one way or another be leveled at all statistical matching methods.

4. LOSS OF SPECIFIED VALUE OFèY Z IN RUBIN’S PROCEDURE: ALTERNATIVES CONSIDERED

In the ideal situation of statistically matching two data Žles, each having many observations with variables that are multi-variate normally distributed, the preferred outcome of a pro-cedure such as Rubin’s would be that a speciŽed value ofèY Z

is reproduced accurately in the Žnal product of the statistical matching procedure. Other characteristics such as means, vari-ances, and covariances that are observable in Žles A and B should also be preserved.

For Rubin’s procedure, the results presented in Section 3.2.1 show that the speciŽed value of èY Z is preserved during the primary estimation of Y and Z in the multivariate normal framework. However, the speciŽed value is not guaranteed to be preserved during the secondary estimation of Y andZ. In fact, signiŽcant deviation is possible.

Our extensive simulation of Rubin’s suggested match-ing procedure (see Sec. 5 for details of the simulation

(6)

Table 1. Summary of Simulation Results for Rubin’s Method and Related Methods

Corr(Y1Z) Performance

Corr(Y1Z)values values away Performance reproducing near conditional from conditional reproducing Corr(X1Z)in Matching independence value independence value variances, ’le A and procedure (1,049 simulations) (824 simulations) means, etc. Corr(X1Y)in ’le B

(1) Rubin’s method

File A .04 .14 Bad for variances; Bad

File B .10 .25 OK for means

File A/B difference .10 .13

(2) Same as (1), except no secondary prediction

File A .03 .03 Bad for variances; Good

File B .02 .02 OK for means

File A/B difference .02 .01

(3) Same as (2), except add residuals to primary predictions

File A .03 .03 Good Good

File B .05 .03

File A/B difference .03 .02

(4) Same as (3), except use univariate constrained matching onYand onZ

File A .03 .03 Good (as is expected from Good

File B .05 .03 using constrained matching)

File A/B difference .03 .01

(5) Same as (4), except use multivariate constrained matching on(Y1Z)

File A .01 .01 Good (as is expected from Good

File B .01 .01 using constrained matching)

File A/B difference .00 .00

NOTE: The ’le A and ’le B rows show the average absolute differences between estimated Corr(Y1Z)and speci’ed Corr(Y1Z). The ’le A/B difference row shows the average absolute difference between estimates of Corr(Y1Z)in the two ’les.

methodology) showed that the speciŽed value of èY Z is not

always preserved in the Žnal Žles. Not surprisingly, there was considerable correlation between the divergence of the esti-mated value of èY Z from the speciŽed value of èY Z and the

divergence of èbYeZ from èY Z. Furthermore, the estimates of

èY Zcomputed from the two Žles often were far apart—a

trou-bling inconsistency. Hence, Rubin’s methodology needed to be revised to address these anomalies.

One alternative possibility is to carry out a modiŽcation of Rubin’s procedure in which the secondary estimation step is omitted, to eliminate the distortions introduced by that step. That is, an alternative procedure is to use actual values, rather than secondary estimates, during the matching step. As shown in Table 1, our simulations provided strong evidence that this modiŽcation of Rubin’s procedure was far more successful on average in retaining the speciŽed value ofèY Z. This modiŽ-cation also attained much better consistency of the estimated values ofèY Z from the two Žles compared with Rubin’s pro-cedure.

A second alternative possibility to consider is to impute residuals to the primary predicted values before the matching step. Note that èZZƒèbZbZ and èY YƒèbYbY can be identiŽed

as variances of random variables with certainconditional dis-tributions(e.g., Anderson 1984, p. 37). (This also shows that the variances of the predicted variable values are smaller than the variances of the variables themselves.) Hence the covari-ance matrices can be made equal by imputing independently drawn normally distributed random residuals with mean 0 and

variance as speciŽed and adding these residuals toZbj andbYi.

As shown in Table 1, simulations suggest that this methodol-ogy is comparable in most ways to Rubin’s procedure with-out the secondary estimation step. However, this methodology provided improved performance in reproducing variances.

A third alternative methodology is to proceed as in the aforementioned alternative, except to replace unconstrained matching onY and onZwith univariateconstrained matching

onY and onZ(two separate matches). As shown in Table 1, the results of this alternative are comparable to those of the two previous alternatives. However, this procedure has the additional beneŽt of guaranteeing the elimination of distortion in variances that can occur when unconstrained matching is used.

A fourth alternative to consider is to proceed as in the foregoing alternative, but replace two univariate constrained matches onY and onZwith a singlemultivariateconstrained match on4Y 1 Z5. This alternative, which consists of the pri-mary estimation step, skipping the secondary estimation step, and multivariate constrained matching on 4Y 1 Z5 after impu-tation of residuals, has been discussed at length by Moriarity (2001) and Moriarity and Scheuren (2001). The multivariate constrained matching step uses a Mahalanobis distance com-puted on4Y 1 Z5and was implemented in our simulations using the RELAX-IV public domain software (Bertsekas 1991; Bert-sekas and Tseng 1994). If the matching step links the jth record in Žle A to theith record of Žle B, then the Žnal value ofZ assigned to thejth record in Žle A comes from theith

(7)

70 Journal of Business & Economic Statistics, January 2003

record of Žle B, and the Žnal value of Y assigned to the ith record of File B comes from thejth record of File A.

This fourth alternative has the beneŽts of univariate con-strained matching, and it also eliminates any inconsistency between 4Y 1 Z5 estimates using Žle A and 4Y 1 Z5 estimates using Žle B. As shown in Table 1, this alternative gave the best performance of the methods considered.

5. DESCRIPTION AND RESULTS OF THE SIMULATION METHODOLOGY

To assess the performance of Rubin’s procedure, and vari-ants of that procedure, we carried out simulations. In these simulations we used univariateX, Y, andZ, with 4X1 Y 1 Z5

assumed to have a trivariate normal distribution. Without loss of generality, 4X1 Y 1 Z5 were assumed to have zero means and unit variances. In this simple framework,èY Z is equal to Corr4Y 1 Z5, and so on.

Corr4X1 Y 5 was allowed to vary from 0 to .95 in incre-ments of .05. For a given value of Corr4X1 Y 5, Corr4X1 Z5

was allowed to vary from Corr4X1 Y 5to .95 in increments of .05. For given values of Corr4X1 Y 5 and Corr4X1 Z5 values, Corr4Y 1 Z5 was allowed to take a range of 4–10 different val-ues within the bounds speciŽed by

Corr4X1 Y 5¢Corr4X1 Z5

p

64Corr4X1 Y 5527

¢64Corr4X1 Z55271

which can be estimated using Žle A and Žle B. These bounds have been variously derived (Rodgers and DeVol 1982; Moriarity 2001). All start with the requirement that the corre-lation matrix of4X1 Y 1 Z5must be positive deŽnite.

Note that the number of values of Corr4Y 1 Z5depended on the length of the interval of admissible values of Corr4Y 1 Z5. For given values of Corr4X1 Y 5, Corr4X1 Z5, and Corr4Y 1 Z5, we drew two independent samples of size 1,000 from the speciŽed multivariate normal distribution. We felt that using a sample size of 1,000 was a reasonable compromise to simulate a dataset of realistic size with minimal sampling variability, while avoiding excessive computational burden.

We carried out the regression steps as previously described in Section 3.1.1. Note that although it is possible to pool the

Xvalues from both Žles to estimate var4X5, this can and does lead to occasional problems of nonpositive deŽnite covariance matrices (Moriarity 2001). To avoid such problems, we used

X values only from Žle A to estimate var4X5when predicting

ZfromX andY for Žle A, and usedX values only from Žle B to estimate var4X5when predictingY fromX andZfor Žle B. All of the simulation work, except for the fourth alternative discussed in the previous section, was carried out on a Pen-tium II PC with a 400 MHz processor and 128 MB of RAM. A set of about 2,000 simulations typically took several hours of continuous computer processing to complete. The simula-tion work for the fourth alternative was done separately (as discussed in Moriarity and Scheuren 2001), and required more computer processing (days, as compared to hours) than the other alternatives.

5.1 Results and Innovations

Table 1 summarizes the results of applying:

1. Rubin’s originally-proposed procedure and then the fol-lowing modiŽcations, as described in Section 4:

2. Skipping the use of secondary predicted values

3. Skipping the use of secondary predicted values and adding residuals to the primary predicted values before match-ing

4. Same as 3, except using univariate constrained match-ing onY and onZ (two separate matches) instead of uncon-strained matching onY and onZ

5. Same as 4, except using multivariate constrained match-ing on4Y 1 Z5.

Comparing the Žrst two columns within a given row reveals the relative performance of a matching procedure for values near conditional independence versus values far from con-ditional independence. For Rubin’s original method, perfor-mance was worse for values far from conditional indepen-dence. All of the modiŽed procedures had robust performance relative to conditional independence.

Comparing the rows within one of the Žrst two columns illustrates the relative ability of different procedures to repro-duce the speciŽed value of Corr4Y 1 Z5after the matching step. It can be seen that Rubin’s method had the poorest perfor-mance of all methods considered, due to the distortion intro-duced by the secondary estimation step. The best-performing procedure was the one that added residuals to primary predic-tions and then performed multivariate constrained matching using4Y 1 Z5.

As shown in Table 1, we performed a broader evaluation of the various procedures. We examined each procedure’s abil-ity to reproduce other estimates, such as E4Z5, var4Z5, and Corr4X1 Z5 in Žle A and E4Y 5, var4Y 5, and Corr4X1 Y 5 in Žle B. Overall, Rubin’s method again had the poorest compar-ative performance for reproducing variances and covariances. We think that this is because of the secondary estimation step, because methods that used unconstrained matching but not the secondary estimation step generally performed better. Overall, multivariate constrained matching had the best comparative performance for reproducing variances and covariances.

5.2 Simulation Summary

To summarize, we found that it was very important to eschew the secondary estimation step. Several procedures that omitted the secondary estimation step gave comparable results, with multivariate constrained matching performing the best.

6. APPLICATIONS AND GENERALIZATIONS

In this article we have made the assumption that the vari-ables to be statistically matched come from multivariate nor-mal distributions. This does not really Žt most situations in practice, where the variables come from complex survey designs and do not have a standard theoretical distribution, let alone a normal one.

In addition to problems that arise with Rubin’s method from carrying out secondary estimation, another limitation of the

(8)

method is that it assumes bivariate4Y 1 Z5. Generalizations to higher dimensions are not immediate. It may be possible to apply techniques akin tomultivariate predictive mean match-ing, as discussed by Little (1988). An alternative procedure that appears to generalize in a straightforward way to higher dimensions is discussed by Moriarity and Scheuren (2001).

Although developing a complete paradigm is beyond our scope here, we can make some suggestions:

1. Constrained matching is a good starting point. It is expensive but now affordable with recent advances in com-puting. The use of unconstrained matching by Rubin does not seem to be essential to his ideas; indeed, it may have been advocated just for the sake of speciŽcity.

2. Applications that match Žles as large as 1,000 (the sam-ple size that we simulated) would be unusual. Even in large-scale projects, like matching the full Current Population Sur-vey with the SurSur-vey of Income and Program Participation, the matching generally would be done separately in modest-sized demographic subsets deŽned by categorical variables such as gender or race (e.g., as described in Ingram et al. 2000).

3. We believe that the general robustness of normal meth-ods can be appealed to, even when the individual observations are not normal. Although not necessarily optimal, the statistics calculated from the resulting combined Žle will be approxi-mately normal because of the central limit theorem.

4. Resampling of the original sample before using the tech-niques presented in this article, could help expose potential lack of robustness to failures in the iid assumption. One such technique was discussed by Hinkins, Oh, and Scheuren (1997). This can be computationally expensive, depending on the sam-ple designs of the two Žles being matched.

5. Often researchers who do statistical matching do not bring survey designers into the matching process. This is needed. The use of sample replication, even if only approxi-mately, is one way that designers can help matchers. Without help from samplers, it generally will not be possible to create credible sample variance estimates for statistics created from the matched Žles. This is formidable in any case but could be especially so with a concatenated Žle.

6. Deep subject matter knowledge is needed to deal with differences in the measurement error and other nonsampling concerns (e.g., edit and imputation issues) that arise and that may even be a dominant limitation to statistical matching.

7. In all applications, no matter what the matcher’s experi-ence level, caution would recommend that with a new prob-lem, simulations always be done and a small prototype involv-ing real data be conducted before beginninvolv-ing on a large scale. No decision on how (or even whether) to do a statistical match should be made until these steps have been taken.

7. SUMMARY AND CONCLUSIONS

Rubin’s 1986 article in the Journal of Business and Eco-nomic Statisticspresented innovative ideas in the area of sta-tistical matching that, although important, have not until now been followed up and developed. This is unfortunate, because the approach advocated by Rubin once modiŽed, has value to practitioners.

7.1 Secondary Estimation

We have shown that Rubin’s procedure is sound during the primary estimation step, but not necessarily during the sec-ondary estimation step. In fact, we strongly advise against the routine use of Rubin’s secondary estimation step.

Because of the loss of the speciŽed value ofèY Z that can

occur during the secondary estimation step, Rubin’s procedure is not feasible as originally described. However, innovations such as avoiding the secondary estimation step, adding resid-uals to the primary estimates before matching, and using con-strained matching, particularly multivariate concon-strained match-ing on4Y 1 Z5, appear to make the procedure feasible.

The end result is a collection of datasets formed from var-ious assumed values of èY Z, where analyses can be repeated

over the collection and the results can be averaged or summa-rized in some other meaningful way. (For a recent reference on ways to average the resulting values, see Hoeting, Madigan, Raftery, and Volinsky 1999). The methods described by Rubin (1987, 1996) also might be used; however, we consider this an area in which additional research is warranted.

7.2 Unconstrained Matching

Legitimate concerns also arise when unconstrained ing is used. The suggestion of using unconstrained match-ing was appealmatch-ing when made originally by Rubin, because it requires much less computational effort than multivariate constrained matching. However, it can be shown that uncon-strained matching leads to distortion of means and variances, as discussed by Rodgers (1984).

A simple form of univariate constrained matching (Goel and Ramalingam 1989) that matches ranked values is compa-rable in computational effort to unconstrained matching. This method appears to work acceptably when used in tandem with avoiding secondary estimation and adding residuals to the pri-mary estimates. Ingram et al. (2000) evaluated this procedure, but without residuals added. Multivariate constrained match-ing requires more computational effort, but advances in com-puter hardware and software (e.g., Bertsekas 1991; Bertsekas and Tseng 1994) have made multivariate constrained match-ing feasible.

7.3 Concatenation

The notion of Žle concatenation is appealing. However, on close examination, it seems to have limited applicability. It is not clear that the suggested weights can always be computed for non-iid sampling for complex survey designs. Moreover, if the two sample Žles are simply brought together, then there is danger of giving the illusion of creation of information by repetition of observations in the concatenated Žle. This incor-rect conclusion is less likely if no concatenation is done.

Nonetheless, in many past applications of statistical match-ing, the matching of the two Žles was done in only one direction, thus wasting information. Rubin’s essential idea of matching in both directions is important and should become standard practice. Use of this approach would imply that two estimates would be available for inference, a helpful way of displaying both sampling and nonsampling error.

(9)

72 Journal of Business & Economic Statistics, January 2003

To calculate the set of wABiD4w

ƒ1 A Cw

ƒ1 B 5

ƒ1, the weight suggested by Rubin for the ith observation on the concate-nated Žle, one needs to calculate the weight that Žle A sample observations would have had on Žle B had they been selected into the Žle B sample. Similarly, one needs to be able to cal-culate for a Žle B case what its probability is of being selected into Žle A.

Now, in general, for a given sample design there is an index setI of labels needed to assign a probability of selec-tion. In order for the type of concatenation that Rubin advo-cated to be feasible, these Žles would need to be of the form

6XA1 YA1WeightA1 IA1 IB7 and6XB1 ZB1WeightB1 IA1 IB7before the matching step.

Now, except in very special cases (e.g., simple random sam-pling (SRS) on both Žles or stratiŽed SRS with a subset of the X variables being the stratiŽers), IA and IB would not be known for all observations. This would almost certainly be true when public use Žles are being statistically matched. Indeed, for public use Žles that generally contain limited infor-mation because of conŽdentiality concerns, wABi cannot be calculated unless IA and IB are effectively captured by the

X variables and the sample weights. (Rubin makes a related point in sec. 3.3 of his article.)

Ignore for a moment that in most settings, fully weight-ing a concatenated Žle of the form that Rubin describes may be impossible. Suppose instead that it were possible. Usually, there are known quantities about the population in either the Žle A or the Žle B samples that are conditioned on either dur-ing selection or after the fact. Although only sketchdur-ing this in his article, Rubin seems to be leading us to this point with his comment on adjusted ratio weights. To illustrate, suppose one matched a stratiŽed sample (Žle A) with a simple random sample (Žle B). Suppose further that the stratifying variables were among the common X variables. Then after construct-ing the concatenated weights one would be led, almost nat-urally to condition them on stratum totals, if known. If both Žles were stratiŽed simple random samples, with the stratiŽers among the commonX variables, then jointly conditioning on both sets of strata totals might be done, possibly using a rak-ing estimator (Demrak-ing and Stephan 1940).

We would argue that Rubin’s adjusted ratio version of con-catenated weighting, as illustrated here, moves practitioners back toward constrained matching, our preferred approach. Now what does Rubin say about constrained matching? I regard the automatic matching of margins to the original Žles as a rela-tively minor beneŽt of the constrained approach in most circumstances, espe-cially considering that the real payoff in matching census margins arises when samples are not large and census data on the margins exist; and then this constrained approach, that matches census margins, is not as appropriate as methods designed to match population margins such as ratio and regression adjustment, which can be applied after the matched Žle is created.

The phrase “in most circumstances” may be where the prob-lem lies.

We agree with Rubin about the beneŽt (or lack thereof) when the samples are very large. But applications usually are to small subpopulations for which census data on the mar-gins may exist or where the practitioner believes one Žle’s estimates to be superior to the other, perhaps due to higher response rates or better measurement properties.

Now an important problem with the usual application of constrained matching—and one on which we wholly agree with Rubin—is that we are treating the constrained totals as Žxed and without error. Replication approaches can address this but usually are not applied. (see Sec. 6).

7.4 Final Observations

We conclude with some general comments about statisti-cal matching and the procedures discussed herein. We take a pragmatic view about statistical matching; it has been used for many years and will continue to be used in the future. Most statistical matching procedures that have been imple-mented have assumed, implicitly or explicitly, that4Y 1 Z5 are conditionally independent given X. Clearly, this is a plausi-ble assumption that provides a consistent set of relationships betweenX1 Y, andZ, but it is not the only possible plausible assumption. For example, as discussed in Section 5 for the case of univariateX1 Y, andZwith unit variances, the “con-ditional independence value” Corr (X1 Y)¢Corr (X1 Z) is the midpoint of a range of plausible values for Corr4Y 1 Z5, and in general, the range of plausible values for Corr(Y 1 Z) in this case is wide (Rodgers and DeVol 1982). A similar situation exists in higher dimensions. Thus any synthetic data Žle pro-duced by statistical matching procedures that do not exhibit the effect of alternative plausible values of the 4Y 1 Z5 rela-tionship to conditional independence has serious limitations, unless the conditional independence assumption happens to be more or less correct.

The techniques described here, which are extensions of work of Kadane (1978) and Rubin (1986), provide a means for exhibiting the effect of alternative plausible assumptions about the unknowable relationship betweenY andZ, and we advocate the creation of a reasonable number of datasets to display the effect of a wide range of plausible values. Pro-cedures for generating a wide range of plausible values for the 4Y 1 Z5 relationship have been outlined by Moriarity and Scheuren (2001).

Also, if sufŽcient resources are available, then we recom-mend, for a given plausible value for the 4Y 1 Z5 relationship, imputation of several different sets of residuals to display the variability introduced by that procedure.

Although the techniques described herein can help exhibit variability due to various plausible assumptions about the

4Y 1 Z5relationship, only auxiliary information can provide an accurate idea of the 4Y 1 Z5 relationship. If auxiliary informa-tion about the4Y 1 Z5relationship is available, then multivari-ate matching after imputing residuals can produce a dataset that accurately reproduces that relationship while preserving other observed relationships in the multivariate normal case (Moriarity and Scheuren 2001). Further research is needed to determine whether our technique works well in other cases.

ACKNOWLEDGMENTS

This article is based in large part on the unpublished doc-toral dissertation cited as Moriarity (2001). Results sketched here are fully developed in that reference. The authors thank Tapan Nayak, Reza Modarres, and Hubert Lilliefors of The

(10)

George Washington University Department of Statistics for useful discussions. The authors also thank the referees for their constructive suggestions, which improved the clarity of this article. The views expressed are ours and do not necessarily reect the views or positions of the U.S. General Accounting OfŽce.

[Received February 2002. Revised April 2002.]

REFERENCES

Anderson, T. W. (1984),An Introduction to Multivariate Statistical Analysis (2nd ed.), New York: Wiley.

Barr, R. S., Stewart, W. H., and Turner, J. S. (1982), “An Empirical Evaluation of Statistical Matching Strategies,” unpublished technical report, Southern Methodist University. Edwin L. Cox School of Business.

Bertsekas, D. P. (1991),Linear Network Optimization: Algorithms and Codes, Cambridge, MA: MIT Press.

Bertsekas, D. P., and Tseng, P. (1994), “RELAX-IV: A Faster Version of the RELAX Code for Solving Minimum Cost Flow Problems,” unpublished technical report, available athttp://web.mit.edu/dimitrib/www/home.html. Citro, C. F., and Hanushek, E. A. (eds.) (1991),Improving Information for

Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I:

Review and Recommendations, Washington, DC: National Academy Press.

Cohen, M. L. (1991), “Statistical Matching and Microsimulation Mod-els,” in Improving Information for Social Policy Decisions: The Uses of

Microsimulation Modeling, Vol. II: Technical Papers, eds. C. F. Citro and

E. A. Hanushek, Washington, DC: National Academy Press, pp. 62–88. Deming, W. E., and Stephan, F. F. (1940), “On a Least Squares Adjustment

of a Sampled Frequency Table when the Expected Marginal Totals are Known,”Annals of Mathematical Statistics, 11, 427–444.

Fellegi, I. P., and Sunter, A. B. (1969), “A Theory for Record Linkage,”

Journal of the American Statistical Association, 64, 1183–1210.

Goel, P. K., and Ramalingam, T. (1989),The Matching Methodology: Some

Statistical Properties, Lecture Notes in Statistics Vol. 52, New York:

Springer-Verlag.

Goodnight, J. H. (1979), “A Tutorial on the SWEEP Operator,”The American

Statistician, 33, 149–158.

Hoeting, J. A., Madigan, D., Raftery, A., and Volinsky, C. T. (1999), “Bayesian Model Averaging: A Tutorial,”Statistical Science, 14, 382–417. Hinkins, S., Oh, H. L., and Scheuren, F. (1997), “Inverse Sampling Design

Algorithms,”Survey Methodology, 23, 11–21.

Ingram, D. D., O’Hare, J., Scheuren, F., and Turek, J. (2000), “Statistical Matching: A New Validation Case Study,” inProceedings of the Survey

Research Methods Section, American Statistical Association, pp. 746–751.

Kadane, J. B. (1978), “Some Statistical Problems in Merging Data Files,” in

1978 Compendium of Tax Research, Washington, DC: U.S. Department of

the Treasury, pp. 159–171. (Reprinted inJournal of OfŽcial Statistics, 17, 423–433.)

Little, R. J. A. (1988), “Missing-Data Adjustments in Large Surveys,”Journal

of Business and Economic Statistics, 6, 287–301.

Moriarity, C. (2001), “Statistical Properties of Statistical Matching,” unpub-lished Ph.D. dissertation, The George Washington University, Dept. of Statistics.

Moriarity, C., and Scheuren, F. (2001), “Statistical Matching: A Paradigm for Assessing the Uncertainty in the Procedure,”Journal of OfŽcial Statistics, 17, 407–422.

National Research Council (1992),Combining Information: Statistical Issues

and Opportunities for Research, Washington, DC: National Academy Press.

Okner, B. A. (1972), “Constructing a New Data Base From Existing Micro-data Sets: The 1966 Merge File,”Annals of Economic and Social Measure-ment, 1, 325–342.

Paass, G. (1986), “Statistical Match: Evaluation of Existing Procedures and Improvements by Using Additional Information,” inMicroanalytic

Simula-tion Models to Support Social and Fiscal Policy, eds. G. H. Orcutt, J. Merz,

and H. Quinke, Amsterdam: North-Holland, pp. 401–420.

(1989), “Stochastic Generation of a Synthetic Sample From Marginal Information,” inFifth Annual Research Conference Proceedings, Washing-ton, DC: U.S. Bureau of the Census, pp. 431–445.

Radner, D. B., Allen, R., Gonzalez, M. E., Jabine, T. B., and Muller, H. J. (1980), “Report on Exact and Statistical Matching Techniques,” Statistical Policy Working Paper 5, U.S. Government Printing OfŽce.

Rodgers, W. L. (1984), “An Evaluation of Statistical Matching,”Journal of

Business and Economic Statistics, 2, 91–102.

Rodgers, W. L. and DeVol, E. B. (1982), “An Evaluation of Statistical Match-ing,” unpublished report submitted to Income Survey Development Pro-gram, Dept. of Health and Human Services, by the Institute for Social Research, University of Michigan.

Rubin, D. B. (1986), “Statistical Matching Using File Concatenation With Adjusted Weights and Multiple Imputations,”Journal of Business and Eco-nomic Statistics, 4, 87–94.

(1987),Multiple Imputation for Nonresponse in Surveys, New York: Wiley.

(1996), “Multiple Imputation After 18+ Years,”Journal of the

Amer-ican Statistical Association, 91, 473–489.

Scheuren, F. (1989), Comment on “The Social Policy Simulation Database and Model: An Example of Survey and Administrative Data Integration,” by M. Wolfson, S. Gribble, M. Bordt, B. Murphy, and G. Rowe,Survey of

Current Business, 69, 40–41.

Scheuren, F., and Winkler, W. E. (1993), “Regression Analysis of Data Files That are Computer Matched,”Survey Methodology, 19, 39–58.

(1997), “Regression Analysis of Data Files That are Computer Matched, Part II,”Survey Methodology, 23, 157–165.

Seber, G. A. F. (1977),Linear Regression Analysis, New York: Wiley. Sims, C. A. (1972), Comment on “Constructing a New Data Base from

Exist-ing Microdata Sets: The 1966 Merge File,” by B. A. Okner, Annals of

Economic and Social Measurement, 1, 343–345.

(1978), Comment on “Some Statistical Problems in Merging Data Files,” by J. B. Kadane, in1978 Compendium of Tax Research, Washington, DC: U.S. Department of the Treasury, pp. 172–177.

Singh, A. C., Mantel, H. J., Kinack, M. D., and Rowe, G. (1993), “Statistical Matching: Use of Auxiliary Information as an Alternative to the Condi-tional Independence Assumption,”Survey Methodology, 19, 59–79.

Gambar

table 5, we need to use the sweep operator, because theTo replicate the regression coef�cients shown in Rubin’sdatasets are very small (eight records in �le A and six recordsin �le B)
Table 1. Summary of Simulation Results for Rubin’s Method and Related Methods

Referensi

Dokumen terkait

Permasalahan tentang Peranan Guru Pendidikan Agama Islam Dalam Pembinaan Kecerdasan Emosional Siswa sangat luas. Karena itu, agar masalah tidak rancu dalam skripsi

[r]

Saat ini permasalahan tersebut telah ditanggapi dan ditangani oleh helpdesk LPSE LKPP (status closed ) dan aplikasi SPSE pada LPSE Kabupaten Sanggau telah kembali normal. Untuk

[r]

Oleh karena itu, kedudukan seorang hakim yang dianggap sama sebagai seorang mujtahid dapat penulis kemukakan dengan terlebih dahulu menjelaskan bagaimana peran seorang

[r]

[r]

“Menurutnya sistem adalah prosedur logis dan rasional untuk merancang suatu rangkaian komponen yang berhubungan satu dengan yang lainnya dengan maksud untuk berfungsi sebagai