• Tidak ada hasil yang ditemukan

Manajemen | Fakultas Ekonomi Universitas Maritim Raja Ali Haji 073500104000000712

N/A
N/A
Protected

Academic year: 2017

Membagikan "Manajemen | Fakultas Ekonomi Universitas Maritim Raja Ali Haji 073500104000000712"

Copied!
5
0
0

Teks penuh

(1)

Full Terms & Conditions of access and use can be found at

http://www.tandfonline.com/action/journalInformation?journalCode=ubes20

Download by: [Universitas Maritim Raja Ali Haji] Date: 12 January 2016, At: 23:47

Journal of Business & Economic Statistics

ISSN: 0735-0015 (Print) 1537-2707 (Online) Journal homepage: http://www.tandfonline.com/loi/ubes20

Rejoinder

John M Abowd & Lars Vilhuber

To cite this article: John M Abowd & Lars Vilhuber (2005) Rejoinder, Journal of Business & Economic Statistics, 23:2, 162-165, DOI: 10.1198/073500104000000712

To link to this article: http://dx.doi.org/10.1198/073500104000000712

Published online: 01 Jan 2012.

Submit your article to this journal

Article views: 24

(2)

Representing uncertainty about how records should be linked introduces additional complications, because this implies un-certainty about the number of objects in the world. However, Pasula, Marthi, Milch, Russell, and Shpitser (2002) recently de-scribed a Markov chain Monte Carlo (MCMC) technique that allows sampling from a posterior over RPN structures associ-ated with different numbers of objects. The method that they describe has not been applied to databases containing more than a few hundred objects; however, in earlier work, Cohen, Kantz, and McAllester (2000) described anO(NlogN)search technique for finding a “clean” database than is locally optimal with respect to posterior probability, and the data structures that they described could be adapted to MCMC sampling as well. In fact, the Abowd–Vilhuber approach of providing a single clean database might be viewed as a maximum a posteriori (MAP) approximation to the full Bayesian approach. What we might want to do is have multiple draws from the posterior distribu-tion, in the spirit of “multiple imputadistribu-tion,” which again relates to issues concerning protection against statistical disclosure (cf. Raghunathan, Reiter, and Rubin 2003).

To be practical for large-scale problems, these sorts of representations of data cleaning uncertainty will have to be combined with conventional schemes for storing data whose attributes and identity are not in question. We note that in Abowd and Vilhuber’s dataset, the vast majority of the lon-gitudinal links are based on apparently uncorrupted SSNs, and only 800,000 of 96,000,000 records are modified by their data cleaning method.

This Bayesian framework for representing the output of data cleaning clearly requires refinement and the imposition of many approximations to make it practical for large datasets. Because of this, it is still essential to obtain a good understanding of which data cleaning decisions most effect the analyses that will be performed on the data and the effect of those decisions on the bias and accuracy of derived statistics. Abowd and Vilhuber’s article is an excellent step toward achieving this understanding.

ADDITIONAL REFERENCES

Abowd, J. M., and Lane, J. (2004), “New Approaches to Confidentiality Protec-tion: Synthetic Data, Remote Access and Research Data Centers,” inPrivacy in Statistical Databases, eds. J. Domingo-Ferrer and V. Torra, New York: Springer-Verlag, pp. 282–289.

Abowd, J. M., and Woodcock, S. D. (2001), “Disclosure Limitation in Longitu-dinal Linked Data,” inConfidentiality, Disclosure, and Data Access: Theory and Practial Applications for Statistical Agencies, eds. P. Doyle, J. Lane, L. Zayatz, and J. Theeuwes, Amsterdam: North-Holland, pp. 215–277.

(2004), “Multiply-Imputing Confidential Characteristics and File Links in Longitudinal Linked Data,” inPrivacy in Statistical Databases, eds. J. Domingo-Ferrer and V. Torra, New York: Springer-Verlag, pp. 290–297. Buntine, W. (1994), “Operations for Learning With Graphical Models,”Journal

of Artificial Intelligence Research, 2, 159–225.

Cohen, W. W., Kautz, H., and McAllester, D. (2000), “Hardening Soft Infor-mation Sources,” inProceedings of the Sixth International Conference on Knowledge Discovery and Data Mining, pp. 255–259.

Domingos, P., and Richardson, M. (2004), “Markov Logic: A Unifying Frame-work for Statistical Relational Learning,” in Proceedings of SRL2004: Statistical Relational Learning and Its Connections to Other Fields,

http://www.cs.umd.edu/projects/srl2004/.

Fellegi, I., and Sunter, A. (1969), “A Theory for Record Linkage,”Journal of the American Statistical Society, 64, 1183–1210.

Fienberg, S. E., Makov, U. E., and Sanil, A. P. (1998), “A Bayesian Approach to Data Disclosure: Optimal Intruder Behavior for Continuous Data,”Journal of Official Statistics, 13, 75–89.

Friedman, N., Getoor, L., Koller, D., and Pfeffer, A. (1999), “Learning Probabilistic Relational Models,” in Proceedings of the Sixteenth Inter-national Joint Conference on Artificial Intelligence, Stockholm, Sweden, pp. 1300–1309.

Heckerman, D., Chickering, D. M., Meek, C., Rounthwaite, R., and Kadie, C. (2000), “Dependency Networks for Inference, Collaborative Filtering, and Data Visualization,”Journal of Artificial Intelligence Research, 1, 49–75. Labert, D. (1993), “Measures of Disclosure Risk and Harm,”Journal of Official

Statistics, 9, 313–331.

Pasula, H., Marthi, B., Milch, B., Russell, S., and Shpitser, I. (2002), “Iden-tity Uncertainty and Citation Matching,” inAdvances in Neural Processing Systems 15, Vancouver, British Columbia: MIT Press, pp. 1425–1432. Raghunathan, T. E., Reiter, J., and Rubin, D. B. (2003), “Multiple Imputation

for Statistical Disclosure Limitation,”Journal of Official Statistics, 19, 1–16. Winkler, W. E. (2002), “Methods for Record Linkage and Bayesian Networks,”

research report, Statistical Research Division, U.S. Bureau of the Census. Winkler, W., and Scheuren, F. (1996), “Recursive Analysis of Linked Data

Files,” inProceedings of the 1996 Census Bureau Annual Research Con-ference, pp. 920–935.

Rejoinder

John M. A

BOWD

Edmund Ezra Day Professor of Industrial and Labor Relations, School of Industrial and Labor Relations, Cornell University, Ithaca, NY 14850 (john.abowd@cornell.edu)

Lars V

ILHUBER

Senior Research associate, CISER (Cornell Institute for Social and Economic Research), Cornell University, Ithaca, NY 14850 (lars.vilhuber@cornell.edu)

We appreciate the time and effort that the discussants have spent providing us with concise and useful comments and sug-gestions. We are also grateful to both Alastair Hall and Torben Andersen, successive JBES editors, for inviting us to this fo-rum and for supporting us in our endeavor. The discussants each came to the table with a different background, and we appreci-ate the wide range of concerns that they raised. The comments

contain many specific questions regarding the data-correction procedures and analyses presented in our article. All

commen-© 2005 American Statistical Association Journal of Business & Economic Statistics April 2005, Vol. 23, No. 2 DOI 10.1198/073500104000000712

(3)

tators offer suggestions concerning the scope, usefulness, and direction that research on the statistical use of administrative data should address. We would like to use this rejoinder to re-ply to these questions and to elaborate our own vision of useful directions for this research.

1. CURRENT STATUS OF IMPLEMENTATION

AND RESEARCH

To begin, we want to point out that the methods proposed in our article have already been implemented within the U.S. Cen-sus Bureau as part of the production system for the Quarterly Workforce Indicators (QWIs) (seehttp://lehd.dsd.census.gov). Since the methods were prototyped and analyzed with Califor-nia data, a number of states have been processed in much the same way. Each state’s historical data provided a new challenge, because each had a different way of capturing the original un-employment insurance (UI) wage record data in the first place, of archiving and extracting it, and of postprocessing. The ac-tual employment-generating process in each state also differs, as does the end result. Although we have not repeated the quite extensive and costly analyses at all of the levels of aggregation that we discussed in our article, we have observed widely vary-ing data characteristics. Some states historically captured only a small portion of the names on the wage records, whereas others have specific ways of dealing with apparently “invalid” name– Social Security Number (SSN) combinations. All factors affect the match rates, and thus affect the number of holes plugged, which vary widely. One key statistic, the reduction in single-quarter job spell interruptions, varies between 2% and 15%. California, with 11%, is above average in this respect.

The great variety in match rates brings us directly to a com-ment by van der Klauw. He points out that states differ in fundamental employment characteristics. Our editing process highlighted some of these characteristics, and van der Klauw cites some other items, such as the actual definition of an em-ployer within the administrative data. Our ongoing application of the methods described in our article to other states has clearly shown this to be a legitimate worry.

2. ELIMINATING BIAS, ADDING BIAS,

OR CHANGING BIAS

van der Klauw is also worried that our procedure potentially introduces new biases. His example is the differential impact of the procedure on wage earners with high and low earnings volatility, as well as the differences in data transmission meth-ods used by firms. For instance, smaller firms are more likely to have submitted paper forms, and thus are more likely to be sub-ject to a data-keying entry error than larger firms. Our correc-tion procedure would find a higher match rate among smaller firms than among large firms, and as a result, the relationship between exit rates in smaller firms relative to exit rates in larger firms will have been altered. When expanding the analysis to multiple states, this differential data error is likely to be exac-erbated. Not only will the conditional ratios within any single state be changed, but also the change will differ by state, de-pending on the mix of small and large firms, the (unobserved) mix of submission methods, and other factors.

We should not forget that the original incidence of SSN cod-ing errors is correlated with the same factors that influence match rates. In the foregoing example, we find more matches among smaller firms, leading to a larger adjustment of exit rates observed in smaller firms, because there are more coding errors in smaller firms to begin with. Nevertheless, because we lack a true validation dataset, it is very difficult to assess how much bias due to miscoding error is still present in the data.

3. EARNINGS MODEL AND THE ADMINISTRATIVE

DATA GENERATING PROCESS

Several commentators discussed the earnings model that we used to augment the matching procedure with economic pat-tern data. We model the earnings patpat-tern generated by two al-ternate processes: continuous employment of a single person by the same employer in the presence of an identifier coding er-ror, and interrupted employment of a person with the same em-ployer combined with the short-term employment of a different person at the same employer. We derive the earnings patterns associated with each process and use the difference in earnings patterns as an additional element in the matching process.

Cohen, Fienberg, and Ravikumar (henceforth CFR) point out that a third earnings pattern–generating process could confound this binary choice. This third pattern would be the result of a model of “costly vacancies,” where employers faced with the absence of a single long-term employee for a prolonged period immediately hire a replacement worker for the duration of the absence. As we explain shortly, the model that CFR propose is accounted for by our scheme, but their proposition high-lights the need to distinguish the true earnings and employment-generating process from the data-employment-generating process of the administrative agency assembling the data. The latter in effect filters and coarsens the true employment patterns.

To illustrate, consider a simplified version of the “costly va-cancy” model, in which no lag arises between the separation of the long-term employee i and the hire of the replacement workerj(i.e., seamless employment of some worker filling the job). To facilitate the exposition, assume that separations occur on the last day of a month and that hiring occurs on the first day. Next, generate employment patters such that the long-term workeri is not observed working in the second quarter of the year, but is observed working in the first and third quarters of the year, that is, a “hole” in quarter II, withe(I) >0,e(II)=0, ande(III) >0 in the notation of section 4.2. Nine possible em-ployment patterns can be generated by processes that fit this pattern: workericould have left on the last day of January, Feb-ruary, or March and returned on any one of July 1, August 1, and September 1. Note, however, that each of the nine true employ-ment patterns has a different earnings pattern, which also differs slightly from our section 4.2, because the separation process is now assumed to be discrete rather than continuous. In the nota-tion of secnota-tion 4.2,

E[e(I)] =2/3, e(II)=0, E[e(III)] =2/3. (42) In the “costly vacancy” model, each of the nine true employ-ment patterns corresponds to an employemploy-ment pattern of a re-placement worker. If workerihad quit in January and returned in August, then worker j would have worked from February

(4)

to July. But these are the true employment patterns, not the data representation of those employment patterns. Because of the way in which the data are collected—the data-generating process—only one of the nine possible replacement worker em-ployment histories generates a single-quarter data record. For instance, if worker j worked from February to July, then he would have generated a three-quarter sequence of wage records in the data. Only if he worked from April to June would he have generated a potential “plug.”

To elaborate further, recall that the expected earnings for “plugs” under H1 are computed not over the population of replacement workers, but rather over the population of work histories that generate one-quarter sequences of wage records at the same employer. Thus, although the replacement worker described earlier will be among the candidates, so too will the worker kwho worked only the month of July. Thus, with the foregoing assumptions concerning hiring and separations, seven possible true employment patterns arise, and the expected earnings of all plugs, including the replacement worker’s his-tory, are

EeII|e(I)=0 ande(III)=0=4/7, (43) which is smaller than the expected earnings of the “hole.”

This simple example serves to illustrate several points. First, if one is willing to make certain assumptions about the employ-ment process, then implications that are different from the ones that we derive in section 4.2 may arise. However, most alternate models will still generate the basic implication that on average plugs will have lower earnings than holes. Second, and more important, the data-generating process that filters the true em-ployment patterns has implications about the candidate records. In the foregoing simple example, one-ninth of all replacement workers are filtered out and simply are not eligible, because their employment patterns do not generate the same data pat-tern that a true miscoding would generate. Finally, in the actual matching we do not use the exact (predicted) earnings; rather, we use the population decile into which earnings from hole and plug fall. We allow hole and plug earnings to be in adjacent deciles, although if earnings fall into the same decile, then the match weight is higher. Interestingly, whereas Winkler suggests increasing the precision of the earnings comparison (i.e., mov-ing away from usmov-ing deciles), to increase the match rate, van der Klauw suggests the opposite, fearing that a comparison that allows for earnings only within a narrow band would increase the nonmatch rate by penalizing workers with volatile earnings patterns. We will continue to investigate both of these avenues.

4. IMPROVING KNOWLEDGE ON

MATCHING METHODS

The application of the methods proposed in our article is not unique to administrative records data. In fact, we used a com-mercial record-linking product that is widely used in business environments. Although we could easily publish the parameters used for the California data (and we will gladly provide these to readers with an interest in using them), our own experience in expanding the methods to other states has shown that weights and probability cutoffs need to be adjusted to the characteristics

of the specific data to obtain satisfactory results. The parame-ters used for California do not yield satisfactory results in other states. Methods do exist that use computing power rather than operator experience to find both appropriate weights and cut-offs, in particular using the EM algorithm. A number of our discussants are active in those fields (Winkler 2000a,b; Yancey 2000, 2004).

A second factor related to improving our knowledge of matching methods relates to the choice of ancillary informa-tion to use in the linking exercise. As we noted in our article, legal constraints prevented expanding the information set used in the edit beyond the full history of the businesses and individ-uals present in the California UI wage record data. In particular, the statute under which the data were received and edited (Ti-tle 15 U.S. Code) prohibited commingling these data with other Title 13 data at the Census Bureau. Such a commingling would have prevented returning the edited file to its original custo-dian, the California employment security office. There would have been substantial additional information had we used these off-limits Title 13 data. In particular, additional name and ge-ography information could have been used both to improve the match rates and to assess the false-match rate. Some research along these lines is clearly warranted; however, for statutory reasons, it must proceed outside the context of the QWI pro-duction system.

5. MOVING AWAY FROM ONE–TO–ONE MATCHES

Statistical methods that can take into account knowledge about matching errors, surveyed in the forthcomingHandbook of Econometricschapter (Moffitt and Ridder 2005) suggested by van der Klauw, will become increasingly important. How to actually transmit the detailed knowledge (e.g., from the match-ing procedures regularly used within the Census Bureau and the BLS) is a different problem—one that is still in search of a solution.

CFR suggest an interesting extension or modification, namely moving away from one-to-one matching toward explicit Bayes matching. The typical approach in matching applications, which we have adopted here as well, is to use one-to-one match-ing. CFR propose a modification in the same spirit that mo-tivated improved missing-data imputation models in the last decades—assigning multiple probable links to a single candi-date employment history. In the case of missing-data imputa-tion, multiple imputation is achieved (in parametric models) by successive draws from the posterior predictive distribution of the missing values. In the case of record linking the actual application is nontrivial, because the posterior distribution not only must cover a draw from a distribution of candidate pairs, but also must account for the changing likelihood of no match being admissible. At present, we are not aware of an actual im-plementation of a multiple-match or Bayesian matching algo-rithm. Some authors, as pointed out to us by William Winkler in private correspondence, have relaxed some aspects of the one-to-one match decision rule by tracking some indicator of match quality from the matching process to the regression analysis (Lahiri and Larsen 2004; Scheuren and Winkler 1993, 1997). The effect of different possible matches is considered when the variable zi used in the regression analysis xizi and

(5)

stemming from a matching procedure is defined as (Lahiri and Larsen 2004)

zi=

yi with probabilityqii

yj with probabilityqij.

The true but unobserved match iszi=yi, but the record

link-age provides for each possibley=j,j=1, . . . ,n, a correspond-ing probabilityqij,jqij=1. In this analysis, the one-to-one

match is a special case, withqii=1 for alli. As a result, the

effect of mismatches—particularly, the potential bias of us-ing only one-to-one matches—can be evaluated in a regression framework.

In the Scheuren–Winkler framework, all possible matches for a given set of matching parameters are used in the analy-sis. This may not always be feasible if the set of possible matches is large. A variation on this theme could be the intro-duction of a step into the matching procedure that perturbs the weights on which the one-to-one rule is applied.Nrepetitions of this step would result in as many sets of possibly different one-to-one matches, which could be analyzed using standard multiple-imputation techniques (Rubin 1987). Such a method still remains speculative at this time, but we find this a most promising path for future research.

The processing associated with a multiply linked analysis file remains challenging. When the resulting data meet Rubin’s conditions, the Bayesian methods for any regular estimand are known. These techniques permit direct implementation of CFR’s suggestion and assessment of the variability due to the resolution of the missing data via probabilistic record linking. Complicated statistical analyses, such as the ones performed at the Census Bureau to form the QWIs, can be programmed to recognize multiply imputed missing data. The QWIs, for ex-ample, recognize 10 implicates for every data item that can be imputed. However, even the simplest of states takes 24 hours of parallel computing time (10 thread maximum) to complete with the assumption that the input employment histories (the items we edit in our article) are fixed. States like California take sev-eral weeks. Feasible multiple imputation of the record linkage in the wage record edit would have to be combined with the other missing-data models in the QWI so that the processing of all implicates could proceed in parallel.

6. ACCESS TO THE DATA

One issue that multiple commentators have mentioned is ac-cess to the data. The data that we used for this article are not public use. In particular, our proposed correction meth-ods require access to highly sensitive confidential information (SSN-laden employment histories). At the Census Bureau, the edited data files (and all other files used by the LEHD program in producing the QWIs) are purged of all name and SSN in-formation before being processed further within the statistical system, where they are still considered confidential. Only a few

analysts have access to the SSN and name-laden files, even at the Census Bureau. Nevertheless, our access is not exclusive. Access can be granted based on approved research projects. The secondary analysis proposed by most commentators, studying the effect of the correction procedure on microeconomic mod-els of behavior and economic activity, can be done using the anonymized but still confidential microdata, because process-ing information from the correction is carried forward with the data. We refer interested researchers to the U.S. Census Research Data Centers and the Center for Economic Studies website (www.ces.census.gov), which contains information on applying to use confidential Census Bureau data.

7. FINAL REMARKS

We agree with all of the discussants that efforts to improve the quality of administrative record data for use as an input to general statistical analyses must go hand-in-hand with efforts to elaborate conventional analysis models. In particular, we think that the several discussions of the differences between the sta-tistical processes generating the actual missing data and the as-sumptions of classical missing-data models are instructive. We welcome the efforts to incorporate more appropriate underlying statistical assumptions into mainstream analyses. These efforts are clearly in their infancy and will become widespread as the benefits of exponentially improving computer speeds continue to manifest themselves. We are particularly optimistic about the integration of large-scale database management techniques with the statistical modeling of record linkage and linked data files. Once the software to facilitate using these data structures as inputs to statistical models is widely disseminated, analyses like the one we have proposed here should become common-place.

ADDITIONAL REFERENCES

Lahiri, P., and Larsen, M. D. (2004), “Regression Analysis With Linked Data,” Preprint 04-9, Iowa State University, Dept. of Statistics.

Moffitt, R., and Ridder, G. (2005), “The Econometrics of Data Combination,” inHandbook of Econometrics, Vol. 6, eds. J. J. Heckman and E. E. Leamer, Amsterdam: North-Holland.

Rubin, D. B. (1987),Multiple Imputation for Nonresponse in Surveys, New York: Wiley.

Scheuren, F., and Winkler, W. E. (1993), “Regression Analysis of Data Files That Are Computer Matched,”Survey Methodology, 19, 39–58.

(1997), “Regression Analysis of Data Files That Are Computer Matched,” Part II,Survey Methodology, 23, 157–165.

Winkler, W. E. (2000a), “Frequency-Based Matching in the Fellegi–Sunter Model of Record Linkage,” Research Report Series RR00/06, U.S. Census Bureau.

(2000b), “Using the EM Algorithm for Weight Computation in the Fellegi–Sunter Model of Record Linkage,” Research Report Series RR00/05, U.S. Census Bureau.

Yancey, W. E. (2000), “Frequency-Dependent Probability Measures for Record Linkage,” Research Report Series RR00/07, U.S. Census Bureau.

(2004), “Improving EM Algorithm Estimates for Record Linkage Pa-rameters,” Research Report Series RRS2004/01, U.S. Census Bureau.

Referensi

Dokumen terkait

Compiler atau interpreter bisa diibaratkan sebagai suatu kamus yang berfungsi untuk menterjemahkan bahasa pemrograman yang ditulis oleh programmer kedalam bahasa

[r]

• Siswa belum diberi kesempatan untuk mengemukakan cara atau idenya yang mungkin berbeda dari apa yang dalam telah diketengehkan buku.. • Beberapa buku telah memberikan

Akan tetapi guru ataupun lembaga agama yang akan mcnggunakan musik n1syid sebagai mctodcnya harus memperhatikan terlebih dahulu latar belakang (kondisi

 Bumi dan tata surya Bumi dan tata surya , yang mencakup tanah, , yang mencakup tanah, bumi, tata surya, dan benda langit lainnya.. bumi, tata surya, dan benda

Sehubungan dengan Pemilihan Langsung Paket Pekerjaan Rehab Puskesmas Sukapura Pada Dinas Kesehatan Kabupaten Probolinggo dari sumber dana Tahun Anggaran 2017, dengan.. ini

1.. 1.1) Peneliti meminta izin ke pihak sekolah untuk mengadakan penelitian tindakan kelas dengan memberikan surat izin penelitian. 1.2) Mengumpulkan data awal tentang

1) Asam sulfat pekat sering ditambahkan ke dalam sampel untuk mempercepat terjadinya oksidasi. Asam sulfat pekat merupakan bahan pengoksidasi yang kuat. Meskipun