• Tidak ada hasil yang ditemukan

Disclosure Risk Evaluation

Dalam dokumen Lecture Notes in Computer Science (Halaman 167-172)

4 Empirical Data Evaluation

4.3 Disclosure Risk Evaluation

Since I only synthesized one variable in each of the simulations, it is difficult to come up with a realistic disclosure scenario. Releasing the dataset with just one variable synthesized would definitely not be an option. However, the main aim of this paper is to compare the two methods so it should be sufficient to compare the risks on a relative scale. I use two simple diagnostics to evaluate the risk from the two approaches. The first represents the percentage of records for which the mode across the synthetic responses equals the true response:

DR1= 1 N

N i=1

I(mode(y(i,j)syn) =yorg(i)), i= 1, ..., N, j= 1, ..., m, (6) whereI(·) is the indicator function,m is the number of imputations and N = 15,644 is number of records in the dataset. The second measure reports the

158 J. Drechsler

Table 2.Disclosure risk evaluations DR1 DR2

regression1 logit imputation 0.905 9.362 SVM imputation 1.000 9.997 regression2 multinomial logit imp. 0.681 7.423 SVM imputation 0.990 9.103

average number of times the true response is imputed across the 10 imputations for the records found by measure one.

DR2= 1 NR

iR

#(ysyn(i,j)=y(i)org), i= 1, ..., N, j = 1, ..., m (7) whereRis the set of records for which the mode of the synthetic records is equal to the true value andNR is the total number of records inR. Note thatDR1is bounded between 0 and 1, whereasDR2is bounded between 0 and the number of imputations (10 in this case).

The results for the two regressions are presented in Table 2. The relative

”risks” for the SVM imputation are significantly larger than the ones for the parametric imputation especially for the second simulation. Obviously for the SVM approach this simple disclosure strategy would reveal the true reported value almost for all records in the dataset. On the other hand, knowing the re- ported value for a single binary variable (or a variable with three categories) will hardly identify a single respondent in the dataset. To evaluate the real risks, more variables would have to be synthesized to achieve a realistic data dissemination scenario. Then it would be possible to evaluate the risk of correctly identifying an individual record based on assumptions about external knowledge an intruder might use for re-identification purposes ([10]). These risks should be consider- ably lower. Nevertheless, the results clearly indicate the increased relative risk of the SVM approach compared to the parametric approach. Arguably the price in terms of increased risk is higher than the potential gains in data utility at least for this simulation.

The reason for this high risk for the SVM approach is probably a direct result of the way, support vector machines are searching for optimal solutions. High priority is given to sparse solutions, i.e. solutions for which a small number of support vectors are necessary to classify the data. Support vectors are the data points that drive the decisions for the classifiers. They are the points closest to the margin. All the other points do not have any influence on the classification.

But this also means that the posterior probabilities derived for the data points that are not support vectors will always be close to one for the assigned category.

As a result, drawing from these posterior probabilities likely will result in the same imputed value for most of the draws. The risk might be further increased by the tuning parameter selection approach defined in (3). This tuning approach

Using Support Vector Machines for Generating Synthetic Datasets 159 favors those tuning parameter combinations that lead to posterior probabilities P(yi|X) with high probabilities for one of the classes of yi. This in turn will increase the probability that the same value is imputed in every imputation round and thus lead to an increase inDR1andDR2.

5 Conclusions

Finding useful parametric imputation models for large scale surveys can be a challenging and time consuming task. One possible way of facilitating the search for useful models can be to rely on machine learning approaches that are based on the common idea of finding inherent structures in the data in a more or less automatic fashion by letting the data speak for themselves. Research on CART models ([27]) and random forests ([4]) already showed some promising results in this direction. In this paper I investigated, if the ideas behind support vector machines could be useful for generating synthetic datasets. The findings in the paper indicate that although some improvements in data utility might be possible with the approach, they might come at the price of an increased disclosure risk although the presented disclosure risk evaluations might be too simplified to allow a final statement. Clearly more research is needed in this area. The potentially increased risk is probably a direct result of the fact that support vector machines aim to use a limited number of observations for the actual classification to avoid overfitting. For this reason, many records will be assigned the same class across all imputations leading either to bad data quality if the classification does not provide consistent results or to a very high disclosure risk, if the classification is correct. Both are undesirable results. As I pointed out earlier, overfitting is not equally problematic in the context of synthetic data as it is in the context of forcasting. For that reason the results from the SVM approach might be improved if the search for sparse solutions could be relaxed.

Along these lines it is reasonable to investigate different approaches for turning the SVM results into posterior probabilities than the ones I used in this paper.

The approach I used explicitly tries to maintain the sparsity of the solution and I used it mainly for convenience reasons because it was readily available in R.

Other approaches simply use some penalized regression to arrive at the posterior probabilities and this might actually be preferable in the synthetic data context.

Given the promising results in terms of data utility, it would be interesting to see if alternative approaches to obtain the posterior probabilities from SVMs could lead to reduced risks of disclosure while maintaining the high data utility.

This paper can be seen as an initial investigation of the applicability of sup- port vector machines for generating synthetic datasets. Besides the necessary extensions for continuous data, an important next step would be to compare this method to other machine learning approaches like CART or random forests that already have been demonstrated to work well as non parametric synthesiz- ing tools.

160 J. Drechsler

Acknowledgments. This research was supported by grants from the German Research Foundation and the German Federal Ministry of Education and Re- search. I want to thank the two anonymous referees of this paper for their valu- able comments which helped to improve the paper.

References

1. Bartlett, B., Jordan, M.I., McAuliffe, J.D.: Comment on: Moguerza, J.M. and Mu˜noz, A.: Support Vector Machines with Applications. Statistical Science (21), 341–345 (2006)

2. Berk, R.: Statistical Learning from a Regression Perspective. Springer, New York (2008)

3. Boser, B.E., Guyon, I., Vapnik, V.: A training algorithm for optimal marign clas- sifiers. In: Proceedings of the Fifth ACM Workshop on Computation Learning Theory (COLT), pp. 144–152. ACM Press, New York (1992)

4. Caiola, G., Reiter, J.P.: Random Forests for Generating Partially Synthetic, Cat- egorical Data. Transactions on Data Privacy 3, 27–42 (2010)

5. Cristianini, N., Shawe-Taylor, J.: Support Vector Machines and other kernel-based learning methods. Cambridge University Press, Cambridge (2000)

6. Drechsler, J.: Synthetic Datasets for the German IAB Establishment Panel. Work- ing paper for the Joint UNECE/Eurostat Work Session on Statistical Data Confi- dentiality (2009)

7. Drechsler, J.: Multiple imputation of missing values in the wave 2007 of the IAB Establishment Panel. IAB Discussion Paper (6) (2010)

8. Drechsler, J., Bender, S., R¨assler, S.: Comparing fully and partially synthetic data sets for statistical disclosure control in the German IAB Establishment Panel.

Transactions on Data Privacy 1, 105–130 (2008)

9. Drechsler, J., Dundler, A., Bender, S., R¨assler, S., Zwick, T.: A new approach for disclosure control in the IAB Establishment Panel–Multiple imputation for a better data access. Advances in Statistical Analysis 92, 439–458 (2008)

10. Drechsler, J., Reiter, J.P.: Accounting for intruder uncertainty due to sampling when estimating identification disclosure risks in partially synthetic data. In:

Domingo-Ferrer, J., Saygin, Y. (eds.) Privacy in Statistical Databases, pp. 227–

238. Springer, Heidelberg (2008)

11. Drechsler, J., Reiter, J.P.: Disclosure risk and data utility for partially synthetic data: An empirical study using the German IAB Establishment Survey. Journal of Official Statistics 25, 589–603 (2009)

12. Fienberg, S.E.: A radical proposal for the provision of micro-data samples and the preservation of confidentiality. Tech. rep., Department of Statistics, Carnegie- Mellon University (1994)

13. Fischer, G., Janik, F., M¨uller, D., Schmucker, A.: The IAB Establishment Panel – from sample to survey to projection. Tech. rep., FDZ- Methodenreport No. 1 (2008)

14. Gomatam, S., Karr, A.F., Reiter, J.P., Sanil, A.P.: Data dissemination and disclo- sure limitation in a world without microdata: A risk-utility framework for remote access servers. Statistical Science 20, 163–177 (2005)

15. Graham, P., Penny, R.: Multiply imputed synthetic data files. Tech. rep., Uni- versity of Otago (2005),http://www.uoc.otago.ac.nz/departments/pubhealth/

pgrahpub.htm

Using Support Vector Machines for Generating Synthetic Datasets 161 16. Graham, P., Young, J., Penny, R.: Multiply imputed synthetic data: Evaluation of hierarchical bayesian imputation models. Journal of Official Statistics 25, 407–426 (2009)

17. Hsu, C.-W., Chang, C.-C., Lin, C.-J.: A Practical Guide to Support Vector Clas- sification. Technical report, Department of Computer Science, National Taiwan University (2010)

18. Karr, A.F., Kohnen, C.N., Oganian, A., Reiter, J.P., Sanil, A.P.: A framework for evaluating the utility of data altered to protect confidentiality. The American Statistician 60, 224–232 (2006)

19. K¨olling, A.: The IAB-Establishment Panel. Journal of Applied Social Science Stud- ies 120, 291–300 (2000)

20. Lin, H.-T., Lin, C.-J., Weng, R.C.: A note on Platt’s probabilistic outputs for sup- port vector machines. Technical report, Department of Computer Science, National Taiwan University (2003)

21. Little, R.J.A.: Statistical analysis of masked data. Journal of Official Statistics 9, 407–426 (1993)

22. Meng, X.-L.: Multiple-imputation inferences with uncongenial sources of input (disc: P558-573). Statistical Science 9, 538–558 (1994)

23. Moguerza, J.M., Mu˜noz, A.: Support Vector Machines with Applications (with discussion). Statistical Science (21), 322–362 (2006)

24. Platt, J.: Probabilities for SV machines. In: Smola, A., Bartlett, P., Sch¨olkopf, B., Schuurmans, D. (eds.) Advances in Large Margin Classifiers, pp. 61–74. MIT Press, Cambridge (2000)

25. Reiter, J.P.: Inference for partially synthetic, public use microdata sets. Survey Methodology 29, 181–189 (2003)

26. Reiter, J.P.: Releasing multiply-imputed, synthetic public use microdata: An illus- tration and empirical study. Journal of the Royal Statistical Society, Series A 168, 185–205 (2005)

27. Reiter, J.P.: Using CART to generate partially synthetic, public use microdata.

Journal of Official Statistics 21, 441–462 (2005)

28. Rubin, D.B.: Discussion: Statistical disclosure limitation. Journal of Official Statis- tics 9, 462–468 (1993)

29. Wahba, G.: Multivariate function and operator estimation, based on smoothing splines and reproducing kernels. In: Casdagli, M., Eubank, S. (eds.) Proc. of Non- linear Modeling and Forcasting, SFI Studies in the Science of Complexity, vol. XII, pp. 95–112. Addison-Wesley, Reading (1992)

30. Wahba, G.: Support vector machines, reproducing kernel hilpert spaces and the erndomized GACV. In: Sch¨olkopf, B., Burges, C.J.C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning, pp. 69–88. MIT Press, Cambridge (1999)

31. Wu, T.-F., Lin, C.-J., Weng, R.C.: Probability estimates for multi-class classifi- cation by pairwise coupling. Journal of Machine Learning Research 5, 975–1005 (2004)

J. Domingo-Ferrer and E. Magkos (Eds.): PSD 2010, LNCS 6344, pp. 162–173, 2010

© Springer-Verlag Berlin Heidelberg 2010

Dalam dokumen Lecture Notes in Computer Science (Halaman 167-172)