Lecture Notes in Computer Science

(1)

Lecture Notes in Computer Science 6344

Commenced Publication in 1973 Founding and Former Series Editors:

Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board

David Hutchison

Lancaster University, UK Takeo Kanade

Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler

University of Surrey, Guildford, UK Jon M. Kleinberg

Cornell University, Ithaca, NY, USA Alfred Kobsa

University of California, Irvine, CA, USA Friedemann Mattern

ETH Zurich, Switzerland John C. Mitchell

Stanford University, CA, USA Moni Naor

Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz

University of Bern, Switzerland C. Pandu Rangan

Indian Institute of Technology, Madras, India Bernhard Steffen

TU Dortmund University, Germany Madhu Sudan

Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos

University of California, Los Angeles, CA, USA Doug Tygar

University of California, Berkeley, CA, USA Gerhard Weikum

Max Planck Institute for Informatics, Saarbruecken, Germany

(2)

Josep Domingo-Ferrer Emmanouil Magkos (Eds.)

Privacy in

Statistical Databases

UNESCO Chair in Data Privacy International Conference, PSD 2010 Corfu, Greece, September 22-24, 2010 Proceedings

1 3

(3)

Volume Editors Josep Domingo-Ferrer Universitat Rovira i Virgili

Department of Computer Engineering and Mathematics UNESCO Chair in Data Privacy

Av. Països Catalans 26, 43007 Tarragona, Catalonia, Spain E-mail: [email protected]

Emmanouil Magkos

Ionian University, Department of Informatics Plateia Tsirigoti 7, 49100 Corfu, Greece E-mail: [email protected]

Library of Congress Control Number: 2010934215

CR Subject Classiﬁcation (1998): H.2, D.4.6, K.6.5, C.2, E.3, E.1

LNCS Sublibrary: SL 3 – Information Systems and Application, incl. Internet/Web and HCI

ISSN 0302-9743

ISBN-10 3-642-15837-4 Springer Berlin Heidelberg New York ISBN-13 978-3-642-15837-7 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law.

springer.com

Typesetting: Camera-ready by author, data conversion by Scientiﬁc Publishing Services, Chennai, India Printed on acid-free paper 06/3180

(4)

Preface

Privacy in statistical databases is a discipline whose purpose is to provide solutions to the tension between the social, political, economic and corporate demand for accurate information, and the legal and ethical obligation to protect the privacy of the various parties involved. Those parties are the respondents (the individuals and enterprises to which the database records refer), the data owners (those organizations spending money in data collection) and the users (the ones querying the database or the search engine, who would like their queries to stay conﬁdential). Beyond law and ethics, there are also practical reasons for data-collecting agencies and corporations to invest in respondent privacy: if individual respondents feel their privacy guaranteed, they are likely to provide more accurate responses. Data owner privacy is primarily motivated by practical considerations: if an enterprise collects data at its own expense, it may wish to minimize leakage of those data to other enterprises (even to those with whom joint data exploitation is planned). Finally, user privacy results in increased user satisfaction, even if it may curtail the ability of the database owner to proﬁle users.

There are at least two traditions in statistical database privacy, both of which started in the 1970s: the first one stems from official statistics, where the discipline is also known as statistical disclosure control (SDC), and the second one originates from computer science and database technology. In official statistics, the basic concern is respondent privacy. In computer science, the initial moti- vation was also respondent privacy but, from 2000 onwards, growing attention has been devoted to owner privacy (privacy-preserving data mining) and user privacy (private information retrieval). In the last few years, the interest and the achievements of computer scientists in the topic have substantially increased, as reflected in the contents of this volume.

“Privacy in Statistical Databases 2010” (PSD 2010) was held under the spon- sorship of the UNESCO Chair in Data Privacy, which has provided a stable umbrella for the PSD biennial conference series since PSD 2008, held in Is- tanbul. Previous PSD conferences were PSD 2006, the ﬁnal conference of the Eurostat-funded CENEX-SDC project, held in Rome in 2006, and PSD 2004, the ﬁnal conference of the EU FP5 CASC project (IST-2000-25069), held in Barcelona in 2004. Proceedings of PSD 2008, PSD 2006 and PSD 2004 were published by Springer in LNCS 5262, LNCS 4302 and LNCS 3050, respectively.

The four PSD conferences held so far are a follow-up of a series of high-quality technical conferences on SDC, which started twelve years ago with “Statistical Data Protection-SDP’98”, held in Lisbon in 1998 and with proceedings published by OPOCE, and continued with the AMRADS project SDC Workshop, held in Luxemburg in 2001 and with proceedings published by Springer in LNCS 2316.

(5)

VI Preface

The PSD 2010 Program Committee accepted for publication in this volume 25 papers out of 37 full submissions. Furthermore, 5 additional submissions were received which were reviewed for short presentation at the conference and inclu- sion in the companion CD proceedings. Papers came from 16 diﬀerent countries from 4 diﬀerent continents. Each submitted paper received at least two reviews.

The revised versions of the 25 accepted papers in this volume are a fine blend of contributions from official statistics and computer science. Topics covered include tabular data protection, microdata protection, synthetic data, differential privacy, on-line databases and remote access, privacy-preserving protocols, and legal issues.

We are indebted to many people. First, to the Government of Catalonia for financial support to the UNESCO Chair in Data Privacy, which enabled the lat- ter to sponsor PSD 2010. Also, to the Organization Committee for making the conference possible and especially to Jesús Manjón, who helped prepare these proceedings. In evaluating the papers we were assisted by the Program Commit- tee and the following external reviewers: Anne-Sophie Charest, Jörg Drechsler, Sara Foresti, Flavio Foschi, Gabriel Ghinita and Peter-Paul de Wolf.

We also wish to thank all the authors of submitted papers and apologize for possible omissions.

July 2010 Josep Domingo-Ferrer

Emmanouil Magkos

(6)

Organization

Program Committee

John Abowd Cornell University, USA

Bettina Berendt Katholieke Universiteit Leuven, Belgium Elisa Bertino CERIAS, Purdue University, USA

Jordi Castro Polytechnical University of Catalonia, Spain Lawrence Cox Nat. Center for Health Statistics, USA Vassilios Chrissikopoulos Ionian University, Greece

Josep Domingo-Ferrer Universitat Rovira i Virgili, Catalonia, Spain Mark Elliot Manchester University, UK

Stephen Fienberg Carnegie Mellon University, USA Luisa Franconi ISTAT, Italy

Sarah Gießing Destatis, Germany

Anco Hundepool Statistics Netherlands, The Netherlands Julia Lane National Science Foundation, USA Emmanouil Magkos Ionian University, Greece

Bradley Malin Vanderbilt University, USA

Josep M. Mateo-Sanz Universitat Rovira i Virgili, Catalonia, Spain Krish Muralidhar University of Kentucky, USA

Jean-Marc Museux EUROSTAT, European Union Silvia Polettini University of Naples, Italy Yosef Rinott Hebrew University, Israel

Gerd Ronning University of Tübingen, Germany Juan José Salazar University of La Laguna, Spain Pierangela Samarati University of Milan, Italy Yücel Saygın Sabancı University, Turkey

Eric Schulte-Nordholt Statistics Netherlands, The Netherlands Natalie Shlomo University of Southampton, UK

Vicen¸c Torra IIIA-CSIC, Catalonia, Spain Vassilios Verykios University of Thessaly, Greece William E. Winkler Census Bureau, USA

Laura Zayatz Census Bureau, USA

Program Chair

Josep Domingo-Ferrer Universitat Rovira i Virgili, Catalonia, Spain

General Co-chairs

Emmanouil Magkos Ionian University, Greece Vassilios Chrissikopoulos Ionian University, Greece

(7)

VIII Organization

Organization Committee

Ioannis Karydis Ionian University, Greece

Jes´us Manj´on Universitat Rovira i Virgili, Catalonia, Spain Alexandros Panaretos Ionian University, Greece

Gl`oria Pujol Universitat Rovira i Virgili, Catalonia, Spain Spyros Sioutas Ionian University, Greece

(8)

Microdata Protection

IPUMS-International Statistical Disclosure Controls: 159 Census

Microdata Samples in Dissemination, 100+ in Preparation . . . . 74 Robert McCaa, Steven Ruggles, and Matt Sobek

Uncertainty for Anonymity and 2-Dimensional Range Query

Distortion. . . . 85 Spyros Sioutas, Emmanouil Magkos, Ioannis Karydis, and

Vassilios S. Verykios

PRAM Optimization Using an Evolutionary Algorithm. . . . 97 Jordi Mar´es and Vicen¸c Torra

Multiplicative Noise Protocols. . . . 107 Anna Oganian

Measurement Error and Statistical Disclosure Control . . . . 118 Natalie Shlomo

(9)

X Table of Contents

Semantic Microaggregation for the Anonymization of Query Logs . . . . 127 Arnau Erola, Jordi Castell`a-Roca, Guillermo Navarro-Arribas, and

Vicen¸c Torra

Data Environment Analysis and the Key Variable Mapping System . . . . 138 Mark Elliot, Susan Lomax, Elaine Mackey, and Kingsley Purdam

Synthetic Data

Using Support Vector Machines for Generating Synthetic Datasets . . . . . 148 J¨org Drechsler

Synthetic Data for Small Area Estimation. . . . 162 Joseph W. Sakshaug and Trivellore E. Raghunathan

Disclosure Risk of Synthetic Population Data with Application in the

Case of EU-SILC . . . . 174 Matthias Templ and Andreas Alfons

Diﬀerential Privacy

Diﬀerential Privacy and the Risk-Utility Tradeoﬀ for Multi-dimensional

Contingency Tables . . . . 187 Stephen E. Fienberg, Alessandro Rinaldo, and Xiaolin Yang

Does Diﬀerential Privacy Protect Terry Gross’ Privacy?. . . . 200 Krish Muralidhar and Rathindra Sarathy

Some Additional Insights on Applying Diﬀerential Privacy for Numeric

Data . . . . 210 Rathindra Sarathy and Krish Muralidhar

On-Line Databases and Remote Access

Remote Data Access and the Risk of Disclosure from Linear Regression:

An Empirical Study. . . . 220 Philipp Bleninger, J¨org Drechsler, and Gerd Ronning

The Microdata Analysis System at the U.S. Census Bureau . . . . 234 Jason Lucero and Laura Zayatz

Establishing an Infrastructure for Remote Access to Microdata at

Eurostat. . . . 249 Wolf Heinrich Reuter and Jean-Marc Museux

(10)

Table of Contents XI

Privacy-Preserving Protocols

Coprivacy: Towards a Theory of Sustainable Privacy . . . . 258 Josep Domingo-Ferrer

Privacy-Preserving Record Linkage. . . . 269 Rob Hall and Stephen E. Fienberg

Legal Issues

Strategies to Achieve SDC Harmonisation at European Level: Multiple

Countries, Multiple Files, Multiple Surveys. . . . 284 Daniela Ichim and Luisa Franconi

Author Index. . . . 297

(11)

Privacy Disclosure Analysis and Control for 2D Contingency Tables Containing Inaccurate Data

Bing Liang¹, Kevin Chiew¹, Yingjiu Li¹, and Yanjiang Yang²

1 School of Information Systems Singapore Management University

80 Stamford Road Singapore 178902

2 Institute for Infocomm Research

#21-01 Connexis South Tower, 1 Fusionopolis Way Singapore 138632

Abstract. The 2D (two-dimensional) contingency tables have been used in many aspects of our daily life. In practice, errors may be incurred when generating or editing such a table hence the data contained by the table could be inaccurate. Even so, it is still possible for a knowledgeable snooper who may have acquired the information of error distributions to decipher some private information from a released table. This paper investigates the estimation of privacy disclosure probability for contingency tables with inaccurate data based on Fr´echet bounds and proposes two optimization solutions for the control of privacy disclosure so as to preserve private information. Our estimation of privacy disclosure probability and the optimization solutions are also applicable to error-free tables which can be regarded as a special case where there are no errors.

The eﬀectiveness of the solutions is veriﬁed by rigorous experiments.

Keywords: contingency table, privacy disclosure, Fr´echet bounds.

1 Introduction

Some organizations such as census bureaus and health care agencies collect statistical information about individuals; however, with the responsibility of preserving individual privacy (e.g., salary, treatment frequency of certain disease), they only publish aggregated data of individuals for public service. Typically these data of individuals are presented in the form of 2D contingency tables with non-negative integers as cell values, but only marginal values (i.e., the sum of cell values of a row, a column, or the whole table) are released as aggregated information to the public as shown in Figure 1.¹ The aggregated information is assumed privacy-preserved though, still suﬀers from the risks of inferential disclosure [4] by which a data snooper may infer some private information about

1 Section 2 will show what types of privacy disclosure patterns can be found from the table shown in the ﬁgure.

J. Domingo-Ferrer and E. Magkos (Eds.): PSD 2010, LNCS 6344, pp. 1–16, 2010.

c Springer-Verlag Berlin Heidelberg 2010

(12)

2 B. Liang et al.

T1 T2 T3 T4

Bob (5) (1) (0) (0) 6 Dan (1) (1) (0) (0) 2 Tim (3) (0) (1) (1) 5 9 2 1 1 13

Fig. 1.An example of contingency table (numbers in parentheses are not released)

individual data values. Disclosure of private and sensitive information may compromise individual and organizational privacies, conﬁdentiality, and national interests [23, 18]. Therefore, data owners need to carefully assess the potential privacy disclosure and take eﬀective actions to protect the data before they are released to the public.

1.1 Related Work

While various techniques have been developed for limiting the disclosure of private information from these contingency tables²(see survey and review in [8,9]), they can be roughly classiﬁed into two categories, namely restriction-based and perturbation-based.

The restriction-based techniques put their eﬀorts on limiting the disclosure of private information by adding restrictions on queries [2,25,24]. These restrictions include the number of values aggregated in each query, the common values aggregated in diﬀerent queries [7], and the rank of a matrix representing an array of answered queries. Some other restriction-based methods, such as partition [22], microaggregation [8], suppression and generalisation [6], andk-anonymity [5],- diversity [19], andt-closeness [13], are proposed to sanitize data before releasing by imposing restrictions on data structures.

The perturbation-based techniques, on the other hand, provide methods to prevent data privacy from disclosure by adding random noises to source data [3, 14, 20], query answers [1], or data structures [21]. As pointed out in the early study by our research team [15] however, it has been reported in [10,11] that the original sensitive information can be estimated accurately from the perturbed data; therefore, it is suggested that the perturbed data be carefully examined before releasing to the public.

Closely related are the early study by our research team on protecting contingency tables [16, 15, 17] based on which this study is continued and carried out by focusing on congtingency tables with inaccurate data. The disclosure control method proposed in [17] can be classified as a restriction-based technique, and is developed based on four types of disclosure patterns that are identified first time based on Fréchet Bounds. There are other two new disclosure patterns identified in [15] for 2D dynamic tables, together with disclosure control methods developed to prevent data snoopers from inferring the data change from such

2 We henceforth use terminologies2D contingency tables,contingency tables, andtables interchangeably if no confusion arises.

(13)

Privacy Disclosure Analysis and Control for 2D Contingency Tables 3 2D dynamic tables. Besides the discussion in [17] on extending the disclosure analysis to high-dimensional tables, in [16] the disclosure patterns for 3D tables (data cubes) are further studied with tighter bounds than Fr´echet bounds and control methods are proposed to prevent data cubes from leaking of data privacy.

1.2 Our Research Scenario and Solution

An inadequacy of the existing research is that errors in contingency tables have not been taken into account. In practice, it is quite common that errors could be incurred when generating, editing, and reviewing of the tables, leading to inaccurate data in contingency tables released to the public. On the other hand, some knowledgeable snoopers may have accumulated either directly or indirectly the experiences of making such contingency tables and have acquired the knowledge of error distributions by comparing, summarizing, and inducting from the existing tables. Therefore, even with errors, it is still possible for such knowledgeable snoopers to decipher some private information from these tables.

In this paper, we investigate the estimation of privacy disclosure probability and control of disclosure for cells in a contingency table that contains inaccurate data. In detail, after revisitingfour types of privacy disclosure patterns based on Fr´echet bounds identiﬁed in our team’s earlier study [15], we model the error distributions of marginal values as the normal distribution based on central limit theorem, and estimate the probability for each type of privacy disclosure pattern based on the marginal values and their error distributions. The probability estimations for four types of privacy disclosure patterns are also applicable to error-free tables which can be regarded as a special case where both the mean values and variances of errors approach to zero (i.e.,μ→0 andσ²→0). Given these estimations, we propose two methods known asmarginal value merge and marginal value vibration for the control of privacy disclosure of a cell in a table. With either method, we can guarantee that the disclosure probability of a cell is within 0.5³ even if a knowledgeable data snooper possesses full knowledge of error distributions. Both methods are developed against knowledgeable snoopers though, they are applicable to against naive data snoopers who have no idea of error distributions given the understanding that a naive snooper can- not infer more privacy from a released table than a knowledgeable snooper can.

Moreover, both methods are applicable to error-free tables. We also conduct rigorous experiments to verify the eﬀectiveness of both methods against inferential attacks from knowledgeable data snoopers. To the best of our knowledge, this paper is the ﬁrst study on the estimation and control of privacy disclosure for 2D contingency tables with inaccurate data.

3 From a viewpoint of security, we can assume that a cell in a table is safe against privacy disclosure unless the disclosure probability is higher than 0.5. This is because an attacker can give a blind guess to a cell with 50% conﬁdence for any situation of either safe or unsafe, just like tossing a coin with any side as the outcome. Unless the attacker claims with over 50% conﬁdence that the cell is unsafe, we do not have to worry about that he/she will win us in terms of knowing the privacy of the cell.

(14)

4 B. Liang et al.

The remaining sections of the paper are organized as follows. We ﬁrst introduce preliminaries including diﬀerent types of privacy disclosure patterns and error distribution of marginal values in Section 2, followed by the estimation of privacy disclosure probability in Section 3 and methods for the control of privacy disclosure in Section 4. We present experimental results with analysis in Section 5 before concluding the paper in Section 6 by pointing out directions for future study.

2 Preliminaries

Before proceeding to estimate the probabilities of privacy disclosure based on Fr´echet bounds for 2D contingency tables, we revisit four types of privacy disclosure patterns identiﬁed in our team’s early study [15] and introduce the modeling of error distributions.

2.1 Revisit of Privacy Disclosure Patterns

Definition 1 (2D Contingency Table).A 2D contingency tableX is anm× ntable with non-negative cell valuesxij, denoted asX={xij|xij 0,where1 imand1jn}.

Definition 2 (Marginal Values).The marginal values of a contingency table are deﬁned as xi+= ⁿ

j=1

xij,x+j =^m

i=1

xij,x++=^m

i=1

n j=1

xij.

Definition 3 (Fréchet Bounds).The Fréchet bounds of any cell valuexij of a contingency table are defined as F(xij)xij Fu(xij) where Fréchet lower boundF(xij) = max{xi++x+j−x++,0} and Fréchet upper bound Fu(xij) = min{xi+, x+j}.

It can be proven that the Fr´echet bounds of xij are the exact bounds (i.e., the supremum and the inﬁmum) with which a data snooper can derive outxij

given the marginal values [16]. In what follows, we revisitfour types of privacy disclosure patterns [15] of a cell value in a contingency table based on its Fréchet bounds. For simplicity, we omit the adjective “Fréchet” when referring to the lower or upper Fréchet bound hereafter if no confusion arises.

Definition 4 (Existence Disclosure).A cell valuexij suﬀers from existence disclosureif its lower bound is positive, i.e., F(xij)>0.

Definition 5 (τ-Upward Disclosure).A cell valuex_ij suﬀers fromτ-upward disclosureif its lower bound is greater than a positive thresholdτ, i.e., F(xij)>

τ >0.

By these two deﬁnitions, if there is not existence disclosure for a cell value, then there is notτ-upward disclosure for this cell value. On the other hand, if there is notτ-upward disclosure for a cell value, there may be existence disclosure for this cell value.

(15)

Privacy Disclosure Analysis and Control for 2D Contingency Tables 5 Definition 6 (τ-Downward Disclosure). A cell value xij suﬀers from τ- downward disclosure if its upper bound is less than a positive threshold τ, i.e., Fu(xij)< τ where τ >0.

Definition 7 (τ-Approximation Disclosure). A cell value xij suﬀers from τ-approximation disclosureif the diﬀerence between its upper and lower bounds is less than a positive thresholdτ, i.e., Fu(xij)−F(xij)< τ whereτ >0.

From the above deﬁnitions, we have the following corollary.

Corollary 1. For any cell valuexij suﬀering from τ-approximation disclosure, if there is not existence disclosure for this cell value (i.e., F(xij) = 0), then the τ-approximation disclosure of cell value xij degenerates to its τ-downward disclosure.

The following examples may give intuitive understandings for these types of privacy disclosure. For example, there are existence disclosure at cell (Tim,T1) andτ-upward disclosure at cell (Bob,T1) forτ= 1.9 in the table of Figure 1. If columnT1in the table records the patient-treatment information, then a snooper can easily infer that Bob receives at least twice (9 + 6−13 = 2) and Tim at least once (9 + 5−13 = 1) the treatment for specific disease. If the specific disease is highly sensitive, such as AIDS, their respective privacy is violated and they may suffer from discrimination from the society. There isτ-downward disclosure at cell (Dan,T2) forτ = 2.1. If columnT2in the table represents certain academic examination results, then a snooper knows the upper bound of Dan’s score (i.e., less than 2.1) and can further decipher whether Dan’s academic performance is satisfactory or not. Another example in Figure 1 is, there is τ-approximation disclosure at cell (Bob, T1) for τ = 4.1. If column T1 in the table stands for salary scales, then Bob’s salary scale is known between 2 to 6 and the actual salary scale of Bob can be revealed with high probability.

2.2 Modeling of Error Distribution of Marginal Values

As aforementioned, a released contingency table may contain inaccurate data because errors could be incurred when generating, editing and reviewing the table. The following relationship holds true for any cell valuexijwhereeijdenotes the error andrij the correct value of xij:

rij=xij+eij (1)

We useμij andσ_ij² to denote the mean and variance of the erroreij of cell value xij. Regardless the distribution of erroreij of cell value xij, both its meanμij

and varianceσ_ij² are statistically obtainable from empirical data when generating the tables.

To estimate the probability of privacy disclosure for a cell value based on its Fr´echet bounds, we need to know the error distribution of marginal values.

Assume that the error of any cell value is independent from others. Let ei+=

n j=1

eij, e+j= m i=1

eij, e++= m i=1

n j=1

eij

(16)

6 B. Liang et al.

whereei+denotes the error of marginal valuexi+,e+jthe error of marginal value x+j, ande++the error of marginal valuex++. Based on Lyapunov’s central limit theorem, if the size of a table is big enough (say 30×30), the errors of marginal values follow normal distribution as follows:

ei+∼N

μi+, σ²_i+

, e+j∼N

μ+j, σ+²j

, e++∼N

μ++, σ++²

where

⎧⎪

⎪⎪

⎨

⎪⎪

⎪⎩

μi+= ⁿ

j=1

μij

σ_i²+= n j=1

σ²_ij ,

⎧⎪

⎪⎨

⎪⎪

⎩

μ+j= ^m

i=1

μij

σ+²j = m i=1

σ²_ij ,

⎧⎪

⎪⎪

⎨

⎪⎪

⎪⎩

μ++=^m

i=1

n j=1

μij

σ++² = m i=1

n j=1

σ²_ij

3 Estimation of Privacy Disclosure Probability

With the above deﬁnitions of privacy disclosure and modeling of error distributions, in what follows, we investigate the probability of a cell value suﬀering from certain type of privacy disclosure.

Let

ri+ = m

i=1

rij, r+j = n j=1

r+j, r++= m i=1

n j=1

rij

whereri+,r+j, andr++represent the correct marginal values. From Equation (1) we have

ri++r+j−r++=xi++x+j−x+++ei++e+j−e++ (2) Let ˆF(rij) =ri++r+j−r++,Fˆ(xij) =xi++x+j−x++,ˆeij=ei++e+j−e++, Equation (2) can be transformed as

Fˆ(rij) = ˆF(xij) + ˆeij.

Let ˆμij and ˆσ_ij² denote the mean and variance of ˆeij. Given the assumption that all errors of cell values are mutually independent we have

μˆij =μi++μ+j−μ++ (3)

and from ˆeij =ei++e+j−e++=eij−^m

s=1s=i

n

t=1t=j

est, we have

ˆσ_ij² =σ_ij² + m

s=1s=i

n

t=1t=j

σ²_st=. σ++² −σ²_i+−σ²+j (4)

(17)

Privacy Disclosure Analysis and Control for 2D Contingency Tables 7 LetPe(rij),P_u^τ(rij),P_d^τ(rij), andP_a^τ(rij) respectively denote the probability of correct valuerij suﬀering from existence disclosure, τ-upward disclosure,τ- downward disclosure, andτ-approximation disclosure. We have

Pe(rij) = 1−Φ

−Fˆ(xij)−μˆij

σˆij

, P_u^τ(rij) = 1−Φ

τ−Fˆ(xij)−μˆij

ˆσij

P_d^τ(r_ij)=. Φ(A) +Φ(B)−Φ(A)·Φ(B) P_a^τ(rij)=. P_d^τ(rij)·

1−Pe(rij) +

Φ(C) +Φ(D)−Φ(C)·Φ(D)

·Pe(rij) whereΦ(x) is the cumulative density function (cdf for short) of standard normal distribution⁴, andA,B, C, and Dare deﬁned as follows

A= τ−xi+−μi+

σi+ , B=τ−x+j−μ+j

σ+j

C=τ−x+++xi+−μ+++μi+

σ++² −σ_i²+

, D= τ−x+++x+j−μ+++μ+j

σ²++−σ²₊_j . Detailed calculations about the above disclosure probabilities are given in Appendix A.

4 Control of Privacy Disclosure

As analyzed in the previous section, given the marginal values and their error distributions of a table, the table owner can estimate the probability of privacy disclosure, so can a knowledgeable data snooper if he/she has acquired the error distributions by certain means besides the released marginal values. Therefore, if a table owner can somehow re-organize or camouflage the data before releasing them to the public such that the probability of privacy disclosure is not greater than a predefined probability thresholdp_τ, then the table owner can safely assume that the released table is privacy-preserved without worrying about attacks from either naive or knowledgeable snoopers. In our scenario, we setpτ = 0.5.⁵ Definition 8 (Unsafe Cell Value).A cell value is assumedunsafeif and only if the probability that it suffers from any type of privacy disclosure is greater than 0.5.

With the above understanding, we propose two methods in the next for re- organizing or camouﬂaging a table before releasing it. These two methods, known asmarginal value merge andmarginal value vibration, ensure that there is not any unsafe cell value in a released table regardless whether or not the original table is error-free. The compromise of using both methods is the acceptably potential loss or inaccuracy of the information released.

4 The cdfΦ(x) is deﬁned asΦ(x) = √¹ 2π

_x

−∞exp −^u₂²

duwherex∈R. For a random variableXfollowing normal distribution, i.e.,X ∼N(μ, σ²), givenx, the probability thatXxisP(Xx) =Φ_x−μ

σ

.

5 See footnote 3 in subsection 1.2 for a further explanation.

(18)

8 B. Liang et al.

4.1 Marginal Value Merge

Similar to the methods for data sanitization [5, 6], the main idea of the method of marginal value merge (MVM for short) is to re-organize a table by merging several marginal values together into a group and releasing their sum instead of individual marginal values. In other words, several rows or columns of a table are merged together as one row or column, thus the marginal value of this merged row or column is the sum of the marginal values of the rows or columns before merge. Taking Figure 1 as example, we may put columnsT1andT2into a group and release the merged marginal value 11 instead of two individual marginal values 9 and 2 to the public. The marginal values merged into a group do not necessarily appear adjacent or consecutive in the original table. This is because usually a released table has been normalized to at least 1NF without any con- straint for appearing order of rows and columns. Regardless whether or not a table contains inaccurate data, the MVM method applies to the circumstance in which users may not have to know the details of each marginal value in a group.

Technically, with this MVM method we ﬁrst estimate the disclosure probability for each cell value and identify those unsafe cell values such asrij. Then we merge several rows (or columns) together. For example, we merge marginal valuexi+ (to which the unsafe cell valuerij contributes) with another non-zero marginal valuexh+and release the new marginal valuexv+=xi++xh+. To minimize the information loss, it is preferred to maximize the number of marginal values after merge as shown by the following optimization:

Z = max{number of marginal values} subject to

P(rij)0.5

whererij (2i+j < m+n) is the correct cell value after merge andP(rij) is the probability of certain type of privacy disclosure for cell valuerij. Theoretically, the optimized solutions exist among 2^m×nchoices of grouping though, practically the search space is much smaller because many of the grouping are meaningless.

Taking Figure 1 as example, if T1 and T2 stand for weight and height of a person respectively, it does not make sense to merge them together into a group.

Therefore, given the tolerable information loss, we can always ﬁnd a locally optimal solution from an array of feasible solutions for this optimization. This is quite intuitive as demonstrated by an extreme case in which all marginal values are merged into one group and the marginal value x++ is the only element released, leading to a nearly useless table.

4.2 Marginal Value Vibration

The major idea of the method of marginal value vibration (MVV for short) is to camouﬂage a table by adding vibrations to some marginal values before releasing

(19)

Privacy Disclosure Analysis and Control for 2D Contingency Tables 9 them, i.e., by increasing or decreasing some marginal values. This idea can be well depicted by the following optimization:

δ²= min

⎧⎨

⎩ m i=1

δ²_i++ n j=1

δ²+j

⎫⎬

⎭

subject to

P(rij)0.5 and m i=1

(xi++δi+) = n j=1

(x+j+δ+j)

where P(rij) is the probability of certain type of privacy disclosure for a cell value rij, δi+ the vibration for marginal valuexi+, and δ+j the vibration for marginal valuex+j. We can obtain the optimized vibrations δ1+, δ2+, . . . , δm+

and δ+1, δ+2, . . . , δ+n by using simulated annealing [12] to solve the above optimization problem.

The MVV method is applicable to the circumstances where inaccurate marginal values in a released table are acceptable to the public. For error-free tables, the MVV method means to intentionally introduce noises to a table for the purpose of data camouﬂage so as to preserve data privacy.

5 Experiments

In what follows, we conduct experiments to verify (1) the estimation of privacy disclosure probability, and (2) the eﬀectiveness of both methods for risk control.

We synthesize a 30×30 contingency table such that (1) the cell values in row 2 and column 10 follow uniform distribution ranging from 15 to 25, and (2) the cell values in other rows and columns follow uniform distribution⁶ranging from 500 to 1500. An example of such a table could be the stock values of certain commercial or industrial entities during diﬀerent periods of time, i.e., the cell values in a row stand for the stock values of a commercial or industrial entity in 30 periods of time; whereas the cell values in a column stand for the stock values of all commercial or industrial entities in one period of time. The cell values of row 2 may stand for the data from a low-stock value entity, whereas the cell values in column 10 may stand for a special period of time during which the stock values of all entities are very low due to some reasons such as economic crisis.

We set the mean and variance of the error of a marginal value as 0.01 and 100 times respectively of the marginal value, i.e.,μi+= 0.01xi+,σ_i²+ = 100xi+ and

6 We assume uniform distribution as example for all cell values in our experiments. It does not matter what distribution the cell values follow. This is because from our theoretically analysis we can see that the estimation of disclosure probability bears upon the mean values and variances of cell value errors rather than the distribution of cell values or the distribution of errors.

(20)

10 B. Liang et al.

Table 1.Numbernof unsafe cell values τ 480 485 490 495 500

n 0 0 1 1 1

Table 2.Privacy disclosure probability of unsafe cell valuer2,10

τ 480 485 490 495 500

Pd^τ(r2,10) 0.4901 0.4999 0.5097 0.5195 0.5293

μ+j = 0.01x+j,σ²+j = 100x+j7. All calculations of formulas and optimizations are carried out by Matlab R2008a and Java. Experiments are run on a laptop PC with Duo Core CPU 2GHz clock rate and 2G RAM running Microsoft Windows XP Professional Version 2002 with SP3.

5.1 Estimation of Privacy Disclosure Probability

This set of experiments examine the probabilities of privacy disclosure. Based on the estimation formulas derived in Section 3, we calculate the disclosure probability for each cell value with Matlab and identify all unsafe cell values with over 0.5 probability for any type of privacy disclosure. We focus onτ= 500 which is the critical value to diﬀerentiate cell values in row 2 or column 10 from other cell values which are not less than 500. In our experiments, we varyτvalue from 480 to 500. The experimental results show that (1) there is not any unsafe cell value suﬀering from existence disclosure. This is because existence disclosure happens on a cell value only if this value is much more greater than all others.

All cell values in the table follow uniform distributions hence there is not such an exceptionally great cell value; (2) there is not any unsafe cell value suﬀering from τ-upward disclosure due to no existence disclosure (refer to Deﬁnitions 4 and 5);

and (3) when τ > 485, there is only one unsafe cell value r2,10 found with over 0.5 probability for τ-downward disclosure (or τ-approximation disclosure equivalently⁸) as shown in Tables 1 and 2.

5.2 Control of Privacy Disclosure

After data re-organizing by the MVM method and data camouﬂaging by the MVV method, all unsafe cell values are resolved. The experimental results have also shown that after grouping and merging of marginal values with the MVM method forτ = 500, we get 59 marginal values among which 58 are the same as

7 The choice of the error variance of a marginal value is reasonable. This is because taking marginal valuexi+ as example,σi+ = 10√

xi+ is less thanxi+ (xi+ 29× 500 + 15 > 100 in our scenario). Thus in the worst case the released marginal valuexi+ with error is xi+−σi+ which is still greater than zero as required for a contingency table.

8 By Corollary 1, the τ-approximation disclosure of a cell value degenerates to its τ-downward disclosure if there is no existence disclosure for this cell value.

(21)

Privacy Disclosure Analysis and Control for 2D Contingency Tables 11

Pd(r2, 10)

0 0.25 0.5 0.75 1

0 240

Before MVM o After MVV After MVM

480 or MVV

720 960

(a)A comparison for0τ 1000

Pd(r2, 10)

0.26 0.32 0.38 0.44 0.5

4800 485 490 495

Before MVM After MVV After MVM

500 M or MVV

(b)A close-up comparison for 480 τ 500

Fig. 2.Comparisons of privacy disclosure probability for cell valuer2,10

before and only one is new (merged by two previous marginal values). This is because in our experiments the grouping and merging are operated on marginal valuesx2+ andxh+ where h= 2. Moreover, this optimization can be ﬁnished within seconds of running time under the system conﬁguration mentioned earlier in this section.

Figure 2 gives a comparison ofτ-downward disclosure risk for cell valuer2,10

under different conditions. Figure 2(a) reveals the fact that cell value r2,10 is more likely upper-bounded by τ when τ is close to 1000. However this is not a privacy to preserve because most cell values range from 500 to 1500; whereas cell values in row 2 or column 10 are even smaller, ranging from 15 to 25. Moreover, it comes to the total vibrationsδ² = 2048 and|δ|/x++ = 5.3×10⁻⁵, resulting that the curve after operated by the MVV method is very close to the curve before any operation. In other words, while the data privacy is preserved by the MVV method, the vibrations are negligible as compared with the total sum of all cell values hence the data quality is guaranteed for practical usage. Figure 2(b) is a close-up view of the comparison for 480τ 500 from which we can see that the disclosure probability is not greater than 0.5 after the operation by either the MVM or the MVV method. This optimization can be finished within minutes of running time under the same system configuration.

5.3 Experimental Conclusion and Finding

In summary, we can draw the following conclusions from our experiments: (1) the estimation of privacy disclosure probability can well identify all unsafe cell values; (2) both methods can eﬀectively resolve unsafe cell values; (3) the MVM method leads to information loss however, it does not decrease the utility of a table in some circumstances if the merged marginal values are not of signiﬁcant interests to users; (4) the MVV method brings with inaccuracy to the released marginal values of a table however, it does not lose any useful information of a table; (5) a table owner has an option from two methods depending on his/her concern on accuracy or information completion of a table.

(22)

12 B. Liang et al.

Moreover, an interesting finding from the experiments is the limitation of the classical analysis of Fréchet bounds. That is, the classical analysis of Fréchet bounds can tell no more than that any cell value in row 2 or column 10 may suffer fromτ-downward disclosure whereτ = min{x2+, x+10}. This is even less than what one can easily infer with justintuition that cell valuex2,10 could be unsafe with potentially higher probability of privacy disclosure as compared with other cell values in row 2 and column 10 given that either marginal valuex2+or x+10is much smaller than other marginal values. However, our analysis which is also based on Fréchet bounds can locate cell valuex2,10under a smallerτ value with quantitative probability of disclosure. Thus we claim that our analysis and experiments shed a new light for the classical analysis of Fréchet bounds.

6 Conclusion

We have proposed pioneering study on the estimation of privacy disclosure probability and control methods for 2D contingency tables containing inaccurate data. In conclusion, we have made the following contributions. First, we have estimated the probability for each type of privacy disclosure pattern based on the error distribution of marginal values. Second, we have proposed two methods for the control of privacy disclosure so as to preserve private information which could be inferred from released contingency tables. Third, we have conducted experiments which have veriﬁed the correctness of our theoretical analysis and the eﬀectiveness of both methods proposed for privacy disclosure control. Fourth, our analysis and methods are also applicable to error-free tables.

For further study, a straightforward direction is to investigate the disclosure control methods for speciﬁc contingency tables such as sparse tables; whereas a worthwhile endeavor could be on the quantitative analysis of the relationship between data utility and privacy-preserving of contingency tables.

References

1. Beck, L.L.: A security mechanism for statistical databases. ACM Transactions on Database Systems 5(3), 316–338 (1980)

2. Brodsky, A., Farkas, C., Jajodia, S.: Secure databases: constraints, inference chan- nels, and monitoring disclosures. IEEE Transactions on Knowledge and Data En- gineering 12(6), 900–919 (2000)

3. Chen, K., Liu, L.: Privacy preserving data classiﬁcation with rotation perturbation.

In: Proceedings of the 5th IEEE International Conference on Data Mining (ICDM 2005), Houston, TX, USA, November 27-30, pp. 589–592 (2005)

4. Chowdhury, S.D., Duncan, G.T., Krishnan, R., Roehrig, S.F., Mukherjee, S.: Dis- closure detection in multivariate categorical databases: auditing conﬁdentiality protection through two new matrix operators. Management Science 45(12), 1710–1723 (1999)

5. Ciriani, V., di Vimercati, S.D.C., Foresti, S., Samarati, P.:K-anonymity. Security in Decentralized Data Management, 323–353 (2007)

(23)

Privacy Disclosure Analysis and Control for 2D Contingency Tables 13 6. Cox, L.H.: On properties of multi-dimensional statistical tables. Journal of Statis-

tical Planning and Inference 117(23), 251–273 (2003)

7. Dobra, A., Fienberg, S.E.: Bounds for cell entries in contingency tables induced by ﬁxed marginal totals with applications to disclosure limitation. Statistical Journal of the United States 18(1), 363–371 (2001)

8. Domingo-Ferrer, J., Mateo-Sanz, J.M.: Practical data-oriented microaggregation for statistical disclosure control. IEEE Transactions on Knowledge and Data En- gineering 14(1), 189–201 (2002)

9. Farkas, C., Jajodia, S.: The inference problem: a survey. SIGKDD Explo- rations 4(2), 6–11 (2002)

10. Huang, Z., Du, W., Chen, B.: Deriving private information from randomized data.

In: Proceedings of the 2005 ACM SIGMOD International Conference on Manage- ment of Data, Baltimore, MD, USA, June 14-16, pp. 37–48 (2005)

11. Kargupta, H., Datta, S., Wang, Q., Sivakumar, K.: On the privacy preserving properties of random data perturbation techniques. In: Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), Melbourne, FL, USA, December 19-22, pp. 99–106 (2003)

12. LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Incognito: eﬃcient full-domaink- anonymity. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, Baltimore, MD, USA, June 14-16, pp. 49–60 (2005) 13. Li, N., Li, T., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity

and-diversity. In: Proceedings of the 23rd IEEE International Conference on Data Engineering, Istanbul, Turkey, April 15-20, pp. 106–115 (2007)

14. Liu, K., Kargupta, H., Ryan, J.: Random projection-based multiplicative data perturbation for privacy preserving distributed data mining. IEEE Transactions on Knowledge and Data Engineering 18(1), 92–106 (2006)

15. Lu, H., Li, Y.: Disclosure analysis and control in statistical databases. In: Jajo- dia, S., Lopez, J. (eds.) ESORICS 2008. LNCS, vol. 5283, pp. 146–160. Springer, Heidelberg (2008)

16. Lu, H., Li, Y.: Practical inference control for data cubes. IEEE Transactions on Dependable Secure Computing 5(2), 87–98 (2008)

17. Lu, H., Li, Y., Wu, X.: On the disclosure risk in dynamic two-dimensional contingency tables (extended abstract). In: Proceedings of the 2nd International Confer- ence on Information System Security (ICISS 2006), Kolkata, India, December 17- 21, pp. 349–352 (2006)

18. Lui, S.M., Qiu, L.: Individual privacy and organizational privacy in business an- alytics. In: Proceedings of the 40th Hawaii International Conference on System Sciences (HICSS 2007), Waikoloa, Big Island, Hawaii, USA, January 3-6, p. 216b (2007)

19. Machanavajjhala, A., Gehrke, J., Kifer, D., Venkitasubramaniam, M.:-diversity:

privacy beyondk-anonymity. In: Proceedins of the 22nd International Conference on Data Engineering (ICDE 2006), Atlanta, GA, USA, April 3-8, pp. 24–35 (2006) 20. Muralidhar, K., Sarathy, R.: A general aditive data perturbation method for

database security. Management Science 45(10), 1399–1415 (1999)

21. Schl¨orer, J.: Security of statistical databases: multidimensional transformation.

ACM Transactions on Database Systems 6(1), 95–112 (1981)

22. Schl¨orer, J.: Information loss in partitioned statistical databases. Computer Jour- nal 26(3), 218–223 (1983)

23. Sweene, L.: Achieving k-anonymity privacy protection using generalization and suppression. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10(5), 571–588 (2002)

(24)

14 B. Liang et al.

24. Wang, L., Jajodia, S., Wijesekera, D.: Securing OLAP data cubes against privacy breaches. In: Proceedings of the IEEE Symposium on Security and Privacy (S&P 2004), Berkeley, CA, USA, May 9-12, pp. 161–175 (2004)

25. Wang, L., Li, Y., Wijesekera, D., Jajodia, S.: Precisely answering multi-dimensional range queries without privacy breaches. In: Snekkenes, E., Gollmann, D. (eds.) ESORICS 2003. LNCS, vol. 2808, pp. 100–115. Springer, Heidelberg (2003)

A Estimation of Privacy Disclosure Probability

A.1 Estimation of Existence Disclosure Probability

By the deﬁnition of existence disclosure (i.e.,F(rij)>0, see Deﬁnition 4) we have

Pe(rij) =P

F(rij)>0

=P

Fˆ(rij)>0

=P

eˆij+ ˆF(xij)>0

= 1−P

eˆij −Fˆ(xij)

= 1−Φ

σˆij

whereΦ(x) is the cumulative density function (cdf for short) of standard normal distribution and ˆμij and ˆσij are deﬁned by Equations (3) and (4). This result covers the special case of an error-free table for which ˆμij → 0 and ˆσij → 0⁺ hold because fromF(rij)>0 or equivalently ˆF(rij)>0 (Deﬁnition 4) we have

lim

R Pe(rij) = 1−lim

R Φ

σˆij

= 1−Φ(−∞) = 1−0 = 1 whereRdenotes{μˆij →0, σˆij →0⁺}.

A.2 Estimation of τ-Upward Disclosure Probability

Similarly, by the deﬁnition ofτ-upward disclosure (Deﬁnition 5) we have P_u^τ(rij) =P

F(rij)> τ

=P

Fˆ(rij)> τ

=P

ˆeij+ ˆF(xij)> τ

= 1−P

ˆeijτ−Fˆ(xij)

= 1−Φ

σˆij

where ˆμij and ˆσij are deﬁned by Equations (3) and (4). Similar to the previous analysis, for error-free tables, fromF(rij)> τ or equivalently τ−Fˆ(xij)<0 we have

lim

R Pe(rij) = 1−lim

R Φ

σˆ_ij

= 1−Φ(−∞) = 1−0 = 1 whereRdenotes{μˆij →0, σˆij →0⁺}.

(25)

Privacy Disclosure Analysis and Control for 2D Contingency Tables 15 A.3 Estimation of τ-Downward Disclosure Probability

By the deﬁnition ofτ-downward disclosure (Deﬁnition 6) we have P_d^τ(rij) =P

Fu(rij)< τ

=P

min{ri+, r+j}< τ

=P

min{ei++xi+, e+j+x+j}< τ

Marginal value errorsei+ ande+j are not mutually independent because there is a shared elementeij. Given that the size of a table is big enough (e.g., 30×30) and the contribution of the shared elementeij to eitherei+ore+j is quite small (e.g., less than 5%), the above calculation can proceed as follows:

P_d^τ(rij)= 1. −P

ei++xi+ τ

·P

e+j+x+j τ

= 1− 1−P

ei+< τ −xi+

· 1−P

e+j < τ−x+j

=Φ(A) +Φ(B)−Φ(A)·Φ(B) (5) whereA= ^τ−x_σⁱ⁺^−μⁱ⁺

i+ andB= ^τ−x^+j_σ ^−μ^+j

+j .

Now consider the special case of an error-free table for whichμi+→0,μ+j → 0,σ_i+ →0⁺, andσ+j →0⁺hold. LetRdenote{μ_i+→0, σ_i+→0⁺, μ+j →0 , σ+j →0⁺}. By deﬁnition we have F_u(r_ij) = min{e_i++x_i+, e+j +x+j} < τ which contains the following two situations:

(1)ei++xi+< τ ande+j+x+j < τ

The above equations are equivalent toxi+ < τ and x+j < τ under the assumption of error-free tables (i.e., ei+ → 0 and e+j → 0). Therefore, from Equation (5) we have

lim

R P_d^τ(rij)=. Φ(+∞) +Φ(+∞)−Φ(+∞)×Φ(+∞) = 1 + 1−1×1 = 1 (2)ei++xi+< τ ande+j+x+j τ, orei++xi+ τ ande+j+x+j < τ Without loss of generality, we consider the ﬁrst case which is equivalent to xi+ < τ andx+j τ under error-free assumption. From Equation (5) we have limR P_d^τ(rij)=. Φ(+∞)+Φ

τ−x+j−μ+j

σ+j

−Φ(+∞)×Φ

τ−x+j