Utility Framework in Data Publication By Weiyi Xia

The x-axis corresponds to the original age, while the y-axis corresponds to the median of the generalized age range. The x-axis corresponds to the original ZIP, while the de-axis corresponds to the median of the ZIP range.

Thesis Goal

In the thesis, we assume that a data publication strategy is composed of a deterrence and data manipulation strategy. In particular, we assume that the data to be published is composed of a set of tuples in the form of a relation table.

Figure 1.2: The domain of privacy disclosure control in publishing individual data and the aims of this dissertation.

Problem Statement

Several examples of regulations with explicit identity protection include HIPAA [26] in the United States and the Data Protection Directive [37] in the European Union. In this setting, identity disclosure means that the identity of the subject of a tuple in the published dataset, or the identity of a group of individuals associated with some sensitive information in the dataset, is inadvertently revealed.

Specific Aims

Specific Aim 1. Develop an accurate and efficient model to quantify

Develop methods for evaluating the parameters of the identity disclosure risk model for various adversaries and available external identifiable re-.

Specific Aim 2. Develop methods for evaluating the parameters of the

Specific Aim 3. Develop algorithms to search for data publishing solu-

There are two challenges we must overcome to achieve this goal: 1) formalizing the solution space in a way that efficient search algorithms can be built and 2) developing efficient and scalable dual-objective optimization algorithms (e.g. possible via heuristic-based search algorithms). ) to find optimal solutions in an extremely large solution space.

Contributions

Therefore, we propose a new approach to investigate the feasibility and effectiveness of this penalty by examining the change in the value of the data over time using linear regression. Therefore, in this thesis we use a proxy in the form of the impact of the publications, which depends on the data.

Dissertation Outline

Additionally, we demonstrate that our approach has consistently discovered boundary policies that offer more utility and less risk than a commonly accepted health data de-identification policy (in the form of HIPAA Safe Harbor). The problem of how to mitigate the disclosure of identity while keeping the data useful for a secondary purpose when de-identified datasets are published is an essential part of the more general privacy-preserving data publishing challenges.

Privacy in Data Publishing

We then examine the area of computational information control and statistical information risk assessment with a particular focus on the issue of identity information. In this thesis, we focus on the identity disclosure problem because this is the primary focus of current regulation.

Computational Disclosure Control

These models also often assume that the de-identified dataset and the external source are all drawn from the same population. Thus, an association of de-identified and external datasets in the quasi-identifier can map a record in the de-identified dataset to a set of individual identities.

Methods for Data Protection

In situations in which the population is extremely large compared to the size of the de-identified data set, the additional distortion is very high. Thus, each specific value in each field is generalized in the same way to each cluster of the data set.

Identity Disclosure Control Using Risk Management

Given the definition of k-anonymity, 1/k can be considered a disclosure risk limit in a specific situation, in which the adversary has access to an external dataset covering exactly the same set of population as the de-identified dataset. and the amount of generalization is a special case for datanuts metrics. For example, the adversary in [27, 70] is formalized as an adversary of the data publisher in a Stackelberg game.

Adversarial Modeling and MDPs

For example, if the adversary's action is to access an external dataset, the value of the reward is the negative value of the cost of the external dataset. In particular, we propose to use a factorized MDP to represent the adversary's decision-making process.

Game Model vs Multi-objective Optimization

In general, the solution to a game is a strategy that optimizes the goal of the game. Another exclusive application for the Risk-Utility research method is when the data publisher can make a decision based on how utility increases as risk increases over an acceptable range.

The Economics of Identity Disclosure Attack

In addition, the value of private information is strongly influenced by factors that should not influence decision-making. Therefore, a possible way to assess the value of personal data is to measure the profit generated by a dynamic pricing strategy based on consumers' private data.

Sampling and Prior Probability

Economic approaches have been proposed to measure the CTB of a system, such as offering a reward to the first exploit of the vulnerability of the system and using the lower bound of the reward as the CTB [112].

Introduction

Re-identification Risk Quantification Framework

The adversary is modeled as a rational agent that calculates an optimal policy; i.e., an optimal action to choose in each state of the FMDP. In Figure 3.2, we show the general architecture of the risk quantification re-identification framework.

Figure 3.1: The re-identification attack process.

Re-Identification as an FMDP

Xg, integer The size of the equivalence group of target recordrin external dataset Xr, integer The remaining number of unexplored individuals in the equivalence. In this case, the number of remaining candidates in the equivalence group is reduced (x0[Xr] =x[Xr]−1).

Table 3.2: The actions of the FMDP model.

Algorithms

The bottom level of the Two-Level LP algorithm solves an LP and stores the value of the statexstart for each sink cluster. This happens when there is an overlap in the opponent's belief about the probability distribution interval of the equivalence group sizeXg.

Experiments

In this case, it should be noted that the total population size of NC according to the census is 9,553,967. In this case, assume that the adversary only knows the total size of the external data set, n, and the probability density of the target record, i.e. total probability of quasi-identification values of the target P(r[QI]), in the population.

Results

Similar to Finding 2, if the group size is

Figure 3.5: The equivalence group size for the target record in the NCVR dataset and the re-identification risk under the known group scenario.

Discussion and Conclusions

It also assumes that the adversary has access to only one external resource to perform an attack. Removing any of these assumptions will lead to an increase in the complexity of the adversary's decision problem.

Introduction

We consider the impact of a data user's publication as a measure of the value of the data that is obtained by the data user. The results strongly suggest that the impact of the release and how soon it emerges after the release of the data may not be correlated.

Preliminaries

Embargo
Requested data
Temporal Penalty

We fit linear regression models to the impact factor as a function of the length of time between the data being made available and the date when the dataset is released for the set of publications under a range of constraints. For example, the DUC of the GAIN: International Multi-Center ADHD Genetics states that if the user violates the provisions in DUC, the DAC may revoke the user's access to all NIH genomic data sets.

Journal Impact Factor and Eigenfactor Score

We refer to the suspension of a user's access to the data for a period astemporal punishment. To mitigate bias by each of the journal metrics, in this thesis we use each of JCR journal impact factor and eigenfactor score of each publication to represent the importance and influence of the publications in a journal, respectively, in our analysis.

Methods

Materials

Dataset of Publications involving analysis of dbGaP data
The extension to the dataset

We use the release date and embargo date of the first version and the first set of participants (.v1.p1). Based on how the JCR journal impact factor is calculated, the journal impact factor of the next two years is based on the citation data of papers published in each year.

Data imputation

Average value of JCR journal impact factors for each of the two years following the year in which the article was excluded. The sample sizes of the first and second data sets are 752 and 566 respectively (the numbers overlap between them).

Table 4.1: The variables for each publication in the extended dataset

Regression analysis

Linear mixed effects model
Data used in the regression analysis

The individual or group variance contributes to the variance in the response variable that is independent of the random error. We also removed publications received before the embargo release date of the dbGaP study.

Figure 4.3: The number of secondary and primary publications that use data from each dbGaP study in the publication set

Results

The summary statistics of the size of the clusters according to the dbGaP survey are shown in Tables 4.5 and 4.6. The p-value is 0.190 for the dataset consisting of both primary and secondary publications, and 0.495 for the dataset consisting only of secondary publications.

Figure 4.6: The impact factor versus the length of time between the publication received date and the related dbGaP study embargo release date of primary and secondary publica-tions.

Model 2: jes ∼ period

Discussion and Conclusions

Fourth, we chose to use the embargo release date of the first version and participant set of each dbGaP survey as the date the survey is made available to all users, regardless of the particular version and participant set used. In this case, our analysis may lead to a biased result by assuming that the dbGaP survey data is made publicly available at the time it was first embargoed, rather than the embargo release time for that version and participant set.

Introduction

Based on the size of the population in these different groups, a record transformed through border policy has a slightly higher risk than its 10-anonymous counterpart, while Safe Harbor has the highest risk. On the other hand, a record in a data set transformed with a boundary policy has less information loss than its safe harbor and 10-anonymous counterparts.

Figure 5.1: An illustrative example of demographic de-identification policies in the risk- risk-utility space (where risk-utility is defined as similarity between the original record and the protected record.

The Policy Space

To characterize the translation, let be the number of values in the domain of a quasi-identifier attribute. The original domain is represented by a bit string of 1s, while a 0 bit indicates that a partition demarcation has been removed to extend an interval1 (ie, the values have been generalized).

Search Algorithms

Random Chain

If the boundary does not contain policies that overrideα, they are inserted into the boundary.

Sublattice Heuristic Search

If we draw a line parallel to the y-axis at each boundary point in R-U space, then the area of the non-dominated region consists of the resulting rectangles. Although any of the policies in the sub-grid can improve the current limit, they can dominate each other.

Experiments Setup

Real World Policy: HIPAA Safe Harbor
Evaluation Dataset
Risk Computation
Utility Computation

In contrast, a maximal chain is the maximal set of policies in the sublattice that can be guaranteed to be on the new frontier. The disclosure risk for the entire generalized data set is equal to the sum of the risk for each record.

Performance Evaluation Results

Next, we evaluated the effect of the subgrid heuristic (i.e., the area below the boundary) on search. For the set of grids in each group, the mean and confidence interval of the ratio of the number of policies updating the boundary to the total number of policies is calculated.

Figure 5.5: The efficiency of search strategies on the Adult dataset as a function of number of policies searched.

Empirical Analysis Results

Frontier Case Studies
Policies on the Frontier
Policies Dominating Safe Harbor
Frontier Ranges
Improvement of the Frontier R-U Tradeoff

The result of the comparison of the ranges of the k-anonymity limit and the SHS limit is summarized in table 5.3. We shorten the SHS limit to be in the same range of the corresponding k-anonymization limit for a fair comparison.

Table 5.1: Number of policies on the frontier for the Adult dataset with ZIP codes simulated based on U.S

Discussion and Conclusions

InProceedings of the 22nd ACM International Conference on Conference on Information and Knowledge Management, pagina's. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pagina's.