Dynamic de-identification policies for pandemic data sharing

Generalization policy with a PK11 upper bound (calculated as the upper bound of the 95% quantile range of 1,000 framework simulations) less than or equal to 0.01 at varying thresholds for the number of disease cases. AMOC curves for Perry County, TN, detecting at least one of the simulated disparity features (left). Generalization policy with a PK5 upper bound (calculated as the upper bound of the 95% quantile range of 1,000 framework simulations) less than or equal to 0.01 at varying thresholds for the number of disease cases.

Policy with a PK20 upper bound (calculated as the upper bound of the 95% quantile range of 1000 framework simulations) less than or equal to 0.01 at varying thresholds for the number of cases. Top) Policy with an upper limit on marketer risk (calculated as the upper limit of the 95% quantile).

INTRODUCTION

M OTIVATION

These requirements drive the need for methods that predict surveillance data to design data sharing policies that preserve patient privacy. In addition to preserving patient privacy, data sharing policy must also support public health research. To support data-driven responses to current and future pandemics, this thesis aims to develop a de-identification method that preserves patient privacy while supporting public health research through near-real-time data sharing.

I address objective 2 in Chapter 4, in which I evaluate how well data shared under the dynamic policy approach supports the detection of disproportionately elevated infection rates within a specific subpopulation, where detection implies that the data sharing policy preserves the evidence of underlying disparate trends. The dynamic policy approach can also be reused to meet emerging data sharing needs where data records are constantly accumulating, such as for vaccine registries45,46.

T HESIS S TRUCTURE

I address objective 1 in Chapter 3 and in Appendix 2, where I present an approach to adaptively create policies for the public sharing of patient-level de-identified epidemiological data. The approach relies on predicting the longitudinal privacy risk from sharing a surveillance data set at different levels of demographic granularity. Such risk assessments enable the selection of a preemptive generalization policy, allowing the data sharer to de-identify new disease case records and update the surveillance dataset in near real time.

I apply a burst detection algorithm to measure the accuracy and timescale in which such disparities can be detected from simulated data partitioned under several data partitioning policies: three variations of the dynamic policy approach and two policies derived from publicly available COVID-19 datasets. Dynamic adaptation of data sharing policies can support the data-driven response to a pandemic by regularly releasing data with epidemiologically critical characteristics in a timely and privacy-preserving way, preserving evidence of different trends.

RELATED WORK

P RIVACY LEGISLATION IN THE U NITED S TATES
D E - IDENTIFICATION MODELS
P RIVACY VS . U TILITY
COVID-19 D ISPARITIES
O UTBREAK DETECTION ALGORITHMS

Here, each record's probability of being re-identified is one over the size of the equivalence class in the population. Although differential privacy addresses the weakness of k-anonymity with respect to sensitive disclosures, it may not meet the de-identification standard of the HIPAA Privacy Rule in every situation65. Minimizing the generalization of the current version of a data set may limit the data sharer's ability to share updated information in the future.

The differences evolved over the course of the pandemic; However, McLaren was unable to identify the source of the transient effects. Around the turn of the century, government agencies and researchers became increasingly interested in developing methods to detect bioterrorist attacks.

Table 2.1. Suppressed attributes for Limited data set and Safe Harbor standards 47,50

DYNAMICALLY ADJUSTING CASE REPORTING POLICY TO MAXIMIZE PRIVACY AND PUBLIC HEALTH

I NTRODUCTION
M ETHODS

Privacy risk estimation framework
Dynamic policy search
Dynamic policy evaluation
Case studies
Code

R ESULTS

Dynamic policy search
Dynamic policy evaluation
Case Study: Davidson County, TN
Case Study: Perry County, TN

D ISCUSSION
C ONCLUSION
A VAILABILITY OF DATA AND MATERIAL

In this paper, we search for policies that satisfy a PK11 threshold of 0.01; i.e., the percentage of data falling into a demographic group of size 10 or less must be less than or equal to 1%. KP11 by dividing the actual number of registrations according to the two policy sequences detailed in the middle graph. KP11 by dividing the actual number of registrations according to the two policy sequences detailed in the middle graph.

First, the dynamic prediction-based approach did not always meet the privacy risk threshold in the PK risk-based scenario. Still, the framework's policy search and policy selection approach depend on many customizable parameters (e.g., the number of simulations performed, the expected number of new disease cases, the specific bins randomly selected to simulate new cases, the size of the quantile range used for confidence that a policy will meet a certain risk threshold), which can be adjusted to mitigate the need for suppression.

Figure 3.1. Privacy risk estimation framework. The curved rectangles represent processes, the cylinders represent data, and the hexagons represent user-defined parameters

SUPPORTING COVID-19 DISPARITY INVESTIGATIONS WITH DYNAMICALLY ADJUSTING CASE

I NTRODUCTION
M ETHODS

Data sharing policies and assumptions
Simulating surveillance data
Disparity detection
Experimental design
Code availability

R ESULTS

Broad Experiment
Fairness Experiment

D ISCUSSION AND C ONCLUSIONS

Risk PK11 measures the privacy risk against an adversary knowing that an individual is in the dataset and a subset of the individual's quasi-identifying information. To simulate a disparity in the specified subpopulation, we first calculate the standard deviation of the baseline infection rate of the subpopulation during the disparity period (3). The first, which we call the broad experiment, evaluates how well each of the de-identification policies enables disparity detection at different significance thresholds.

Proportion of detected disparities for Davidson County, TN, where at least one of the simulated disparity features (left) and both features (right) are detected. However, the k-anonymous policy detects both demographic features only 20% of the time at the 0.1 significance level, and the Marginal Counts policy's lack of common statistics prevents detection of either feature altogether. Proportion of detected disparities for Perry County, TN where at least one of the simulated disparity features (left) and both features (right) are detected.

In Perry County, MAP detects one of the disparity markers (female or 30-39 years old) almost as often as the raw data. The k-anonymous policy detects one of the features more often than RAP at statistical significance thresholds of 0.1 and 0.05 and less often at the other thresholds. Neither of the deidentification policies, nor the raw data, allowed for the detection of either disparity feature in Perry County.

AMOC curves for Davidson County, TN, for detection of at least one of the simulated disparity features (left) and both features (right). AMOC curves for Perry County, TN, for detection of at least one of the simulated disparity features (left) and both features (right). The McNemar tests indicate that there is insufficient evidence to reject the null hypothesis that the proportion of inequalities detected under the RAP and MAP policies is similar to that of the raw data.

In terms of supporting relatively similar levels of detection between racial groups in Davidson County, the SAP is the most accurate with a standard deviation of percent disparities detected between racial groups of 0.1. Of the policies that generally detect disparities, MAP produces the most similar results to the raw data, with a p-value of 0.666, while the k-anonymous policy is the fairest.

SUMMARY

D ISCUSSION
L IMITATIONS AND FUTURE DIRECTIONS
C ONCLUSION
A CKNOWLEDGEMENTS

This is because SAP, having prioritized racial granularity over age granularity, shares less detailed information overall to mitigate the privacy risk of sharing data with a stronger adversary. In this thesis, I evaluate several dynamic policies, each designed to meet a privacy risk threshold against adversaries with different types of knowledge. This investigation shows how the flexibility of the privacy risk assessment framework can inform different approaches to dynamic policy adjustment.

Furthermore, the results highlight the importance of adversarial models in the development and selection of data sharing policies. First, the dynamic prediction-based approach did not always meet the privacy risk threshold in the SAP, PK risk-based scenario. And when the number of cases is overestimated, the privacy risk does not always dramatically exceed the threshold.

Moreover, when policies are adjusted according to the actual case counts, the privacy risk never crosses the threshold. Third, the privacy risk estimation framework depends on random sampling methods that may not realistically simulate the pandemic spread of disease. Fifth, the utility evaluation in Chapter 4 measures the ability to detect an inequality without quantifying how accurately the inequality is represented by the data sharing policy.

Future work should consider more complex differences and quantify how well data sharing policies preserve their functions. The difference in Davidson and Perry county performance suggests that all five data sharing policies are inequitable in providing uniform disparity detection performance across counties. I show that prediction-driven de-identification provides better privacy protection than the static data sharing policy application.

MY ROLE IN MANUSCRIPT DEVELOPMENT

SUPPLEMENTARY INFORMATION FOR CHAPTER 3

The result is the proportion of the records shared in the lag period that fall into a demographic group of size smaller than 𝐾. The data generalization policy can be selected according to the expected number of new instances based on the results of the policy search. The PK risk considers the most unique records in the dataset, while the marketer risk measures the average uniqueness of each record in the context of the surrounding population.

The first three steps of the marketing risk estimation algorithm are identical to the first three steps of the PK risk estimation algorithm. The result is the expected proportion of records in the shared data set that are correctly matched to records in the identified data set. The worst case time complexity of the marketing risk algorithm follows that of the PK risk algorithm until the marketing risk calculation in step 9.

Although the share remains constant, the number of people at risk increases with the size of the dataset. The orange dotted line shows the marketer risk when the size of the shared data set is equal to the size of the population. We repeat the evaluation from the main body of the article, this time for search results based on marketers' risk-based policies.

Again, we are considering a daily and a weekly release schedule in the context of the COVID-19 pandemic. The weekly release schedule policy is chosen according to the size of the cumulative data set at the end of each week (Saturday). The mean and 95% quantile interval of the marketing risk remain below the 0.01 threshold at each time point.

The expectation and 95% quantile range of the trader's risk remain below the 0.01 threshold of the trader's risk throughout the time interval. Cumulative number of case counts reported in Davidson County, according to Johns Hopkins COVID-19 tracking data.