• Tidak ada hasil yang ditemukan

privacy leaks and efficient countermeasures for human

N/A
N/A
Protected

Academic year: 2023

Membagikan "privacy leaks and efficient countermeasures for human"

Copied!
187
0
0

Teks penuh

For each individual (x-axis) from the eMERGE study, the attack score (y-axis) according to the inference attack method is quantified, and the difference in the score distribution reveals participation status in the GWAS. 65 IV.9 Effect of the number of steps in the Taylor series (ie, Kin's Eq. IV.11) on (a) computational accuracy and (b) running time efficiency.

Privacy Concerns in Data Sharing

For example, in a famous study it was shown that a man's identity (mainly surname) could be established by profiling his short tandem repeats (Y-STRs) on the Y chromosome and referencing various public genealogical databases on the Internet – even if the individual's identity was not originally tied to a DNA sequence [59]. Furthermore, in 2012, it was indicated that releasing even basic summary information (ie, association effect size or its direction) of genome-wide association studies (GWAS) could lead to study (or disease) related privacy leaks [77].

Outlines for This Dissertation

Later, we note that machine learning based on cryptography is still extremely slow for large-scale tasks. First, in Section V.1, we present the design of a new state-of-the-art privacy-preserving technology—ordinary logistic regression—using distributed machine learning and cryptography.

Genetic Datasets

  • eMERGE hypothyroidism study
  • PAGE obesity study
  • EAGLE diabetes study
  • Other Public Genetic Data

From this dataset, we mainly used summary statistics associated with the meta-analysis, so individual-level records are not included. This study focuses on a binary phenotype (obese or not) and includes 14,998 participants and spans multiple ethnic groups (eg, non-Hispanic whites, non-Hispanic blacks, Mexican-Americans, and others).

Experimental Evaluation and Reproducibility

Building Blocks for Privacy Protection

Secure Multiparty Computation (SMC)

  • Yao’s Garbled Circuit
  • Additive and Linear Secret-Sharing Schemes

In our case, the secret may be raw genetic or health-related data or summary statistics at the institutional level. The secret is divided in such a way that it allows: 1) to perform some mathematical operations, and 2) reconstruction of the secret (original or later derived).

Paillier Additively Homomorphic Encryption

These pairs, each representing a part of the secret, are then distributed among several computing centers (ie, each participant receives only one part of the secret). With this mechanism, we can claim that the secret is successfully protected because only a few limited centers (and, in special cases, no single center) can infer anything about the polynomial or the embedded secret.

Differential Privacy

  • Output Perturbation

Differential privacy essentially implies that even if a strong adversary knows the entire dataset D except for the target record (individual), he still cannot derive much information about the target from the function output. A popular way to achieve differentiated privacy is output perturbation, which calibrates artificial noise to the exact output of the function.

Introduction

GWAS associations) outside their respective institutions, leading to the unexpected disclosure of sensitive information of individual participants. We validate our new pipeline with synthetic and real-world studies from several large consortia.

Privacy Inference Attacks and Cryptographic Protection

Privacy Inference Attacks

  • Study Participation Status Inference from Allele Fre-
  • Inference of Exact Traits and Study Participation Status. 16

To illustrate this, for each target individual, we measure the deviations of its genome (denoted by Yi) from two different allele frequency references: the study mixture (EAF, divided into QC; denoted Std), and a public reference panel (e.g., 1000 Genome Project [29] or HapMap [54]; indicated as Ref). D(Yi,j) =|Yi,j−Re fj| − |Yi,j−Stdj|, (III.1) where Re fj and Stdj denote allele frequencies for SNP j from the public reference and study mixture, respectively.

Major QC procedures

  • Site-specific QC

Each site performs its own GWAS, and the result files are submitted to the Coordinating Center for QC and Meta-Analysis. This goal of site-specific QC is primarily to perform a series of local inspections of various quality issues.

Figure III.1: Meta-analysis QC pipeline.
Figure III.1: Meta-analysis QC pipeline.

2.3.1.1 Privacy Analysis

2.3.1.2 Our Protection

SE-N plot

Suppose we denote the sample variance of the beta estimate obtained by linear regression of a specific SNP jaSEj, the variance of the phenotype asVar(Y), the sample size as N.

2.3.2.1 Privacy Analysis

2.3.2.2 Our Protection

The P-Z plot

2.3.3.1 Privacy Analysis

2.3.3.2 Our Protection

Effect allele frequency (EAF) plot

2.3.4.1 Privacy Analysis

2.3.4.2 Our Protection

The lambda-N plot

2.3.5.1 Privacy Analysis

2.3.5.2 Our Protection

Heterogeneity Tests

The I2 statistic[68] indicates the fraction of total variation in the estimate that is due to between-study heterogeneity.

2.3.6.1 Privacy Analysis

2.3.6.2 Our Protection

Experimental Design and Results

  • Site-level QC
    • Cross-site QC

We conduct a systematic privacy assessment on various types of summary statistics that are frequently shared outside of their original settings during meta-analysis QC. Currently, site-level QC is often performed by a central organization to check for problems in various site-level submission files (i.e., inputs to meta-analysis), such as formatting errors, missing values, nonsense values, imputation quality issues, and so on.

3.1.1.1 Effect allele frequency (EAF) plot

To demonstrate that detection of effect allele frequencies by cross-site QC raises privacy concerns, we aim to distinguish between participating (ie, in the study) and non-participating (ie, kept out) individuals. using EAF summaries (at the site level) and public reference genomes (eg, the 1000 Genome Project [29] or the HapMap project [54]). Given the genotypes of each target individual, the privacy-invasion method will quantify the per-person risk score, the distribution of which may differ significantly between study individuals and those left out.

3.1.1.2 P-Z plot

Post-analysis QC

Doing this requires sharing and contrasting summary statistics between sites. a) Using originally published effect size estimates. For example, several tests of heterogeneity rely on estimates of effect size and variance from GWAS results, which can be used to tease out sensitive properties.

Figure III.5: Inference of dichotomous traits on targeted individuals, using effect size esti- esti-mates that are: a) as originally disclosed for QC; and b) protected using our proposal
Figure III.5: Inference of dichotomous traits on targeted individuals, using effect size esti- esti-mates that are: a) as originally disclosed for QC; and b) protected using our proposal

Privacy-enhanced QC

The Center performs cross-site comparisons and post-analysis checks in an encrypted manner without the need to view data content. The gold standards for comparison are often public information or can be easily standardized and distributed to local sites (eg, for MAF plots, the public HapMap project acts as the public baseline; for P-Z plots, the diagonal line is the gold standard).

Figure III.7: Overview of our privacy-enhanced system. Each study site performs its local QC procedures, and provides encrypted diagnostic summaries to the Center
Figure III.7: Overview of our privacy-enhanced system. Each study site performs its local QC procedures, and provides encrypted diagnostic summaries to the Center

Accuracy of Secure Heterogeneity Tests

Accuracy of Other Secure Procedures

Below we report our empirical evaluation in terms of result accuracy and system runtime. In particular, heterogeneous SNPs (with I2>75%) are still identified as heterogeneous by the confident implementation, and normal SNPs are also correctly labeled as normal.

Computation runtime

Discussion

  • Implications for genome research
  • Limitations
  • Conclusion

In this chapter, we have demonstrated important privacy vulnerabilities in disclosing several summary statistics that are routine in QC for GWA meta-analysis. In this chapter, we introduce a novel cryptographic strategy to securely perform meta-analyses of genome-wide association studies (GWAS) in multi-site consortia.

Introduction

To address the privacy concerns related to individual genomic information and site-level summary statistics, we propose a practical protocol to securely conduct meta-analyses of genome-wide association studies (GWAS) in large multi-site consortia (Fig. IV .1). In this article, we demonstrate the design and implementation of our secure meta-analysis protocol (named SecureMA) and provide empirical evaluations using three separate multi-site genetic association studies.

Overview of Proposed Framework

  • Secure Meta-analysis Protocol
  • Setup Step of the Protocol
  • Secure Computation Step of the Protocol

Then the mediator coordinates with one randomly selected data manager to perform a safe division to derive the weighted average, the final operation of the meta-analysis (Fig. IV.1b; details in Section IV.4.2.1) . The mediator is then responsible for initiating a final round of collaborative decryption by distributing the encrypted result to a majority of the trusted data managers for partial decryption (Figure IV.1c).

Figure IV.1: The SecureMA protocol (secure computation step). (a) The process begins when a scientist submits a meta-analysis study inquiry
Figure IV.1: The SecureMA protocol (secure computation step). (a) The process begins when a scientist submits a meta-analysis study inquiry

SecureMA for privacy-preserving meta-analysis

  • Meta-analysis
  • Secure Computation of Meta-analysis

The logarithmic transform, lnx (where it is encrypted), is approximated using secure computation techniques and a Taylor series (Section IV.4.8). Next, secure multiplication-by-constant and subtraction subprotocols (e.g., the MULC and SUB subprotocols in Section IV.4.7) are used to complete the rest of the operations in Equation IV.2, yielding encryption E(lnZ2) .

Technical Details and Secure Implementation

  • Cryptographic Key Management and Secure Workfkow
  • Meta-analysis and Protocol Participants
    • Meta-analysis of Genome-wide Association Studies
    • Protocol Participants
  • Computational Accuracy in a Controlled Setting
  • Details on Securely Computing Meta-analysis
  • SHARES: Converting Encryptions to Secret Shares
  • Garbled Circuits for Secure Division
  • Secure Arithmetic Operations
  • Secure Logarithmic Transformation
    • Logarithm Phase 1: Rough Estimate via Garbled Circuits 54

To do this without affecting the final result, we square the aforementioned equation IV.1 for ease of implementation. Further, it can be observed in Equation IV.3 that meta-analysis requires final division of the numerator by the denominator.

Figure IV.2: During the Setup step of the SecureMA protocol, encryption/decryption keys are generated and distributed
Figure IV.2: During the Setup step of the SecureMA protocol, encryption/decryption keys are generated and distributed

Results

  • Study Data
    • The eMERGE hypothyroidism study
    • The PAGE obesity study
    • The EAGLE diabetes study
  • Protection of Sensitive Information
  • Accuracy of GWAS Meta-analysis Results
  • Running time Efficiency
    • Sample size
    • Number of sites
  • Sensitivity Analysis
    • Parameters Influencing Protocol Sensitivity
    • Evaluation of the Scale-up Factor
    • Evaluation of the Maximum Exponent of the Logarithm
    • Evaluation of the Number of Steps in the Taylor Series

Nevertheless, it can be seen from Figure IV.7b) that the variance of the total running time is relatively small as the scaling factor increases. It is confirmed that the total operating time changes almost linearly with the increase of the maximum exponent (Figure IV.8b).

Figure IV.5: Protocol accuracy. The correlation plots correspond to: (a) the p-values (se- (se-cure protocol vs
Figure IV.5: Protocol accuracy. The correlation plots correspond to: (a) the p-values (se- (se-cure protocol vs

Discussion

  • Analysis on GWAS Scale
  • Limitations
  • Alternative Methods to Maintain Genomic Privacy
  • Conclusion

To support more dirty data in nature, it will be necessary to introduce QC processes for meta-analysis into the protocol [153]. This is a different approach from the traditional distributed machine learning based formula common in the community.

Figure IV.9: The impact of the number of steps in the Taylor series (i.e., k in Equation IV.11) on (a) computational accuracy and (b) running time efficiency
Figure IV.9: The impact of the number of steps in the Taylor series (i.e., k in Equation IV.11) on (a) computational accuracy and (b) running time efficiency

Safeguarding Regularized Logistic Regression

  • Introduction
    • Contributions
    • Outlines
  • Preliminaries

In this work, we show how to perform regularized logistic regression while preserving data privacy. First, we demonstrate that regularized logistic regression can be efficiently supported without violating privacy.

1.2.1 (Regularized) Logistic Regression

Newton-Raphson Method

This section is organized as follows: in Subsection V.1.2, background information on regularized logistic regression and Newton's method is provided; then I present the method details in Section V.1.3; This is followed by experimental results in Section V.1.5; we conclude in Section V.1.6. To illustrate the process, we use βold and βnew to denote the β coefficient estimates for the current and next iterations, respectively.

Privacy-preserving Regularized Logistic Regression

  • Hybrid Architecture
  • Newton-Raphson Method for ` 2 -regularized Logistic Re-
  • Distributed Model Estimation
  • Distributed Computation
  • Centralized Aggregation

Here we first demonstrate how the above-mentioned Newton-Raphson method applies to '2-regularized logistic regression. Once the globally adjusted H(.) and g(.) are derived, the computing centers will perform the Newton-Raphson update based on the βold estimate and then check for model convergence.

Figure V.1: Overview of our secure framework for regularized logistic regression. Each institution (possessing private data) locally computes summary statistics from its own data, and submits encrypted aggregates following a strong cryptographic scheme [13
Figure V.1: Overview of our secure framework for regularized logistic regression. Each institution (possessing private data) locally computes summary statistics from its own data, and submits encrypted aggregates following a strong cryptographic scheme [13

Protecting Privacy

  • Privacy on Individual Data
  • Privacy on Aggregate Data
  • Shamir’s Secret-Sharing for Protecting Data
  • Privacy on Computations
  • Generating synthetic data

In our protocol, we leverage Shamir's secret sharing [136] to protect intermediate data (including summary statistics of institutions). To split and 'divide' the secret, we continue by evaluating q(x) and deriving more different values ​​from the polynomial, yielding coordinate pairs.

Results

  • Evaluation Datasets
  • Regression Result Accuracy
  • Running Time
  • Scalability to Large Studies

Note that the convergence scores for Parkinsons.Motor and Parkinsons.Total studies almost overlap due to their high similarity in plot. We reported the results in Fig V.4 (we simplified the scenario by assuming that each institution contributes 10,000 records. So, in fact, our estimate reflects the execution time affected by the increase in the number of institutions and the total number of data records ).

Figure V.2: Model accuracy of our securely estimated β against the gold standard for four evaluation datasets
Figure V.2: Model accuracy of our securely estimated β against the gold standard for four evaluation datasets

Discussion

  • Application Scenarios

Although there has been recent research in the field of privacy preserving [120] specifically focusing on ridge (linear) regression (i.e. with '2 regularization), this focused on a much simpler regression model (i.e. linear regression) and the model Het estimation process is completely different from regularized logistic regression (the focus of our work). The distributed process makes our secure protocol for regularized logistic regression very efficient compared to a simple centralized implementation [41].

1.6.1.1 Genetic and Biomedical Studies

While privacy protection on summary statistics has been investigated for other tasks [159], ours is the first to protect ordered logistic regression with respect to intermediate data.

1.6.1.2 Analytics for Smart Grid

1.6.1.3 Large-scale Network Analysis

Conclusion

PrivLogit: Efficient Privacy-preserving Logistic Regression by Tailoring

  • Introduction

Despite encouraging progress, few proposals have seen wide adoption in the real world for privacy-preserving logistic regression. In our proposal (called PrivLogit), we derive a constant approximation for the second-order curvature information (i.e., Hessian) in the Newton method for logistic regression.

2.1.0.1 Contributions

Following PrivLogit, we propose and evaluate two highly efficient cryptographic protocols for privacy-preserving distributed logistic regression, viz. PrivLogit-Hessian and PrivLogit-Local. We propose two highly efficient secure protocols (i.e., PrivLogit-Hessian and PrivLogit-Local) for privacy-preserving logistic regression.

2.1.0.2 Outline

Logistic Regression and Newton Method

  • Logistic Regression
  • Distributed Newton Method

The `2-regularized logistic regression introduces an additional regularization term, −λ2βTβ, to the optimization objective during model estimation. The de facto approach for estimating the (regularized) logistic regression coefficient β (Equation V.8) is the Newton method (or iteratively reweighted least squares, known as IRLS) [65].

PrivLogit: A Novel Optimizer Tailored for Fast Logistic Regression 106

  • PrivLogit for Fast Privacy-preserving Logistic Regression.107

Newton's (distributed) method is widely implemented in statistical software and also underlies almost all existing privacy-preserving logistic regression solutions. Moreover, the lack of guarantee of model convergence in Newton's method is also a known issue, when poor initialization (initial guess of coefficients) is provided [16].

2.3.3.1 Asymmetric Computational Complexity in Se-

Our new PrivLogit optimizer enables some attractive properties that seem very promising for efficient privacy-preserving logistic regression.

2.3.3.2 Constant Hessian

2.3.3.3 Decomposition of Computation

2.3.3.4 Guaranteed Model Quality

2.3.3.5 Guaranteed Model Convergence

Safeguarding PrivLogit

The center securely concatenates these encrypted Hessians for each organization (and a regularization term if necessary), yielding the encrypted global Hessian approximationEnc(H)˜ (step 5 in Algorithm 7 and Eq. V.14). Later, the Center must safely invert the Hessian, which is usually achieved by a safe Cholesky decomposition (see Appendix).

Figure V.5: Distributed architecture for privacy-preserving logistic regression. Two main types of computations are involved between: 1) local Nodes and the Center; 2) different Servers/authorities at the Center.
Figure V.5: Distributed architecture for privacy-preserving logistic regression. Two main types of computations are involved between: 1) local Nodes and the Center; 2) different Servers/authorities at the Center.

2.4.1.1 Secure Cholesky Decomposition

PrivLogit-Local: Further Offsetting Computations to Lo-

For each iteration, local organizations only need to calculate their local gradient-likelihood and log-likelihood j (where j indexes each organization) and securely transmit their encryptions to the Hub (Steps 3 to 7). Later, in each iteration, the local organizations derive their local summaries, such as the log-likelihood (Step 5) and the gradient (Step 6).

Theoretical Analysis and Proof

  • Complexity Analysis
  • Security Guarantees
  • Convergence Proof for PrivLogit

Using the negative definiteness of H˜ and the second-order Taylor expansion of l2(β), we have,. From the previous derivation, we obtain the lower bound of the increment of the objective function at each iteration.

Experiments

  • Datasets
  • Model Accuracy
  • Computational Performance

To do this, we examine the model quality (as measured by accuracy of coefficient estimates and subsequent predictions) estimated from PrivLogit-Hessian and PrivLogit-Local. Moreover, it also confirms that the different cryptographic protections underlying PrivLogit-Hessian and PrivLogit-Local have no influence on the model quality.

2.6.3.1 Iterations to convergence

Judging by the iterations of the convergence model, PrivLogit seems "unfavorable" to Newton, as PrivLogit often requires a few dozen or more iterations, while the latter seems significantly faster with only a single-digit number of iterations.

Figure V.7: Convergence iterations of PrivLogit and the Newton method baseline on real- real-world (upper panel) and simulated (lower panel) datasets
Figure V.7: Convergence iterations of PrivLogit and the Newton method baseline on real- real-world (upper panel) and simulated (lower panel) datasets

2.6.3.2 Convergence runtime

2.6.3.3 Relative speedup

  • Model Convergence Guarantee
  • Related Works
    • Cryptographic Protections on Logistic Regression and
    • Perturbation-based Privacy Protection
    • Improved Numerical Optimization for Regression
  • Discussion
    • Conclusion
  • QuickLogit: A Novel Paradigm for Efficient Privacy-preserving Logistic
    • Introduction
    • Preliminaries
    • QuickLogit: accelerating performance using local models

Privacy-preserving distributed machine learning (or data mining) [4] is a popular research effort to solve this problem, using distributed algorithms and cryptography such as Secure Multi-Party Computing (SMC) to support machine learning while preserving privacy to protect. privacy. This ubiquitous privacy-preserving workflow for distributed machine learning (and logistic regression) can be summarized as follows:

Figure V.8: Relative speedup of PrivLogit-Hessian and PrivLogit-Local over the secure distributed Newton baseline (the y = 1 line), across various datasets
Figure V.8: Relative speedup of PrivLogit-Hessian and PrivLogit-Local over the secure distributed Newton baseline (the y = 1 line), across various datasets

3.3.2.1 A Geometric Intuition

QuickLogit: A Novel Approach to Privacy-preserving

In general, as demonstrated in Algorithm 10, our QuickLogit protocol now follows the above two-phase paradigm, with the first phase a one-shot effort and the second typically an iterative process. The second phase of QuickLogit, as listed in Algorithm 10, aims to refine model estimation centrally and safely.

Phase 1: Local Models

Phase 2: Global Model Refinement

3.3.5.1 Local-institution Summary Statistics in Newton 145

Security Guarantees and Information Disclosure

Potential disclosure of information occurs mainly in connection with bridging between different schemes or organisations, including the transition between different schemes, and the different privacy definitions between different parties.

Theoretical Proof

  • Same Theoretical Convergence as Newton
  • Better Practical convergence than Newton
  • Computational complexity

We demonstrate that our carefully chosen initialization has an error bounded to the optimum, so it is more likely to fall close to the optimum than a random initialization. For Newton's method, it is crucial to find an initial value that is close to the optimal solution.

Table V.5: Computational complexity of secure subprotocols.
Table V.5: Computational complexity of secure subprotocols.

Experiments

  • Datasets
  • Runtime Benchmarks

Real-world studies: 1) Adult data for predicting income level (whether it is greater than $50,000 or not) based on various social factors such as demographics; Certain problems are too big to solve safely. Therefore, their iteration performance is based on non-secure simulations and marked with brackets in Table V.6.

3.5.2.1 Significantly reduced number of iterations to

Real-world studies: 1) Data on adults predicting income level (whether it is greater than $50,000 or not) from several social factors such as demographics;. 2) Loan data from a known online lending platform for predicting loan application status based on anonymized personal profiles.

3.5.2.2 Dramatic runtime improvement

Guaranteed Model Accuracy

Many existing privacy-preserving machine learning protocols achieve efficiency gains by compromising model accuracy. To see this, we directly compare the final model accuracy from QuickLogit and that from Newton's baseline, as illustrated in Figure V.13.

3.5.3.1 Simple model averaging has good approxima-

3.5.3.2 Simple averaging alone is not perfect

  • Related Works
  • Discussion
    • Conclusion
  • Privacy issues with QC summary statistics and countermeasures
  • Running time of secure heterogeneity tests
  • The core variables and computations for SecureMA
  • Per-SNP running time for SecureMA and the proportion of the time ded-
  • Computational efficiency on evaluation datasets
  • Notations
  • Model convergence iterations (Iter.) and runtime (in seconds) bench-
  • Main notations
  • Computational complexity of secure subprotocols
  • Runtime benchmark (iteration counts and in seconds)

35] Joshua C Denny, Lisa Bastarache, Marylyn D Ritchie, Robert J Carroll, Raquel Zink, Jonathan D Mosley, Julie R Field, Jill M Pulley, Andrea H Ramirez, Erica Bowton, et al. 36] Joshua C Denny, Dana C Crawford, Marylyn D Ritchie, Suzette J Bielinski, Melissa A Basford, Yuki Bradford, High Seng Chai, Lisa Bastarache, Rebecca Zuvich, Peggy Peissig, et al.

Figure V.13: Accuracy of model coefficient estimates from our QuickLogit, with Newton as baseline (x-axis)
Figure V.13: Accuracy of model coefficient estimates from our QuickLogit, with Newton as baseline (x-axis)

Gambar

Figure I.1: An overview of this dissertation and its chapters. 1. We first propose multiple statistical inference attacks on genomic summary statistics (Chapter III), which provides novel findings as well as serving as motivations for this dissertation; 2
Figure III.1: Meta-analysis QC pipeline.
Table III.1: Privacy issues with QC summary statistics and countermeasures.
Figure III.3: Detection of GWAS participation status on target individuals using QC effect allele frequencies
+7

Referensi

Dokumen terkait

Some machine learning methods such as Logistic Regression, Decision Tree, and Random Forest are applied and compared results to get the most efficient method of detecting spam e-mail..