III. 3.1.1.2 P-Z plot
III.4 Discussion
Our work has strong relevance to genome research for several reasons, especially given the urgency for both scientific data sharing (either for increased sample size or for reproducible research) and guaranteeing participant data privacy.
III.4.1 Implications for genome research.
With respect to participant privacy risk, we demonstrate that sharing various summary statistics that are routinely available and necessary to QC and meta-analysis tasks (both at the individual and study levels) is indeed equivalent to or only slightly safer than directly disclosing individual genomic information.
Concealment of individual-participant data and privacy protection has always been the primary, if not the sole, motivation for adopting and advocating summary statistics-based approaches towards collaborative genome research. A popular and representative exam- ple is meta-analysis (and its essential component of QC). Here instead, we prove that the promised advantage of meta-analysis is not fulfilled, and conducting meta-analysis and QC on summary statistics does not offer significant advantage over directly handling individual-participant genome information. From this perspective, meta-analysis and QC may (and should) be subject to the same regulatory frameworks governing individual-level genomic information and thus subject to the same legal risks.
Our privacy assessment has major implications for genome research, as ever-larger study consortia are being formed nationally and internationally [135, 168]. This will con- tinue to pose significant challenges in mandating universal trust between institutions, guar- anteeing sufficient maturity in security and privacy standards among all investigators, and detecting and prosecuting inappropriate disclosure of genomic information.
With respect to privacy protection, we illustrate that technological advancements can be of help for simultaneously balancing participant privacy requirements and supporting scientific workflows even as complex as QC for GWA meta-analysis. A recent work [71]
published a few years later than our current work reached similar conclusions, by showing that cryptographic methods (using Yao’s garbled circuit [164]) are feasible and relatively efficient in protecting quality control pipelines. Overall this is a striking message to convey to the genetics community, as most existing works on genome privacy only demonstrate the privacy risks without providing proper solutions or advice, leaving general scientists in the misconception that privacy is dead [42] and one can only choose between data sharing or privacy.
III.4.2 Limitations.
The inference attack in our privacy assessment focused on a simplified GWAS model which did not account for covariates such as gender, age, and ethnicity. Incorporating such factors into our attack model might boost the success rate of our attacks, but only with marginal improvements since it is generally expected that the contribution of such covariates would be very small to the GWAS regression problem. So in theory, even given that correlation effects are not reported on such covariates in most published GWAS or meta-analysis, our attacks would still be successful regardless.
While we have tried our best to empirically validate our methods and claims as compre- hensively as possible, we point out that an even larger-scale evaluation would still be ideal to generalize our findings and significance more broadly. Our statistical inference attacks were empirically validated using a selected collection of multi-site consortia meta-analaysis and GWAS datasets and simulated (phenotype) studies due to resource constraints. While our methods were designed to be generic and widely applicable, we yet have to benchmark our methods on a wider variety of genomic and phenotypic datasets.
Our protections and the secure QC pipeline are primarily motivated to safeguard inter- mediate data and analytics in the whole QC process, while incurring minimal changes to the original scientific or administrative workflows. While this makes our secure pipeline more accurate and easy to deploy in real studies, it sometimes may bring about the side
effect of privacy leaks from QC results themselves. For instance, the Q- or I2-statistics themselves may be leveraged for privacy inferences. However, we point out that there are currently no known studies demonstrating such risks; also, since better protections for this scenario most probably would require revamping the complete scientific workflow and retaining scientifically critical information and decision making, we leave it as follow-up discussion for the general scientific community; finally, our protections could be enhanced by enforcing the concept of differential privacy on all revealed QC results. However, this would certainly deteriorate the scientific utility of the results and QC in general.
Due to computational efficiency considerations, we adopted a hybrid computing ar- chitecture by leveraging safe and faster distributed computing. We point out that in some scenarios, it may still be necessary for the consortia center to enforce central quality control on all steps, including on file-level. A a natural extension, we hope to implement a fully centralized and secure version of the pipeline.
III.4.3 Conclusion.
In this chapter, we demonstrated important privacy vulnerabilities of disclosing various summary statistics that are routine in QC for GWA meta-analysis. We further demonstrated the design and evaluation of our privacy-enhanced QC pipeline which incorporated novel and practical technical countermeasures. Empirical evaluations on various real studies con- firmed the privacy vulnerabilities in traditional QC workflow. Meanwhile, our secure QC pipeline prove to support QC accurately and efficiently while guaranteeing strong privacy.
We hope that our solution could alleviate privacy concerns over genome privacy and enable broader scale of collaborations and data sharing in genome research.
CHAPTER IV
SecureMA: Safeguarding Meta-analysis of Genome-wide Association Studies (GWAS)
This chapter is based on our work in [159, 158]. My contribution in this work includes conception and design of the study, implementation and experimental evaluation, analysis of results, writing the manuscript and addressing reviewer comments.
Sharing genomic data is crucial to support scientific investigation such as genome- wide association studies. However, recent investigations suggest the privacy of the indi- vidual participants in these studies can be compromised, leading to serious concerns and consequences, such as overly restricted access to data. In this chapter, we introduce a novel cryptographic strategy to securely perform meta-analysis of genome-wide associ- ation studies (GWAS) in multi-site consortia. Our methodology is useful for support- ing joint studies among disparate data sites, where privacy or confidentiality is of con- cern. We validate our method using three multi-site association studies. Our research shows that genetic associations can be analyzed efficiently and accurately across sub- study sites, without leaking information on individual participants and site-level asso- ciation summaries. In additional to the above methodology improvement, we also re- lease our open-source software, SecureMA, for secure meta-analysis of GWAS at: http:
//github.com/XieConnect/SecureMA. Our customized secure computation framework is also open-source at: http://github.com/XieConnect/CircuitService.