• Tidak ada hasil yang ditemukan

1.6.1.3 Large-scale Network Analysis

Many important innovations involve analysis of social network data, such as [109, 88, 5]. These include anomaly detection, novel discoveries in online social networks (such as personalization and link prediction), etc. Social networks data often involve person- level private information, making them inappropriate to share across institutions in large collaborative studies. Our framework could serve the purpose by allowing for joint network analysis without disclosing private information.

V.1.7 Conclusion

In this work, we propose new cryptographic methods for preserving privacy in regular- ized logistic regression, a widely-used statistical model in various domains. To make the model efficient in a secure setting, we adapted a distributed method for model estimation.

To further enhance privacy and prevent inference attacks over intermediate data during model estimation, we introduced strong cryptographic protections. These lead to an effi- cient framework for supporting regularized logistic regression across different institutions while guaranteeing strong privacy both for individual study participants and institutions.

Extensive empirical evaluations have demonstrated the efficacy of the framework in guar- anteeing privacy with modest computational overhead. We hope that careful implementa- tion of our framework could enable a wider range of cross-institution joint analytics, which

would otherwise be impossible due to privacy or confidentiality concerns.

V.2 PrivLogit: Efficient Privacy-preserving Logistic Regression by Tailoring Nu- merical Optimizers

The section is based on our work [161]. My contribution in this work includes conception, design and supervision of the study, implementation and experimental evaluation, analysis of results, writing the manuscript and addressing reviewer comments.

Safeguarding privacy in machine learning is highly desirable, especially in collaborative studies across many organizations. Despite popularity, existing cryptographic solutions for privacy-preserving distributed machine learning incur excess computational overhead, par- tially due to naive adoption of mainstream model estimation algorithms (such as the New- ton method) and failing to tailor for secure computing-specific characteristics. Here, we present a contrasting perspective on designing numerical optimization method for crypto- graphically secure settings. We introduce a seemingly less-favorable optimization method that can in fact significantly accelerate privacy-preserving logistic regression. Leverag- ing this new method, which we call PrivLogit, we propose two new secure protocols for conducting logistic regression in a privacy-preserving and distributed manner. Extensive theoretical and empirical evaluations prove the competitive performance of our two secure proposals while ensuring accuracy and privacy: with speedup up to 2.3x and 8.1x, respec- tively, over state-of-the-art; and even faster as data scales up. Our drastic improvement makes privacy-preserving logistic regression more scalable and practical to large-scale studies which are common for modern science. In addition, our proposal of the PrivLogit optimizer is agnostic of and parallel to existing and future performance innovations from cryptography alone, thus can serve as a drop-in replacement for any privacy-preserving (distributed) logistic regression protocols.

V.2.1 Introduction

Logistic regression is a fundamental statistical model with wide adoption in various do- mains, such as in computer science, biomedical and social sciences (e.g., healthcare, ge-

netics, psychology, education, etc), etc. To reach powerful and reliable statistical conclu- sions, it is increasingly popular for these disciplines to perform collaborative regression through data sharing and joint analysis across a federation of organizations [111]. Such a trend, however, is often hampered by serious privacy concerns as human subject data un- derlying these studies are typically considered sensitive and strictly protected by various privacy laws and regulations [123, 72, 32]. Meanwhile, many organizations are also re- luctant to reveal their data content to external entities (due to concerns around privacy and business secrets), even though they still want to contribute to collaborative studies. This is increasingly common in areas such as healthcare, business, finance, etc.

More formally, we are interested in the following common scenario: multiple inde- pendent organizations (e.g., different institutions, medical centers, etc) want to conduct joint analytics (e.g., logistic regression). They each possess their respective private data of a sub-population (e.g., patient health records or human genomes), but are not willing or permitted to disclose the data beyond their respective organizations due to privacy and proprietary reasons. We focus on the horizontally partitioned setting [4]. In such a col- laborative study, potential adversaries include: distrustful aggregation center (e.g., due to breached servers or malicious employees), distrustful member organizations (due to curios- ity about other organizations’ secrets or business competition), and external curious people or hackers. The adversary’s goal is to learn privacy-sensitive information of individual data records or organizations by peeking into raw and summary-level data. The challenge here is on how to support such a collaborative study while preserving privacy, especially when it is difficult or economically impractical to find a fully entrusted central authority.

Cryptography (secure multi-party computation or SMC in particular) and distributed computing are classical and reviving solutions for tackling the challenge [4]. Numerous efforts have attempted to support data mining without disclosing raw and intermediate data [4, 155, 118, 121, 159, 17, 93] (known as privacy-preserving distributed data min- ing). Among these, significant attention is devoted to logistic regression [154, 155, 41,

118, 93, 7].

Despite encouraging progress, few proposals have seen wide adoption in real world for privacy-preserving logistic regression. A major reason seems to concern the excess com- putational overhead of cryptographic protocols. While it is generally expected for secure computation to be slower than non-secure counterparts, we also make a surprising observa- tion: much of the computational overhead indeed traces back to the sub-optimal technical decisions made by humans experts (e.g., authors of secure protocols) and could have been avoided. For instance, nearly all existing secure protocols [155, 41, 93] directly apply mainstream (distributed) model estimation algorithms (e.g., the popular Newton method for logistic regression [65]), failing to account for secure computing-specific characteris- tics and thus missing valuable opportunities for performance improvement.

In this work, we present a contrasting perspective on privacy-preserving logistic re- gression, and propose an improved model estimation method tailored for secure computing which significantly accelerates the computation while guaranteeing privacy and accuracy.

In our proposal (termed PrivLogit), we derive a constant approximation for the second- order curvature information (i.e., Hessian) in the Newton method for logistic regression.

This adapted optimizer seems counter-intuitive and “unfavorable” due to its elongated con- vergence and increased network interactions, but surprisingly turns out to be highly com- petitive in performance.

Following PrivLogit, we propose and evaluate two highly-efficient cryptographic pro- tocols for privacy-preserving distributed logistic regression, i.e., PrivLogit-Hessian and PrivLogit-Local.

V.2.1.0.1 Contributions