Differentially private high dimensional sparse covariance matrix estimation

(1)

Differentially private high dimensional sparse covariance matrix estimation

Item Type Article

Authors Wang, Di;Xu, Jinhui

Citation Wang, D., & Xu, J. (2021). Differentially private high dimensional sparse covariance matrix estimation. Theoretical Computer Science. doi:10.1016/j.tcs.2021.03.001

Eprint version Post-print

DOI 10.1016/j.tcs.2021.03.001

Publisher Elsevier BV

Journal Theoretical Computer Science

Rights NOTICE: this is the author’s version of a work that was accepted for publication in Theoretical Computer Science. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document.

Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Theoretical Computer Science, [, , (2021-03-10)] DOI:

10.1016/j.tcs.2021.03.001 . © 2021. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://

creativecommons.org/licenses/by-nc-nd/4.0/

Download date 2023-12-01 06:57:22

Link to Item http://hdl.handle.net/10754/668232

(2)

Diﬀerentially Private High Dimensional Sparse Covariance Matrix Estimation

^✩✩

Di Wang^1,1,, Jinhui Xu¹

aDivision of Computer, Electrical and Mathematical Sciences and Engineering King Abdullah University of Science and Technology

Thuwal 23955, Saudi Arabia

bDepartment of Computer Science and Engineering State University of New York at Buﬀalo

338 Davis Hall, Buﬀalo, 14260

Abstract

In this paper, we study the problem of estimating the covariance matrix under diﬀer- ential privacy, where the underlying covariance matrix is assumed to be sparse and of high dimensions. We propose a new method, called DP-Thresholding, to achieve a non- trivial𝓁₂-norm based error bound whose dependence on the dimension drops to logarithmic instead of polynomial, it is significantly better than the existing ones, which add noise directly to the empirical covariance matrix. We also extend the𝓁₂-norm based error bound to a general𝓁_𝑤-norm based one for any1 ≤𝑤≤ ∞, and show that they share the same upper bound asymptotically. Our approach can be easily extended to local diﬀerential privacy. Experiments on the synthetic datasets show results that are consistent with theoretical claims.

Keywords: Diﬀerential privacy, sparse covariance estimation, high dimensional statistics

1. Introduction

In recent year, Machine Learning and Statistical Estimation have had profound impact on many applied domains such as social sciences, genomics, and medicine. A frequently encountered challenge in their applications is how to deal with the high dimensionality of the datasets, especially for those in genomics, educational and psycho- logical research. A commonly adopted strategy for dealing with such an issue is to assume that the underlying structures of parameters are sparse.

Another often encountered challenge is how to handle sensitive data, such as those in social science, biomedicine, and genomics. A promising approach is to use diﬀer-

✩A preliminary version appeared in Proceedings of The 53rd Annual Conference on Information Sciences and Systems (CISS 2019).

✩✩This research was supported in part by the National Science Foundation (NSF) through grants CCF- 1422324 and CCF-1716400.

∗Corresponding author

(3)

entially private mechanisms for the statistical inference and learning tasks. Diﬀeren- tial Privacy (DP) [? ] is a widely-accepted criterion that provides provable protection against identification and is resilient to arbitrary auxiliary information that might be available to attackers. Since its introduction over a decade ago, a rich line of works are now available, which have made diﬀerential privacy a compelling privacy enhancing technology for many organizations, such as Uber [? ], Google [?], Apple [?].

Estimating or studying the high dimensional datasets while keeping them (locally) diﬀerentially private could be quite challenging for many problems, such as sparse linear regression [?], sparse mean estimation [?], and selection problem [?]. However, there are also evidences showing that the loss of some problems under the privacy constraints can be quite small compared with their non-private counterparts. Examples of such nature include Empirical Risk Minimization under sparsity constraints [? ? ], high dimensional sparse PCA [? ? ? ], sparse inverse covariance estimation [? ], and high- dimensional distributions estimation [?]. Thus, it is desirable to determine which high dimensional problem can be learned or estimated eﬃciently in a private manner.

In this paper, we aim to give an answer to this question for a simple but fundamental problem in machine learning and statistics, namely estimating the underlying sparse covariance matrix of a bounded sub-Gaussian distribution. For this problem, we propose a simple but nontrivial(𝜖,𝛿)-DP method, DP-Thresholding, and show that the squared 𝓁_𝑤-norm error for any1≤𝑤≤∞is bounded by𝑂(^𝑠²^log_𝑛 ^𝑝+^𝑠²^log_𝑛₂^𝑝_𝜖^log₂ ¹^𝛿+^log_𝑛₂^{2 1}_𝜖₄^𝛿), where 𝑛is the sample size,𝑝is the dimension of the underlying space and𝑠is the sparsity of each row in the underlying covariance matrix. Moreover, our method can be easily extended to the local diﬀerential privacy model with an upper bound of𝑂(^𝑠²^log_𝑛𝜖^𝑝₂^log¹^𝛿). Experiments on synthetic datasets confirm the theoretical claims. To our best knowl- edge, this is the first paper studying the problem of estimating a high dimensional sparse covariance matrix under (local) diﬀerential privacy.

2. Related Work

Recently, there have beeen several papers studying private distribution estimation, such as [? ? ? ? ? ]. For distribution estimation under the central differential privacy model, [? ] considers the 1-dimensional private mean estimation of a Gaussian distribution with (un)known variance. The work that is probably most closely related to ours is [? ], which studies the problem of privately learning multivariate Gaussian and product distributions. The following are the main differences with ours. Firstly, our goal is to estimate the covariance of a sub-Gaussian distribution. Even though the class of distributions considered in our paper is larger than the one in [?], it has an additional assumption which requires the𝓁₂norm of a sample of the distribution to be bounded by1. This means that it does not include the general Gaussian distribution. Secondly, although [?] also considers the high dimensional case, it does not assume the sparsity of the underlying covariance matrix. Thus, its error bound depends on the dimension- ality𝑝polynomially, which is large in the high dimensional case (𝑝 ≫ 𝑛), while the dependence in our paper is only logarithmic (i.e.,log𝑝). Thirdly, the error in [? ] is measured by the total variation distance, while it is by𝓁_𝑤-norm in our paper. Thus, the two results are not comparable. Fourthly, it seems difficult to extend the methods of [?

(4)

] to the local model. Recently, [? ] also studies the covariance matrix estimation via iterative eigenvector sampling. However, their method is just for the low dimensional case and the error is measured with respect to the Frobenious norm.

Distribution estimation under local diﬀerential privacy has been studied in [? ? ].

However, both of them study only the 1-dimensional Gaussian distribution. Thus, it is quite diﬀerent from the class of distributions in our paper.

In this paper, we mainly use Gaussian mechanism on the covariance matrix, which has been studied in [? ? ? ]. However, as it will be shown later, simply outputting the perturbed covariance can incur big error and thus is insuﬃcient for our problem.

Compared to these previous work, the problem in this paper is clearly more complicated since here we assume it is in the high dimensional space where𝑝 ≫ 𝑛.

3. Preliminaries 3.1. Diﬀerential Privacy

Diﬀerential privacy [? ] is by now a de facto standard for statistical data privacy which constitutes a strong standard for privacy guarantees for algorithms on aggregate databases. DP requires that there is no significant change in the outcome distribution under a single entry change to the dataset. We say that two datasets𝐷,𝐷^!are neighbors if they diﬀer by only one entry, denoted as𝐷∼𝐷^!.

Definition 1(Diﬀerential Privacy [?]).A randomized algorithmis(𝜖,𝛿)-diﬀerentially private (DP) if for all neighboring datasets𝐷,𝐷^!and for all measurebale events𝑆in the output space of, the following holds

ℙ((𝐷)∈𝑆)≤𝑒^𝜖ℙ((𝐷^!)∈𝑆) +𝛿.

When𝛿= 0,is𝜖-diﬀerentially private.

We will use the Gaussian Mechanism [? ] to guarantee(𝜖,𝛿)-DP.

Definition 2(Gaussian Mechanism [? ] ). Given any function𝑞 ∶ ^𝑛 → ℝ^𝑝, the Gaussian Mechanism is defined as:

_𝐺(𝐷,𝑞,𝜖) =𝑞(𝐷) +𝑌,

where Y is drawn from Gaussian Distribution(0,𝜎²𝐼_𝑝)with𝜎 ≥ ^√2 log(1.25∕𝛿)Δ₂(𝑞)

𝜖 .

HereΔ₂(𝑞)is the𝓁₂-sensitivity of the function𝑞, i.e.

Δ₂(𝑞) = sup

𝐷∼𝐷^"!!𝑞(𝐷)−𝑞(𝐷^!)!!₂.

The Gaussian Mechanism preservers(𝜖,𝛿)-diﬀerential privacy.

(5)

3.2. Private Sparse Covariance Estimation

Let𝑥₁,𝑥₂,⋯,𝑥_𝑛be𝑛random samples from a𝑝-variate distribution with covariance matrixΣ= (𝜎_𝑖𝑗)_{1≤𝑖,𝑗≤𝑝}, where the dimensionality𝑝is assumed to be high,i.e.,𝑝 ≫ 𝑛≥ Poly(log𝑝).

We define the parameter space of𝑠-sparse covariance matrices as the following:

0(𝑠) = {Σ= (𝜎_𝑖𝑗)_{1≤𝑖,𝑗≤𝑝}∶𝜎_−𝑗,𝑗is𝑠-sparse∀𝑗∈[𝑝]}, (1) where𝜎−𝑗,𝑗 denotes the𝑗-th column ofΣwith the entry𝜎_𝑗𝑗removed. That is, a matrix in0(𝑠)has at most𝑠non-zero oﬀ-diagonal elements in each column.

We assume that each𝑥_𝑖 is sampled from a0-mean and sub-Gaussian distribution with parameter𝜎², that is,

𝔼[𝑥_𝑖] = 0,ℙ{!𝑣^𝑇𝑥_𝑖!>𝑡}≤𝑒⁻^𝑡

2

2𝜎2,∀𝑡>0and‖𝑣‖2= 1. (2)

This means that all the one-dimensional marginals of𝑥_𝑖 have sub-Gaussian tails. We also assume that with probability 1,‖𝑥_𝑖‖2≤1. We note that such assumptions are quite common in the diﬀerential privacy literature, such as [? ].

Let_𝑝(𝜎²,𝑠)denote the set of distributions of𝑥_𝑖satisfying all the above conditions (ı.e.,(??) and‖𝑥_𝑖‖₂≤1) and with the covariance matrixΣ ∈0(𝑠). The goal of private covariance estimation is to obtain an estimatorΣ^privof the underlying covariance matrix Σbased on{𝑥₁,⋯,𝑥_𝑛}∼𝑃 ∈_𝑝(𝜎²,𝑠)while preserving its privacy. In this paper, we will focus on(𝜖,𝛿)-diﬀerential privacy. We use the𝓁₂norm to measure the diﬀerence betweenΣ^privandΣ,i.e.,‖Σ^priv− Σ‖₂.

Lemma 1([? ]). Let{𝑥₁,⋯,𝑥_𝑛}be 𝑛random variables sampled from a Gaussian distribution(0,𝜎²). Then

𝔼max

1≤𝑖≤𝑛!𝑥_𝑖!≤𝜎√

2 log 2𝑛, (3)

ℙ{ max

1≤𝑖≤𝑛!𝑥_𝑖!≥𝑡}≤2𝑛𝑒⁻^2𝜎^𝑡²². (4) Particularly, if𝑛= 1, we haveℙ{!𝑥_𝑖!≥𝑡}≤2𝑒⁻^2𝜎^𝑡²².

Lemma 2([? ]). If{𝑥₁,𝑥₂,⋯,𝑥_𝑛}are sampled from a sub-Gaussian distribution in (??) andΣ^∗= (𝜎^∗)_{1≤𝑖,𝑗≤𝑝}= ¹_𝑛∑_𝑛

𝑖=1𝑥_𝑖𝑥^𝑇_𝑖 is the empirical covariance matrix, then there exist constants𝐶1and𝛾 >0such that∀𝑖,𝑗∈[𝑝]

ℙ(!𝜎_𝑖𝑗^∗ −𝜎_𝑖𝑗!>𝑡)≤𝐶1𝑒⁻^𝑛𝑡^{2 8}^𝛾² (5) for all!𝑡!≤𝜉, where𝐶1,𝜉and𝛾are constants and depend only on𝜎². Specifically,

ℙ{!𝜎_𝑖𝑗^∗ −𝜎_𝑖𝑗!>𝛾

√log𝑝

𝑛 }≤𝐶1𝑝⁻⁸. (6)

(6)

Notations:. All the constants and big-𝑂notations throughout the paper omit the fac- tors that are related to polynomial of𝜎², which is the sub-Gaussian parameter. Many previous papers assume the sub-Gaussian parameter as a constant, such as [? ?].

4. Method

4.1. A First Approach

A direct way to obtain a private estimator is to perturb the empirical covariance matrix by symmetric Gaussian matrices, which has been used in previous work on private PCA, such as [? ? ]. However, as we can see bellow, this method will introduce big error.

By [? ], for any given0 < 𝜖,𝛿 ≤ 1 and{𝑥₁,𝑥₂,⋯,𝑥_𝑛} ∼ 𝑃 ∈ _𝑝(𝜎²,𝑠), the following perturbation procedure is(𝜖,𝛿)-diﬀerentially private:

Σ̃ =Σ^∗+𝑁= (̃𝜎_𝑖𝑗)_{1≤𝑖,𝑗≤𝑝}= 1 𝑛

∑𝑛

𝑖=1𝑥_𝑖𝑥^𝑇_𝑖 +𝑁, (7)

where𝑁is a symmetric matrix with its upper triangle ( including the diagonal) being i.i.d. samples from(0,𝜎₁²); here𝜎₁²= 2 log(1.25∕𝛿)

𝑛²𝜖² , and each lower triangle entry being copied from its upper triangle counterpart. By the Corollary 2.3.6 of [?], we know that

‖𝑁‖₂ ≤𝑂(√

𝑝𝜎₁) =𝑂(^√^𝑝

√log¹_𝛿

𝑛𝜖 )with high probability. We can easily get that, with high probability (i.e.,with probability at least1−_𝑝¹_𝑐 for some𝑐>0)

‖Σ − Σ‖̃ ₂≤‖Σ^∗− Σ‖₂+‖𝑁‖₂≤𝑂(

√ 𝑝log¹

𝛿

𝑛𝜖 ), (8)

where the second inequality is due to a Theorem in Chapter 1.6.3 of [?]. However, we can see that the upper bound of the error in (??) is quite large in the high dimensional case.

Another issue of the private estimator in (??) is that it is not clear whether it is positive-semidefinite, a property that is normally expected from an estimator.

4.2. Post-processing via Thresholding

We note that one of the reasons that the private estimatorΣ̃ in (??) fails is due to the fact that some entries are quite large which make‖Σ̃_𝑖𝑗− Σ_𝑖𝑗‖2large for some𝑖,𝑗. More precisely, by (??) and (??) we can get the following, with probability at least1−𝐶𝑝⁻⁶, for all1≤𝑖,𝑗≤𝑝,

!̃𝜎_𝑖𝑗−𝜎_𝑖𝑗!≤𝛾

√log𝑝 𝑛 + 4√

2 log¹^._𝛿²⁵√ log𝑝

𝑛𝜖 =𝑂(𝛾

√log𝑝

𝑛𝜖² ). (9) Thus, to reduce the error, a natural approach is the following. For those𝜎_𝑖𝑗with larger values, we keep the corresponding ̃𝜎_𝑖𝑗in order to make their diﬀerence less than some

(7)

threshold. For those𝜎_𝑖𝑗with smaller values compared with (??), since the correspond- ing̃𝜎_𝑖𝑗may still be large, if we threshold̃𝜎_𝑖𝑗to 0, we can lower the error oñ𝜎_𝑖𝑗−𝜎_𝑖𝑗.

Following the above thinking and the thresholding methods in [? ] and [? ], we propose the following DP-Thresholding method, which post-processes the perturbed covariance matrix in (??) with the threshold𝛾√

log𝑝

𝑛 +⁴^√2 log(1.25∕𝛿)√ log𝑝

𝑛𝜖 . After thresholding, we further threshold the eigenvalues ofΣ̂ in order to make it positive semi- definite. See Algorithm??for detail.

Algorithm 1DP-Thresholding

𝐈𝐧𝐩𝐮𝐭:{𝑥₁,𝑥₂,⋯,𝑥_𝑛}∼𝑃 ∈_𝑝(𝜎²,𝑠), and𝜖,𝛿∈(0,1)

1: Compute

Σ̃ = (̃𝜎_𝑖𝑗)_{1≤𝑖,𝑗≤𝑝}= 1 𝑛

∑𝑛

𝑖=1𝑥_𝑖𝑥^𝑇_𝑖 +𝑁,

where 𝑁 is a symmetric matrix with its upper triangle (including the diagonal) being i.i.d samples from(0,𝜎₁²); here𝜎₁²= 2 log(1.25∕𝛿)

𝑛²𝜖² , and each lower triangle entry being copied from its upper triangle counterpart.

2: Define the thresholding estimatorΣ̂ = (̂𝜎_𝑖𝑗)_{1≤𝑖,𝑗≤𝑛}as

̂𝜎_𝑖𝑗 = ̃𝜎_𝑖𝑗⋅𝐼[!̃𝜎_𝑖𝑗!>𝛾

2 log(1.25∕𝛿)√ log𝑝

𝑛𝜖 ]. (10)

3: Let the eigen-decomposition ofΣ̂ beΣ̂ = ∑_𝑝

𝑖=1𝜆_𝑖𝑣_𝑖𝑣^𝑇_𝑖 . Let𝜆⁺ = max{𝜆_𝑖,0}be the positive part of𝜆_𝑖, then defineΣ⁺=∑_𝑝

𝑖=1𝜆⁺𝑣_𝑖𝑣^𝑇_𝑖 .

4: return Σ⁺.

Theorem 1. For any0<𝜖,𝛿≤1, Algorithm??is(𝜖,𝛿)-diﬀerentially private.

Proof. By Section 3 in [? ], we know that Step 1 keeps the matrix(𝜖,𝛿)-differentially private. Thus, Algorithm 1 is(𝜖,𝛿)-differentially private due to the post-processing property of differential privacy [? ].

For the matrixΣ̂ in (??) after the first step of thresholding, we have the following key lemma.

Lemma 3. For every fixed1 ≤𝑖,𝑗 ≤ 𝑝, there exists a constant𝐶 >0such that with probability at least1−𝐶𝑝⁻⁹², the following holds:

!̂𝜎_𝑖𝑗−𝜎_𝑖𝑗!≤4 min{!𝜎_𝑖𝑗!,𝛾

√log𝑝 𝑛 +4√

𝑛𝜖 }. (11)

Proof of Lemma??. LetΣ^∗= (𝜎_𝑖𝑗^∗)_{1≤𝑖,𝑗≤𝑝}and𝑁 = (𝑛_𝑖𝑗)_{1≤𝑖,𝑗≤𝑝}. Define the event𝐴_𝑖𝑗= {!̃𝜎_𝑖𝑗!>𝛾√_log

𝑝

𝑛 + ⁴^√2 log(1.25∕𝛿)√ log𝑝

𝑛𝜖 }. We have:

!̂𝜎_𝑖𝑗−𝜎_𝑖𝑗!=!𝜎_𝑖𝑗!⋅𝐼(𝐴^𝑐_𝑖𝑗) +!̃𝜎_𝑖𝑗−𝜎_𝑖𝑗!⋅𝐼(𝐴_𝑖𝑗). (12)

(8)

By the triangle inequality, it is easy to see that 𝐴_𝑖𝑗=(

!̃𝜎_𝑖𝑗−𝜎_𝑖𝑗+𝜎_𝑖𝑗!>𝛾

2 log(1.25∕𝛿)√ log𝑝 𝑛𝜖

)

⊂(

!̃𝜎_𝑖𝑗−𝜎_𝑖𝑗!>𝛾

𝑛𝜖 −!𝜎_𝑖𝑗!) and

𝐴^𝑐_𝑖𝑗=(

!̃𝜎_𝑖𝑗−𝜎_𝑖𝑗+𝜎_𝑖𝑗!≤𝛾

)

⊂(

!̃𝜎_𝑖𝑗−𝜎_𝑖𝑗!>!𝜎_𝑖𝑗!−(𝛾

𝑛𝜖 ))

.

Depending on the value of𝜎_𝑖𝑗, we have the following three cases.

Case 1. !𝜎_𝑖𝑗!≤ ^𝛾₄√_log

𝑝

𝑛 + ^√2 log(1.25∕𝛿)√ log𝑝

𝑛𝜖 . For this case, we have ℙ(𝐴_𝑖𝑗)≤ℙ(!̃𝜎_𝑖𝑗−𝜎_𝑖𝑗!> 3𝛾

4

𝑛𝜖 )≤𝐶₁𝑝⁻⁹²+2𝑝⁻⁹². (13) This is due to the following:

ℙ*

!̃𝜎_𝑖𝑗−𝜎_𝑖𝑗!> 3𝛾 4

+ (14)

≤ℙ*

!𝜎_𝑖𝑗^∗ −𝜎_𝑖𝑗!> 3𝛾 4

𝑛𝜖 −!𝑛_𝑖𝑗!+

(15)

=ℙ*

𝐵_𝑖𝑗⋂ (3√

𝑛𝜖 −!𝑛_𝑖𝑗!>0)+

(16) +ℙ*

𝐵_𝑖𝑗⋂ (3√

𝑛𝜖 −!𝑛_𝑖𝑗!≤0)+

(17)

≤ℙ(!𝜎_𝑖𝑗^∗ −𝜎_𝑖𝑗!> 3𝛾 4

√log𝑝

𝑛 ) +ℙ(3√

𝑛𝜖 ≤!𝑛_𝑖𝑗!) (18)

≤𝐶₁𝑝⁻⁹² + 2𝑝⁻⁹², (19) where event𝐵_𝑖𝑗denotes𝐵_𝑖𝑗 = {!𝜎_𝑖𝑗^∗−𝜎_𝑖𝑗!> ^3𝛾₄√_log

𝑝

𝑛 +³^√^{2 log(1}^._𝑛𝜖^25∕^𝛿⁾^√^log^𝑝)−!𝑛_𝑖𝑗!}, and the last inequality is due to (??) and (??).

Thus by (??), with probability at least1−𝐶₁𝑝⁻⁹² −2𝑝⁻⁹², we have

!̂𝜎_𝑖𝑗−𝜎_𝑖𝑗!=!𝜎_𝑖𝑗!, which satisfies (??).

(9)

Case 2. !𝜎_𝑖𝑗!≥2𝛾√_log

𝑝

𝑛 +⁸^√2 log(1.25∕𝛿)√ log𝑝

𝑛𝜖 . For this case, we have ℙ(𝐴^𝑐_𝑖𝑗)≤ℙ(!̃𝜎_𝑖𝑗−𝜎_𝑖𝑗!≥𝛾

𝑛𝜖 )≤𝐶₁𝑝⁻⁹² + 2𝑝⁻⁸, where the proof is the same as (13-17). Thus, with probability at least1−𝐶₁𝑝⁻⁹²−2𝑝⁻⁸, we have

!̂𝜎_𝑖𝑗−𝜎_𝑖𝑗!=!̃𝜎_𝑖𝑗−𝜎_𝑖𝑗!. (20) Also, by (??), (??) also holds.

Case 3. Otherwise, 𝛾

4

√log𝑝 𝑛 +

√2 log(1.25∕𝛿)√ log𝑝

𝑛𝜖 ≤!𝜎_𝑖𝑗!≤2𝛾

𝑛𝜖 .

For this case, we have

!̂𝜎_𝑖𝑗−𝜎_𝑖𝑗!=!𝜎_𝑖𝑗!or!̃𝜎_𝑖𝑗−𝜎_𝑖𝑗!. (21) When!𝜎_𝑖𝑗!≤𝛾√_log

𝑝

𝑛 +⁴^√^{2 log(1}^._𝑛𝜖^25∕^𝛿⁾^√^log^𝑝, we can see from (??) that with probability at least1−2𝑝⁻⁶−𝐶₁𝑝⁻⁸,

!̃𝜎_𝑖𝑗−𝜎_𝑖𝑗!≤𝛾

𝑛𝜖 ≤4!𝜎_𝑖𝑗!.

Thus,(??)also holds.

Otherwise when !𝜎_𝑖𝑗! ≥ 𝛾√_log

𝑝

𝑛 + ⁴^√2 log(1.25∕𝛿)√ log𝑝

𝑛𝜖 , (??)also holds. Thus, Lemma 3 is true.

By Lemma??, we have the following upper bound on the𝓁₂-norm error ofΣ⁺. Theorem 2. The outputΣ⁺of Algorithm??satisfies:

𝔼‖Σ⁺− Σ‖²₂=𝑂(𝑠²log𝑝

𝑛 +𝑠²log𝑝log¹_𝛿

𝑛²𝜖² +log¹_𝛿

𝑛²𝜖⁴), (22) where the expectation is taken over the coins of the Algorithm and the randomness of {𝑥₁,𝑥2,⋯,𝑥_𝑛}.

Proof of Theorem??. We first show that‖Σ⁺− Σ‖₂ ≤2‖Σ − Σ‖̂ ₂. This is due to the following

‖Σ⁺− Σ‖₂≤‖Σ⁺−Σ‖̂ ₂+‖Σ − Σ‖̂ ₂≤ max

𝑖∶𝜆_𝑖≤0!𝜆_𝑖!+‖Σ − Σ‖̂ ₂

≤ max

𝑖∶𝜆_𝑖≤0!𝜆_𝑖−𝜆_𝑖(Σ)!+‖Σ − Σ‖̂ ₂≤2‖Σ − Σ‖̂ ₂,

where the third inequality is due to the fact thatΣis positive semi-definite.

(10)

This means that we only need to bound‖Σ − Σ‖̂ ₂. SinceΣ − Σ̂ is symmetric, we know that‖Σ − Σ‖̂ ₂≤‖Σ − Σ‖̂ ₁[? ]. Thus, it suﬃces to prove that the bound in (??) holds for‖Σ − Σ‖̂ ₁.

We define event𝐸_𝑖𝑗as

𝐸_𝑖𝑗 = {!̂𝜎_𝑖𝑗−𝜎_𝑖𝑗!≤4 min{!𝜎_𝑖𝑗!,𝛾

𝑛𝜖 }}. (23)

Then, by Lemma??, we haveℙ(𝐸_𝑖𝑗)≥1−2𝐶₁𝑝⁻⁹².

Let𝐷= (𝑑_𝑖𝑗)_{1≤𝑖,𝑗≤𝑝}, where𝑑_𝑖𝑗 = (̂𝜎_𝑖𝑗−𝜎_𝑖𝑗)⋅𝐼(𝐸_𝑖𝑗^𝑐). Then, we have

‖Σ − Σ‖̂ ²₁≤‖Σ − Σ −̂ 𝐷+𝐷‖²₁

≤2‖Σ − Σ −̂ 𝐷‖²₁+ 2‖𝐷‖²₁

≤4(sup

𝑗

∑

𝑖≠𝑗!̂𝜎_𝑖𝑗−𝜎_𝑖𝑗!𝐼(𝐸_𝑖𝑗))²+ 2‖𝐷‖²₁+𝑂(𝛾²log𝑝

𝑛 + log𝑝log¹_𝛿

𝑛²𝜖² ). (24) We first bound the first term of (??). By the definition of𝐸_𝑖𝑗 and Lemma 3, we can upper bound it by

(sup𝑗

∑

𝑖≠𝑗!̂𝜎_𝑖𝑗−𝜎_𝑖𝑗!𝐼(𝐸_𝑖𝑗))²

≤16(sup

𝑗

∑

𝑖≠𝑗min{!𝜎_𝑖𝑗!,𝛾

𝑛𝜖 })²

≤16𝑠²(𝛾

𝑛𝜖 )²

≤𝑂(𝑠²𝛾²log𝑝

𝑛 +𝑠²log𝑝log¹_𝛿

𝑛²𝜖² ), (25)

where the second inequality is due to the assumption that at most𝑠elements of(𝜎_𝑖𝑗)_𝑖≠𝑗 are non-zero.

For the second term in (??), we have 𝔼‖𝐷‖²₁≤𝑝∑

𝑖𝑗𝔼𝑑_𝑖𝑗² =𝑝𝔼∑

𝑖𝑗

[(̂𝜎_𝑖𝑗−𝜎_𝑖𝑗)²𝐼(𝐸_𝑖𝑗^𝑐 ⋂

{̂𝜎_𝑖𝑗= ̃𝜎_𝑖𝑗}) + (̂𝜎_𝑖𝑗−𝜎_𝑖𝑗)²𝐼(𝐸_𝑖𝑗^𝑐⋂

{̂𝜎_𝑖𝑗= 0})]

=𝑝𝔼∑

𝑖𝑗

(̃𝜎_𝑖𝑗−𝜎_𝑖𝑗)²𝐼(𝐸_𝑖𝑗^𝑐) +𝑝∑

𝑖𝑗

𝔼𝜎²_𝑖𝑗𝐼(𝐸_𝑖𝑗^𝑐 ⋂

{̂𝜎_𝑖𝑗= 0}). (26) For the first term in (??), we have

𝑝∑

𝑖𝑗

𝔼{(̃𝜎_𝑖𝑗−𝜎_𝑖𝑗)²𝐼(𝐸^𝑐_𝑖𝑗)}≤𝑝∑

𝑖𝑗

[𝔼(̃𝜎_𝑖𝑗−𝜎_𝑖𝑗)⁶]¹³ℙ²³(𝐸_𝑖𝑗^𝑐) (27)

≤𝐶𝑝⋅𝑝²log¹_𝛿

𝑛²𝜖²𝑝⁻³=𝑂(log¹_𝛿 𝑛²𝜖²),

(11)

where the first inequality is due to Hölder inequality and the second inequality is due to the fact that with some constant𝐶3>0,

𝔼(̃𝜎_𝑖𝑗−𝜎_𝑖𝑗)⁶≤𝐶3[𝔼(𝜎_𝑖𝑗^∗−𝜎_𝑖𝑗)⁶+𝔼𝑛⁶_𝑖𝑗].

Since𝑛_𝑖𝑗 is a Gaussian distribution, we have 𝔼𝑛⁶_𝑖𝑗 ≤ 𝐶4𝜎₁⁶ = 𝑂((^log_𝑛₂_𝜖₂¹^𝛿)³)for some constant𝐶4[?]. For the first term𝔼(𝜎^∗_𝑖𝑗−𝜎_𝑖𝑗)⁶, since𝑥_𝑖is sampled from a sub-Gaussian distribution (??), by Whittle Inequality (Theorem 2 in [?] or [?]), the quadratic form 𝜎^∗_𝑖𝑗satisfies𝔼(𝜎_𝑖𝑗^∗ −𝜎_𝑖𝑗)⁶≤𝐶₅_𝑛¹₆ for some positive constant𝐶₅>0.

For the second term of (??), we have 𝑝∑

𝑖𝑗

𝔼𝜎²_𝑖𝑗𝐼(𝐸_𝑖𝑗^𝑐 ⋂

{̂𝜎_𝑖𝑗= 0})

=𝑝∑

𝑖𝑗

𝔼𝜎_𝑖𝑗²𝐼(!𝜎_𝑖𝑗!>4𝛾

𝑛𝜖 )

×𝐼(!̃𝜎_𝑖𝑗!≤𝛾

𝑛𝜖 )

≤𝑝∑

𝑖𝑗

𝔼𝜎_𝑖𝑗²𝐼(!𝜎_𝑖𝑗!>4𝛾

𝑛𝜖 )

×𝐼(!𝜎_𝑖𝑗!−!̃𝜎_𝑖𝑗−𝜎_𝑖𝑗!≤𝛾

𝑛𝜖 )

≤𝑝∑

𝑖𝑗

𝜎_𝑖𝑗²𝔼𝐼(!𝜎_𝑖𝑗!>4𝛾

𝑛𝜖 )𝐼(!̃𝜎_𝑖𝑗−𝜎_𝑖𝑗!≥ 3 4!𝜎_𝑖𝑗!)

≤𝑝∑

𝑖𝑗

𝜎_𝑖𝑗²𝔼𝐼(!𝜎_𝑖𝑗!>4𝛾

𝑛𝜖 )𝐼(!𝜎_𝑖𝑗^∗ −𝜎_𝑖𝑗!+!𝑛_𝑖𝑗!≥ 3 4!𝜎_𝑖𝑗!)

≤𝑝∑

𝑖𝑗

𝜎_𝑖𝑗²ℙ*(

!𝜎^∗_𝑖𝑗−𝜎_𝑖𝑗!≥ 3

4!𝜎_𝑖𝑗!−!𝑛_𝑖𝑗!) ⋂ (

!𝜎_𝑖𝑗!>4𝛾

)+

(28)

(12)

=𝑝∑

𝑖𝑗

!𝜎^∗_𝑖𝑗−𝜎_𝑖𝑗!≥ 3

4!𝜎_𝑖𝑗!−!𝑛_𝑖𝑗!) ⋂ (

!𝑛_𝑖𝑗!≤ 1

4!𝜎_𝑖𝑗!) ⋂ (!𝜎_𝑖𝑗!>4𝛾

√log𝑝 𝑛 + 16√

)++𝑝∑

𝑖𝑗

𝜎²_𝑖𝑗ℙ*(

!𝜎_𝑖𝑗^∗ −𝜎_𝑖𝑗!≥ 3

4!𝜎_𝑖𝑗!−!𝑛_𝑖𝑗!)

⋂ (!𝑛_𝑖𝑗!≥ 1

4!𝜎_𝑖𝑗!) ⋂ (

)+ (29)

≤𝑝∑

𝑖𝑗

!𝜎^∗_𝑖𝑗−𝜎_𝑖𝑗!≥ 1

2!𝜎_𝑖𝑗!) ⋂ (

√log𝑝 𝑛 + 16√

)+

+𝑝∑

𝑖𝑗

!𝑛_𝑖𝑗!≥ 1

4!𝜎_𝑖𝑗!) ⋂ (

)+.

(30) For the second term of (??), by Lemmas 1 and 2 we have

𝑝∑

𝑖𝑗

𝜎_𝑖𝑗²ℙ({!𝑛_𝑖𝑗!≥ 1

4!𝜎_𝑖𝑗!}⋂

{!𝜎_𝑖𝑗!>4𝛾

𝑛𝜖 })

≤𝑝∑

𝑖𝑗

𝜎²_𝑖𝑗ℙ(!𝑛_𝑖𝑗!≥𝛾

2 log(1.25∕𝛿) log𝑝

𝑛𝜖 })ℙ(!𝑛_𝑖𝑗!> 1 4𝜎_𝑖𝑗)

≤𝐶𝑝∑

𝑖𝑗

𝜎_𝑖𝑗²exp(−(𝛾√

log𝑝

𝑛 + 4𝜎₁√ log𝑝)²

2𝜎₁² ) exp(− 𝜎_𝑖𝑗² 32𝜎₁²)

≤𝐶𝑝∑

𝑖𝑗

𝜎_𝑖𝑗²exp(−(𝛾√_log

𝑝

𝑛 + 4𝜎₁√ log𝑝)² 2𝜎₁² )32𝜎₁²

𝜎_𝑖𝑗²

≤𝐶𝜎₁²𝑝⋅𝑝²exp(−𝛾²log𝑝

2𝑛𝜎²₁ )𝑝⁻⁸ (31)

≤𝐶𝜎₁²𝑝⁻⁵( 2𝑛𝜎₁²

𝛾²log𝑝)²=𝑂(log²1∕𝛿

𝑛²𝜖⁴ ). (32)

(13)

For the first term of (??), by Lemma 2 we have

𝑝∑

𝑖𝑗

𝜎²_𝑖𝑗ℙ({!𝜎_𝑖𝑗^∗ −𝜎_𝑖𝑗!≥ 1

2!𝜎_𝑖𝑗!}⋂

{!𝜎_𝑖𝑗!≥4𝛾

√log𝑝 𝑛 })

≤ 𝑝 𝑛

∑

𝑖𝑗

𝑛𝜎_𝑖𝑗²exp(−𝑛2𝜎_𝑖𝑗²

𝛾² )𝐼(!𝜎_𝑖𝑗!≥4𝛾

√log𝑝 𝑛 )

= 𝑝 𝑛

∑

𝑖𝑗

[𝑛𝜎_𝑖𝑗²exp(−𝑛𝜎_𝑖𝑗²

𝛾²)] exp(−𝑛𝜎²_𝑖𝑗

𝛾²)𝐼(!𝜎_𝑖𝑗!≥4𝛾

√log𝑝 𝑛 )

≤ 𝑝 𝑛

∑

𝑖𝑗

𝑛𝜎_𝑖𝑗² 𝛾²

𝑛𝜎_𝑖𝑗² exp(−16 log𝑝) (33)

≤𝐶𝛾²𝑝³

𝑛 𝑝⁻¹⁶=𝑂(1

𝑛). (34)

Thus in total, we have𝔼‖𝐷‖²₁=𝑂(^{log 1∕𝛿}_𝑛₂_𝜖₂ ). This means that𝔼‖Σ − Σ‖̂ ²₁=𝑂(^𝑠²^log_𝑛 ^𝑝 +

𝑠²log𝑝log¹_𝛿

𝑛²𝜖² + ^log_𝑛₂^{2 1}_𝜖₄^𝛿), which completes the proof.

Corollary 1. For any1≤𝑤≤∞, the matrixΣ̂in (??) after the first step of thresholding satisfies

‖Σ − Σ‖̂ ²_𝑤≤𝑂(𝑠²log𝑝

𝑛 + 𝑠²log𝑝log¹_𝛿

𝑛²𝜖² + log^{2 1}_𝛿

𝑛²𝜖⁴ ), (35) where the𝑤-norm of any matrix𝐴is defined as‖𝐴‖_𝑤 = sup^{‖𝐴𝑥‖}_‖𝑥‖^𝑤

𝑤 . Specifically, for a matrix𝐴= (𝑎_𝑖𝑗)_{1≤𝑖,𝑗≤𝑝},‖𝐴‖₁= sup_𝑗∑

𝑖!𝑎_𝑖𝑗!is the maximum absolute column sum, and‖𝐴‖_∞= sup_𝑖∑

𝑗!𝑎_𝑖𝑗!is the maximum absolute row sum.

Comparing the bound in the above corollary with the optimal minimax rateΘ(^𝑠²^log_𝑛 ^𝑝) in [? ] for the non-private case, we can see that the impact of the diﬀerential privacy is an additional error of𝑂(^𝑠²^log_𝑛₂^𝑝_𝜖₂^log¹^𝛿 + ^log_𝑛₂^{2 1}_𝜖₄^𝛿). It is an open problem to determine whether the bound in Theorem??is tight.

Proof of Corollary??. By Riesz-Thorin interpolation theorem [? ], we have

‖𝐴‖_𝑤≤max{‖𝐴‖₁,‖𝐴‖2,‖𝐴‖∞}

for any matrix𝐴and any1 ≤ 𝑤≤ ∞.SinceΣ⁺− Σis a symmetric matrix, we have

‖Σ⁺− Σ‖₂≤‖Σ⁺− Σ‖₁and‖Σ⁺− Σ‖₁=‖Σ⁺− Σ‖_∞. Thus, by the proof of Theorem

??we get this corollary.

4.3. Extension to Local Diﬀerential Privacy

One advantage of our Algorithm??is that it can be easily extended to the local diﬀerential privacy (LDP) model.

(14)

Diﬀerential privacy in the local model. In LDP, we have a data universe,𝑛players, with each holding a private data record𝑥_𝑖 ∈ , and a server that is in charge of co- ordinating the protocol. An LDP protocol proceeds in𝑇 rounds. In each round, the server sends a message, which sometimes is called a query, to a subset of the players, requesting them to run a particular algorithm. Based on the queries, each player𝑖in the subset selects an algorithm𝑄_𝑖, runs it on her data, and sends the output back to the server.

Definition 3. [? ] An algorithm𝑄is(𝜖,𝛿)-locally diﬀerentially private (LDP) if for all pairs𝑥,𝑥^!∈, and for all events𝐸in the output space of𝑄, we have

ℙ[𝑄(𝑥)∈𝐸]≤𝑒^𝜖ℙ[𝑄(𝑥^!)∈𝐸] +𝛿.

A multi-player protocol is𝜖-LDP if for all possible inputs and runs of the protocol, the transcript of player i’s interaction with the server is𝜖-LDP. If𝑇 = 1, we say that the protocol is(𝜖,𝛿)non-interactive LDP.

Algorithm 2LDP-Thresholding

𝐈𝐧𝐩𝐮𝐭:{𝑥₁,𝑥2,⋯,𝑥_𝑛}∼𝑃 ∈_𝑝(𝜎²,𝑠), and𝜖,𝛿∈(0,1)

1: forEach𝑖∈[𝑛]do

2: Denote ̃𝑥_𝑖̃𝑥^𝑇_𝑖 = 𝑥_𝑖𝑥^𝑇_𝑖 +𝑧_𝑖, where 𝑧_𝑖 ∈ ℝ^𝑝×𝑝 is a symmetric matrix with its upper triangle ( including the diagonal) being i.i.d samples from(0,𝜎²); here 𝜎² = 2 log(1.25∕𝛿)

𝜖² , and each lower triangle entry being copied from its upper triangle counterpart.

3: end for

4: ComputeΣ̃ = (̃𝜎_𝑖𝑗)_{1≤𝑖,𝑗≤𝑝}= ¹_𝑛∑_𝑛

𝑖=1̃𝑥_𝑖̃𝑥^𝑇_𝑖,

5: Define the thresholding estimatorΣ̂ = (̂𝜎_𝑖𝑗)_{1≤𝑖,𝑗≤𝑛}as

̂𝜎_𝑖𝑗 = ̃𝜎_𝑖𝑗⋅𝐼[!̃𝜎_𝑖𝑗!>𝛾

√𝑛𝜖 ]. (36)

6: Let the eigen-decomposition ofΣ̂ beΣ̂ = ∑_𝑝

𝑖=1𝜆_𝑖𝑣_𝑖𝑣^𝑇_𝑖 . Let𝜆⁺ = max{𝜆_𝑖,0}be the positive part of𝜆_𝑖, then defineΣ⁺=∑_𝑝

𝑖=1𝜆⁺𝑣_𝑖𝑣^𝑇_𝑖 .

7: return Σ⁺.

Inspired by Algorithm??, it is easy to extend our DP algorithm to the LDP model.

The idea is that each𝑋_𝑖 perturbs its covariance and aggregates the noisy version of covariance; see Algorithm??for detail.

The following theorem shows that the error bound of the output of Algorithm??

is the same as the the bound in Theorem??asymptotically, whose proof is almost the same as in Theorem??.

Theorem 3. The outputΣ⁺of Algorithm??satisfies:

𝔼‖Σ − Σ‖̂ ²₂=𝑂(𝑠²log𝑝log¹_𝛿

𝑛𝜖² ), (37)

(15)

where the expectation is taken over the coins of the Algorithm and the randomness of {𝑥₁,𝑥2,⋯,𝑥_𝑛}. Moreover,Σ̂ in (??) satisfies‖Σ − Σ‖̂ ²_𝑤=𝑂(^𝑠^log_𝑛𝜖^𝑝₂^log¹^𝛿).

Compared with the upper bound of𝑂(^𝑠²^log_𝑛 ^𝑝 + ^𝑠²^log_𝑛₂^𝑝_𝜖^log₂ ¹^𝛿 + ^log_𝑛₂^{2 1}_𝜖₄^𝛿)in the central (𝜖,𝛿)-DP model, we can see that the upper bound of𝑂(^𝑠^log_𝑛𝜖^𝑝^log₂ ¹^𝛿)in the local model is much more lower. We also note that the upper bound in the local model is tight, given by [?] recently.

5. Experiments

In this section, we evaluate the performance of Algorithm??and??in practice on synthetic datasets.

Data Generation. We first generate a symmetric sparse matrix ̃𝑈with the sparsity ratio 𝑠𝑟, that is, there are𝑠𝑟×𝑝×𝑝non-zero entries of the matrix. Then, we let𝑈 = ̃𝑈+𝜆𝐼_𝑝 for some constant𝜆to make𝑈 positive semi-definite and then scale it to𝑈 = ^𝑈_𝑐 by some constant𝑐which makes the norm of samples less than 1 (with high probability)¹. Finally, we sample{𝑥₁,⋯,𝑥_𝑛}from the multivariate Gaussian distribution(0,𝑈). In this paper, we set𝜆= 50and𝑐= 200.

Experimental Settings. To measure the performance, we compare the𝓁₁and𝓁₂norm of relative error, respectively. That is, ^‖Σ_‖𝑈‖⁺^−𝑈‖₂ ² or ^‖Σ_‖𝑈‖⁺^−𝑈‖₁ ¹ with the sample size𝑛in three diﬀerent settings: 1) We set𝑝 = 100,𝜖 = 1,𝛿 = ¹

𝑛and change the sparse ratio 𝑠𝑟= {0.1,0.2,0.3,0.5}. 2) We set𝜖= 1,𝛿= ¹_𝑛,𝑠𝑟= 0.2, and let the dimensionality𝑝 vary in{50,100,200,500}. 3) We fix𝑝= 200,𝛿= ¹_𝑛,𝑠𝑟= 0.2and change the privacy level as𝜖 = {0.1,0.5,1,2}. We run each experiment 20 times and take the average error as the final one.

Experimental Results. Figure??and??are the results of DP-Thresholding (Algorithm

??) with𝓁₂ and𝓁₁ relative error, respectively. Figure ??and?? are the results of LDP-Thresholding (Algorithm??) with𝓁₂and𝓁₁relative error, respectively. From the figures we can see that: 1) if the sparsity ratio is largei.e.,the underlying covariance matrix is more dense, the relative error will be larger, this is due to the fact that the error depends on the sparsity s, as shown in Theorem??and??. 2) The dimensionality only slightly aﬀects the relative error. That is, even if we double the value of𝑝, the error increases only slightly. This is consistent with our theoretical analysis in Theorem

??and??which says that the error of our private estimators is only logarithmically depending on𝑝 (i.e., log𝑝). 3) As the privacy parameter𝜖 increases (which means

1Although the distribution is not bounded by 1, actually, as we see from the previous section, we can obtain the same result as long as the𝓁₂norm of the samples is bounded by 1.