2 Multiplicative Noise Protocols - Lecture Notes in Computer Science

Suppose that X_o containnrecords, each with dnumerical attributes. Some of the attributes are nonnegative, denote them X_o^p. We wish to construct and release a masked data setX_m with these characteristics:

X_m^p≥0 (2)

E[X_m] =E[X_o] (3)

Σ(X_m) =Σ(X_o), (4)

whereX_m^p are the masked values of X_o^p and Σ(·) means “covariance matrix of (·).”

In [OganKarr10], a masking scheme which preserves positivity, means and covariance matrix data was proposed. The basis of this scheme is to use multiplicative noise, implemented by taking logarithms, applying additive, normally distributed noise and exponentiating. This scheme works only if all the variables in the data set are nonnegative. Below are the details.

Let E be noise that is conditionally independent of X_o given E[X_o] and Σ(X_o), and satisﬁes

E[X_o◦exp(E)] =E[X_o] (5)

Σ(X_o◦exp(E)) = (1 +k)Σ(X_o), (6) where k >0 is an agency-chosen parameter and ◦ denotes elementwise matrix multiplication (Schur or Hadamard product). That is, the exponentiation in (5), (6) and elsewhere below also takes place componentwise. Then

X_m= (√

1 +k−1)E[X_o] + [X_o◦exp(E)]

√1 +k (7)

satisﬁes (2)–(4).

110 A. Oganian

For normally distributed noise E, [OganKarr10] showed that the following vector of meansμ_E and covariance matrixΣ_E should be chosen forEto satisfy (5) and (6):

Σ_E(i, j) = log

1 + kΣ_o(i, j) E[X_o(i)X_o(j)]

, i, j= 1, . . . , d (8) μ_E(i) =−σ_E(i)/2, i= 1, . . . , d. (9) Here,dis the number of dimensions in the data.

If the data set contains not only nonnegative variables but variables with negative values as well, the scheme described above cannot be apply directly.

The variables with negative and positive values may lead to 1 + kΣ_o(i, j)

E[X_o(i)X_o(j)] <0 (10) so, the covariance matrix (8) cannot be computed. After extensive experiments, it was also noticed that for some very rare distributions of values inX_o^p, (10) may still hold. Example of such distribution is shown in Figure 1. The variables here are negatively correlated and aligned along the axes.

So, for the implementation of the multiplicative noise masking strategy, it would be helpful to have a scheme applicable to all the data sets. One possible solution is to convert all the variables to z-scores and make these z-scores nonnegative by adding some value (or vector – for multivariate data), denote itlag, such thatlag≥ |min(Z)|. Denote these nonnegative z-scores byZ_pos. Then we can apply the masking scheme described by (7), (8) and (9) toZ_pos and after that return the resulting data to the original scale:

Z_m= (√

1 +k−1)lag+ [Z^pos◦exp(E^z^pos)]

√1 +k (11)

X_m= (Z_m−lag)◦σ_o+E(X_o) (12)

0 10 20 30 40 50

024681012

Fig. 1.Example of a data set when covariance matrix for noise cannot be computed

Multiplicative Noise Protocols 111 where σ_o is the main diagonal of Σ_o and E^z^pos has the following mean and covariance matrix:

Σ_E_zpos(i, j) = log

1 + kΣ_z_pos(i, j) E[Z_pos(i)Z_pos(j)]

, i, j= 1, . . . , d (13) μ_E_zpos(i) =−σ_E_zpos(i)/2, i= 1, . . . , d. (14) where Σ_z_pos(i, j) is the (i, j) element of the covariance matrix of positive z- scores.

Masked dataX_m in this case can be represented as X_m=

(Z_pos◦exp(E^z^pos) + (√

1 +k−1)lag

√1 +k −lag

◦σ_o+E(X_o) =

= [X_o◦exp(E^z^pos)]−E(X_o)◦exp(E^z^pos) +σ_o◦lag◦exp(E^z^pos)

√1 +k +

+E(X_o)√

1 +k−lag◦σ_o

√1 +k (15)

It is easy to see that such scheme preserves means and covariance matrix:

E(X_m) = 1

√k+ 1[E(X_o)−E(X_o) +σ_o◦lag−σ_o◦lag+

+E(X_o)√

1 +k] =E(X_o) (16)

The equality in the formula above follows from the fact that noise is independent fromX_o andE(exp(E^z^pos)) = 1.

Σ_m(i, j) =ΣZ_pos(i) exp(E^z^pos(i))σ_o(i)

√1 +k ,Z_pos(j) exp(E^z^pos(j))σ_o(j)

√1 +k

= σ_o(i)σ_o(j)

1 +k (1 +k)cov(Z_pos(i),Z_pos(j)) =

=σ_o(i)σ_o(j)cor(X_o(i),X_o(j)) =Σ_o(i, j) (17) wherecov(·) andcor(·) denote covariance and correlation of (·) respectively. Note that the second equality in the formula above follows from the property (6).

Now we will show that masking scheme (15) with the speciﬁc choice forlag will never lead to the case described by (10).

First, let us see what are the possible values forlagin this scheme.lagshould be greater than|min(Z)|, however, a very biglagmay lead to negative masked data (this follows from equation(12)), which violates positivity constrants for the variablesX_o^p.

From (11),Z_m is minimized when En→ −∞: min(Z_m)>(√

1 +k−1)lag

√1 +k

112 A. Oganian

From (12),min(X_m) is larger than

−lag

√1 +kσ_o+E(X_o) (18) To preserve positivity in the masked data it would be enough to require positivity of (18). So, we have an upper bound forlag:

lag≤ E(X_o) σ_o

√1 +k

where division is done componentwise.

The lower bound for lag is |min(Z)|. For nonnegative variables with zeros

|min(Z)|=E(X_o)/σ_o. So, we can write the lower and upper bound forlagas:

E(X_o)

σ_o ≤lag≤E(X_o) σ_o

√1 +k (19)

Let us consider a few choices forlagin this range. If we chooselag=E(X_o)/σ_o, then the scheme with z-scores transformation (15) is equivalent to the scheme without transformation (7). In fact, it is straightforward to verify that masked data in this case can be written as:

X_m =(√

1 +k−1)E[X_o] + [X_o◦exp(E^z^pos)]

√1 +k (20)

Expression (20) is almost identical to (7) except the second term in the nomi- nator: [X_o◦exp(E^z^pos)].

Below we will show that even this term is identical in both schemes. In par- ticular, after application of our masking scheme to the positivez-scores, noise E^z^pos has the mean and covariance matrix deﬁned by (14) and (13) respectively.

Note, that

Σ_z_pos(i, j) E[Z_pos(i)Z_pos(j)] =

= cor(X_o(i),X_o(j)) E[(⁽^X^o⁽ⁱ⁾^−E_σ ⁽^X^o⁽ⁱ⁾⁾⁾

o(i) +lag(i))(⁽^X^o⁽^j⁾_σ^−E⁽^X^o⁽^j⁾⁾⁾

o(j) +lag(j))] =

= cor(X_o(i),X_o(j))

E[X_o(i)/σ_o(i)∗X_o(j)/σ_o(j)] = Σ_o(i, j) E[X_o(i)X_o(j)]

So, whenlag=E(X_o)/σ_o, transformation to positivez−scores does not make any changes in the original scheme (7).

Now let us consider another extreme forlag:lag=

(1 +k)E(X_o)/σ_o. It is easy to verify that masked data in this case can be written as:

X_m= (√

1 +k−1)E[X_o]◦exp(E^z^pos) + [X_o◦exp(E^z^pos)]

√1 +k (21)

Multiplicative Noise Protocols 113 Covariance matrix for noise for this scheme is:

Σ_E_zpos(i, j) = log

1 + kΣ_z_pos(i, j) E[Z_pos(i)Z_pos(j)]

(22) To prove that the expression under logarithm of (22) is always positive, let’s express it in terms of original data.

Z_pos(i) =X_o(i) +E(X_o(i))(√

1 +k−1)) σ_o(i)

It is easy to see that

E[Z_pos(i)Z_pos(j)] = E[X_o(i)X_o(j)] +kE(X_o(i))E(X_o(j)) σ_o(i)σ_o(j)

Σ_E_zpos(i, j) = log

1 + kσ_o(i)σ_o(j)cor(X_o(i),X_o(j)) E[X_o(i)X_o(j)] +kE(X_o(i))E(X_o(j))

= log

(1 +k)E[X_o(i)X_o(j)]

E[X_o(i)X_o(j)] +kE(X_o(i))E(X_o(j))

(23) The expression under logarithm in (23) is always positive for nonnegativeX_o, so we can always computeΣ_E_zpos. In the same way, it is possible to show that no other value forlag(in the range of its possible values) can guarantee positivity of (10) for all possible data sets.

When the data set contains variables which can take positive and negative values together with nonnegative variables, the scheme with z-scores transfor- mations will work too. First the data should be made nonnegative by adding

|min(X_o)|and then scheme (21) is applied to this data. Last, to return the data to the original location, we have to substract|min(X_o)|from the result of the previous step.

Dalam dokumen Lecture Notes in Computer Science (Halaman 119-123)