• Tidak ada hasil yang ditemukan

2 Multiplicative Noise Protocols

Dalam dokumen Lecture Notes in Computer Science (Halaman 119-123)

Suppose that Xo containnrecords, each with dnumerical attributes. Some of the attributes are nonnegative, denote them Xop. We wish to construct and release a masked data setXm with these characteristics:

Xmp0 (2)

E[Xm] =E[Xo] (3)

Σ(Xm) =Σ(Xo), (4)

whereXmp are the masked values of Xop and Σ(·) means “covariance matrix of (·).”

In [OganKarr10], a masking scheme which preserves positivity, means and covariance matrix data was proposed. The basis of this scheme is to use multi- plicative noise, implemented by taking logarithms, applying additive, normally distributed noise and exponentiating. This scheme works only if all the variables in the data set are nonnegative. Below are the details.

Let E be noise that is conditionally independent of Xo given E[Xo] and Σ(Xo), and satisfies

E[Xoexp(E)] =E[Xo] (5)

Σ(Xoexp(E)) = (1 +k)Σ(Xo), (6) where k >0 is an agency-chosen parameter and denotes elementwise matrix multiplication (Schur or Hadamard product). That is, the exponentiation in (5), (6) and elsewhere below also takes place componentwise. Then

Xm= (

1 +k−1)E[Xo] + [Xoexp(E)]

1 +k (7)

satisfies (2)–(4).

110 A. Oganian

For normally distributed noise E, [OganKarr10] showed that the following vector of meansμE and covariance matrixΣE should be chosen forEto satisfy (5) and (6):

ΣE(i, j) = log

1 + kΣo(i, j) E[Xo(i)Xo(j)]

, i, j= 1, . . . , d (8) μE(i) =σE(i)/2, i= 1, . . . , d. (9) Here,dis the number of dimensions in the data.

If the data set contains not only nonnegative variables but variables with negative values as well, the scheme described above cannot be apply directly.

The variables with negative and positive values may lead to 1 + kΣo(i, j)

E[Xo(i)Xo(j)] <0 (10) so, the covariance matrix (8) cannot be computed. After extensive experiments, it was also noticed that for some very rare distributions of values inXop, (10) may still hold. Example of such distribution is shown in Figure 1. The variables here are negatively correlated and aligned along the axes.

So, for the implementation of the multiplicative noise masking strategy, it would be helpful to have a scheme applicable to all the data sets. One possible solution is to convert all the variables to z-scores and make these z-scores non- negative by adding some value (or vector – for multivariate data), denote itlag, such thatlag≥ |min(Z)|. Denote these nonnegative z-scores byZpos. Then we can apply the masking scheme described by (7), (8) and (9) toZpos and after that return the resulting data to the original scale:

Zm= (

1 +k−1)lag+ [Zposexp(Ezpos)]

1 +k (11)

Xm= (Zmlag)σo+E(Xo) (12)

0 10 20 30 40 50

024681012

X

Y

Fig. 1.Example of a data set when covariance matrix for noise cannot be computed

Multiplicative Noise Protocols 111 where σo is the main diagonal of Σo and Ezpos has the following mean and covariance matrix:

ΣEzpos(i, j) = log

1 + kΣzpos(i, j) E[Zpos(i)Zpos(j)]

, i, j= 1, . . . , d (13) μEzpos(i) =σEzpos(i)/2, i= 1, . . . , d. (14) where Σzpos(i, j) is the (i, j) element of the covariance matrix of positive z- scores.

Masked dataXm in this case can be represented as Xm=

(Zposexp(Ezpos) + (

1 +k−1)lag

1 +k lag

σo+E(Xo) =

= [Xoexp(Ezpos)]−E(Xo)◦exp(Ezpos) +σolag◦exp(Ezpos)

1 +k +

+E(Xo)

1 +k−lagσo

1 +k (15)

It is easy to see that such scheme preserves means and covariance matrix:

E(Xm) = 1

√k+ 1[E(Xo)−E(Xo) +σolagσolag+

+E(Xo)

1 +k] =E(Xo) (16)

The equality in the formula above follows from the fact that noise is independent fromXo andE(exp(Ezpos)) = 1.

Σm(i, j) =ΣZpos(i) exp(Ezpos(i))σo(i)

1 +k ,Zpos(j) exp(Ezpos(j))σo(j)

1 +k

=

= σo(i)σo(j)

1 +k (1 +k)cov(Zpos(i),Zpos(j)) =

=σo(i)σo(j)cor(Xo(i),Xo(j)) =Σo(i, j) (17) wherecov(·) andcor(·) denote covariance and correlation of (·) respectively. Note that the second equality in the formula above follows from the property (6).

Now we will show that masking scheme (15) with the specific choice forlag will never lead to the case described by (10).

First, let us see what are the possible values forlagin this scheme.lagshould be greater than|min(Z)|, however, a very biglagmay lead to negative masked data (this follows from equation(12)), which violates positivity constrants for the variablesXop.

From (11),Zm is minimized when En→ −∞: min(Zm)>(

1 +k−1)lag

1 +k

112 A. Oganian

From (12),min(Xm) is larger than

lag

1 +kσo+E(Xo) (18) To preserve positivity in the masked data it would be enough to require positivity of (18). So, we have an upper bound forlag:

lag E(Xo) σo

1 +k

where division is done componentwise.

The lower bound for lag is |min(Z)|. For nonnegative variables with zeros

|min(Z)|=E(Xo)/σo. So, we can write the lower and upper bound forlagas:

E(Xo)

σo lag≤E(Xo) σo

1 +k (19)

Let us consider a few choices forlagin this range. If we chooselag=E(Xo)/σo, then the scheme with z-scores transformation (15) is equivalent to the scheme without transformation (7). In fact, it is straightforward to verify that masked data in this case can be written as:

Xm =(

1 +k−1)E[Xo] + [Xoexp(Ezpos)]

1 +k (20)

Expression (20) is almost identical to (7) except the second term in the nomi- nator: [Xoexp(Ezpos)].

Below we will show that even this term is identical in both schemes. In par- ticular, after application of our masking scheme to the positivez-scores, noise Ezpos has the mean and covariance matrix defined by (14) and (13) respectively.

Note, that

Σzpos(i, j) E[Zpos(i)Zpos(j)] =

= cor(Xo(i),Xo(j)) E[((Xo(i)−Eσ (Xo(i)))

o(i) +lag(i))((Xo(j)σ−E(Xo(j)))

o(j) +lag(j))] =

= cor(Xo(i),Xo(j))

E[Xo(i)/σo(i)Xo(j)/σo(j)] = Σo(i, j) E[Xo(i)Xo(j)]

So, whenlag=E(Xo)/σo, transformation to positivez−scores does not make any changes in the original scheme (7).

Now let us consider another extreme forlag:lag=

(1 +k)E(Xo)/σo. It is easy to verify that masked data in this case can be written as:

Xm= (

1 +k−1)E[Xo]exp(Ezpos) + [Xoexp(Ezpos)]

1 +k (21)

Multiplicative Noise Protocols 113 Covariance matrix for noise for this scheme is:

ΣEzpos(i, j) = log

1 + kΣzpos(i, j) E[Zpos(i)Zpos(j)]

(22) To prove that the expression under logarithm of (22) is always positive, let’s express it in terms of original data.

Zpos(i) =Xo(i) +E(Xo(i))(

1 +k−1)) σo(i)

It is easy to see that

E[Zpos(i)Zpos(j)] = E[Xo(i)Xo(j)] +kE(Xo(i))E(Xo(j)) σo(i)σo(j)

ΣEzpos(i, j) = log

1 + kσo(i)σo(j)cor(Xo(i),Xo(j)) E[Xo(i)Xo(j)] +kE(Xo(i))E(Xo(j))

=

= log

(1 +k)E[Xo(i)Xo(j)]

E[Xo(i)Xo(j)] +kE(Xo(i))E(Xo(j))

(23) The expression under logarithm in (23) is always positive for nonnegativeXo, so we can always computeΣEzpos. In the same way, it is possible to show that no other value forlag(in the range of its possible values) can guarantee positivity of (10) for all possible data sets.

When the data set contains variables which can take positive and negative values together with nonnegative variables, the scheme with z-scores transfor- mations will work too. First the data should be made nonnegative by adding

|min(Xo)|and then scheme (21) is applied to this data. Last, to return the data to the original location, we have to substract|min(Xo)|from the result of the previous step.

Dalam dokumen Lecture Notes in Computer Science (Halaman 119-123)