3 Generating Random Noise for Frequency Tables

The Australian Bureau of Statistics ([4],[7]) has developed a concept for a cell perturbation method. They propose that the random noise should have zero-mean and a fixed variance. An alternative cell perturbation method referred to as “Invariant Post- tabular SDL” method was suggested in [8]. In the following two subsections we briefly outline the two alternative concepts and discuss the technical construction of suitable probability transition matrices for a random perturbation eliminating all small frequency counts.

3.1 How to Create Zero-Mean / Fixed Variance Cell Perturbations?

[4] propose to generate for each cell c with non-zero cell count ic an independent integer value perturbation dc satisfying the following two criteria:

(a) mean of zero

(b) fixed variance V for all cells c and all frequency counts i

A third criterion, in order to meet the requirement that perturbed cells do not have a count of one or two, would be

This means we look for a L x L transition matrix P ¹ containing conditional probabilities: pij = p (perturbed cell value is j | original cell value is i) with the following properties:

1 As index j may take a value of zero (when a cell value is changed to zero), in the following we start counting matrix and vector indices at 0, enumerating rows and columns of the L x L matrix by 0,1,2,…,L-1.

Eliminating Small Cells from Census Counts Tables 55

(1) piν_i = 0 (2) pi (ν_i)² = V (3) pij = 0 for j in {1,2}

(4) _ij 1

p =

∑

(5) pij = 0 ; if j < i – D or j > i + D ,

(6) p00 = 1 and p0j = 0 for j > 0, and of course (7) 0 ≤ pij ≤ 1

where pi denote the ith row-vector of matrix P and ν_i a column vector of the noise which is added, if an original value of i is turned into a value of j. I.e. the j^th entry of ν_i is (j-i). For example ν_i = (-1,0,1,2,3,…,L-2). (1) is equivalent to (a) and expresses the requirement that the expected value of the noise should be zero. Similarly, (2) is equivalent to (b), expressing the requirement of a constant variance. (3) and (7) are of course necessary for any Transition matrix, (5) states a maximum allowed absolute perturbation of some pre-defined constant D and (6) states that zero frequencies must not change. Note for all rows after row D + 2, condition (3) is always satisfied, when (5) holds. Hence we can facilitate the task of computing suitable transition probabilities by adding a symmetry requirement for all rows after row D + 2:

(8) pi,i-k = pi,i+k for k = 1,..,D, if i > D + 2

With (8), condition (1) is always satisfied because the negative and positive devia- tions balance each other. (2) simplifies into

(2a) ²

1, ,

2 _ij

j D

p j V

∑

…

For simplicity, in the following we therefore assume L – 1 = D + 3, applying the perturbation probabilities given by row (D + 3) of matrix P to all cell counts ≥ D + 3.

For every row (or cell count) i (i=1,..,D + 2) conditions (1) to (5) can be rewritten as system of three linear equations

(9) AiD x = b , where

AiD is a (3 x (min(i,D)+ 1+ D-k) ) ² coefficient matrix and b = (1,0,V)’. The elements of x correspond to the entries of row i in P which are not zero anyway by definition (because of (3) or (5)). The first row of AiD corresponds to condition (4), the second row to (1) (e.g. unbiasedness) and the third row to (2) (fixed variance V).

Consider for example A13 =

⎪⎭

⎪⎬

⎫

⎪⎩

⎪⎨

⎧

− 9 4 1

3 2 1

1 1 1

. In this simple case, the coefficient matrix is invertible. The last row of the inverse is (-1/2,-1/4,1/4). Hence, in order for p13 to be positive, (-1/2,-1/4,1/4)hb = (-1/2+ V/4) must be positive, and hence V must be at least 2. In this case (9) has a unique solution, depending on the choice of V only. If V is exactly 2, p13 is zero.

In general, AiD has more columns than rows. So usually, there is no unique solution for (9). But we can use (9) to derive feasibility intervals for x (e.g. for the pij).

2 k is the number of elements in {1,2} ∩ [i-D ; i+D].

56 S. Giessingand J. Höhne

A practical approach is to fix V to 2+ε with a small positive value for ε (increasing ε and hence the variance of the perturbation leads to an unnecessary loss of informa- tion). The system (9) can be further strengthened by additional constraints, for example to express desirable monotony properties like pij ≥ pi,j+1 for j > i, or to improve symmetry by bounding the difference between pi,i-1 and pi,i+1.

We have experimented with D = 3, 4 and 5. One of the findings was that for small D and i, the linear programming problem derived from (9) (eventually together with the additional constraints) gives quite small intervals for x. In those cases it is usually reasonable to minimize or maximize one of the variables. In some cases for example we have maximized the variable corresponding to the most right hand tail of the distribution (e.g. pi,i+D), when it was very small anyway, or we have maximized a variable at the centre of the distribution, which more or less fixes the remaining variables.

For larger D and i the intervals for x are wider. In those cases we first fixed a value (like 70%) for the centre of the distribution, p_ii. Afterwards we fitted each tail of the distribution pij, j > i and pij, j < i to the tails of a normal distribution using a heuristic approach outlined in the appendix. Table 1 in the appendix shows the final probability matrices for D=3,4 and 5 and compares them to the transition probabilities observed empirically for the cells of the set of controlled tables after protecting the data by SAFE. Obviously, the SAFE method results in much smaller probabilities that cell values change by less than three.

3.2 How to Combine Invariance and a “No-Small-Cells” Requirement?

The idea of the “Invariant Post-tabular SDL” method ([8]) is to preserve the frequency distribution of the cell counts. But in our setting we require the frequency of perturbed small counts (ones and twos) to be zero. So for the small counts these are aims that clearly exclude each other. The way out we propose here is to relax the goal of invariance. We only seek to preserve the frequency distribution of cell counts above three and the total frequency of all cell counts below four. This can be achieved as follows:

As shown in [8], an invariant matrix R is obtained by multiplying some pre-defined initial transition matrix P (for an example see [8]) with a suitable matrix Q. Q is obtained by transposing matrix P, multiplying each column j by the relative frequency of count j and then normalizing its rows so that the sum of each row equals one. Fi- nally the diagonal elements of this matrix are increased by the following transforma- tion R*=α^R+(1-α)I, where I is the identity matrix of the appropriate size.

We adapt this procedure using a two-stage approach. In the first stage, we compute an invariant matrix R* such that the first row gives the joint transition probabilities of all counts under four, and the first column gives the probabilities for changing a given count into a count smaller than four. The procedure to obtain R* is the same as in [8], except that we use a vector of relative frequencies, where the entries corresponding to the ones, twos and threes are added up to one joint entry v1-3. We also replace the first row of the initial transition matrix by a column vector where all entries except for the first two are zero. The first two entries can be computed as follows: Define an initial transition matrix P0 for the small counts that leads to an unbiased perturbation like

Eliminating Small Cells from Census Counts Tables 57

P0 =

⎪⎪

⎭

⎪⎪

⎬

⎫

⎪⎪

⎩

⎪⎪

⎨

⎧

3 0 6 0 0 0 1 0

3 0 0 2 3 0 1

3 0 0 1 3 0 2

0 0 0 0 1

. . .

. Sum up the four rows of this matrix and divide through the

total of the matrix entries. Use the three non-zero entries of this row-vector as first three entries of the initial transition matrix. Now, in order to come to a matrix P1-3 of separate transition probabilities for the small counts remaining small counts, we add a second stage to the procedure: after stage one, multiplying the first row of R* with the total frequency of counts under four gives expected numbers nu4,u4 and nu4,3+j of cells under four that remain cells under four, and of cells that change into counts (3 + j) for some j > 0. Assume without loss of generality a non-zero frequency of threes in the original table. Set the expected number of threes that change into counts (3+j), n3,3+j

to nu4,3+j and, correspondingly, ni,3+j to zero for i=1,2 and j > 0. Use the probability p33

of threes to remain unchanged of the initial transition matrix P0 to compute an expected number n30 of threes that change into zero by subtracting the expected number of threes that do not change or turn into 3 + j from the number of threes in the table:

∑

≥ +

−

0 3 3 3 30

n j

n _, . Then compute transition probabilities p1j and p2j (j=0,3) so that the expected bias of the counts under four that remain counts under four is zero, e.g. -V1 p10+2V1 p13-2V2 p20+V2 p23-3p30=0. It is straightforward to show that add-

ing ⎭⎬⎫

⎩⎨

⎧

−

⋅ −β β

β β

0 0

0 to rows 2 and 3 of the initial transition matrix P0 solves this problem, when β = n30 / (V1 + V2). Obviously, this definition of pij (or nij, resp.) for i = 1, 2, 3 and j = 0,3 is also consistent with R*: Summing the entries of (V1, V2, V3) * P1-3 should equal the first entry of the first row of R* multiplied with the total number of counts under four, V1-3. Because of the properties of the invariant matrix R* the latter equals the total number of counts under four minus the number of counts under four that change into counts 3 + j, j > 0, e.g. ₁ ₂ ₃ _3,3

0 j j

V V V n ₊

+ + −

∑

^which

equals V1 + V2 + n30 + n33 because of the definition of n30. On the other hand, because the row totals of the first two rows of P1-3 are one, summing the entries of (V1, V2, V3) * P1-3 yields V1 +V2 + n30 + n33 as well. Hence, if we replace the first line of R* by the separate transition probabilities for counts under four as explained here (and attach three columns of zeros to the other lines), we get a transition matrix R**, which is almost invariant, except that for counts under four only their total frequency is preserved. An example using real data of a table of the last West German census of 1987 is given in below.

Example 1

For a census table with frequencies (V1, V2, V3, V4, V5,…) = (96,32,20,16,15, …) observed for counts (1,2,3,4,5,…), we computed an initial invariant matrix R* (with D=2). Table 1 shows the first four rows and six columns of the matrix of expected frequencies obtained from (V1-3., V4, V5, V6, V7,…)hR*

58 S. Giessingand J. Höhne

Table 1. Expected frequencies ni,j of counts of i perturbed into counts of j

0-3 4 5 6 7 8

0-3 144.62403 2.9690406 0.4069246 0 0 0

4 2.9690406 11.81552 1.0893785 0.1260606 0 0 5 0.4069246 1.0893785 12.211631 1.196397 0.0956693

6 0 0.1260606 1.196397 10.676843 0.9239783 0.0767213

We now set n34 and n35 to 2.9690406 and 0.4069246 and compute n33=0.6h20=12.

Then

∑

≥ +

−

0 3 3 3 30

n j

n _, yields

20 - 0.6h20 - 2.9690406 - 0.4069246 = 4.6240348 (hence p30=0.2312017375), result- ing in β ^{= n}30 / (V1 + V2)= 4.6240348 / 128 = 0.03612527.

According to the specifications above, (pij)i=1,2;j=0,3 are then

⎪⎭

⎪⎬

⎫

⎪⎩

⎪⎨

⎧

− +

−

β β

23 13 13 23

= ⎭⎬⎫

⎩⎨

⎧

0.70279194 0.29720806

0.36945861 0.63054139

Table 3 (appendix) shows the first six rows and six columns of the matrix of expected frequencies computed as (V1, V2, V3, V4, V5,…)hR**. The sum of the first two column totals in table 3 (regarding j=0,3) is 148, e.g. the total observed frequency of the counts under 4 (= 96+32+20) is exactly preserved.

Dalam dokumen Lecture Notes in Computer Science (Halaman 64-68)