CHAPTER VI - Computational analysis of the random components induced by a binary equivalence r

Com$utationril Tools

-

In the prececring ciiapter-, many n ~ ~ n i e r i c a l r e s i r i t s w c r e obtained either by r e c u r s i v e calcuiati?as o r Miinte r J L ~ r l u sam31irig.

In view of the importance of these resuits, ca-iitiun had to L e e x e r - c i s e d throughout the c o u r s e of Lhese computations. A s a consequence, s e v e r a l methods w e r e developed in o r d e r t o e n s u r e efficiency and accuracy. This chapter d e s c r i b e s t h r e e p a r t i c u l a r a r e a s of ^{i n -} vestigation. F i r s t , common sampling methods such a s s;;mpliiig with and without replacement a r e compared and a r e shown t o be equivalent by rneans of suitable transformations f o r the purpose of c i a s s counting. Then random storage assignnsent techniques used f o r the storage and r e t r i e v a l of l a r g e f a m i l i e s of graphs a r e ex- amined and significantly improved. Finally random number gen-

e r a t o r s used in the previous experimentation a r e described, their p r o p e r t i e s compared and a method f o r producing reliable pseudo-

random sequences i s presented.

6. 1. Relationship Between Various Sampling Methods

Most of the graph problems that we a r e concerned with

-

r e q u i r e generating subgraphs, the v e r t i c e s of which a r e selected with equal probability f r o m the v e r t e x s e t of some complete graph. Selecting such a sample is equivalent t o generating a random a r r a n g e m e n t of k i n t e g e r s f r o m the s e t { I , 2,

. . . ^, 4.

Details on how t o p e r f o r m t h i s operation efficiently

(&L j

operations) can be found in Reference

I 1.

However, when the number of v e r t i c e s of the s o u r c e graph i n c r e a s e s , the requirement that the k selected v e r t i c e s be distinct gradually l o s e s i t s importance a s the probability of finding a match decreases. This i s v e r y fortunate indeed, since the labor needed t o impose that constraint i n c r e a s e s like k itself. Depending upon whether a l l elements in the sample a r e distinct o r not, we get the c l a s s i c a l sampling without o r with r e - placement, respectively.

In

the c a s e of l a r g e samples, the approach we adopt i s t o sample with replacement (which i s obviously the e a s i e s t method), then p e r f o r m a transformation to r e c o v e r , i f need be, the r e s u l t s that would have been obtained had we computed e v e r y t i m e the p r e c i s e number of distinct elements in the sample o r selected s a m p l e s of distinct elements.

Let the population

fiJN

^contain ^N distinct elements.

Our experiment i s the uniform selection of samples of s i z e k designated by

S when sam;lling with replacement, k

Sk when sampling without replacement.

F o r each Sk

,

l e t s k designate the subset of Sk having k l 1

distinct elements 1 Q; k l S k such that k,, i s masimum, Finally l e t f be a function which i s defined f o r e v e r y sample

$

of 6'.

T h e o r e m

-

F o r a l l functions f defined on s a m p l e s of

/PN

f o r wF.ich f ( S k ) = f ( s ) O g k g N , then

i:l

being the Stirling number of the f i r s t kind.

proof: given k

-

t u p l e s Lxi< xi2

, . . . ^,

^x.^I

]

^{such that}

x. 1. "XI. x z s

. . . ^,

⁼

^1,^2,

. . ^. ^,

k, the probabilityof finding J

U distinct elements among the k selected i s

since t h i s i s a coupon collector's problem with N equally prob- able distinct coupons. In this formula

i:i

designated Stirling n u m b e r s of the second kind, i e . the number of ways of

partitioning a s e t of k elements into Z.' non-empty subsets.

The expected value of the function f evaluated over fixed s i z e samples depends only upon the s i z e of the sample; f o r c ~ n v e n i e n c e

l e t u s write

which a r e related, using ( 6.2 ), in the following way

If we now look a t g(k) a s the k th component of a vector and similarly f o r G ( k ) we obtain the system

where the kxk lower triangular m a t r i x $ has the f o r m

'tf:

i s lower triangular, it i s not difficult to write down i t s inverse by inspection, namely

g -

1 =

We simply have to verify that indeed, the product produces the identity matrix

It i s imgortant to s t r e s s that both transformation m a t r i c e s being lower triangular, the computation of gk requires only values of GLI up to g = k . It should be c l e a r by now that this approach will prove advantageous Lor a l l sampling problems where the function f i s insensitive t o t h e presence of duplicate elements in the sample.

F o r l a r g e N and sample s i z e s satisfying kCcN we might even operate with a straightforward sampling with replacement

-

and never apply transformation

% .

We derive an estimate f o r the e r r o r .

Lemma

proof: using expression

(6.3)

f o r G(k) we write

the coefficient of G ( k ) now becomes

which yields

6 . 2 . Random Storage A s signment

Iri section ( 4 . 7 ), n l a r g e number nf encoded g r a p h s , together with t h e i r associatee information, w e r e s t o r e d anii retrieved using a

random storage method which will riorxi be described in some detail, since i t i s an improvement over e x i s t i i ~ g "hash" algorithms. Because we do r e s t r i c t our c o z p a r i s o n to l i n e a r and random probing, the r e a d e r i s r e f e r r e d t o [ i l

]

^and

[

²²

]

f o r oilier methods of search.

Hash algorithms a r e p r i m a r i l y distinguished by the way in which they handle coliisions. Elements f r o m a s e t S

=[

^{s i ,}^{s 2 , .}

. . ^,

^s_;?;

[

can be mapped into a table T wit11 n available positions { t o d t l ' " . ' t n - l )

,

each t . ¹ being able to accommodate a single

.-. .

element of

&' .

Since tne mapping function

d'

i s m general many t o 1, s e v e r a l elements drawn f r o m S may be initialiy assigned the

s a m e t position. I f t . f o r j =&si) i s a l r e a d y occupied, we

J 1 ¹

have collision and some vacant t: position elsewhere in the table m u s t be selected. Thus, any p a r t i c u l a r hash algorithm provides a way of computing f r o m s. a sequence t.

,

t .

, . . . ^,

^{t .} satififying

I J I JZ J Y

t . , t .

,...,

^{t .} ^occupied

Ji J Z J y - i

j , j

. ^{. .} ^,

^j forming some permutation of U distinct i n t e g e r s

[

0, n

- 1

.

The element s . i s subsequently assigned t o slot t . ¹ ^{J Y} and can be r e t r i e v e d in an identical fashion, provided that neither t .

,

t .

, . . ^.

^{n o r} ^{t .} have been changed t o vacant in the mean-

J 1 J2 Ju-1

time.

Such an algorithm will be optimal storagewise if given any distribution of table occupancy, the probability of assigning the next i t e m to any of the s t i l l vacant s l o t s i s equal; optimality here m e a n s that the expected value of the length 21 of the probing sequence

.ils

J ~ * -

. - ^,

^{j V} i s minimized.

F o r example, the w o r s t s t r a t e g y corresponds to jy = j

+ -

mod n U = 2 , 3 ,

...,

ⁿ

because if a collision has o c c u r r e d a t t .

,

the probability that a

J 1

collision will occur a t t . $. 1 i s higher than the average over a l l J 1

t ' s . Linear probing i s therefore replaced by random probing, whereby a fixed permutation

Ed1, %, . . . , tn-lj

of {1,2

,...,

^n-

is used t o f o r m the probing sequence

ju = jl +

d v - l

m o d n zJ= 2 , 3 ,

...,

ⁿ ( 6 . 9 ) The probability that the (k+ 1 ) s t i t e m entered into the table will r e q u i r e Z/ probes i s

We can easily verify by induction that

k k(k-1) k!

-

1 + z - r +

+ ... ⁺ ^- -

_n-k

Thus, the expected value of ^1/ when k i t e m s a r e a l r e a d y in the table i s

the second f o r m being ba-sed directly on the probability that the probing sequence i s of length g r e a t e r than o r equal to V . We r e w r i t e

k .

s o that if a =

-

1s the occupancy f a c t o r , n

The expected value of

v

when retrieving an i t e m f r o m the table can then be approximated by

A few values of &(I/) a r e

However, the foregoing computation makes the t a c i t assumption that the probability of occupancy of any table position i s the s a m e , which

i s likely to be f a l s e if the mapping function ~4 does riot satisfy that property f o r a l l s a m p l e s of the s e t S

.

Indeed, g i s usually chosen

on intuitive grounds, hoping that it will distribute the elements of S uniformly over the table, since the distribution of the data in the

s - s p a c e may not even be known. Consequently, the probability of collision f o r the (kt 1 ) s t element i s bound t o be higher than ru and the above algorithm will not p e r f o r m a s well a s expected.

Alternatively, consider now the following algorithm. The table T has positions ( t o , t l ,

. . .

^tn-l

)

^but n i s constrained t o be p r i m e . Compute:

then probe a t

j j

-

m n d n = 1 2 ,

...,

ⁿ

(6.

15)

until either a vacant slot o r the data i t e m i s encountered,

I£

and

g2

a r e chosen in such a way that jl and

4

a r e uncorrelated f o r a l l

s t S

,

then the previous e s t i m a t e of the Length of the probing sequence will be valid. The probability that any two distinct elements f r o m S have identical hashing sequences jl, j2,.

. . ^,

^{i s now}

6 (IZ)

1 i n -

1 ¹¹

stead of

IS(-).

The requirement that n be prirne e n s u r e s that no n

m a t t e r what value

4

has, a l l table positions will have been visited a f t e r n probes.

4

^and

^$

r- can be chosen a s f o l l w s : consider each data iten1 ( o r some t r a n s f o r m of the data item, f o r instance only a fractional representation of i t ) a s a n integer x. Using the Chinese

remainder t h e o r e m in i t s simplest f o r m , we know that any integer m in the range

[

^0, ^I ^-¹

^-

I

calt be uniquely r e p r e s e n t e d by the p a i r o f r e m a i n d e r s r r where

1 ' 2 r 1 = m mod n r Z = ^rn mod n-1

Therefore, in the p r e s e n t c a s e we can simply u s e

Clearly, f o r a l l n ( n - I ) distinct p a i r s ( j

4 )

t o be feasible, the I '

range of x should be a t least equal. to n ( n - I ) . We can thus recorn- mend t o apply to the data i t e m a transformation which will distribute x over a s Large a range a s possible; then i t s remainder modulo n(n-1) should be approximately uniformly distributed. This i s , of c o u r s e , the typical approach t o "randomization" a s i m ~ t l e m e n t e d by Linear congruential random number generators.

In p r a c t i c e , one need not compute each t i m e j ₁and

8

^;^{m o s t}

of the time, f o r a s long a s the Load f a c t o r i s moderate, one probe w i l l suffice t o s t o r e o r r e t r i e v e an i t e m s o that only j l need be calculated. Still, an objection may he r a i s e d concerning table s i z e s which a r e p r i m e s r a t h e r than powers of two. Usually, t h e r e a r e hvo

motivations f o r choosing n = 2k ; t a b l e s can b e simply combined o r broken to f o r m s i m i l a r tables and operations modulo n a r e easily performed by just masking the high o r d e r bits. Let u s r e m a r k , how-

e v e r , that once a table s i z e h a s been chosen t o implement a hash algorithm, it cannot in g e n e r a l be altered. Table extensions a r e usually achieved by performing multiple s e a r c h e s ; if the i t e m i s not found in the f i r s t table, a second table i s s e a r c h e d and s o forth, but t h i s i s of c o u r s e significantly w o r s e than having a unique table s e t up in the f i r s t place. As f a r a s modulo n operations a r e concerned, the objection d i s a p p e a r s if the algorithm i s implemented in a higher l e v e l language such a s FORTRAN where a n honest division i s actually

c a r r i e d out t o obtain the remainder. A t any rate, the cost of division should prove advantageous over the time required t o generate the successive permuted i n c r e m e n t s in the c l a s s i c a l random probing scheme a s the table s t a r t s t o f i l l up.

Finally, l e t u s mention briefly b ~ w deletions a r e handled.

In

o r d e r t o indicate which table e n t r i e s a r e e i t h e r vacant o r deleted, we u s e two special codes which a r e not m e m b e r s of S

.

Although it i s

commonly said that l o s t space i s reclaimed but lookup t i m e not

reduced, still some lookup time can be eliminated. When an i t e m i s r e t r i e v e d f r o m t . using tlse probing sequence t .

,

^{t .}

, . .

^,

^{t .}

^,

^it

JV J I 32 J P

X PCiLYNOMlRL F I T OF DEMiEE 5 I

N' )iC POLYNOMIX F I T OF DECREE 7

-

=,.L-L

^1--1 ^- + d - L - L - L - - I--A

'0 10 20 30 3- 0 SO 70 80 98 I00

Dalam dokumen Computational analysis of the random components induced by a binary equivalence relation (Halaman 138-151)

CHAPTER VI

-

-

. . . , 4.

(&L j

I 1.

In

fiJN

,

$

-

/PN

i:l

-

, . . . ,

]

. . . ,

=

. . . ,

i:i

'tf:

g -

-

% .

(6.3)

]

[

]

=[

. . ,

[

,

.-. .

&' .

d'

,

, . . . ,

,...,

. . . ,

[

- 1

.

,

, . . .

.ils

. - ,

+ -

...,

,

Ed1, %, . . . , tn-lj

,...,

d v - l

...,

-

+ ... + - -

-

v

.

. . .

)

-

...,

(6.

I£

g2

4

,

. . ,

6 (IZ)

IS(-).

4

4

$

[

-

I

4 )

8

In

.

. . . ^, 4.

, . . . ^,

. . . ^,

⁼

. . ^. ^,

. . ^,

, . . . ^,

. ^{. .} ^,

, . . ^.

. - ^,

+ ... ⁺ ^- -

. . ^,

^$

^-

^,

^,

^1--1 ^- + d - L - L - L - - I--A