Working With Sets - Fundamentals for Data Science, Machine Learning and Artificial Intelligence

In the code, a path is encoded by a sequence of 0 and 1 values, indicating “move east” or “move north” respectively. The function isUpperLattice() defined in lines 5-10 checks if a path is an upper lattice path by summing all the odd partial sums, and returning false if any sum ends up at a coordinate below the diagonal. Note the use of the ? : operator in line 7 . Also note that in line 6, Int() is used to convert the division length(v)/2 to an integer type. In line 12, a collection of all possible lattice paths is created by applying thepermutations()function from the Combinatoricspackage to an initial array ofnzeros andnones. Theunique()function is then used to remove all duplicates. In line 13 theisUpperLattice()function is applied to each element of omega via the ‘.’ operator just after the function name. The result is a boolean array. Then omega[] selects the indices ofomega where the value is trueand in the next linepA modelI is calculated. In lines 17-31 the functionrandomWalkPath()is implemented, which creates a random path according to Model II. Note that the code in line 29 appends either zeros or ones to the path, depending on if it hit the north boundary or east boundary first. Then in line 33, the Monte Carlo estimate, pA modelIIestis determined. The functionplotPath()defined in lines 36-50 plots a path with a specified label and color. It is then invoked in line 52 for an upper lattice path selected via rand(A)and again in the next line for a non-upper path by usingsetdiff(omega,A)to determine the collection of non upper lattice paths. Functions dealing with sets are covered in more detail in the next section.

2.2 Working With Sets

As evident from the examples in Section 2.1 above, mathematical sets play an integral part in the evaluation of probability models. Subsets of the sample space Ω are also called events. By carrying out intersections, unions and differences of sets, we may often express more complicated events based on smaller ones.

A set is an unordered collection of unique elements. A set A is a subset of the set B if every element that is in A is also an element of B. The union of two sets, A and B, denoted A∪B is the set of all elements that are either inA orB, or both. Theintersection of the two sets, denoted A∩B, is the set of all elements that are in bothA andB. Thedifference, denotedA\B is the set of all elements that are inA but not inB.

In the context of probability, the sample space Ωis often considered as theuniversal set. This allows us to then consider the complement of a set A, denoted A^c, which can be constructed via all elements of Ωthat are not inA. Note thatA^c= Ω\A. Also observe that in the presence of a universal set: A\B =A∩B^c.

Representing Sets in Julia

Julia includes built-in capability for working with sets. Unlike anArray, aSetis an unordered collection of unique objects. Listing 2.6 illustrates how to construct a Set in Juila, and illustrates the use of the union(), intersect(), setdiff(), issubset() and in() functions.

There are also other functions related to sets that you may explore independently. These include issetequal() symdiff(),union!(),setdiff!(),symdiff!()andintersect!(). See the online Julia documentation under “Collections and Data Structures”.

56 CHAPTER 2. BASIC PROBABILITY - DRAFT Listing 2.6: Basic set operations

1 A = Set([2,7,2,3]) 2 B = Set(1:6) 3 omega = Set(1:10) 4

5 AunionB = union(A, B)

6 AintersectionB = intersect(A, B) 7 BdifferenceA = setdiff(B,A) 8 Bcomplement = setdiff(omega,B)

9 AsymDifferenceB = union(setdiff(A,B),setdiff(B,A)) 10 println("A = $A, B = $B")

11 println("A union B = $AunionB")

12 println("A intersection B = $AintersectionB") 13 println("B diff A = $BdifferenceA")

14 println("B complement = $Bcomplement")

15 println("A symDifference B = $AsymDifferenceB")

16 println("The element ’6’ is an element of A: $(in(6,A))")

17 println("Symmetric difference and intersection are subsets of the union: ", 18 issubset(AsymDifferenceB,AunionB),", ", issubset(AintersectionB,AunionB))

A = Set([7, 2, 3]), B = Set([4, 2, 3, 5, 6, 1]) A union B = Set([7, 4, 2, 3, 5, 6, 1])

A intersection B = Set([2, 3]) B diff A = Set([4, 5, 6, 1]) B complement = Set([7, 9, 10, 8])

A symDifference B = Set([7, 4, 5, 6, 1]) The element ’6’ is an element of A: false

Symmetric difference and intersection are subsets of the union: true, true

In lines 1-3 three different sets are created via theSet()function (a constructor). Note thatAcontains only three elements, since sets are meant to be a collection of unique elements. Also note that unlike arrays order is not preserved. Lines 5-9 perform various operations using the sets created. Lines 10-18 create the listing output. Note the use of the functions in()andissubset()in lines 16-18.

The Probability of a Union

Consider now two events (sets)AandB. IfA∩B =∅, thenP(A∪B) =P(A) +P(B). However more generally, when A and B are not disjoint, the probability of the intersection, A∩B plays a role. For such cases the inclusion exclusion formulais useful:

P(A∪B) =P(A) +P(B)−P(A∩B). (2.3) To help illustrate this, consider the simple example of choosing a random lower case letter, ‘a’-‘z’.

Let A be the event that the letter is a vowel (one of ‘a’, ‘e’, ‘i’, ‘o’, ‘u’). Let B be the event that the letter is one of the first three letters (one of ‘a’, ‘b’, ‘c’). Now since A∩B ={‘a’}, a set with one element, we have,

P(A∪B) = 5 26 + 3

26 − 1 26 = 7

26.

For another similar example, consider the case where A is the set of vowels as before, but B = {‘x’, ‘y’, ‘z’}. In this case, since the intersection of A and B is empty, we immediately know that

2.2. WORKING WITH SETS 57 P(A∪B) = (5 + 3)/26 ≈0.3077. While this example is elementary, we now use it to illustrate a type of conceptual error that one may make when using Monte Carlo simulation.

Consider code Listing 2.7, and comparemcEst1andmcEst2from lines 12 and 13 respectively.

Both variables are designed to be estimators ofP(A∪B). However, one of them is a correct estimator and the other is faulty. In the following we look at the output given from of both, and explore the fault in the underlying logic.

Listing 2.7: An innocent mistake with Monte Carlo

1 using Random, StatsBase 2 Random.seed!(1)

4 A = Set([’a’,’e’,’i’,’o’,’u’]) 5 B = Set([’x’,’y’,’z’])

6 omega = ’a’:’z’

8 N = 10^6 9

10 println("mcEst1 \t \tmcEst2") 11 for _ in 1:5

12 mcEst1 = sum([in(sample(omega),A) || in(sample(omega),B) for _ in 1:N])/N 13 mcEst2 = sum([in(sample(omega),union(A,B)) for _ in 1:N])/N

14 println(mcEst1,"\t",mcEst2) 15 end

First observe line 12. In Julia, || means “or”, so at first glance the estimator mcEst1 looks sensible, since:

A∪B =the set of all elements that are inA or B.

Hence we are generating a random element via sample(omega) and checking if it is an element of A or an element of B. However there is a subtle error. Each of the N random experiments involves two separate calls to sample(omega). Hence the code in line 12 simulates a situation where conceptually, the sample space,Ωis composed of pairs of letters (2-tuples), not single letters!

Hence the code computes probabilities of the event, A1∪B2 where, A₁ =First element of the tuple is a vowel,

B₂ =Second element of the tuple is an ‘x’, ‘y’, or ‘z’ letter.

Now observe thatA1 and B2 are not disjoint events, hence,

P(A₁∪B₂) =P(A₁) +P(B₂)−P(A₁∩B₂).

Further it holds thatP(A₁∩B₂) =P(A₁)P(B₂). This follows from independence (further explored in Section 2.3). Now that we have identified the error, we can predict the resulting output.

P(A1∪B2) =P(A1) +P(B2)−P(A1)P(B2) = 5 26 + 3

26 − 5 26

26 ≈0.2855.

It can be seen from the code output, which repeats the comparison5times, thatmcEst1consistently underestimates the desired probability, yielding estimates near 0.2855 instead.

58 CHAPTER 2. BASIC PROBABILITY - DRAFT

mcEst1 mcEst2

0.285158 0.307668 0.285686 0.307815 0.285022 0.308132 0.285357 0.307261 0.285175 0.306606

In lines 11-15 aforloop is implemented, which generates 5 Monte Carlo predictions. Note that lines 12 and 13 contain the main logic of this example. Line 12 is our incorrect simulation, and yields incorrect estimates. See the text above for a detailed explanation as to why the use of two separate calls to sample()are incorrect in this case. Line 13 is our correct simulation, and for largeNyields results close to the expected result. Note that the union()function is used onAandB, instead of the “or” operator, ||, used in line 12. The important point is that only a single sample is generated for each iteration of the composition.

Secretary with Envelopes

Now consider a more general form of theinclusion exclusion principle applied to a collection of sets,C1, . . . , Cn. It is presented below, written in two slightly different forms:

P n

[

i=1

C_i

i=1

P(C_i)− X pairs

P(C_i∩C_j) + X triplets

P(C_i∩C_j∩C_k)− . . . + (−1)ⁿ⁻¹P(C₁∩. . .∩C_n)

i=1

P(C_i) −X

i<j

P(C_i∩C_j) + X

i<j<k

P(C_i∩C_j∩C_k) − . . . + (−1)ⁿ⁻¹ P n

i=1

C_i

Notice that there are n major terms. The first term deals with probabilities of individual events;

the second term deals with pairs; the third with triplets; and the sequence continues until a single final term involving a single intersection is reached. The`’th term has ⁿ_`

summands. For example, there are ⁿ₂

pairs, ⁿ₃

triplets, etc. Notice also the alternating signs via(−1)^`−1. It is possible to conceptually see the validity of this formula for the case of n= 3 by drawing a Venn diagram and seeing the role of all summands. In this case,

P C1∪C2∪C3

=P C1

+P C2

+P C3

−P C1∩C2

−P C1∩C3

−P C2∩C3

+P C1∩C2∩C3

. Let us now consider a classic example that uses this inclusion exclusion principle. Assume that a secretary has an equal number of pre-labelled envelopes and business cards,n. Suppose that at the end of the day, he is in such a rush to go home that he puts each business card in an envelope at random without any thought of matching the business card to its intended recipient on the envelope.

The probability that each of the business cards will go to the correct envelope is easy to obtain. It is 1/n!, which goes to zero very quickly as n grows. However, what is the probability that each of the business cards will go to a wrong envelope?

As an aid, let Ai be the event that the i’th business card is put in the correct envelope. We have a handle on events involving intersections of distinct A_i values. For example, if n= 10, then P(A₁∩A₄∩A₆) = 7!/10!, or more generally, the probability of an intersection ofk such events is pk:= (n−k)!/n!.

2.2. WORKING WITH SETS 59 The event we are seeking to evaluate is, B =A^c₁∩A^c₂∩. . .∩A^c_n. Hence by De Morgan’s laws, B^c=A₁∪. . .∪A_n. Hence using the inclusion exclusion formula together withp_k, we can simplify factorials and binomial coefficients to obtain:

P(B) = 1−P(A₁∪. . .∪A_n) = 1−

k=1

(−1)^k+1 n

p_k= 1−

k=1

(−1)^k+1

k! =

k=0

(−1)^k

k! . (2.4) Observe that as n → ∞ this probability converges to 1/e ≈ 0.3679, yielding a simple asymptotic approximation. Listing 2.8 evaluates P(B) in several alternative ways for n = 1,2, . . . ,8. The function bruteSetsProbabilityAllMiss() works by creating all possibilities and counting.

Although a highly inefficient way of evaluating P(B), it is presented here as it is instructive. The functionformulaCalcAllMiss()evaluates the analytic solution from (2.4). Finally, the function mcAllMiss()estimates the probability via Monte Carlo simulation.

Listing 2.8: Secretary with envelopes

1 using Random, StatsBase, Combinatorics 2 Random.seed!(1)

4 function bruteSetsProbabilityAllMiss(n) 5 omega = collect(permutations(1:n)) 6 matchEvents = []

7 for i in 1:n

8 event = []

9 for p in omega

10 if p[i] == i

11 push!(event,p)

12 end

13 end

14 push!(matchEvents,event)

15 end

16 noMatch = setdiff(omega,union(matchEvents...)) 17 return length(noMatch)/length(omega)

18 end 19

20 formulaCalcAllMiss(n) = sum([(-1)^k/factorial(k) for k in 0:n]) 21

22 function mcAllMiss(n,N)

23 function envelopeStuffer()

24 envelopes = Random.shuffle!(collect(1:n))

25 return sum([envelopes[i] == i for i in 1:n]) == 0

26 end

27 data = [envelopeStuffer() for _ in 1:N]

28 return sum(data)/N 29 end

31 N = 10^6 32

33 println("n\tBrute Force\tFormula\t\tMonte Carlo\tAsymptotic",) 34 for n in 1:6

35 bruteForce = bruteSetsProbabilityAllMiss(n) 36 fromFormula = formulaCalcAllMiss(n)

37 fromMC = mcAllMiss(n,N)

38 println(n,"\t",round(bruteForce,digits=4),"\t\t",round(fromFormula,digits=4), 39 "\t\t",round(fromMC,digits=4),"\t\t",round(1/MathConstants.e,digits=4)) 40 end

60 CHAPTER 2. BASIC PROBABILITY - DRAFT

n Brute Force Formula Monte Carlo Asymptotic

1 0.0 0.0 0.0 0.3679

2 0.5 0.5 0.4994 0.3679

3 0.3333 0.3333 0.3337 0.3679

4 0.375 0.375 0.3747 0.3679

5 0.3667 0.3667 0.3665 0.3679

6 0.3681 0.3681 0.3678 0.3679

Lines 4-18 define the function bruteSetsProbabilityAllMiss(), which uses a brute force ap- proach to calculate P(B). The nested loops in lines 7-15 populate the array matchEvents with elements of omega that have a match. The inner loop in lines 9-13, puts elements from omega in event if they satisfy an i’th match. In line 16, notice the use of the 3 dots splat operator, ....

Here union() is applied to all the elements of matchEvents. The return value in line 17 is a direct implementation via counting the elements of noMatch. The function on line 20 implements (2.4) in straightforward manner. Lines 22-29 implement the function mcAllMiss()that estimates the probability via Monte Carlo. The inner function,envelopeStuffer() returns a result from a single experiment. Note that shuffle!()is used to create a random permutation in line 24. The remainder of the code prints the output, and compares the results to the asymptotic formula obtained via1/MathConstants.e.

An Occupancy Problem

We now consider a problem related to the previous example. Imagine now the secretary placing r identical business cards randomly into n envelopes, with r ≥ n and no limit on the number of business cards that can fit in an envelope. We now ask what is the probability that all envelopes are non-empty (i.e. occupied)?

To begin, denote A_i as the event that the i’th envelope is empty, and hence A^c_i is the event that the i’th envelope is occupied. Hence as before, we are seeking the probability of the event B =A^c₁∩A^c₂∩. . .∩A^c_n. Using the same logic as in the previous example,

P(B) = 1−P(A1∪. . .∪An)

= 1−

k=1

(−1)^k+1 n

˜ p_k,

wherep˜_k is the probability of at leastk envelopes being empty. Now from basic counting consider- ations,

p_k= (n−k)^r n^r =

1− k

n r

Thus we arrive at,

P(B) = 1−

k=1

(−1)^k+1 n

1− k n

k=0

(−1)^k n

1− k n

. (2.5)

We now calculateP(B)in Listing 2.9 and compare the results to Monte Carlo simulation estimates.

In the code we consider several situations by varying the number of envelopes in the range n = 1, . . . ,100, and for every n, consider the number of business cards r = Kn for K = 2,3,4. The results are displayed in Figure 2.4.

Dalam dokumen Fundamentals for Data Science, Machine Learning and Artificial Intelligence. (Halaman 65-71)