The foundation of Bayesian statistics is a rather simple probability rule known as Bayes’ rule (also called Bayes’ theorem, Bayes’ law). Bayes’ rule is an accounting identity that obeys the axioms of probability. It is by itself
uncontroversial, but it is this simple rule that is the source of the rich applications and controversy over subjective elements in Bayesian statistics.
Bayes’ rule begins with relating joint and conditional probability:
P (A|B)=
) B ( P
) B P(A∩
Figure 9-1 illustrates this, where A is the event inside the larger circle, B is the event inside the smaller circle, and C is the intersection of events A and B and represents the joint probability of A and B.
A
C B
Figure 9-1
However, the probability of A and B can also be rewritten as:
P(A∩B)=P(A|B)*P(A) Or as P(A∩B)=P(B|A)*P(B)
We can rewrite the relationship between conditional and joint probability of A and B given earlier as:
P (A|B)=
) B ( P
) B P(A∩
= P(B) ) A ( P
* ) A
| B ( P
Where the relation
P (A|B)=
) B ( P
) A ( P
* ) A
| B ( P
Is known as Bayes’ rule. Bayes’ rule is sometimes called the rule of inverse probability. This is because it shows how a conditional probability P(B|A) can be turned into, or inverted, into a conditional probability P(A|B).
The denominator of Bayes’ rule P(B) is the marginal probability of event B, that is the probability of event B over all possibilities of A where there is joint probability. In the case where A is not a single event, but a set n of mutually exclusive and exhaustive events, , such as a set of hyptheses, we can use the law of total probability to calculate P(B):
P(B)=
∑
n
) An ( P
* ) An
| B ( P
In this situation, Bayes’ rule provides the posterior probability of any particular of these n hypotheses, say Aj given that the even B has occurred:
( ) ( | ) ( )
| ( | ) * ( )
j j
j
n
P B A P A P A B
P B An P An
= ∑
Because for a given situation, P(B) is a constant, Bayes theorem may be written as:
P (A|B)∝ P(B|A)*P(A) Where ∝ is the symbol for “proportional to”.
Sometimes Bayes’ rule is written using E and H, where E stands for “evidence”
and H stands for “hypothesis”. Using E and H we can write:
P(H|E)=
) E ( P
) H ( P ) H
| E ( P
In this form, P(H) represents the prior degree of belief in the hypothesis before the evidence. P(H|E) is the updated probability of belief in the hypothesis given the evidence. In other words, Bayes’ rule is updating the degree of belief in the hypothesis based on the evidence. This is where the usefulness of Bayes rule and Bayesian statistics in learning comes from, and this idea is a foundation of the usefulness of Bayesian statistics.
Applying Bayes’ Rule
Let’s apply Bayes’ rule to two examples. In the first case, we will have complete information about the joint probability of two events. In the second case, we will have only select probability information to work with.
Table 9-1 shows the joint probability of two events, event A being a membrane bound protein and event B having a high proportion of hydrophobic (amino acid) residues. The two columns with data represent the marginal distributions
or Ac), not being a membrane bound protein. The two rows represent the marginal distributions of B, having a high hydrophobic content and the complement of B (~B or Bc). Each cell represents the joint probability of two events.
Let’s say we want to calculate the probability of a protein having a high hydrophobic content given that it is a membrane bound protein. To do this we can apply Bayes’ rule in this form:
P(B|A)=
) B ( P
) A ( P ) B
| A (
P =
) B ( P
) B A (
P ∩
P(A∩B) is the joint probability of A and B is simply found from the cell in the joint probability table and is 0.3. P(B) can be calculated by the law of total probability, calculating the P(B|A)P(A)+P(B|~A)P(~A) or since we have the table by the sum of the row for event B, which is 0.5.
Therefore
P(B|A)=
) B ( P
) B A (
P ∩
=0.5 3 . 0 =0.6
And we conclude given protein is membrane bound that there is a 60% chance the protein has a high hydrophobic residue content.
Table 9-1
Type of Protein Membrane Bound (A)
Non-membrane Bound (~A)
High (B) 0.3 0.2
Prop. Hydrophobic residues
Low (~B) 0.1 0.4
In the first example (above) the computation of the desired conditional probability and use of Bayes rule are quite straightforward, since all the information needed is available from the joint probability table. Now let’s consider a second example, where the computation would be quite difficult were it not for Bayes formula.
Suppose having a gene X results in the onset of a particular disease 50% of the time. Suppose the prevalence of having the gene X is 1/1000 and the prevalence of having a particular disease is 1%. From this, compute the probability of having gene X given that you have the disease
We could use this information and try to produce a joint probability table for the two events – having the gene and having the disease. Or, now that we know there is Bayes’ rule, we can use it to solve this problem.
From the information above we are given:
P(Gene)=1/1000 P(Disease)=1/100 And
P(Disease|Gene)=0.5
Note that what we are doing here is inverting the probability P(Disease|Gene) to calculate P(Gene|Disease) in light of the prior knowledge that the P(Gene) is 1/1000 and the P(Disease)=1/100. The P(Disease) can be assumed to be the marginal probability of having the disease in the population, across those with and those without the gene X.
P(Gene|Disease)=
) Disease ( P
) Gene ( P ) Gene
| Disease ( P
Solving this is as simple as plugging in the numbers:
P(Gene|Disease)=
100 / 1
(1/1000)
*
0.5 =0.05
This result is interpreted as if you have the disease (condition) the probability that you have the gene is 5%. Note that this is very different from the condition we started with which was the probability that you have the gene, then there is a 50% probability you will have the disease.
Bayes’ rule could be applied over and over in order to update probabilities of hypotheses in light of new evidence, a process known as Bayesian updating in artificial intelligence and related areas. In such a sequential framework, the posterior from one updating step becomes the prior for the subsequent step. The evidence from the data usually starts to drive the results fairly quickly and the influence of the initial (subjective) prior will be diminished. However, since we are interested in Bayes formula with respect to Bayesian statistics and probability models, our discussion here will continue in this direction.