125
Correlation and Dependence
Point Biserial Correlation
When one of the variables is binary and the other continuous, we can use the point biserial correlation coefficient. If S is a continuous variable and Y a binary variable taking the values 0 and 1, the point biserial correlation is calculated as
(
S1 − S0)
ρ = p (1− p) σS
– S
where S1 = mean of S when Y = 1
0 = mean of S when Y = 0
s s = sample standard deviation of S r = proportion of values where Y = 1
A VBA function to estimate the biserial correlation coefficient is given below. Note that the first column passed to the function should be the continuous variable and the second column the binary variable.
Function Biserial(data As Range) ' Calculate Biserial correlation Dim number_columns As Double
Dim number_rows As Double Dim row As Integer
Dim x1 As Double Dim x0 As Double Dim s As Double Dim p As Double Dim average_x As Double Dim all_x() As Double
number_columns = data.Columns.Count number_rows = data.Rows.Count
If (number_columns <> 2) Then 'check no more than 2 columns Biserial = ˝2 columns only˝
ElseIf (number_rows< 4) Then 'We should use at least 4 ' observations although the more the better Biserial = ˝need at least 4 rows˝
Else x0 = 0
x1 = 0 p = 0
s = 0
average_x = 0 Dim number_ones As Integer number_ones = 0
ReDim all_x(number_rows)
For row = 1 To number_rows ' calculate averages and sum of binary variable
Dim is_one As Integer
is_one = data(row, 2).Value p = p + is_one
If (is_one = 0) Then
x0 = x0 + data(row, 1).Value Else
x1 = x1 + data(row, 1).Value number_ones = number_ones + 1 End If
average_x = average_x + data(row, 1).Value all_x(row) = data(row, 1).Value
Next row
x0 = x0 / (number_rows − number_ones) x1 = x1 / number_ones
average_x = average_x / number_rows p = p / number_rows
For row = 1 To number_rows 'calculate standard deviation s = s + (all_x(row) - average_x)ˆ2
Next row
s = (s / (number_rows−1))ˆ0.5 Dim temp As Double
Biserial = ((x1 - x0) * (p * (1 - p))ˆ0.5) / s ' Return Biserial correlation
End If End Function
EXAMPLE 11.4 CORRELATION BETWEEN OPERATIONAL RISK AND GROSS INCOME OF BUSINESS LINES
An application of this function is given in the worksheet Biserial in the workbook Operational Risk 11.xls. The worksheet is based on the follow-
127
Correlation and Dependence
TABLE 11.3 OR Risk and Gross Income across Business Lines
Operational Gross income Risk
Business line risk (millions $) coding
Corporate finance Low 117.78 0
Trading and sales High 161.84 1
Retail banking Low 117.11 0
Commercial banking High 161.91 1
Payment and settlement High 162.11 1
Agency services and custody High 80.45 1
Asset management Low 50.57 0
Retail brokerage High 172.88 1
ing example: Let us imagine that the level of operational risk in an institu- tion’s business lines is graded as high or low based on the opinion of an OR analyst. Suppose we are interested in assessing the degree of correlation between OR risk and gross income of the business lines. Since in this case OR risk is a binary variable and gross income a continuous variable, we can use the biserial coefficient to estimate correlation. Table 11.3 provides a typical example. The final column gives the mapping of high or low into a binary variable. Using the above Biserial() function in the worksheet
Biserial, the estimate of correlation is 0.57.
Tetrachoric Correlation
Tetrachoric correlation measures the association between two binary vari- ables. Assume T and S are dichotomized at unknown threshold values θS
and θT , respectively. Our observable measurements on S and T are denoted by Sd and Td, where Td =1 if T ≥ θT (otherwise Td = 0), and Sd =1 if S ≥qS (otherwise Sd =0). The joint distribution of (Sd, Td) can be summarized as
TABLE 11.4 The General Situation for Outcomes of a Binary Variables S and T with Probability of Occurrence
T = 1 T = 0
S = 1 P11 P01 PS
S = 0 P10 P00 1 −PS
PT 1 −PT 1
Note Pkj is the probability that T =k and S =j where j, k =0 or 1
a 2 × 2 contingency table. The general situation is outlined in Table 11.4, where Pij = Prob (Td = i , Sd = j). Each cell is a bivariate normal integral.
For example:
P00 = Prob ( Td = 0, Sd = 0) = Prob (T < θ T , S < θ S )
θ T θ S
t s r) dtds Φ ( , ,
=
∫ ∫
-∞ -∞
The actual formula for the tetrachoric correlation coefficient is complex and contains an infinite series of terms. However, Pearson1 provides an easy-to- use approximation2 given by
o
ρ = ˆ cos 180
1 + bc ad
where a, b, c, and d refer to the frequencies in a fourfold table in cells 11, 12, 21, and 22, respectively, and where row 1 and column 2 designate presence.
Consider Table 11.5, which provides information concerning whether a reputational risk event has occurred alongside an internal OR audit score.
TABLE 11.5 Reputational Risk Events and OR Internal Audit Score for 12 Fictional Banks
Recorded Data Data Mapping
Reputational OR Reputational OR Bank risk event audit score risk event Audit Score
XYZ Bank No Low 0 1
GIA Financials No Low 0 1
City FG Holdings Yes High 1 0
Financial Street Bank Yes High 1 0
FPG No Low 0 1
Boston Regal Yes High 1 0
Imperial Crown No Low 0 1
Market DG Yes Low 1 1
Coventry Provincial No High 0 0
Bank 10 Yes High 1 0
AG Swift Inc Yes Low 1 1
High Street Holdings Yes Low 1 1
129
Correlation and Dependence
For this table the tetrachoric correlation coefficient is equal to 0.58. A VBA function to calculate tetrachoric correlation is
Function Tetra(S As Range, T As Range)' Function takes two binary ranges S and T
' Error checks
If (S.Columns.Count > 1 Or T.Columns.Count > 1) Then Tetra = ˝Need only 1 column˝
ElseIf (S.Rows.Count < 10 Or T.Rows.Count < 10) Then Tetra = ˝Need at least 10 rows˝
ElseIf (S.Rows.Count <> T.Rows.Count) Then Tetra = ˝Need at equal number of rows˝
Else ' correlation calculation starts here Dim a As Integer
Dim b As Integer Dim c As Integer Dim d As Integer Dim i As Integer a = 0
b = 0 c = 0 d = 0
Dim pi As Double pi = 3.14159265358979 For i = 1 To S.Rows.Count
If (S(i, 1) = 1 And (T(i, 1) = 1)) Then a = a + 1 If (S(i, 1) = 1 And (T(i, 1) = 0)) Then b = b + 1 If (S(i, 1) = 0 And (T(i, 1) = 1)) Then c = c + 1 If (S(i, 1) = 0 And (T(i, 1) = 0)) Then d = d + 1 Next i
Tetra = Cos(pi / (1 + (Sqr((b * c) / (a * d))))) End If
End Function
The function Tetra() takes two columns which must be of equal length and have at least 10 rows. An example of the use of this function is given in the worksheet Tetrachoric. The worksheet combines the Tetra()
function with a simulation of the two binary variables “OR Audit Score”
and “Reputational Risk Event” for 12 fictional financial institutions. Press
<F9> to run the simulation.
Note that when bc = ad,
means that the overall denominator is 2, and that the overall fraction r = 0 and therefore there is no correlation between S and T. When bc dominates over ad, the overall denominator is greater than 2, which means that the overall fraction is less than 90º, and the resulting estimate of ˆr is negative.