A new statistic for the analysis of association between trait and
polymorphic marker loci
Vartan Choulakian
a, Smail Mahdi
b,* aDepartment of Mathematics and Statistics, Universitede Moncton, Moncton, New Brunswick, Canada E1A 3E9
b
Department of Computer Science, Mathematics and Physics, University of The West Indies, Cave Hill Campus, P.O. Box 64, Bridgetown, Barbados
Received 5 January 1999; received in revised form 31 January 2000; accepted 4 February 2000
Abstract
Inference for detecting the existence of an association between a diallelic marker and a trait locus is based on the chi-squared statistic with one degree of freedom. For polymorphic markers withmalleles (m>2), three approaches are mainly used in practice. First, one may use Pearson's chi-squared statistic withmÿ1 degrees of freedom (d.f.) but this leads to a loss in test power. Second, one can select an allele to be the most associated and then collapse the other allele categories into a single class. This reduces in a biased way, the locus to a diallelic system. Third, one may use the Terwilliger [J.D. Terwilliger, Am. J. Hum. Genet. 56 (1995) 777] likelihood ratio statistic which has a non-standard unknown limiting probability distribution.
In this paper, we propose a new statistic, LD, based on the second testing approach. We derive the
as-ymptotic probability distribution ofLD in an easy way. Simulation studies show thatLDis more powerful than Pearson's chi-squared statistic withmÿ1 d.f. Ó 2000 Elsevier Science Inc. All rights reserved.
MSC:62H15; 62P10
Keywords:Pearson's chi-squared statistic; Contingency tables; Association
1. Introduction
This paper addresses the statistical problem of testing for the homogeneity of two multinomial
probability distributions that describe two populations: A diseased population denoted byDand
a control population denoted by. The given data are cross-classi®ed in a m2 contingency
*Corresponding author. Tel.: +1-264 417 4367; fax: +1-264 425 1327.
E-mail address:[email protected] (S. Mahdi).
0025-5564/00/$ - see front matter Ó 2000 Elsevier Science Inc. All rights reserved.
table, where 2 represents the number of populations, namely, the diseased and the control, and,
wheremrepresents the number of allele types or categories. The number of chromosomes falling
into category i that are sampled from the populationsDand are, respectively, denoted by Xi
andYifori1;. . .;m. Therefore, the sample sizes allocated to populationsDand + are denoted
represent the probability that alleleibelongs to population. The goodness-of-®t test confronts
the hypothesis H0 :`qiri fori1;. . .;m' against the alternative HD:`qi >ri for at least onei'.
The null hypothesis asserts that both populations are homogeneous with respect to the probability
distribution of themalleles. The alternative one-sided hypothesis states that there is at least one
allele excessively associated with the diseased population D. In the case of two alleles, Pearson's
chi-squared statistic is sucient to test the above hypothesis. However, if the number of alleles is greater than 2, Pearson's chi-squared statistic loses signi®cant test power as the number of degrees of freedom (d.f.) increases. This has already been pointed out in Terwilliger [1]. Therefore, one of
the two alternate methods is used to test H0. The ®rst method is based on multiple comparisons. It
reduces the m2 table to a set of m 22 tables in which each marker allele is separately
tested, for association with the disease allele, by Pearson's chi-squared statistic with 1 d.f. The highest signi®cant chi-squared statistic is chosen as the test statistic and some sort of
Bonferroni-type correction for multiple testing is then applied to calculate the overall p-value. The second
method proposed by Terwilliger [1] is based on a likelihood ratio statistic,K. However,Khas a
non-standard probability distribution and is highly computational.
In this paper, a new statistic,LD, is proposed which is based on the multiple comparison
ap-proach. In the next section, we introduce the statistic LD and derive its asymptotic probability
distribution. We present some simulation studies which show that the statisticLD leads to a more
powerful test than Pearson's chi-squared test when the number of alleles is greater than 2. In
Section 3, we apply the statistic LD to test for association in a real data set. Finally, a succinct
conclusion is presented in Section 4.
Terwilliger's statistic,K, is very interesting, and it will be studied in a separate paper.
2. Statistic LD
withnn1n2. We recall below the following well-known result.
Lemma 2.1. Consider the null hypothesisH0j:qjrjand the alternativeH1j:qj 6rj for a fixed j,
j1;. . .;m. If the sample sizes n1 andn2 are large enough, then underH0j, the statisticq~jÿ~rj is
approximately a centered normal variable with variance nhj 1ÿhj=n1n2.Therefore, the statistic
dj
n1n2 q~jÿ~rj 2
nhj 1ÿhj
Proof.LetXjandYj, forj1;. . .;m, denote the number of alleles of typejin the disease sample
and in the control sample, respectively. Under the null hypothesis H0, the variables XjandYj are
independent and have binomial distributions, that is,
XjB n1;hj and YjB n2;hj:
Therefore, Xj=n1 and Yj=n2 are asymptotically normal variables with the same mean hj and
cor-responding varianceshj 1ÿhj=n1 andhj 1ÿhj=n2. Thus, the variable Xj=n1 ÿ Yj=n2has a
Remark 2.1. The statistic dj is Pearson's chi-squared statistic calculated from the 22
con-tingency table, where the ®rst row of the 22 table is the jth row of the original m2
contingency table and its second row is obtained by collapsing on themÿ1 remaining rows of the
m2 contingency table.
In Lemma 2.1, the alternative hypothesis, H1j :qj6rj, is two-sided. For the particular
one-sided alternative hypothesis, HDj: `qj>rj for a ®xedj', the appropriate statistic is de®ned to be
Jj
positiveJi's and m=2 ÿ1identically null others.Moreover, the strictly positive ones are mutually
independent and asymptotically distributed as chi-square with 1 d.f.
Proof.Consider the set of variablesJ1;. . .;Jmÿ1. From Remark 2.2, it follows that the number of
strictly positiveJis varies from 1 tomÿ1. Then, on average there are 1mÿ1=2m=2 strictly
positive Jis and mÿ1ÿ m 2 m=2ÿ1 null Jis. Furthermore, the positive Ji variables
be-long to the set of the positive di variables. Therefore, they are also mutually independent and
We consider now the general goodness-of-®t problem: H0: `qj rjforj1;. . .;m' versus HD:
`qj>rj for at least one j, j1;. . .;m'. Note that the alternative hypothesis HD is equal to Sm
j1HDj. Then, the appropriate statistic is
LD max
i1;...; mÿ1Ji: 3
To evaluate the asymptotic probability distribution of LD, we need to make the following
assumptions.
LetVbe the random variable representing the number of positiveJis out of themÿ1 variables.
Let alsof v, forv1;. . .;mÿ1, denote the probability density function ofV. We assume that
f v is a symmetric function about the mean E V m=2. For instance,V has the discrete
uni-form distribution. Under these assumptions, we state the following theorem.
Theorem 2.2. For any given large positive real numbern, we have
PrLD>n '1ÿ fF ng m=2
; 4
where Fis the cumulative distribution function of the v2 1 variable.
Proof. By Theorem 2.1, the Ji's variables are mutually independent. Thus, PrLD>njV l
for large positive real n.
To see if formula (4) provides a good approximation to the asymptotic probability distribution
ofLDunder H0, we conducted a simulation study. Several values formare considered, speci®cally,
m2;3;4;5;10;20. Table 3, in Appendix A, presents the empirical p-values expressed in
per-centage which are obtained from 5000 simulations under H0. The cell frequencies are generated
with equal probability and, n1n2100 as in [1]. We see that the empiricalp-values associated
p-values. In casem2,LD Pearson's chi-squared statistic. Furthermore, a power study for the
statisticsLDandv2 is conducted and, obtained simulation results are presented in Table 4, given in
Appendix A. The data in Table 4, illustrate the situation under HD whereq1 0:65,r1 0:5,
Frequencies of four alleles in a sample of 130 cases and 136 controls
Allele
1 2 3 4
Case 46 10 40 34
Control 20 19 39 58
Table 2
Jivalues evaluated from Table 1
Allele-Table 1 2 3 4
Ji 15.532 0 0.139 0
Table 3
Empiricalp-values in % obtained from 5000 simulations with equally likely categoriesa
Targetp-values 40% 30% 20% 10% 5% 1% 0.1%
m2
LD 42.46 26.94 17.52 9.62 5.24 0.84 0.12
v2 42.46 26.94 17.52 9.62 5.24 0.82 0.12
m3
LD 41.18 30.04 21.20 9.58 5.08 0.98 0.12
v2 39.96 29.94 19.58 9.82 4.82 1.06 0.06
m4
LD 42.52 31.62 20.96 10.44 5.32 1.06 0.12
v2 38.78 29.02 19.92 10.16 4.78 1.14 0.04
m5
LD 45.02 33.50 21.54 11.18 5.36 0.90 0.14
v2 39.88 29.80 20.02 9.92 4.90 1.14 0.14
m10
LD 44.10 33.38 21.60 10.70 5.44 1.28 0.12
v2 40.02 30.36 19.72 9.80 4.28 0.80 0.04
m20
LD 44.60 26.68 19.60 5.84 2.84 0.40 0.00
v2 34.24 24.20 14.00 5.70 2.18 0.24 0.00
a
The simulation globally showed that LD has uniformly greater power than the v2 statistic . This
con®rms the well-known fact that one-sided alternative hypotheses lead to more powerful tests, (see, e.g., [2]).
3. Case study
To see how our approach works, we analyzed a data set given in [3] for which a signi®cant association with the ®rst allele has been found by a parametric analysis [3]. The data are sum-marized in Table 1.
The computed values of Ji, i1;. . .;4, for the 22 tables constructed from Table 1 are
displayed in Table 2.
From Table 2, we deduce thatLD15:532 with a correspondingp-value 0:00016. Therefore, we
conclude that the test is not signi®cant at the level a0:0001. Note that a0:0001 is the
commonly used value in linkage analysis, see, e.g., [4] and references therein. On the other hand, Pearson's chi-squared test statistic has the value of 19:18. With 3 d.f., thep-value is 0:00025; so at the levela0:0001, the null hypothesis is not rejected.
Table 4
Empirical power values in % obtained from 5000 simulations in the situationq10:65,r10:5,qj0:35=mÿ1 and rj0:5=mÿ1 forj2;. . .;ma
Targetp-values 40% 30% 20% 10% 5% 1% 0.1%
m2
LD 92.06 86.98 82.46 70.54 59.40 33.74 13.02
v2 92.06 86.98 82.46 70.54 59.40 33.74 13.02
m3
LD 88.62 84.30 75.60 65.42 53.86 30.80 12.18
v2 87.10 82.26 74.24 61.30 48.66 26.94 8.76
m4
LD 84.76 80.58 71.04 59.10 47.64 26.86 9.58
v2 83.60 77.44 68.30 53.72 40.52 19.96 5.92
m5
LD 82.44 76.94 69.58 56.14 43.94 23.76 8.78
v2 81.62 74.60 65.14 50.12 37.04 17.02 4.32
m10
LD 75.32 68.06 56.46 42.80 34.62 17.90 5.86
v2 70.84 61.50 48.64 31.78 20.10 6.50 0.66
m20
LD 47.00 38.68 28.54 19.00 12.50 3.96 0.66
v2 10.76 5.80 2.42 0.74 0.18 0.00 0.00
aThe parametermrepresents the number of allele categories,
4. Conclusion
We have brie¯y reviewed the most common statistical methods in linkage analysis that test for association in case±control contingency tables. The Pearson and multiple comparison methods are
based on a chi-squared limiting distribution. Terwilliger's [1] likelihood ratio statisticKis highly
computational and based on an unknown non-standard asymptotic distribution. We have
proposed in this paper a new statistic LD. Its probability distribution is easily evaluated and
simulation studies have shown that it is more powerful than the classical Pearson chi-squared test.
Acknowledgements
Thanks are due to the Editor and Referees for useful comments which led to major im-provements in the paper. V.C.'s research was supported by a grant from the Natural Science and Research Council of Canada. S.M.'s research was supported by a UWI Study and Travel grant.
Appendix A
See Tables 3 and 4.
References
[1] J.D. Terwilliger, A powerful likelihood method for the analysis of linkage disequilibrium between trait loci at one or more polymorphic marker loci, Am. J. Hum. Genet. 56 (1995) 777.
[2] D.I. Tang, S.P. Lin, An approximate likelihood ratio test for comparing several treatments to a control, J. Am. Statist. Assoc. 92 (1997) 1155.
[3] J.D. Terwilliger, J. Ott, Handbook of Human Genetic Linkage, John Hopkins University, Baltimore and London, 1992.