Directory UMM :Data Elmu:jurnal:M:Mathematical Biosciences:Vol167.Issue2.Oct2000:

(1)

A new statistic for the analysis of association between trait and

polymorphic marker loci

Vartan Choulakian

a

, Smail Mahdi

b,* a

Department of Mathematics and Statistics, Universitede Moncton, Moncton, New Brunswick, Canada E1A 3E9

b

Department of Computer Science, Mathematics and Physics, University of The West Indies, Cave Hill Campus, P.O. Box 64, Bridgetown, Barbados

Received 5 January 1999; received in revised form 31 January 2000; accepted 4 February 2000

Abstract

Inference for detecting the existence of an association between a diallelic marker and a trait locus is based on the chi-squared statistic with one degree of freedom. For polymorphic markers withmalleles (m>2), three approaches are mainly used in practice. First, one may use Pearson's chi-squared statistic withmÿ1 degrees of freedom (d.f.) but this leads to a loss in test power. Second, one can select an allele to be the most associated and then collapse the other allele categories into a single class. This reduces in a biased way, the locus to a diallelic system. Third, one may use the Terwilliger [J.D. Terwilliger, Am. J. Hum. Genet. 56 (1995) 777] likelihood ratio statistic which has a non-standard unknown limiting probability distribution.

In this paper, we propose a new statistic, LD, based on the second testing approach. We derive the

as-ymptotic probability distribution ofLD in an easy way. Simulation studies show thatLDis more powerful than Pearson's chi-squared statistic withmÿ1 d.f. Ó _{2000 Elsevier Science Inc. All rights reserved.}

MSC:62H15; 62P10

Keywords:Pearson's chi-squared statistic; Contingency tables; Association

1. Introduction

This paper addresses the statistical problem of testing for the homogeneity of two multinomial

probability distributions that describe two populations: A diseased population denoted byDand

a control population denoted by. The given data are cross-classi®ed in am2 contingency

*_{Corresponding author. Tel.: +1-264 417 4367; fax: +1-264 425 1327.}

E-mail address:[email protected] (S. Mahdi).

(2)

table, where 2 represents the number of populations, namely, the diseased and the control, and,

wheremrepresents the number of allele types or categories. The number of chromosomes falling

into category i that are sampled from the populationsDand are, respectively, denoted by Xi

andYifori1;. . .;m. Therefore, the sample sizes allocated to populationsDand + are denoted

represent the probability that alleleibelongs to population. The goodness-of-®t test confronts

the hypothesis H0 :`qiri fori1;. . .;m' against the alternative HD:`qi >ri for at least onei'.

The null hypothesis asserts that both populations are homogeneous with respect to the probability

distribution of themalleles. The alternative one-sided hypothesis states that there is at least one

allele excessively associated with the diseased population D. In the case of two alleles, Pearson's

chi-squared statistic is sucient to test the above hypothesis. However, if the number of alleles is greater than 2, Pearson's chi-squared statistic loses signi®cant test power as the number of degrees of freedom (d.f.) increases. This has already been pointed out in Terwilliger [1]. Therefore, one of

the two alternate methods is used to test H0. The ®rst method is based on multiple comparisons. It

reduces the m2 table to a set of m 22 tables in which each marker allele is separately

tested, for association with the disease allele, by Pearson's chi-squared statistic with 1 d.f. The highest signi®cant chi-squared statistic is chosen as the test statistic and some sort of

Bonferroni-type correction for multiple testing is then applied to calculate the overall p-value. The second

method proposed by Terwilliger [1] is based on a likelihood ratio statistic,K. However,Khas a

non-standard probability distribution and is highly computational.

In this paper, a new statistic,LD, is proposed which is based on the multiple comparison

ap-proach. In the next section, we introduce the statistic LD and derive its asymptotic probability

distribution. We present some simulation studies which show that the statisticLD leads to a more

powerful test than Pearson's chi-squared test when the number of alleles is greater than 2. In

Section 3, we apply the statistic LD to test for association in a real data set. Finally, a succinct

conclusion is presented in Section 4.

Terwilliger's statistic,K, is very interesting, and it will be studied in a separate paper.

2. Statistic LD

withnn1n2. We recall below the following well-known result.

Lemma 2.1. Consider the null hypothesisH0j:qjrjand the alternativeH1j:qj 6rj for a fixed j,

j1;. . .;m. If the sample sizes n1 andn2 are large enough, then underH0j, the statisticq~jÿ~rj is

approximately a centered normal variable with variance nhj1ÿhj=n1n2.Therefore, the statistic

dj

n1n2q~jÿ~rj 2

nhj1ÿhj

(3)

Proof.LetXjandYj, forj1;. . .;m, denote the number of alleles of typejin the disease sample

and in the control sample, respectively. Under the null hypothesis H0, the variables XjandYj are

independent and have binomial distributions, that is,

XjBn1;hj and YjBn2;hj:

Therefore, Xj=n1 and Yj=n2 are asymptotically normal variables with the same mean hj and

cor-responding varianceshj1ÿhj=n1 andhj1ÿhj=n2. Thus, the variableXj=n1 ÿ Yj=n2has a

Remark 2.1. The statistic dj is Pearson's chi-squared statistic calculated from the 22

con-tingency table, where the ®rst row of the 22 table is the jth row of the original m2

contingency table and its second row is obtained by collapsing on themÿ1 remaining rows of the

m2 contingency table.

In Lemma 2.1, the alternative hypothesis, H1j :qj6rj, is two-sided. For the particular

one-sided alternative hypothesis, HDj: `qj>rj for a ®xedj', the appropriate statistic is de®ned to be

Jj

positiveJi's andm=2 ÿ1identically null others.Moreover, the strictly positive ones are mutually

independent and asymptotically distributed as chi-square with 1 d.f.

Proof.Consider the set of variablesJ1;. . .;Jmÿ1. From Remark 2.2, it follows that the number of

strictly positiveJis varies from 1 tomÿ1. Then, on average there are1mÿ1=2m=2 strictly

positive Jis and mÿ1ÿ m2 m=2ÿ1 null Jis. Furthermore, the positive Ji variables

be-long to the set of the positive di variables. Therefore, they are also mutually independent and

(4)

We consider now the general goodness-of-®t problem: H0: `qj rjforj1;. . .;m' versus HD:

`qj>rj for at least one j, j1;. . .;m'. Note that the alternative hypothesis HD is equal to Sm

j1HDj. Then, the appropriate statistic is

LD max

i1;...; mÿ1Ji: 3

To evaluate the asymptotic probability distribution of LD, we need to make the following

assumptions.

LetVbe the random variable representing the number of positiveJis out of themÿ1 variables.

Let alsofv, forv1;. . .;mÿ1, denote the probability density function ofV. We assume that

fv is a symmetric function about the mean EV m=2. For instance,V has the discrete

uni-form distribution. Under these assumptions, we state the following theorem.

Theorem 2.2. For any given large positive real numbern, we have

PrLD>n '1ÿ fFng m=2

; 4

where F_{is the cumulative distribution function of the} _v2₁ _variable_.

Proof. By Theorem 2.1, the Ji's variables are mutually independent. Thus, PrLD>njV l

for large positive real n.

To see if formula (4) provides a good approximation to the asymptotic probability distribution

ofLDunder H0, we conducted a simulation study. Several values formare considered, speci®cally,

m2;3;4;5;10;20. Table 3, in Appendix A, presents the empirical p-values expressed in

per-centage which are obtained from 5000 simulations under H0. The cell frequencies are generated

with equal probability and, n1n2100 as in [1]. We see that the empiricalp-values associated

(5)

p-values. In casem2,LD Pearson's chi-squared statistic. Furthermore, a power study for the

statisticsLDandv2 is conducted and, obtained simulation results are presented in Table 4, given in

Appendix A. The data in Table 4, illustrate the situation under HD whereq1 0:65,r1 0:5,

Frequencies of four alleles in a sample of 130 cases and 136 controls

Allele

1 2 3 4

Case 46 10 40 34

Control 20 19 39 58

Table 2

Jivalues evaluated from Table 1

Allele-Table 1 2 3 4

Ji 15.532 0 0.139 0

Table 3

Empiricalp-values in % obtained from 5000 simulations with equally likely categoriesa

Targetp-values 40% 30% 20% 10% 5% 1% 0.1%

m2

LD 42.46 26.94 17.52 9.62 5.24 0.84 0.12

v2 _42.46 _26.94 _17.52 _9.62 _5.24 _0.82 _0.12

m3

LD 41.18 30.04 21.20 9.58 5.08 0.98 0.12

v2 _39.96 _29.94 _19.58 _9.82 _4.82 _1.06 _0.06

m4

LD 42.52 31.62 20.96 10.44 5.32 1.06 0.12

v2 _38.78 _29.02 _19.92 _10.16 _4.78 _1.14 _0.04

m5

LD 45.02 33.50 21.54 11.18 5.36 0.90 0.14

v2 _39.88 _29.80 _20.02 _9.92 _4.90 _1.14 _0.14

m10

LD 44.10 33.38 21.60 10.70 5.44 1.28 0.12

v2 _40.02 _30.36 _19.72 _9.80 _4.28 _0.80 _0.04

m20

LD 44.60 26.68 19.60 5.84 2.84 0.40 0.00

v2 _34.24 _24.20 _14.00 _5.70 _2.18 _0.24 _0.00

a

(6)

The simulation globally showed that LD has uniformly greater power than the v2 statistic . This

con®rms the well-known fact that one-sided alternative hypotheses lead to more powerful tests, (see, e.g., [2]).

3. Case study

To see how our approach works, we analyzed a data set given in [3] for which a signi®cant association with the ®rst allele has been found by a parametric analysis [3]. The data are sum-marized in Table 1.

The computed values of Ji, i1;. . .;4, for the 22 tables constructed from Table 1 are

displayed in Table 2.

From Table 2, we deduce thatLD15:532 with a correspondingp-value 0:00016. Therefore, we

conclude that the test is not signi®cant at the level a0:0001. Note that a0:0001 is the

commonly used value in linkage analysis, see, e.g., [4] and references therein. On the other hand, Pearson's chi-squared test statistic has the value of 19:18. With 3 d.f., thep-value is 0:00025; so at the levela0:0001, the null hypothesis is not rejected.

Table 4

Empirical power values in % obtained from 5000 simulations in the situationq10:65,r10:5,qj0:35=mÿ1 and rj0:5=mÿ1 forj2;. . .;ma

Targetp-values 40% 30% 20% 10% 5% 1% 0.1%

m2

LD 92.06 86.98 82.46 70.54 59.40 33.74 13.02

v2 _92.06 _86.98 _82.46 _70.54 _59.40 _33.74 _13.02

m3

LD 88.62 84.30 75.60 65.42 53.86 30.80 12.18

v2 _87.10 _82.26 _74.24 _61.30 _48.66 _26.94 _8.76

m4

LD 84.76 80.58 71.04 59.10 47.64 26.86 9.58

v2 _83.60 _77.44 _68.30 _53.72 _40.52 _19.96 _5.92

m5

LD 82.44 76.94 69.58 56.14 43.94 23.76 8.78

v2 _81.62 _74.60 _65.14 _50.12 _37.04 _17.02 _4.32

m10

LD 75.32 68.06 56.46 42.80 34.62 17.90 5.86

v2 _70.84 _61.50 _48.64 _31.78 _20.10 _6.50 _0.66

m20

LD 47.00 38.68 28.54 19.00 12.50 3.96 0.66

v2 _10.76 _5.80 _2.42 _0.74 _0.18 _0.00 _0.00

a_{The parameter}_m_{represents the number of allele categories,}

(7)

4. Conclusion

We have brie¯y reviewed the most common statistical methods in linkage analysis that test for association in case±control contingency tables. The Pearson and multiple comparison methods are

based on a chi-squared limiting distribution. Terwilliger's [1] likelihood ratio statisticKis highly

computational and based on an unknown non-standard asymptotic distribution. We have

proposed in this paper a new statistic LD. Its probability distribution is easily evaluated and

simulation studies have shown that it is more powerful than the classical Pearson chi-squared test.

Acknowledgements

Thanks are due to the Editor and Referees for useful comments which led to major im-provements in the paper. V.C.'s research was supported by a grant from the Natural Science and Research Council of Canada. S.M.'s research was supported by a UWI Study and Travel grant.

Appendix A

See Tables 3 and 4.

References

[1] J.D. Terwilliger, A powerful likelihood method for the analysis of linkage disequilibrium between trait loci at one or more polymorphic marker loci, Am. J. Hum. Genet. 56 (1995) 777.

[2] D.I. Tang, S.P. Lin, An approximate likelihood ratio test for comparing several treatments to a control, J. Am. Statist. Assoc. 92 (1997) 1155.

[3] J.D. Terwilliger, J. Ott, Handbook of Human Genetic Linkage, John Hopkins University, Baltimore and London, 1992.