• Tidak ada hasil yang ditemukan

Directory UMM :Data Elmu:jurnal:M:Mathematical Biosciences:Vol167.Issue2.Oct2000:

N/A
N/A
Protected

Academic year: 2017

Membagikan "Directory UMM :Data Elmu:jurnal:M:Mathematical Biosciences:Vol167.Issue2.Oct2000:"

Copied!
7
0
0

Teks penuh

(1)

A new statistic for the analysis of association between trait and

polymorphic marker loci

Vartan Choulakian

a

, Smail Mahdi

b,* a

Department of Mathematics and Statistics, Universitede Moncton, Moncton, New Brunswick, Canada E1A 3E9

b

Department of Computer Science, Mathematics and Physics, University of The West Indies, Cave Hill Campus, P.O. Box 64, Bridgetown, Barbados

Received 5 January 1999; received in revised form 31 January 2000; accepted 4 February 2000

Abstract

Inference for detecting the existence of an association between a diallelic marker and a trait locus is based on the chi-squared statistic with one degree of freedom. For polymorphic markers withmalleles (m>2), three approaches are mainly used in practice. First, one may use Pearson's chi-squared statistic withmÿ1 degrees of freedom (d.f.) but this leads to a loss in test power. Second, one can select an allele to be the most associated and then collapse the other allele categories into a single class. This reduces in a biased way, the locus to a diallelic system. Third, one may use the Terwilliger [J.D. Terwilliger, Am. J. Hum. Genet. 56 (1995) 777] likelihood ratio statistic which has a non-standard unknown limiting probability distribution.

In this paper, we propose a new statistic, LD, based on the second testing approach. We derive the

as-ymptotic probability distribution ofLD in an easy way. Simulation studies show thatLDis more powerful than Pearson's chi-squared statistic withmÿ1 d.f. Ó 2000 Elsevier Science Inc. All rights reserved.

MSC:62H15; 62P10

Keywords:Pearson's chi-squared statistic; Contingency tables; Association

1. Introduction

This paper addresses the statistical problem of testing for the homogeneity of two multinomial

probability distributions that describe two populations: A diseased population denoted byDand

a control population denoted by‡. The given data are cross-classi®ed in a…m2† contingency

*Corresponding author. Tel.: +1-264 417 4367; fax: +1-264 425 1327.

E-mail address:[email protected] (S. Mahdi).

0025-5564/00/$ - see front matter Ó 2000 Elsevier Science Inc. All rights reserved.

(2)

table, where 2 represents the number of populations, namely, the diseased and the control, and,

wheremrepresents the number of allele types or categories. The number of chromosomes falling

into category i that are sampled from the populationsDand ‡ are, respectively, denoted by Xi

andYiforiˆ1;. . .;m. Therefore, the sample sizes allocated to populationsDand + are denoted

represent the probability that alleleibelongs to population‡. The goodness-of-®t test confronts

the hypothesis H0 :`qiˆri foriˆ1;. . .;m' against the alternative HD:`qi >ri for at least onei'.

The null hypothesis asserts that both populations are homogeneous with respect to the probability

distribution of themalleles. The alternative one-sided hypothesis states that there is at least one

allele excessively associated with the diseased population D. In the case of two alleles, Pearson's

chi-squared statistic is sucient to test the above hypothesis. However, if the number of alleles is greater than 2, Pearson's chi-squared statistic loses signi®cant test power as the number of degrees of freedom (d.f.) increases. This has already been pointed out in Terwilliger [1]. Therefore, one of

the two alternate methods is used to test H0. The ®rst method is based on multiple comparisons. It

reduces the …m2† table to a set of m …22† tables in which each marker allele is separately

tested, for association with the disease allele, by Pearson's chi-squared statistic with 1 d.f. The highest signi®cant chi-squared statistic is chosen as the test statistic and some sort of

Bonferroni-type correction for multiple testing is then applied to calculate the overall p-value. The second

method proposed by Terwilliger [1] is based on a likelihood ratio statistic,K. However,Khas a

non-standard probability distribution and is highly computational.

In this paper, a new statistic,LD, is proposed which is based on the multiple comparison

ap-proach. In the next section, we introduce the statistic LD and derive its asymptotic probability

distribution. We present some simulation studies which show that the statisticLD leads to a more

powerful test than Pearson's chi-squared test when the number of alleles is greater than 2. In

Section 3, we apply the statistic LD to test for association in a real data set. Finally, a succinct

conclusion is presented in Section 4.

Terwilliger's statistic,K, is very interesting, and it will be studied in a separate paper.

2. Statistic LD

withnˆn1‡n2. We recall below the following well-known result.

Lemma 2.1. Consider the null hypothesisH0j:qjˆrjand the alternativeH1j:qj 6ˆrj for a fixed j,

jˆ1;. . .;m. If the sample sizes n1 andn2 are large enough, then underH0j, the statisticq~jÿ~rj is

approximately a centered normal variable with variance nhj…1ÿhj†=n1n2.Therefore, the statistic

djˆ

n1n2…q~jÿ~rj† 2

nhj…1ÿhj†

(3)

Proof.LetXjandYj, forjˆ1;. . .;m, denote the number of alleles of typejin the disease sample

and in the control sample, respectively. Under the null hypothesis H0, the variables XjandYj are

independent and have binomial distributions, that is,

XjB…n1;hj† and YjB…n2;hj†:

Therefore, Xj=n1 and Yj=n2 are asymptotically normal variables with the same mean hj and

cor-responding varianceshj…1ÿhj†=n1 andhj…1ÿhj†=n2. Thus, the variable……Xj=n1† ÿ …Yj=n2††has a

Remark 2.1. The statistic dj is Pearson's chi-squared statistic calculated from the …22†

con-tingency table, where the ®rst row of the …22† table is the jth row of the original …m2†

contingency table and its second row is obtained by collapsing on themÿ1 remaining rows of the

…m2† contingency table.

In Lemma 2.1, the alternative hypothesis, H1j :qj6ˆrj, is two-sided. For the particular

one-sided alternative hypothesis, HDj: `qj>rj for a ®xedj', the appropriate statistic is de®ned to be

Jjˆ

positiveJi's and……m=2† ÿ1†identically null others.Moreover, the strictly positive ones are mutually

independent and asymptotically distributed as chi-square with 1 d.f.

Proof.Consider the set of variablesJ1;. . .;Jmÿ1. From Remark 2.2, it follows that the number of

strictly positiveJis varies from 1 tomÿ1. Then, on average there are…1‡mÿ1†=2ˆm=2 strictly

positive Jis and …mÿ1ÿ …m…2†† ˆ…m=2ÿ1† null Jis. Furthermore, the positive Ji variables

be-long to the set of the positive di variables. Therefore, they are also mutually independent and

(4)

We consider now the general goodness-of-®t problem: H0: `qj ˆrjforjˆ1;. . .;m' versus HD:

`qj>rj for at least one j, jˆ1;. . .;m'. Note that the alternative hypothesis HD is equal to Sm

jˆ1HDj. Then, the appropriate statistic is

LDˆ max

iˆ1;...; mÿ1Ji: …3†

To evaluate the asymptotic probability distribution of LD, we need to make the following

assumptions.

LetVbe the random variable representing the number of positiveJis out of themÿ1 variables.

Let alsof…v†, forvˆ1;. . .;mÿ1, denote the probability density function ofV. We assume that

f…v† is a symmetric function about the mean E…V† ˆm=2. For instance,V has the discrete

uni-form distribution. Under these assumptions, we state the following theorem.

Theorem 2.2. For any given large positive real numbern, we have

Pr‰LD>nŠ '1ÿ fF…n†g m=2

; …4†

where Fis the cumulative distribution function of the v2…1† variable.

Proof. By Theorem 2.1, the Ji's variables are mutually independent. Thus, Pr‰LD>njV ˆlŠ ˆ

for large positive real n.

To see if formula (4) provides a good approximation to the asymptotic probability distribution

ofLDunder H0, we conducted a simulation study. Several values formare considered, speci®cally,

mˆ2;3;4;5;10;20. Table 3, in Appendix A, presents the empirical p-values expressed in

per-centage which are obtained from 5000 simulations under H0. The cell frequencies are generated

with equal probability and, n1ˆn2ˆ100 as in [1]. We see that the empiricalp-values associated

(5)

p-values. In casemˆ2,LDˆ Pearson's chi-squared statistic. Furthermore, a power study for the

statisticsLDandv2 is conducted and, obtained simulation results are presented in Table 4, given in

Appendix A. The data in Table 4, illustrate the situation under HD whereq1 ˆ0:65,r1 ˆ0:5,

Frequencies of four alleles in a sample of 130 cases and 136 controls

Allele

1 2 3 4

Case 46 10 40 34

Control 20 19 39 58

Table 2

Jivalues evaluated from Table 1

Allele-Table 1 2 3 4

Ji 15.532 0 0.139 0

Table 3

Empiricalp-values in % obtained from 5000 simulations with equally likely categoriesa

Targetp-values 40% 30% 20% 10% 5% 1% 0.1%

mˆ2

LD 42.46 26.94 17.52 9.62 5.24 0.84 0.12

v2 42.46 26.94 17.52 9.62 5.24 0.82 0.12

mˆ3

LD 41.18 30.04 21.20 9.58 5.08 0.98 0.12

v2 39.96 29.94 19.58 9.82 4.82 1.06 0.06

mˆ4

LD 42.52 31.62 20.96 10.44 5.32 1.06 0.12

v2 38.78 29.02 19.92 10.16 4.78 1.14 0.04

mˆ5

LD 45.02 33.50 21.54 11.18 5.36 0.90 0.14

v2 39.88 29.80 20.02 9.92 4.90 1.14 0.14

mˆ10

LD 44.10 33.38 21.60 10.70 5.44 1.28 0.12

v2 40.02 30.36 19.72 9.80 4.28 0.80 0.04

mˆ20

LD 44.60 26.68 19.60 5.84 2.84 0.40 0.00

v2 34.24 24.20 14.00 5.70 2.18 0.24 0.00

a

(6)

The simulation globally showed that LD has uniformly greater power than the v2 statistic . This

con®rms the well-known fact that one-sided alternative hypotheses lead to more powerful tests, (see, e.g., [2]).

3. Case study

To see how our approach works, we analyzed a data set given in [3] for which a signi®cant association with the ®rst allele has been found by a parametric analysis [3]. The data are sum-marized in Table 1.

The computed values of Ji, iˆ1;. . .;4, for the …22† tables constructed from Table 1 are

displayed in Table 2.

From Table 2, we deduce thatLDˆ15:532 with a correspondingp-value 0:00016. Therefore, we

conclude that the test is not signi®cant at the level aˆ0:0001. Note that aˆ0:0001 is the

commonly used value in linkage analysis, see, e.g., [4] and references therein. On the other hand, Pearson's chi-squared test statistic has the value of 19:18. With 3 d.f., thep-value is 0:00025; so at the levelaˆ0:0001, the null hypothesis is not rejected.

Table 4

Empirical power values in % obtained from 5000 simulations in the situationq1ˆ0:65,r1ˆ0:5,qjˆ0:35=mÿ1 and rjˆ0:5=mÿ1 forjˆ2;. . .;ma

Targetp-values 40% 30% 20% 10% 5% 1% 0.1%

mˆ2

LD 92.06 86.98 82.46 70.54 59.40 33.74 13.02

v2 92.06 86.98 82.46 70.54 59.40 33.74 13.02

mˆ3

LD 88.62 84.30 75.60 65.42 53.86 30.80 12.18

v2 87.10 82.26 74.24 61.30 48.66 26.94 8.76

mˆ4

LD 84.76 80.58 71.04 59.10 47.64 26.86 9.58

v2 83.60 77.44 68.30 53.72 40.52 19.96 5.92

mˆ5

LD 82.44 76.94 69.58 56.14 43.94 23.76 8.78

v2 81.62 74.60 65.14 50.12 37.04 17.02 4.32

mˆ10

LD 75.32 68.06 56.46 42.80 34.62 17.90 5.86

v2 70.84 61.50 48.64 31.78 20.10 6.50 0.66

mˆ20

LD 47.00 38.68 28.54 19.00 12.50 3.96 0.66

v2 10.76 5.80 2.42 0.74 0.18 0.00 0.00

aThe parametermrepresents the number of allele categories,

(7)

4. Conclusion

We have brie¯y reviewed the most common statistical methods in linkage analysis that test for association in case±control contingency tables. The Pearson and multiple comparison methods are

based on a chi-squared limiting distribution. Terwilliger's [1] likelihood ratio statisticKis highly

computational and based on an unknown non-standard asymptotic distribution. We have

proposed in this paper a new statistic LD. Its probability distribution is easily evaluated and

simulation studies have shown that it is more powerful than the classical Pearson chi-squared test.

Acknowledgements

Thanks are due to the Editor and Referees for useful comments which led to major im-provements in the paper. V.C.'s research was supported by a grant from the Natural Science and Research Council of Canada. S.M.'s research was supported by a UWI Study and Travel grant.

Appendix A

See Tables 3 and 4.

References

[1] J.D. Terwilliger, A powerful likelihood method for the analysis of linkage disequilibrium between trait loci at one or more polymorphic marker loci, Am. J. Hum. Genet. 56 (1995) 777.

[2] D.I. Tang, S.P. Lin, An approximate likelihood ratio test for comparing several treatments to a control, J. Am. Statist. Assoc. 92 (1997) 1155.

[3] J.D. Terwilliger, J. Ott, Handbook of Human Genetic Linkage, John Hopkins University, Baltimore and London, 1992.

Referensi

Dokumen terkait

from the major ports, nor is there an apparent statistically signi®cant relationship between barge and rail demand and the Paci®c Northwest to Gulf corn price spread.. ·

Since signi®cant light depression at higher cation concentrations was observed in this study, increasing the amount of EDTA above that in the commercial ATP assay kit may be neces-

Our observations suggest that the e€ect of selective utilisation of 13 C-enriched compounds by soil hetero- trophs is more signi®cant than metabolic e€ects, and can induce a

Postnatal development of muscle fibre thickness (cross-sectional area or diameter) and total muscle fibre number per muscle cross-section in: (a) rectus femoris muscle of

Two strains of Bacillus subtilis, showing strong inhibition of a number of pathogenic fungi on agar plates, and the capacity to grow under anoxic and anaerobic conditions when

Company structures ranged from vertically integrated businesses, that had signi " cant manufacturing capability, to design and contract organisations that outsourced all

F-values and significance of two-way ANOVA of number of mini- tubers (NT), total fresh weight of minitubers (TFW), fresh weight of one minituber (FWO) and mycorrhizal colonization

If a signi®cant association between measured meaning and decision outcomes is found, it will provide evidence in understanding how changes to accounting standards and regulations