M01881

(1)

Single nucleotide polymorphism (SNP) data analysis by using bootstrap method

Adi Setiawan

Citation: AIP Conference Proceedings 1746, 020051 (2016); doi: 10.1063/1.4953976 View online: http://dx.doi.org/10.1063/1.4953976

View Table of Contents: http://scitation.aip.org/content/aip/proceeding/aipcp/1746?ver=pdfcov Published by the AIP Publishing

Articles you may be interested in

Microarray study of single nucleotide polymorphisms and expression of ATP-binding cassette genes in breast tumors

AIP Conf. Proc. 1688, 030036 (2015); 10.1063/1.4936031

Melting analysis on microbeads in rapid temperature-gradient inside microchannels for single nucleotide polymorphisms detectiona)

Biomicrofluidics 8, 064109 (2014); 10.1063/1.4902907

Multicolor fluorescence detection for single nucleotide polymorphism genotyping using a filter-less fluorescence detector

Appl. Phys. Lett. 102, 233701 (2013); 10.1063/1.4809999

Identification of transposon insertion polymorphisms by computational comparative analysis of next generation personal genome data

AIP Conf. Proc. 1368, 163 (2011); 10.1063/1.3663485

(2)

Single Nucleotide Polymorphism (SNP) Data Analysis

by using Bootstrap Method

Adi Setiawan

a)

Department of Mathematics, Faculty of Science and Mathematics, Satya Wacana Christian University, Jl. Diponegoro 52-60 Salatiga 50711 Indonesia

a) _{Corresponding author: [email protected]}

Abstract.In this paper it is described how to analyze SNPs data by using bootstrap method. The bootstrap method is used in case of the chi-square Q2 statistics and the likelihood ratio statistics G2. It is described how to use the method in case-control association studies. Simulation study is used to explore the properties of the method. Based on the original table, it can be construct a new table and related statistical value of statistics. Figure 1(a) and Figure 1(b) presents the number B = 1.000.000 statistical value of chi-square test and likelihood ratio test, respectively. The bootstrap p-values for two (original) statistical tests are 1.2 × 10-5_{and 1.9 ×10}-5_{, respectively. The described method can also be applied to the} whole genome case-control association studies which uses thousands SNPs.

INTRODUCTION

Association studies have become very popular in the last few years (see e.g. [1] and [2]). In this paper, we focus on case-control association studies, which simply compare the genotypes of individuals who have a disease (cases) with the genotype of individuals without the disease (controls). The proportions in each group having a characteristic of interest (for instance the numbers of alleles of a given type) are then compared to determine whether there is an association between the disease and the characteristic of interest. Association studies must use markers (in this case by using Single Nucleotide Polymorphism (SNP)) in the analysis, these being by definition the observable characteristics of the genome. Furthermore, we study how to analyze SNP data by using bootstrap method in case of the chi-square Q2 statistics and the likelihood ratio statistics G2.

LITERATURE REVIEW

The bootstrap method has many application in the data analysis when the distribution of the population cannot be determined exactly. The method have been already applied in SNP data analysis (see e. g. [3] and [4]). In this paper, it will be used in case-control association studies. The association studies are based on a case-control design and try to find marker loci associated to the disease by comparing genotype frequencies between random samples of cases (diseased individuals, for an interested disease) and controls (individual without the interested disease). The methods can be classified as single marker, double marker or multiple markers according to whether they take into account frequencies of markers at one locus or combinations of markers at two or more loci. Under assumptions of infinite population size, discrete generations, random mating, no selection, no migration, no mutation and equal initial genotype frequencies in the two sexes, Hardy-Weinberg equilibrium arises after one generation and thereafter the genotype frequencies in the population are constant from generation to generation.

(3)

families that originally inhabited the region approximately 400 years ago. Genotyping was done using the Affimetrix 10K SNP chip to 27 controls and 31 cases We summarized characteristics of the 11229 SNPs, such as the identity of the SNP in the chromosome and the genotype of every individual in the control and case samples The genotype of individuals are defined as AA, AB or BB, a missing genotype is coded as “NoCall”, meaning that the

marker did not pass the discrimination filter [5]. A case-control study with a biallelic marker was conducted from SNPs analysis with identity 1513978 in chromosome 2 and the results are given in TABLE 1.

TABLE 1. The number of Genotypes in Controls and Cases Sample.

AA Aa aa Total

Controls 11 14 2 27

Cases 29 2 0 31

Pooled 40 16 2 58

Single Marker Genotype-Based Method by Using the Chi-square Test

Let (p1, p2, p3) and (q1, q2, q3) be the genotype frequencies in the populations of controls and cases, respectively.

are the numbers of genotype AA, Aa and aa in the samples of controls and cases, respectively. These vectors possess Multi(n, p) and Multi(m, q) distributions, respectively. The test statistic Q2 is defined as chi-squared distribution with 2 degrees of freedom asymptotically as n, mof under H0.

Under the null hypothesis H0, the MLE of pj0 is

TABLE 2. The number of Genotypes in Controls and Cases Sample.

AA Aa aa Total

Controls X1 X2 X3 n

Cases Y1 Y2 Y3 m

Pooled X1 + Y1 X2 + Y2 X3 + Y3 n + m

Example 1 A case-control study with a biallelic marker was conducted from SNPs analysis with identity 1513978 in

chromosome 2, the results are given in TABLE 1. We have

(4)

) association between the marker genotype and the disease.

Single Marker Genotype-Based Method by Using the Likelihood Ratio Test

A likelihood ratio test is a statistical test based on the ratio between the maximum of the likelihood function under the null hypothesis and the maximum under the alternative. In this section, we describe this test for single marker genotype-based method. In this subsection, we consider the situation of previous subsection. We wish to test testing the null hypothesis H0 : p= q versus the alternative hypothesis H1 : pzq where p = (p1, p2, p3) and q = (q1, q2, q3) by using the likelihood ratio test. Because X possesses a Multi(n, p) distribution and Y possesses a

Multi(m, q) distribution, the log-likelihood function is given by

¦

likelihood function is

.)

= 1, 2, 3, respectively. Thus, the likelihood ratio statistic is

»¼

and straightforward computation yields

( 0) 40log(0.6897) 16log(0.2759) 2log(0.0344)

The likelihood ratio test is

(5)

RESEARCH METHODS

The bootstrap method can be described as follows :

1. Based on the original table, it is determined the statistic Toriginal.

2. Generate a new table by generate the number of genotypes AA, Aa dan aa in controls based on Multi(n, p0)

distribution where p0 = (p10, p20, p30) and

m n

y x

p_j j j

0 ^

for j = 1, 2, 3; the number of genotypes AA,

Aa dan aa in cases based on Multi(m, p0) distribution.

3. Based on the new table, the statistic T1* is determined.

4. Procedure number 2 is repeated until a big number B times and we have B statistics T1*, T2*, ….., TB*.

5. The p-value of the hypothesis is determined by the proportion of T1*, T2*, ….., TB* that is bigger than

Toriginal.

The statistic can be replaced by using the chi-square statistic and the likelihood ratio statistic. In this paper, it is described how the bootstrap method is used. Furthermore, simulation study is done by generate a new table based on Multinomial distribution with parameter n and p = (pAA, pAa, paa) where n = 50, p = q = (0.1, 0.1, 0.8) and p ≠q = (0.1, 0.7, 0.2).

RESULT AND DISCUSSION

Under the null hypothesis, a new table can be constructed based on the original table. By using TABLE 1 as an original table, we can construct TABLE 3 as a new table by using procedure number 2 in research methods. Furthermore, the chi-square statistic Q2 = 1.0687 and the likelihood ratio statistic G2 = 1.0739 can be determined. Figure 1(a) and Figure 1(b) presents the number B = 1.000.000 statistical value of chi-square test and likelihood ratio test, respectively. Based on Fig. 1(a) and the original statistical value (i.e. 18.9159), it can be determined the bootstrap p-value for the chi-square statistic Q2 as a proportion of statistical value Q2 that is more than 18.9159. The analog way can be applied for the likelihood statistic G2. The bootstrap p-values for two (original) statistical tests are 1.2 × 10-5_{and 1.90 ×10}-5_{, respectively.}

TABLE 3. The number of Genotypes in Controls and Cases Sample for Example 1.

AA Aa aa Total

Controls 16 9 2 27

Cases 22 8 1 31

Pooled 38 17 3 58

FIGURE 1. The histogram of B = 1.000.000 bootstrap statistical values based on the original table for chi-square statistic Q2 (a) and likelihood ratio statistic G2 (b).

Histogram of bootstrap value Q2

(a) Statistic Q2

De

ns

ity

0 5 10 15 20 25 30

0.0

0.1

0.2

0.3

0.4

Histogram of bootstrap value G2

(b) Statistic G2

De

ns

ity

0 5 10 15 20 25 30

0.0

0.1

0.2

0.3

0.4

(6)

Case control association simulated data can be generated as follows. We use 50 controls and 50 cases to describe the method. Genotype AA, Aa and aa are generated for 50 individuals in controls sample by using Multinomial distribution with parameter 50 and

(pAA, pAa, paa) = (0.1, 0.1, 0.8)

Similarly, 50 individuals in cases sample can be generated. The result of 10 simulated data in controls sample, cases sample and the related p-values for Q2 and G2 for classical and bootstrap methods are presented in TABLE 4. By using 5 % level of significance and based on the table, we tend to conclude that p = q, i. e., there is no association

between the marker and the disease. TABLE 5 presents simulated data and their p-values when we use Multinomial distribution with parameter 50 and (pAA, pAa, paa) = (0.1, 0.1, 0.8) for controls sample and for (qAA, qAa, qaa) = (0.1,

0.7, 0.2) cases sample. By using 5 % level of significance and based on the table, we conclude that the p-values tend to conclude that p ≠q, as we expected, i.e. there is an association between the marker and the disease.

TABLE 4. The result of 10 simulated data by using 50 individuals in controls sample and 50 individuals in cases sample and the related p-value for p = q = (0.1, 0.1, 0.8).

Controls Sample Cases Sample p-value

No. nAA nAa naa nAA nAA nAA Q2 classical (bootstrap) G2 classical (bootstrap)

1 2 6 42 8 4 38 0.1077 (0.1251) 0.1225 (0.1231)

2 7 9 34 5 5 40 0.3714 (0.3903) 0.3748 (0.3869)

3 5 5 40 10 2 38 0.2145 (0.2463) 0.2227 (0.2350)

4 7 3 40 7 5 38 0.7571 (0.7684) 0.7591 (0.7686)

5 4 5 41 7 8 35 0.3676 (0.3873) 0.3708 (0.3834)

6 7 3 40 3 2 45 0.3428 (0.3862) 0.3510 (0.3738)

7 5 4 41 3 4 43 0.7584 (0.7740) 0.7606 (0.7732)

8 5 6 39 8 5 37 0.6563 (0.6683) 0.6584 (0.6671)

9 4 5 39 6 4 40 0.7686 (0.7886) 0.7697 (0.7858)

10 4 6 40 2 8 40 0.6168 (0.6394) 0.6211 (0.6396)

TABLE 5. The result of 10 simulated data by using 50 individuals in controls sample and 50 individuals in cases sample and the related p-value for p = q = (0.1, 0.1, 0.8).

Controls Sample Cases Sample p-value

No. nAA nAa naa nAA nAA nAA Q2 classical (bootstrap) G2 classical (bootstrap)

1 5 3 42 4 28 18 6.03 × 10-8_{( 0 )} _{3.26 × 10}-7_{( 0 )}

2 1 8 41 7 33 10 9.05 × 10-10_{( 0 )} _{4.12 × 10}-9_{( 0 )}

3 8 3 39 4 36 10 3.76 × 10-12_{( 0 )} _{8.32 × 10}-11_{( 0 )}

4 7 3 42 3 33 14 2.27 × 10-10_{( 0 )} _{2.65 × 10}-9_{( 0 )}

5 4 4 42 5 33 12 3.23 × 10-10_{( 0 )} _{2.64 × 10}-9_{( 0 )}

6 6 5 39 4 29 17 8.22 × 10-7_{(1.0 × 10}-6₎ _{2.28 × 10}-6_{( 0 )}

7 4 6 40 6 34 10 6.0 × 10-9_{( 0 )} _{5.60 × 10}-9_{( 0 )}

8 7 6 37 1 39 10 6.0 × 10-11_{( 0 )} _{2.51 × 10}-10_{( 0 )}

9 7 2 41 4 36 10 6.0 × 10-13_{( 0 )} _{1.33 × 10}-11_{( 0 )}

10 6 2 42 4 38 8 6.0 × 10-15_{( 0 )} _{7.20 × 10}-13_{( 0 )}

The simulation study can be extended to several sample sizes, p and q. The described method can also be

(7)

CONCLUSION

In this paper, we have described how to analyze SNP data by using bootstrap methods in case of the chi-square Q2 statistics and the likelihood ratio statistics. Using recent technology, it is allowed to have highly dense SNP data in the whole-genome which use hundreds of thousand SNPs (see e.g. [6] and [8]), and the bootstrap methods for high density SNP become very time consuming and finding an excellent method still become a challenge research in the next future research.

REFERENCES

[1] D. Balding, Nature Reviews Genetics 7, 781-791 (2006).

[2] A. Setiawan, SNP Data Analysis using Logistic Regresion, it is presented at IndoMS International Conference on Mathematics and its Applications, 2009.

[3] F. Sambo, B. di Camillo, Minimizing Time When Applying Bootstrap to Contingency Tables Analysis of Genome-Wide Data, Learning and Intelligent Optimization, the International Conference, LION 6, Paris, France, January 16-20, 2012.

[4] O. Manor, E. Segal, PLOS Computational Biology, 9(8), page (2013).

[5] A. Setiawan, Statistical Data Analysis of Genetic Data in Twin Studies and Association Studies, Vrije Universiteit Amsterdam, the Netherlands, 2007.

[6] N. Nishida, BMC Genomics 9(1) 431 (2008).

[7] L. Pengyuan, Journal of the National Cancer Institute 100, 1326-1330 (2008).

[8] H.H.M. Draisma et al., Nature Communications, 6, 72081-9 (2015).

[9] W. Zheng et al., Nature Genetics, 41, 324-328 (2009).