PDF Integrating whole-genome genetic association studies with gene

(1)

Proc. Assoc. Advmt. Anim. Breed. Genet. 17: 81-84

81

INTEGRATING WHOLE-GENOME GENETIC-ASSOCIATION STUDIES WITH GENE EXPRESSION DATA TO PRIORITISE CANDIDATE GENES AFFECTING

INTRAMUSCULAR FAT IN BEEF CATTLE TRAITS Eva K. F. Chan and Antonio Reverter

Cooperative Research Centre for Beef Genetic Technologies

CSIRO Livestock Industries, Queensland Bioscience Precinct, 306 Carmody Rd., St Lucia QLD 4067 SUMMARY

A common problem with many genetic association studies is the daunting task of identifying functional candidate genes from a large list of positional candidates. This paper presents an approach to incorporate external gene expression data to facilitate this process. Our approach is based on the assumption that genes contributing to a complex trait will exhibit greater variability in expression level under certain biological conditions that are themselves related to the trait than non-contributing genes. Rather than using expression data from the same animals used in the genetic-association study, which are often not available, we demonstrate the feasibility of borrowing information from independent studies. Using intramuscular fat (IMF) as an illustration, we first identify five unique genetic loci encompassing 35 positional candidate genes. Analyses of differential expression from eight independent microarray experiments are then performed to identify five functional candidates.

INTRODUCTION

While genetic association studies are effective for mapping the genetic influence(s) of a phenotype to the appropriate chromosomal region(s), these mapped quantitative trait loci (QTL) are located imprecisely, implicating tens to hundreds of genes. Common approaches for pruning and prioritising candidate genes, such as fine-mapping and positional cloning, typically require large panels of animals for genotyping and phenotyping, which can be time and resource intensive.

In recent years, much attention has focused on understanding the influence of gene expression variations on phenotypic differences. This has led to an accumulation of gene expression data across many species, including cattle (see Lehnert et al., 2006, and references therein).

Stemming from the underlying hypothesis that many phenotypic differences are likely driven by changes in the amount of gene products, we argue that positional candidate genes (genes implicated by their genomic location) can be prioritised on (1) the degree of their expression variation and (2) the amount of trait variance they can explain. We further argue that information on gene expression can be borrowed from independent studies for this purpose. These may include studies examining a different physiological system, in different tissues, using different animals, and/or different organisms, so long as the experimental system of the gene expression study is known/predicted to influence the trait of interest.

By the rationale of ‘guilt-by-association’, gene expression studies conducted under conditions relevant to the trait of interest should, in principle, be appropriate. That is, given a list of positional candidate genes within a QTL, if a subset is found differentially expressed (DE) under circumstances that are known to influence the trait, then this subset of genes is more likely functional candidates (have functional relevance) than non-DE genes. A wide variety of expression studies may be appropriate, including those comparing: natural and perturbed physiological systems, different

(2)

Gene Expression and Bioinformatics

82

genetic backgrounds, different tissue/cell types, and different developmental stages. The advantage of this approach is that in-silico methods may partially replace additional animal experiments and reduce the amount of genotyping, sequencing, and phenotyping required.

This paper demonstrates the principle and application of this approach using intramuscular fat (IMF) in beef cattle. QTLs for this trait were identified using whole-genome association (WGA) mapping (Barendse and Reverter, 2007), and the resulting positional candidate genes were prioritised using gene expression data from eight microarray studies in cattle (Lehnert et al. 2006).

MATERIALS AND METHODS

Whole-genome SNP-association. We used SNP genotype data described in Barendse and Reverter (2007). In brief, 189 steers from the Australian Beef CRC I resource (Upton et al., 2001) were genotyped using the MegAllele™ Genotyping Bovine 10K SNP Panel (Hardenbol et al., 2005). We used 8,000 informative SNPs with a minor allele frequency > 5%. Prior to use in SNP-association analysis, IMF values were adjusted for non-genetic effects, including breed type, herd, sex, age at slaughter, and market weight endpoint.

WGA was performed using a regression approach implemented in the R/SNPassoc v. 1.3-0 package (González et al., 2006). For each SNP, five genetic inheritance models were tested: additive, dominance, recessive, co-dominance, and over-dominance allelic effects. Association significance was defined at a 5% false discovery rate (FDR) based on the Q-value calculation (Storey 2002) using the R/qvalue package (Dabney and Storey 2006). Note that, based on the adjusted r² from the linear regressions performed we have statistical power to detect effects of r²≤0.12 at 5% FDR, but lack power for detecting any recessive or over-dominance allelic effects.

Gene expression analyses. From eight independent microarray studies, measuring a total of 51 conditions using 135 identical microarray platforms (Lehnert et al., 2004), gene expression variations were extracted for 15 contrasts: three comparing diets, two comparing cattle breeds, nine comparing developmental stages, and one examining effect of adipogenesis stimulant. Each array contains 7,898 clones, of which 728 unique genes (1,947 clones) have been accurately annotated, verified, and mapped onto the latest bovine genome assembly (Btau3.1, August 2006). Expression values were normalised using a multivariate mixed model (Reverter et al. 2006), to facilitate comparison of relative gene expression values across the 51 conditions and subsequently to construct the 15 contrasts. Significance of the number of times a gene is DE across the 15 contrasts was assessed by permutation testing, using 10,000 permutations. A standard P<0.05 was used to indicate the probability of observing a specific number of DE for a given gene is not random.

RESULTS AND DISCUSSION

IMF QTLs. Genetic association of IMF was assessed at each of 8,000 SNP markers under five inheritance models, using a linear regression approach. At a false discovery rate of 5%, eight unique SNP-QTLs were found associated to IMF under at least one inheritance model (Table 1). These eight SNP-QTLs fall roughly into five QTL regions: one on each of Chr2, Chr7, Chr11, and two on Chr26.

Typically, genes encoded physically close to a mapped QTL are considered as positional candidates. In the first instance, we identified genes within 0.5Mb from a SNP-QTL as positional candidates. This genomic range was chosen because 95% of the 8,000 SNPs are within 0.5Mb from their closest neighbouring SNP. This bin size identified 35 positional candidates corresponding to the eight IMF SNP-QTLs, from the latest bovine genome assembly (Btau3.1, August 2006).

(3)

Proc. Assoc. Advmt. Anim. Breed. Genet. 17: 81-84

83

Table 1. SNP-QTLs associated with IMF under at least one of five genetic inheritance models

% IMF variance explained^A Location Candidates^C SNP Add Codom Dom Res Over Chr Mb HWE^B

Name Dist

S1 - - 9.2 - - 2 121.8 0.033 PLA2G2A 1.088

S2 - 12.2 12.5 - - 7 60.3 0.003 SPARC 1.298

S3 - 10.1 10.5 - - 7 60.3 0.003 SPARC 1.298

S4 10.7 12.0 11.9 - - 11 55.9 0.003 RPS21 0.174 S5 10.8 11.6 - - - 11 55.9 0.001 RPS21 0.175 S6 10.6 11.5 11.3 - - 11 55.9 0.001 RPS21 0.175 PDLIM1 0.258 SORBS1 0.176

S7 - 10.6 10.5 - - 26 14.4 0.475

RPL27 0.129

S8 - 11.5 11.7 - - 26 36.5 0.405 ZRANB1 3.884

APercentage of IMF phenotypic variance (adjusted r²from linear regression) explained by each inheritance model at the respective SNP-QTL. Add: additive allelic inheritance model; Codom: co- dominance; Dom: dominance model; Rec: recessive; Over: over-dominance.

BP-value for null hypothesis that, across breed, Hardy-Weinberg equilibrium (HWE) holds.

CPositional candidates: Name and Mb distance to SNP.

Gene expression analysis. We have available to us eight gene expression experiments investigating various factors known to influence IMF (summarised in Lehnert et al. 2006). From these we compiled 15 contrasts examining breed type, aging, diet quality, vitamin A level, and gender; all of these have been implicated to affect marbling and IMF (Harper and Pethick 2004).

Due to the relatively low number of annotated genes represented on these arrays, compared to the entire transcriptome (~20,000 genes), expression data is only available for four of the 35 positional candidate genes. For this reason, we expanded our criterion for defining positional candidates to include the closest gene encoded from a mapped SNP-QTL that is also represented on the microarray.

This resulted in a total of seven positional candidates for which we have expression data available.

For this analysis, a gene is defined as DE if its expression in a given contrast is in the extreme 10% compare to other genes. With this criterion, five of the eight candidate genes are DE in at least one of the 15 contrasts (Table 2) and the number of times a gene is DE can be used to further prioritise the list candidates. Consequently, the most highly ranked gene is SPARC (secreted protein, acidic, cysteine-rich), which is DE in five contrasts including three comparing muscle development and two nutritional studies. Interestingly, SPARC has previously been shown to affect adiposity (Bradshaw et al. 2003), thus further implicating a possible functional role in IMF phenotype.

Table 2. Number (N) and probability (P) of differential expression observed across 15 experimental contrasts and for each of the seven positional candidate genes

PLA2G2A SPARC RPS21 PDLIM1 SORBS1 RPL27 ZRANB1

N 0 5 0 1 2 4 2

P 0.839 0.036 0.834 0.545 0.317 0.079 0.316

(4)

Gene Expression and Bioinformatics

84 CONCLUSIONS

This paper proposed and illustrated the incorporation of gene expression data to better prioritise positional candidate genes resulting from a QTL-analysis. The idea of incorporating gene expression and phenotypic trait data is not novel, but in studies where this has been achieved, the same cohort of animals has been used to obtain these data. Although ideal, such luxury is not always possible, particularly in studies using production animal where obtaining such data from a single study group is often practically and financially prohibitive. The approach presented here makes use of the rapidly growing volume of independently generated data. Using IMF in beef cattle, we illustrated the possibility to borrow and incorporate additional expression data from external sources to identify and prioritise a list of five functional candidate genes. Although expression data for only a relatively small number of genes was available, by our argument, any gene expression data corresponding to any biological/physiological systems relevant to IMF may be appropriate. This also includes studies from alternate species, such as mouse and human studies examining obesity and fat metabolism (e.g.

Nadler et al. 2000; Lee et al. 2005). With appropriate consideration of gene orthologs, incorporation of multiple gene expression data may be extremely powerful in assisting candidate gene selection.

Furthermore, one should not be limited to gene expression data. Additional evidence to support, or upgrade, a positional candidate to a functional candidate can be sought from further in silico studies.

REFERENCES

Barendse, W. and Reverter-Gomez, A. (2007) Patent Application WO/2007/012119.

Bradshaw, A.D., Graves, D.C., Motamed, K. and Sage, E.H. (2003) Proc. Nat. Acad. Sci. (USA) 100:6045.

Dabney, A. and Storey, J.D. (2006). qvalue: Q-value estimation for false discovery rate control. R package version 1.1 http://www.r-project.org

González, J.R., Armengol, Ll., Guinó, E., Solé, X. and Moreno, V. (2006). SNPassoc: SNPs-based whole genome association studies. R package version 1.0-2. http://www.r-project.org

Hardenbol, P., Yu, F., Belmont, J., Mackenzie, J., Bruckner, C., Brundage, T., Boudreau, A., Chow, S., Eberle, J., Erbilgin, A., Falkowski, M., Fitzgerald, R., Ghose, S., Iartchouk, O., Jain, M., Karlin-Neumann, G., Lu, X., Miao, X., Moore, B., Moorhead, M., Namsaraev, E., Pasternak, S., Prakash, E., Tran, K., Wang, Z., Jones, H.B., Davis, R.W., Willis, T.D. and Gibbs, R.A. (2005) Genome Res. 15: 269.

Harper, G.S., and Pethick, D.W. (2004) Aust. J. Exp. Agric. 44: 653.

Lee YH, Nair S, Rousseau E, Allison DB, Page GP, Tataranni PA, Bogardus C and Permana PA (2005) Diabetologia 48:1776.

Lehnert, S.A., Wang, Y.H. and Byrne, K.A. (2004) Aust. J. Exp. Agric. 44:1127.

Lehnert, S.A., Wang ,Y.H., Tan, S.H. and Reverter, A. (2006) Aust. J. Exp. Agric. 46:165.

Nadler, S.T., Stoehr, J.P., Schueler, K.L., Tanimoto, G., Yandell, B.S. and Attie, A.D. (2000) Proc.

Nat. Acad. Sci. (USA) 97:11371.

Reverter, A., Hudson, N.J., Wang, Y.H., Tan, S.H., Barris, W., McWilliam, S.M., Bottema, C.D.K., Kister, A., Greenwood, P.L., Harper, G.S., Lehnert, S.A. and Dalrymple, B.P. (2006) Phys. Gen.

28:76.

Storey JD. (2002) A direct approach to false discovery rates. J. R.. Stat. Soc. B, 64: 479-498.

Upton, W., Burrow, H.M., Dundon, A., Robinson, D.L. and Farrell, E.B. (2001) Aust. J. Exp. Agric.

41:493.