Genetic Evaluation and Marker Assisted Selection

(1)

PREDICTING GENOMIC BREEDING VALUES WITHIN AND BETWEEN POPULATIONS

B. J. Hayes¹, A. P. W. De Roos² and M.E. Goddard^1,3

1Animal Genetics and Genomics, Department of Primary Industries Victoria, 475 Mickleham Rd, Attwood 3049, Australia.

2 Holland Genetics, PO Box 5073, 6802 EB, Arnhem, The Netherlands.

3 Faculty of Land and Food Resources, University of Melbourne, Parkville, Australia.

SUMMARY

The availability of 10s of thousands of SNP markers which can be genotyped cheaply is accelerating the adoption of MAS in the livestock industries, particularly LD-MAS and Genomic selection. The comparative accuracies of these two approaches is dependent on the density of markers available.

Within cattle populations, approximately 30,000 evenly spaced markers are recommended for genomic selection, or for the genome wide scan used to select the markers for LD-MAS. With sub- optimal marker densities genomic selection can still be implemented, with the inclusion of a polygenic effect to capture genetic variance not captured by the markers. When tested in an Australian Holstein data set with 9918 markers, the accuracy of GEBVs for both LD-MAS and Genomic selection was above 0.70. If the goal is to predict GEBVs across different populations, a larger number of markers must be genotyped in the reference population than is required if the results are only to be used within the population. This is because the persistence of LD phase between markers and QTL across populations is less than that within populations. By comparing the persistence of phase between populations and sub-populations, we conclude that 50,000 evenly spaced markers would be required to predict GEBVs between populations as diverged as Dutch Black-and-White and Dutch Red-and-White Holsteins, while at least 150,000 evenly spaced markers for more diverged populations such as Australian Holsteins and Australian Angus.

INTRODUCTION

In marker assisted selection (MAS) one or a number of regions of the genome containing genes with an effect on the trait of interest are traced with DNA markers. Breeding values are then calculated using both pedigree and the marker information. MAS can be based on markers in linkage equilibrium with a quantitative trait locus (QTL) (LE-MAS), markers in linkage disequilibrium with a QTL (LD-MAS), or based on selection of the actual mutation causing the QTL effect (Gene-MAS).

All three types of MAS are currently being used in the livestock industries (Dekkers 2004). With all three types of MAS, only a proportion of the genetic variance is captured by the markers, determined by the number of QTL traced and the proportion of total genetic variance these QTL explain. An alternative when a dense marker map is available is to divide the genome into small segments and then simultaneously estimate the effects of all these segments on the trait of interest, thereby tracing all QTL with markers. In subsequent generations, animals can be genotyped for the markers to determine which chromosome segments they carry, then the effects of the segments the animal carries can be summed up across the whole genome to predict a breeding value. Meuwissen et al.

(2001) termed this ‘Genomic selection’. Meuwissen et al. (2001) in simulations achieved accuracies of predicting breeding values from markers alone (the correlation between true breeding value and estimated breeding values) of 0.85. Such EBVs are termed genomic EBVs, or GEBVs.

(2)

Two recent developments are resulting in a rapidly accelerating adoption of LD-MAS and genomic selection in particular. The sequencing of a number of livestock genomes, including cow, pig and chicken has lead to the discovery of 10s of thousands of DNA markers, in the form of single nucleotide polymorphisms (SNPs). Concurrent with the discovery of numerous SNP markers throughout the livestock genomes has been a dramatic reduction in the cost of genotyping per SNP.

The MAS and genomic selection programs underway at present are applied within populations, for example the Holstein population within Australia or the Holstein population within the Netherlands.

The extension of such programs to predict for example the performance of Australian bulls in the Netherlands, or prediction of performance of Jersey bulls from Holstein data, is complicated by the persistence of LD phase between markers and QTL across populations and breeds.

The aim of this paper is to firstly evaluate the marker density necessary within cattle populations for the application of LD-MAS or genomic selection. We then compare methodologies for calculating GEBVs when different marker densities are available. Finally we investigate marker density required for implementation of LD-MAS or genomic selection between populations.

MARKER DENSITY REQUIRED FOR LD-MAS AND GENOMIC SELECTION WITHIN POPULATIONS

Both LD-MAS and Genomic Selection exploit linkage disequilibrium (LD). LD between a gene affecting a quantitative trait and one or several markers can be measured by r², the proportion of variation caused by the alleles at a QTL which is explained by the markers. The extent of LD in the population is important for LD-MAS because it determines how many markers are required in an initial genome scan in order to identify a set of markers with effects on the trait of interest.

Specifically, sample size must be increased by a factor of 1/r² to detect an ungenotyped QTL, compared with the sample size for testing the QTL itself (Pritchard and Przeworski 2001). For genomic selection, the density of markers will determine the proportion of genetic variance which is captured by the markers. In the absence of knowledge of all QTL in the genome we can use marker - marker LD as a proxy for QTL – marker LD. In Figure 1, the average decline of r² with distance is given for five different cattle populations (Australian Holstein, Norwegian Red, Australian Angus, New Zealand Jersey and Dutch Holsteins).

The Dutch and Australian Holstein populations had a very similar decline of LD, probably because these populations are highly related (e.g. Zenger et al. 2007) and are similar in effective population size and history. The decline of LD in the Norwegian Reds was more rapid than in the Holstein populations. One explanation for this could be that the effective population size in Norwegian Red is higher than in Holstein, even though the global population is much smaller.

Effective population size in Norwegian Reds is approximately 400 (Meuwissen et al. 2002), while for the global Holstein population effective population size is close to 150 (Hayes et al. 2003), and a more limited extent of LD is expected with larger effective population size.

The marker density simulated by Meuwissen et al. (2001) for genomic selection was such that the average r² between adjacent markers was 0.2, and they demonstrated that genomic selection produced highly accurate GEBVs with this level of LD. Figure 2 implies that for the Holstein populations at least, there must be a marker approximately every 100 kb (kilo bases) or less to achieve an average r² of 0.2. As the bovine genome is approximately 3,000,000 kb, this implies that in order of 30,000 evenly spaced markers are necessary in order that every QTL in the genome can be captured in Genomic selection. In Jerseys and Norwegian Reds, a larger number of markers would be required.

(3)

In order to detect QTL with a reasonable size experiment, similar numbers of markers would be required for the initial genome wide association study prior to LD-MAS.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8

0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 Distance (kb)

Average r2 value

Australian Holstein Norwegian Red Australian Angus New Zealand Jersey Dutch Holsteins

Figure 1. Average r² value according to the distance between SNP markers. Results are from 9918 SNPs distributed across the genome genotyped in 384 Holstein cattle or 384 Angus cattle, 403 SNPs genotyped in 783 Norwegian Red cattle, 3072 SNPs genotyped in 2430 Dutch Holstein cattle, or 351 SNPs genotyped in Jersey cattle. Norwegian red data kindly supplied by Prof.

Sigbjorn Lien, Norwegian University of Life Sciences, New Zealand Jersey data kindly supplied by Dr. Richard Spelman, Livestock Improvement Corporation

CALCULATION OF GEBVS IN LD-MAS AND GENOMIC SELECTION

Implementation of both LD-MAS and Genomic selection conceptually proceeds in two steps, 1.

Estimation of the effects of markers or chromosome segments (the selected set in the case of LD- MAS) in a reference population and 2. Prediction of GEBVs for animals not in the reference population, for example selection candidates, or in a validation population.

In LD-MAS, a polygenic breeding value is included in the GEBV to pick up genetic variance not captured by the markers. In Genomic selection as specified by Meuwissen et al. (2001), a polygenic component is not included in the prediction of GEBVs. However if the available marker density is less than suggested above, inclusion of a polygenic component in the GEBV from genomic selection would recapture some of the effects of the QTL which are not in sufficient LD with markers.

Both in LD-MAS and Genomic selection, the effect of chromosome segments on a trait of interest can be estimated using either single SNPs or haplotypes of the SNP alleles. Marker haplotypes can potentially capture more of the QTL variance than single markers, as the r² between the QTL alleles and haplotypes can be larger than the r² between the QTL alleles and single marker alleles. Hayes et al. (2006) found that with 9918 SNPs genotyped in an Australian Angus population, using haplotypes would capture more of the QTL variance than using single markers. However in simulated data, the advantage of using haplotypes over single markers is reduced as the marker density increases (e.g.

Grapes et al. 2004, Grapes et al. 2006). The reason for this is most likely that with a high marker density there is very often one marker in very high LD with the QTL. Using haplotypes instead of single markers increases the number of effects to be estimated and gives more difficulty of

(4)

positioning the QTL. With lower marker density, however, there may be no marker in LD with the QTL and, consequently, a single marker regression model does not explain any QTL variation. A haplotype model, on the other hand, may still capture some QTL variation, as one group of haplotypes may contain the favourable QTL allele and another group the unfavourable allele.

There are a number of statistical approaches which can be used for calculating chromosome segment effects, using either haplotypes or single markers. Meuwissen et al. (2001) compared three statistical methods for calculating chromosome segment effects. The methods can equally be used for LD-MAS when a large number of markers is available. The methods were least squares, a best linear unbiased prediction (BLUP) method assuming equal variances associated with each chromosomal segment, and a Bayesian method assuming a prior distribution of the variance associated with each chromosome segment. The methods were compared with different numbers of phenotypic records. In simulations, the effects of the chromosome segments were estimated in one generation of animals, and the breeding values for the progeny of these animals were predicted based only on the markers which they carried. The results suggested the Bayesian method was superior to the others, Table 1. The increased accuracy of the Bayesian approach occurs because this method sets many of the effects of the chromosome segments to zero, and regresses the effect of other chromosome segments, based on a prior distribution of QTL effects.

Table 1. Correlations between true and estimated breeding values when the number of phenotypic records is varied (from Meuwissen et al. 2001, with permission from the authors)

No. of phenotypic records

Method 500 1000 2200

Least squares 0.124 0.204 0.318

Best linear unbiased prediction (BLUP) 0.579 0.659 0.732

BayesB 0.708 0.787 0.848

All the above methods require high LD between markers or marker haplotypes and the QTL. An alternative to the above methods is to use the Identical by Descent (IBD) methodology originally developed for QTL mapping (e.g. Meuwissen and Goddard 2004). For a putative QTL position in the genome, these methods calculate the probability that two animals share a chromosome segment inherited from a common ancestor, and therefore carry identical QTL alleles. Linkage and linkage disequilibrium information can be used simultaneously to predict these IBD probabilities. If multiple QTL across the genome are fitted simultaneously, GEBVs can then be produced for an animal by summing up across the effects of its QTL alleles. This methodology is particularly suited to cases where marker density is low, as in this case there will be some advantage in including the linkage information in the estimation of chromosome segment effects carried by each animal.

The prediction equation for GEBVs can be made more complex than just including the additive effects of chromosome segments, for instance by adding dominance effects and epistatic interactions.

However, when the variable being predicted is ‘additive genetic value’ or breeding value it is not appropriate to include dominance and epistatic interactions. Use of haplotypes is equivalent to fitting interactions between markers, but limited to markers close to one another and alleles on the same chromosome. The semi-parametric genomic selection method of Gianola et al. (2006) fits a high level of interaction and performed well on data simulated with epistatic interactions. Neural

(5)

networks might also be used to search a vast model space to find the best model. However it seems illogical to discard biological knowledge, such as whether or not markers are syntenic, and use these very general models.

ACCURACIES FOR LD-MAS AND GENOMIC SELECTION IN A SINGLE DAIRY CATTLE POPULATION

To determine the accuracy of GEBVs that are possible with a commercially available panel of 9918 bovine SNPs markers, we tested both genomic selection and LD-MAS in the Australian Holstein population. Three hundred and eighty four bulls were selected from the Australian dairy bull population for genotyping. The bulls selected were those with extreme high and extreme low estimated breeding values (EBVs) for the Australian selection index, ASI = (3.8*protein) + (0.9*fat) – (0.048*milk) and the records for the sub-components are based on performance of the bulls daughters (a progeny test). The bulls were genotyped for 9918 SNP genome wide markers.

LD-MAS. The data set was split in half at random. In the first half of the data, all 9918 SNPs were tested individually for their effect on ASI using the model

i i i

i

a SNP SNP e

DYD = µ + + 1 + 2 +

where ai is a polygenic breeding value, and

) , 0 (

~ N A

²_A

a σ

, with A being a matrix of additive genetic relationships among the Australian dairy bull population,

σ

_A²is the additive genetic variance not explained by the markers, and SNP1 and SNP2 are the effects of the SNP alleles carried by animal i. SNPs were taken as significant if the F-value exceeded 10.84 (i.e. P<0.001). Thirty SNPs exceeded this significance threshold. The effect of the SNPs were then evaluated by fitting all 30 SNPs simultaneously as a fixed effect in a multiple regression. The SNP effects were then regressed by 0.5, as suggested by Hayes et al. (2006). As bulls in the second half the data set were also included in the pedigree, polygenic breeding values were calculated for these 192 bulls (without using their DYDs). A GEBV was then calculated for the second set of 192 bulls as the sum of this polygenic effect and the vector of SNPs effects multiplied by their genotypes. The GEBVs were correlated with the DYDs of these bulls.

Genomic selection. From the 384 bulls, 192 were chosen at random for the prediction of SNP effects (using a program kindly provided by Prof. Larry Schaeffer). For the remaining 192 bulls, GEBVs were predicted based on their SNP genotypes for all 9918 markers, without using a polygenic effect. The GEBVs were correlated with the DYDs of these bulls.

The correlation of GEBVs and DYDs for LD-MAS was 0.71. For Genomic selection the correlation was 0.72. This suggests in our data set at least, that in LD-MAS including a polygenic effect is compensating to some degree for the genetic variance not captured by the 30 markers. The lower accuracy of Genomic selection in our data than reported by Meuwissen et al. (2001) is probably a result of sub-optimal marker density in our data set, as well as the limited number of records. However these results should be treated with some caution. The calculation of accuracy was derived by splitting the data set in two at random, predicting the SNP effects from half the data, calculating GEBVs in the second half using genotypes only, and correlating these predicted EBVs with progeny test results. The accuracy derived in this way may not reflect the accuracies achieved for GEBVs of a group of young animals with the SNP effects predicted from older animals.

(6)

APPLYING LD-MAS AND GENOMIC SELECTION BETWEEN POPULATIONS

In practise, LD-MAS or Genomic selection are always applied in a population that is different to the reference population where the marker effects are estimated. It might be that the selection candidates are from the same breed, but are younger than the reference population, or they could be from a different selection line or breed. MAS relies on the phase of LD between markers and QTL being the same in the selection candidates as in the reference population. However as the two populations diverge, this is less and less likely to be the case, especially if the distance between markers and QTL is relatively large. The statistic r is a measure for LD between two markers in a population, but can also be used to measure the persistence of the LD phases across populations. While the r² statistic between two SNP markers at the same distance in different breeds or populations can be the same value even if the phases of the haplotypes are reversed, they will only have the same value and sign for the r statistic if the phase is the same in both breeds or populations. For marker pairs of a given distance, the correlation between their r in two populations, corr(r1,r2), is equal to the correlation of the effects of the marker between both populations, for markers that have that same distance to a QTL. If this correlation is 1, the marker effects are equal in both populations. If this correlation is zero, a marker in population 1 is useless in population 2. A high correlation between r values means that the marker effect persists across the populations. In LD-MAS and Genomic selection, if the chromosome segment effects are estimated in population 1, and GEBVs in that population can be predicted with an accuracy x1, then the GEBVs of animals population 2 may be predicted from the chromosome segment effects of population 1 with an accuracy x2 = x1*corr(r1,r2). For each set of populations, one can work out the marker density that is required to obtain a corr(r1,r2) = 0.9. Here, we calculate the correlation of r values across different breeds and populations as an indicator of how far the same marker phase is likely to persist between these breeds and populations. This is used to give an indication of marker density required to ensure marker-QTL phase persists across populations and or breeds, which would be necessary for the application LD-MAS or Genomic selection using the same marker set and SNP effects across the breeds or populations.

The correlation of r values for Dutch Red-and-White bulls and Dutch Black-and-White bulls (HF_NLD and RW_NLD) was 0.9 at 30 kb, Figure 2. This indicates at this distance r² is high in both populations and the sign of r is the same in both populations, so the LD phase is the same in both populations. If one of these SNPs was actually an unknown mutation affecting a quantitative trait, the other SNP could be used in MAS and the favourable SNP allele would be the same in both breeds.

So if we had a maximum distance between marker and QTL of 30 kb (50,000 evenly spaced markers across the genome) we could do a reasonable job of predicting GEBVs of Dutch Red-and-White Holsteins from a Dutch Black-and-White reference population. For Holstein and Angus breeds (HF_AUS and ANG_AUS), the correlation of r is above 0.9 only at 10 kb or less. Using the same logic as above, at least 150,000 evenly spaced markers would be required to predict genomic EBVs for the Australian Angus population from Australian Holstein population (although this is unlikely to be required in practise!). For Australian Holsteins and Dutch Holsteins, the correlation of r values was above 0.9 up to 100 kb, reflecting the fact that there are common bulls used in the two populations (e.g. Zenger et al. 2007).

(7)

-0.20 0.00 0.20 0.40 0.60 0.80 1.00

0 200 400 600 800 1000

Marker distance (kb)

Correlation of r

HF_NLD vs HF_AUS HF_NLD vs RW_NLD HF_NLD vs HF_NZL HF_AUS vs HF_NZL HF_NZL vs JER_NZL HF_AUS vs ANG_AUS ANG_AUS vs JER_NZL

Figure 2. Correlation between r values for various cattle populations or sub-populations, as a function of marker distance.

Calculations of the correlation of r values between sub-divisions of the same population across time are indicative of persistency of phase across generations. For example the correlation of r values between Dutch Holstein bulls before 1995 and Dutch Holstein calves born in 2006 was 0.9 at 75 kb, indicating approximately 20,000 markers are required to predict GEBVs for Dutch calves born in 2006 from Dutch Holstein bulls born before 1995 (data not shown). Another important conclusion that can be drawn from this information is that with 20,000 markers, the predictions of chromosome segment effects should be usable for two generations, as accuracy will be reduced only slightly (by a factor 0.9) by breakdown of LD phase over this time.

In the above, we have assumed that effects of QTL alleles are similar in different breeds and populations. For some QTL which have been traced to known mutations, the alleles do act reasonably similarly in different breeds and populations. For example, the A allele of the DGAT1 gene results in increased fat yield and reduced protein yield and milk volume in New Zealand Holstein-Friesians, Jerseys and Ayshires (Spelman et al. 2002). However while the size of the effects are consistent for protein and milk volume in the Holstein-Friesian and Jersey breeds, the size of the fat response in Holstein-Friesians is nearly double that for Jerseys (Spelman et al. 2002). Another problem is that we have assumed that the same mutations affecting production traits are polymorphic in different breeds. This is true for some well characterised mutations such as the K232A mutation in DGAT1, which is polymorphic in Holsteins, Jerseys, Aryshires and some Bos indicus breeds (Spelman et al. 2002, Kaupe et al. 2004). Other mutations, such as some of the functional mutations in the myostatin gene, appear to breed specific (Dunner et al. 2003). One solution would be to use a multi-breed reference population, so that all the genetic variants are captured. Finally, genotype by

(8)

environment interaction may also reduce the accuracy of predicted GEBV when the chromosome segment effects are estimated from animals in another population.

CONCLUSIONS

Within Holstein cattle populations, approximately 30,000 evenly spaced markers are recommended for genomic selection, or for the genome wide scan used to select the markers for LD-MAS. With sub-optimal marker densities genomic selection can still be implemented, with the inclusion of a polygenic effect to capture genetic variance not captured by the markers. Using linkage information and haplotypes of markers could also be beneficial in this situation. When tested in an Australian Holstein data set with 9,918 markers, the accuracy of GEBVs for both LD-MAS and Genomic selection was above 0.70. If the goal is to predict GEBVs across different populations, a larger number of markers must be genotyped in the reference population than is required if the results are only to be used within the population. This is because the persistence of LD phase between markers and QTL across populations is less than within populations. By comparing the persistence of phase between populations and sub-populations, we conclude that 50,000 evenly spaced markers would be required to predict GEBVs between populations as diverged as Dutch Black-and-White and Dutch Red-and-White Holsteins, and at least 150,000 evenly spaced markers for populations as diverged as Australian Holsteins and Australian Angus. While it is unlikely that there will be a need to predict milk production GEBVs for Angus cattle from Holsteins, it is likely that it will be desirable to predict GEBVs for food conversion efficiency for different beef breeds for example.

ACKNOWLEDGMENTS

The program for calculating GEBVs was kindly provided by Prof Larry Schaeffer, University of Guelph. Norwegian Red data were kindly supplied by Prof. Sigbjorn Lien, Norwegian University of Life Sciences. Jersey data were kindly supplied by Dr. Richard Spelman, Livestock Improvement Corporation.

REFERENCES

Dekkers J. C. (2004) J Anim Sci. 82 E-Suppl:E313

Dunner, S, Miranda, M.E., Amigues, Y. et al. (2003) Genet Sel Evol.35:103 Gianola, D, Fernando, R.L., Stella, A. (2006) Genetics 173:1761

Grapes, L., Dekkers, J.C., Rothschild, M.F., Fernando, R.L. (2004) Genetics. 166:1561

Grapes, L., Firat, M.Z., Dekkers, J.C., Rothschild, M.F. and Fernando R.L. (2006) Genetics.

172:1955

Hayes, B. J., Visscher, P. M., McPartlan, H. and Goddard, M. E. (2003) Genome Res. 13:635 Hayes, B. J., Chamberlain, A. C and Goddard, M. E. (2006) Proc. 8^th World Congr. Genet. Appl.

Livest Prod.

Kaupe, B., Winter, A., Fries, R. and Erhardt G. (2004) J Dairy Res. 71:182 Meuwissen, T.H. and Goddard, M.E. (2004) Genet Sel Evol. 36:261

Meuwissen, T.H.E., Hayes, B.J. and Goddard, M.E. (2001) Genetics 157: 1819

Meuwissen, T.H.E., Karlsen, A., Lien, S., Olsaker, I. and Goddard. M.E. (2002) Genetics 161: 373 Pritchard, J.K. and Przeworski, M. (2001) Am J Hum Genet 69:1

Spelman, R.J., Ford, C.A., McElhinney,et al (2002) J Dairy Sci. 85:3514

Zenger, K.R., Khatkar, M.S., Cavanagh, J.A., Hawken, R.J., Raadsma, H.W. (2007) Anim Genet. 38:7