|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ANIMAL GENETICS |


* University of Life Sciences, Department of Animal and Aquacultural Sciences, PO Box 5003, N-1432 Ås, Norway;
and
AKVAFORSK (Institute of Aquaculture Research Ltd.), PO Box 5010, N-1432 Ås, Norway; and
Roslin Institute (Edinburgh), Roslin, Midlothian EH25 9PS, United Kingdom
| Abstract |
|---|
|
|
|---|
Key Words: accuracy of selection breeding value estimation dense marker map genome-wide selection marker-assisted selection
| INTRODUCTION |
|---|
|
|
|---|
Meuwissen et al. (2001)
made a first step toward predicting a total genetic value using a genome-wide dense map of highly informative markers. The method was termed genomic selection, and the idea was to estimate the effects of all genes or chromosomal segments simultaneously. The effect of these segments is summed to predict the total breeding value. Selection can then be based on these breeding values. By treating the markers or haplotypes as random effects, the problem of estimating large numbers of haplotype effects from a limited number of animals can be managed. Meuwissen et al. (2001)
compared different methods for predicting breeding values based on haplotype effects and found accuracies in the range of 0.79 to 0.85. Although the accuracy obtained depends on the genomic and phenotypic models assumed, these accuracies were obtained only by phenotyping the parental and grandparental generation and were greater than the accuracy obtained if the breeding values of both parents had been known without error.
Factors affecting the accuracy of the prediction of the genotypes are largely unknown. At present it is unknown how dense markers need to be, particularly if they vary in information content. Therefore, the main aim of this paper was to examine how the accuracy of predicted breeding values responded to changes in marker densities with 2 different classes of markers such as microsatellites and SNP. Additionally, we examined if effects of marker genotypes should be used or if it was advantageous to use haplotype effects instead, due to their increased informativeness.
| MATERIALS AND METHODS |
|---|
|
|
|---|
This study examined by simulation the changes in accuracy resulting from changing the type and density of markers and the approach to estimating effects with the evaluation. Throughout all comparisons, a Markov chain Monte Carlo approach was taken to the analysis with markers and QTL obtained from 1,000 generations of mutation selection balance.
Population
A stochastic model was simulated with a base generation of 100 unrelated animals (50 males and 50 females). To form generation 1 and later generations, sires and dams were mated randomly, but excluding selfing, for 1,000 generations. The effective population size (Ne) was 100. The number of offspring born in generation t = 1,001 was increased to 1,000 using a factorial mating design involving all 50 sires and 50 dams from generation t = 1,000, where each sire was mated to 20 dams with 1 offspring per mating pair. In generation t = 1,002 a further 1,000 individuals were created by randomly sampling with replacement the sire and dam from among the 500 sires and 500 dams in generation t = 1,001.
Genome
In total, 10 chromosomes of equal length were simulated in the genome, such that the total genome size was 10 morgan. Markers were distributed evenly along the chromosomes but with differing densities. Marker densities are expressed throughout the paper in terms of Ne and morgan if not otherwise stated, because linkage disequilibrium is a function of 4Nec, where c = the distance between the loci, and thus the same linkage disequilibrium is obtained when doubling Ne and halving c (i.e., doubling marker density). For example, 8Ne/morgan is equivalent to 800 markers per chromosome in our case, where Ne = 100, or 8,000 markers in total. In total, 4 different density schemes were evaluated for each of the 2 marker types. Using microsatellites, the 4 density schemes were 2Ne/morgan, 1Ne/morgan, 0.5Ne/morgan, and 0.25Ne/morgan, whereas, using SNP markers, the densities were 8Ne/morgan, 4Ne/morgan, 2Ne/morgan, and 1Ne/morgan. The total number of putative QTL, which can turn into a real QTL when a mutation occurs, was kept constant at 100 per chromosome for all marker density schemes. Marker and QTL positions are illustrated in Table 1
for 1 chromosome for the different marker density schemes. For example, with a density of 2Ne/morgan, the chromosome begins with 2 markers, then 1 putative QTL, followed by 2 markers, a putative QTL, and so forth, in total 302 loci per chromosome.
|
At t = 0 there was no polymorphism in marker or QTL loci. Each generation, for each individual at each locus, a random number was drawn in the simulation to test for the occurrence of a mutation. The mutation rate for the marker positions was 2.5 x 10–3 per locus per generation. The mutation rate for the QTL positions was 2.5 x 10–5 per locus per generation, and whether a putative QTL caused genetic variance or not depended on the mutations at these putative QTL positions. For each new mutation at a QTL, an allelic effect was drawn from the gamma distribution with a shape parameter β = 0.4 and a scale parameter of 1.66 (Hayes and Goddard, 2001
). The shape parameter was chosen to reflect the empirical distribution of the QTL effects, and the scale parameter was chosen such that the expected total genetic variance equaled 1. The gamma distribution yields only positive effects; therefore, the QTL effect was sampled to be positive or negative with probability 0.5. Typically, the number of segregating QTL was between 5 and 6% of the total number of QTL, and the distribution of the QTL allele frequencies resembled a U-shaped distribution as in the Wright-Fisher mutation drift equilibrium (Wright, 1931
, 1935
). Figure 1
shows the total genetic variance and heterozygosity of microsatellite markers plotted against the number of generations.
|
The SNP markers were obtained by recoding microsatellite alleles. First, the evolutionary tree of how the microsatellite alleles evolved was stored (e.g., allele 4 may have come from a mutation in allele 2 and so on). Second, one of the mutations in this evolutionary tree was assumed visible and the others were invisible, which resulted in only 2 alleles, 1 mutated allele and 1 ancestral allele. The mutation to be visible was chosen such that the SNP allele frequency was as close as possible to 0.5 in generations t = 1,001 and 1,002. Figure 2
shows a typical minor allele frequency of the SNP markers, illustrated using the 1Ne/morgan density scheme. The minor allele frequencies of the SNP markers showed approximately a uniform distribution with an overrepresentation of marker alleles with intermediate frequencies, which reflects the effect in practice of prescreening SNP markers and selecting the most informative. This resulted in a typical number of segregating SNP markers of 98 to 99% of the total number of markers.
|
Only the 1,000 individuals in generation t = 1,001 were assumed to have a phenotypic record. Phenotypic records in generation t = 1,001 were obtained by Pi = TBVi +
i, where TBVi = the true breeding value of the ith animal and
i sampled from N(0,
e2). The environmental variance (
e2) was set equal to the true genetic variance (
g2); therefore, the heritability was 0.5 for every replicate. The genetic variance varied somewhat from replicate to replicate, but was on average close to 1 (Figure 1
).
The effect of doubling the number of phenotypes was tested by doubling the number of animals in t = 1,001 to 2,000 using the factorial mating design described above, except that each sire was mated to 40 dams with 1 offspring per mating pair. In this scenario, a further 2,000 unphenotyped individuals were created in t = 1,002 by randomly sampling with replacement its sire and dam among the 1,000 sires and 1,000 dams in generation t = 1,001.
Estimation of Marker and Haplotype Effects
The markers were treated either as marker genotype effects or as 2 markers combined into a haplotype. When haplotypes are referred to in this text, it means 2 neighboring markers combined into a haplotype, unless otherwise stated. Haplotypes used densities of 1Ne/morgan and 2Ne/morgan using microsatellites and 4Ne/morgan and 8Ne/morgan using SNP markers. Recombination between the markers followed Haldanes mapping function (Haldane, 1919
). In each simulated population, the BayesB method of Meuwissen et al. (2001)
was used to estimate the effects of single markers and haplotypes in generation t = 1,001. The model at the level of the data was: y = µ1n +
iXigi + e, where y = the vector of phenotypes; 1n = a vector of n ones;
i = the summation over all markers (haplotypes); Xi = a design matrix for the ith marker; gi = the vector of marker (haplotype) effects; and e = the error. The variance of the marker effects is
2gi, which is estimated for every marker using an informative prior distribution. The distribution of the genetic variances across loci resembles a situation where there are many loci with no genetic variance (not segregating) and some with genetic variance. The prior distribution, therefore, is a mixture of distributions with a probability
, at
2gi = 0 and an inverted chi-square distribution X–2(v, S) with probability (1–
) for
2gi > 0. The parameters of the prior distribution were v = 4.234 and S = 0.0429 (Meuwissen et al., 2001
). The probability
depends on the density of the markers and varies with different marker densities, because with more markers, it becomes less likely for marker i to be closest to a QTL. At 0.25, 0.5, 1, 2, 4, and 8Ne/morgan, the values of
were 0.212, 0.106, 0.053, 0.0265, 0.01325, and 0.006625, respectively.
Sampling from a prior for
2gi, which was a mixture distribution, was by a Metropolis-Hastings algorithm that sampled
2gi from p(
2gi|y*), where the prior distribution, p(
2gi) was used as the distribution to suggest updates for the Metropolis-Hastings chain (Gilks et al., 1996
) and y* denotes the data y corrected for the mean and all other genetic effects except the marker (haplotype) effect (gi). The Metropolis Hastings chain was run for 10,000 cycles using a burn-in period of 1,000 cycles. Given
2gi, marker (haplotype) effects, gi, were sampled from p(gi|
2gi) using Gibbs sampler (Sørensen and Gianola, 2002
).
Linkage Disequilibrium and Marker Informativeness
Linkage disequilibrium (LD) was estimated as the average r2 value for 2 adjacent SNP markers using 10 replicates for all 1,000 animals in generation t = 1,001 for the 4 different SNP density schemes. The formulae used to calculate LD was r2 = D2/(p1p2q1q2), where D = the coefficient of disequilibrium. The frequency of allele 1 and 2 in locus 1 and frequency of allele 1 and 2 in locus 2, are p1, p2 = (1–p1), q1 and q2 = (1–q1), respectively. Figure 3
shows the estimated r2 value for the 4 different density schemes when adjacent SNP markers were evaluated. The estimated LD was similar to the expected value of LD if a population is in recombination drift balance and allows for mutations, which is approximately 1/(2 + 4Nec), where c = the recombination rate between the markers (Tenesa et al., 2007
). The LD increased with increasing marker density, as expected. Marker informativeness was calculated for both microsatellites and SNP markers using the polymorphism information content (PIC; Lynch and Walsh, 1998
).
|
Generation t = 1,002 was not phenotyped, but breeding values in generation t = 1,002 were estimated using the phenotypic records of t = 1,001 and the genomic records in t = 1,001 and 1,002. The EBV were made either using marker genotypes or combining neighboring marker genotypes into haplotypes. The EBV were compared with the TBV in generation t = 1,002. Estimated breeding values of animal j were obtained after estimating the marker-haplotype effects from:

where Xji denotes the marker genotype of animal j at locus i in generation t = 1,002 and gi = the estimate of the marker or haplotype effects, which were estimated on animals in generation t = 1,001.
Twenty replicated simulations were performed per marker density and marker type as described above. A regression analysis was performed of TBV on estimated breeding values. The regression analysis resulted in a regression coefficient (b) and a correlation coefficient (r), where the regression coefficient reflects the bias of the breeding value estimate, b = 1 denotes unbiased estimates, and the correlation coefficient reflects the accuracy of predicting the breeding values.
| RESULTS |
|---|
|
|
|---|
Typically, the number of microsatellite alleles was between 1 and 14, with an average of 6 alleles per locus genome-wide. Table 2
shows the marker informativeness (PIC) for microsatellites and SNP in generation t = 1,001. Mean values for the PIC genome-wide for microsatellites and SNP markers were typically 0.459 and 0.234, respectively. These can be compared with theoretical maxima for PIC of 0.810 for a 6-allele microsatellite and 0.375 for a biallelic SNP, where the maxima are achieved with equal frequency among the alleles at a locus.
|
Four schemes with different densities for microsatellites and SNP markers were compared. The regression coefficients (b) of TBV on EBV and the accuracy of selection (i.e., the correlation between TBV and EBV) are given in Tables 3
, 4
, and 5
. The accuracy of the EBV varied from 0.626 to 0.827 for microsatellites using a marker density of 0.25Ne/morgan and 2Ne/morgan, respectively (Table 3
). Hence, the accuracy of estimating the breeding values using microsatellites increased as the density of the markers increased, as expected. The regression of TBV on EBV become closer to 1 as the marker density increased, although for SNP, this relationship was less clear.
|
|
|
In general, the accuracies for both microsatellites and SNP markers increased about 1.04 to 1.07-fold when the marker density was doubled. Figure 4
summarizes the results graphically, where the average accuracy for the 20 replicates is plotted against the marker density when using marker genotypes for estimation.
|
Using Haplotypes to Estimate Breeding Values
Two different densities using haplotypes for both microsatellites and SNP were evaluated. The regression coefficients (b) of TBV on EBV and the accuracy of EBV are given in Tables 6
and 7
.
|
|
| DISCUSSION |
|---|
|
|
|---|
It is perhaps counterintuitive that the use of marker haplotypes instead of marker genotypes yielded somewhat lower accuracies at a relatively low SNP density of 4Ne/morgan, and similar tendencies were found for the microsatellite markers (Tables 6
and 7
). The covariance structure imposed by fitting marker genotypes differs from that by fitting marker haplotypes. For example, for 2 adjacent markers M and N with M11N11 denoting the marker genotypes at M and N: when fitting marker genotype, there is a covariance between records yM11N11 and yM11N22, due to the common genotype at M, whereas there is no covariance between these records when marker haplotypes are fitted. Apparently, assuming covariance between records that carry the same marker genotype yields better prediction of the QTL genotypes than assuming this covariance is absent. If we assume that one of the markers is the better predictor of the QTL, say marker M, it is apparently not always beneficial to make haplotypes with an adjacent marker, which itself may not be a good predictor of the QTL. The latter is especially the case when the adjacent markers are further apart (i.e., having relatively low r2), and thus their r2 with the QTL may be different (Table 7
).
A simulation study, as carried out here, requires several assumptions about the underlying genetic model and the population history. The genetic model assumed here is a model, just like the infinitesimal model is a model, and the assumptions will never hold exactly in any practical situation. However, we believe that our main conclusions about the effects of marker density and types of markers are general and they will qualitatively also hold for different genetic models. Nevertheless, it may be that different genetic models (e.g., with fewer large QTL) may require adjusting the prior distribution of the genetic model at hand. It is important that the prior distribution and the true underlying genetic model correspond well, and this will require research on the distribution of the QTL effects (e.g., Hayes and Goddard, 2001
). The prior distribution of the variances of marker effects used here was a mixture distribution of an inverted chi-square and a distribution with zero variance, whereas the true QTL effects were sampled from the gamma distribution. Hence, the prior distribution used for the analysis and the distribution for the sampling of the true QTL effects did not agree exactly in this study, which may explain the bias observed in the EBV (b values <1 in Tables 3
and 4
). However, we also simulated 5 replicates, where it was assumed that the QTL genotypes at the true QTL positions were known (i.e., running the model using only QTL genotypes) and the estimates of the QTL effects were used to predict the EBV. The accuracy of selection in that case was 0.919 ± 0.013, and the regression of TBV on EBV was 1.000 ± 0.023, which indicates that in our study the bias came from the use of marker effects rather than from the distributional assumptions made concerning QTL. The accuracy of 0.919 also places an upper bound for the accuracy that can be expected from using a very dense marker map. In practice, we have to be aware of possible biases of genomic selection EBV, which may be due to differences between the prior distribution used in the EBV estimation and the true underlying genetic model and from using markers instead of QTL effects. These biases may be corrected for using a cross-validation type of approach, in which some phenotypes are hidden from the analysis and later the regression of the hidden phenotypes on the predicted EBV is estimated. This regression coefficient should equal 1, and if this is not the case, the EBV can be rescaled such that the regression coefficient equals 1.
Although there are many structural similarities in the simulated populations and the parameters used in this study compared with Meuwissen et al. (2001)
, the results show that there is no barrier to achieving such accuracies using genomic evaluations in practice. In a traditional progeny test scheme in Canadian Holsteins, the accuracy of predicting the EBV of progeny-tested young bulls for production, conformation, fertility, and longevity was estimated to be 0.75 for their first EBV (Schaeffer, 2006
). Compared with this, a density of 1Ne/morgan using microsatellites or 2Ne/morgan using SNP seems sufficient to achieve a similar accuracy.
The presented simulations assumed a relatively small effective population size of Ne = 100, which generates LD between the markers and a QTL and thus causes the marker effects. The expected amount of disequilibrium in a stable population represents a balance between its creation by drift and its decay by recombination. For a randomly mating population in which the drift-recombination balance has been achieved, the expected squared correlation between the presence of 2 linked loci is E(r2) = 1/(1 + 4Nec), where c = the recombination between 2 loci (Lynch and Walsh, 1998
). Thus, if Ne is 2 times greater than in our simulations, the marker density needs to be doubled to generate the same LD and thus to have the same accuracy of selection. To test this, a simulated population with Ne = 200 using SNP markers with a marker density of 1Ne/morgan (i.e., Ne = 200) was compared with a population using Ne = 100 and a marker density of 1Ne/morgan. Table 8
shows that using Ne = 200 gave almost the same accuracy compared with using Ne = 100 using half the marker density. Note that in both schemes, marker densities are 1Ne/morgan, demonstrating that selection accuracies are similar for similar densities expressed in Ne/morgan units, which is why we used these units throughout this paper.
|
|
The estimation model assumes that there is no dominance (i.e., only the additive effects are fitted), and the average effects of the genes are estimated, which is probably satisfactory for the prediction of breeding values in most cases. When the prediction of dominance effects is important to predict total genetic values, dominance effects need to be added to the statistical model. In theory, this is not a problem, but research is needed to verify that such estimates are accurate in realistic scenarios. Another assumption in the simulation model is that the markers and QTL are evenly distributed on the chromosome, which in fact is realistic for microsatellite markers (Liu and Cordes, 2004
).
This study showed that the results of Meuwissen et al. (2001)
could be extended to SNP markers, which make dense high-throughput genotypes possible. At greater densities, one needs about twice as many SNP as microsatellites. Using such dense SNP genotyping technology may make selection for complex traits, or traits that are not widely recorded, possible, by estimating the marker effects in one generation, and using these effects in later generations to select their descendents.
1 Corresponding author: trygve.roger.solberg{at}umb.no
Received for publication January 4, 2007. Accepted for publication April 2, 2008.
| LITERATURE CITED |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
T. Luan, J. A. Woolliams, S. Lien, M. Kent, M. Svendsen, and T. H. E. Meuwissen The Accuracy of Genomic Selection in Norwegian Red Cattle Assessed by Cross-Validation Genetics, November 1, 2009; 183(3): 1119 - 1126. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Rincon, A. Islas-Trejo, J. Casellas, Y. Ronin, M. Soller, E. Lipkin, and J. F. Medrano Fine mapping and association analysis of a quantitative trait locus for milk production traits on Bos taurus autosome 4 J Dairy Sci, February 1, 2009; 92(2): 758 - 764. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |