J. Anim Sci. 2007. 85:909-918. doi:10.2527/jas.2006-668
© 2007 American Society of Animal Science
Accounting for outliers and heteroskedasticity in multibreed genetic evaluations of postweaning gain of Nelore-Hereford cattle1
F. F. Cardoso*,2,
G. J. M. Rosa
and
R. J. Tempelman
* EMBRAPA Pecuária Sul (Brazilian Agricultural Research Corporation South Cattle and Sheep Center), Bagé, RS 96401-970;
and
Department of Dairy Science, University of Wisconsin, Madison 53706; and
and
Department of Animal Science, Michigan State University, East Lansing 48824
 |
Abstract
|
|---|
The objectives of this study were to demonstrate the utility of hierarchical Bayesian models combining residual heteroskedasticity with robustness for outlier detection and muting and to evaluate the effects of such joint modeling in multibreed genetic evaluations. A 3 x 2 factorial specification of 6 residual variance models based on several distributional (Gaussian, Students t, or Slash) and variability (homoskedastic or heteroskedastic) assumptions was used to analyze 22,717 postweaning gain records from a Nelore-Hereford population (40,082 animals in the pedigree). To illustrate the utility of the 2 robust distributional specifications (Students t and Slash) for outlier detection and muting, 3 records from the same contemporary group (an extreme residual outlier, a mild residual outlier, and a near-zero residual) were chosen for further study. The posterior densities of the corresponding weighting variables of these records were used to assess their degree of Gaussian outlyingness and the ability of the robust models to mute the effects of deviant records. The Students t heteroskedastic provided the best-fit model among the 6 specifications and was preferred for genetic merit inference. Kendall rank correlations of the posterior means of the additive genetic effects of the animals, used to compare the selection order of the Students t and Gaussian models, were reasonably high across all animals within the most frequent genotypes, ranging from 0.83 to 0.91 and from 0.89 to 0.95 for the homoskedastic and the heteroskedastic versions, respectively. However, when considering only animals ranked in the top 10% by the customary Gaussian homoskedastic model, these rank correlations were reduced considerably, ranging from 0.29 to 0.57 and from 0.72 to 0.85 between the 2 residual densities within the homoskedastic and heteroskedastic versions, respectively. Rank correlations between the homoskedastic and heteroskedastic versions within each of the Gaussian and Students t error models tended to be smaller, with a range from 0.68 to 0.90 across all animals and from 0.28 to 0.67 for animals ranked in the top 10%. These results support the implementation of robust models accounting for sources of heteroskedasticity to increase the precision and stability of multibreed genetic evaluations with proper statistical treatment of deviant records.
Key Words: Bayesian inference beef cattle genetic evaluation heteroskedasticity multibreed robust model
 |
INTRODUCTION
|
|---|
The accuracy of EBV depends on the quality of the phenotypic and pedigree data available. Performance records influenced by factors that are not explicitly specified in the statistical model but that have potentially large effects (i.e., outliers) can severely bias the genetic parameter estimates and genetic evaluations (Stranden and Gianola, 1998
). Preferential treatment, disease, inappropriate contemporary group (CG) formation, and animal misidentification are potential examples of incomplete recording in beef cattle populations.
Data editing for genetic evaluation systems generally involves deleting records that are considered extremely deviant (usually 3 or more SD) from the phenotypic mean of their subclass or that fall outside the range of 60 to 140% of the ratio record or mean of its subclass (Bertrand and Wiggans, 1998
). Although this strategy seems suitable for very obvious data recording errors (e.g., digit transposition), it could be unduly inefficient if the true residual distribution strays from normality and the variances are heterogeneous across various data subclasses. Robust heteroskedastic error models, recently proposed by Cardoso et al. (2005)
for genetic evaluations, combine features of structural variance models (Foulley et al., 1992
; Kizilkaya and Tempelman, 2005
) with those of heavy-tailed densities from the normal-independent family (Lange and Sinsheimer, 1993
; Rosa et al., 2003
), such as the Students t and the Slash distributions, representing an alternative to the deletion of records that could be plausible yet considered extreme relative to the normality assumption.
The objectives of this study were as follows: 1) to demonstrate the utility of hierarchical Bayesian models combining residual heteroskedasticity with robustness for outlier detection and muting and 2) to further evaluate the effects of such joint modeling in multibreed genetic evaluations.
 |
MATERIALS AND METHODS
|
|---|
Animal Care and Use Committee approval was not obtained for this study because the data were obtained from an existing performance records database.
Nelore-Hereford Data
The data consisted of 22,717 postweaning gain (PWG) records (40,082 animals in the pedigree file) of a beef cattle population composed of Herefords, Nelores, and their crosses, belonging to 15 herds within a large-scale breeding program in Brazil, the Conexão Delta G. The animals were grown on extensive pasture conditions in 3 regions: region 1, located between 14 and 16°S latitude (5,410 records); region 2, located between 21 and 23°S (3,110 records); and region 3, located between 30 and 32°S (14,197 records). The distribution of the breed composition groups was highly unbalanced across the various regions, with a predominance of F1 offspring from Nelore dams in the tropical regions 1 and 2 and purebred Herefords in subtropical region 3. The average PWG ± SD was 98.2 ± 41.2 kg. More details on the structure of this data set were presented by Cardoso and Tempelman (2004)
.
Model Specification and Posterior Inference
Hierarchical Bayesian models combining heteroskedasticity and robustness as proposed by Cardoso et al. (2005)
were employed. A 3 x 2 factorial representation was used in the specification of 6 residual variance models, with 1 factor being the distributional specification (Gaussian, Students t, or Slash) and the second factor specifying variability (homoskedastic vs. heteroskedastic). The resulting 6 residual variance models were as follows: 1) Gaussian homoskedastic (G-HO), 2) Students t homoskedastic (T-HO), 3) Slash homoskedastic (S-HO); 4) Gaussian heteroskedastic (G-HE), 5) Students t heteroskedastic (T-HE), and 6) Slash heteroskedastic (S-HE).
For all 6 residual variance models, the same linear mixed model was specified for data as an additive function of the location parameters, as in Cardoso et al. (2005)
:
 | [1] |
Here, yijklmn = the PWG record on the nth animal within ith region, jth CG, and kth sex; µ = the overall mean; regi = the effect of the ith region (i = 1, 2, 3); cgj = the effect of the jth CG (j = 1, 2, ..., 940), with cgj ~ N(0,
2cg) for all j; sexk = the effect of the kth sex (k = 1 or k = 0 for male or female, respectively); the ßs are regression coefficients associated with postweaning test period in days (PWP, 106
PWP
483), age of dam in years (AOD, 2
AOD
12), "fixed" additive effect (A) for the Nelore breed, Nelore-Hereford dominance effect (D), Nelore-Hereford additive x additive epistatic effect (AA), and interactions between A and sex, PWP, and AOD.
The coefficient fn (fd) represents the expected proportion of Nelore alleles for animal n (dam d), whereas
n = the heterozygosity coefficient, obtained as the probability that 1 allele is derived from the Nelore breed and the other allele is derived from the Hereford breed for a randomly chosen locus from individual n. Furthermore, 2[1 fn]fn = the additive x additive epistatic coefficient for animal n based on Kinghorns (1980)
definition of recombination loss. Now, un = the additive genetic effect of the nth animal, assuming that u = {un}n = 140082 ~ N(0,G), where G = a multibreed genetic (co)variance matrix that is a function of the breed-specific genetic variances for the Nelore and Hereford breeds, the segregation variance between the 2 breeds, and genetic relationships among animals (Lo et al., 1993
).
Finally, eijklmn = a residual term with distribution according to the 6 alternative specifications as outlined above by specifying residual variances conditionally on particular record-specific weighting variables. More specifically, eijklmn ~ N(0,
jkn), where
jkn= 
j
k (
A)fn (
D)
n wn1 , is the residual scale parameter specific to the record on animal n, as determined by a multiplicative function of an overall scale parameter (
2e) and CG (
j), sex (
k), breed (
A), and heterozygosity (
D) factors for the 3 heteroskedastic error models (G-HE, THE, and S-HE). For the 3 homoskedastic error models (G-HO, T-HO, and S-HO), residual variances were more simply specified across all records as
2ejkn =
2ewn1. Here, wn = a weighting variable specific for the record on animal n and whose distribution conditionally defines the marginal residual distribution for the data density. For example, the Gaussian error specifications (G-HO and G-HE) were obtained by setting wn = 1 for all n. Otherwise, there are alternative distributional specifications on wn, each yielding a different heavier-tailed marginal density that models the deviation of the distribution of the residuals from a Gaussian distribution (Lange and Sinsheimer, 1993
).
We specifically considered 2 heavy-tailed densities: the Students t and the Slash distributions, and other alternatives are discussed in Lange and Sinsheimer (1993)
and Rosa et al. (2003)
. The Students t residual distribution (T-HE and T-HO) was obtained by assuming that
for all n with wn > 0, where
e > 0 represents the degree of freedom or robustness parameter. The Slash residual distribution (S-HE and S-HO) was alternatively specified by wn ~ p(wn|
e) =
ewn
e1, for all n, with 0 < wn
1 and
e > 0. See Cardoso et al. (2005)
for additional details.
Bayesian inference was based on Markov chain Monte Carlo (MCMC) chains of 200,000 cycles after 15,000 cycles of burn-in for all 6 models; again, full conditional densities and implementation details are provided in Cardoso et al. (2005)
. Posterior means, modes, key percentiles, and SD of the parameters were obtained from MCMC samples. Kendall rank correlations between posterior means of the additive genetic effects obtained by the various models were used to compare the ordering of the genetic evaluations of the animals for PWG selection within genotype.
 |
RESULTS AND DISCUSSION
|
|---|
Detection and Muting of Outliers
The scatter plot of standardized residuals by CG, derived from the analysis of PWG using G-HO (Figure 1
), showed 0.2% of the records with residuals lying outside the range of ± 4.0 SD. This percentage is over 30 times that expected given the G-HO error assumption. Beyond attenuating the effect of extreme records on parameter estimates, heavy-tailed error models can be used to aid the detection of outliers (Rosa et al., 2003
). The posterior density of wn provides useful information to assess the corresponding records degree of outlyingness relative to a Gaussian error assumption. Low values of wn (i.e., wn closer to zero) suggest that the corresponding record yijklmn is deviant, whereas values of wn close to 1 indicate that yijklmn matches well with the model prediction. To illustrate this point in the context of heteroskedastic multibreed genetic evaluations, we deliberately chose for further study the records of 3 F1 males from the same CG containing 160 animals and within region 2. Based on the G-HO specification, the first record (y1) represents a mild outlier, being 3.08 SD above zero; the second record (y2) represents a near-zero residual or a perfect model fit (0.02 SD); and the third record (y3) consists of an extreme outlier, being 5.57 SD from zero (Figure 1
).

View larger version (18K):
[in this window]
[in a new window]
|
Figure 1. Scatter plot of standardized residuals of postweaning gain on contemporary group label using the Gaussian homoskedastic model. Three residuals from the same contemporary group are highlighted for further inference: 1 represents a mild outlier, being +3.08 SD from zero; 2 represents a near-zero residual (perfect fit); and 3 represents an extreme outlier, at 5.57 SD from zero.
|
|
Posterior densities of the weighting variables w1, w2, and w3, corresponding to the 3 records y1, y2, and y3, respectively, are presented for the T-HE and S-HE error models in Figure 2
. We emphasize that only the heteroskedastic versions of these 2 robust models are shown in Figure 2
, because they provided a much better fit to this PWG data set based on a Bayesian model choice criterion (Cardoso et al., 2005
). One essential difference between the posterior distributions of wn in the Students t and Slash error models is that wn is defined anywhere on the positive real line for the Students t, whereas wn is defined only between 0 and 1 in the Slash error model (Lange and Sinsheimer, 1993
). This distinction results in substantially different posterior densities for each of w1, w2, and w3 between the THE and S-HE error models. Nevertheless, both models successfully identify y3 as fairly deviant. For instance, the posterior mean and 95% posterior probability interval (PPI) for w3 were, respectively, 0.18 and [0.05, 0.42] for the T-HE model, where the 95% PPI is defined by the [2.5th, 97.5th] percentiles of the posterior density. Corresponding values for the S-HE model were 0.10 and [0.02, 0.28]. Both densities are then concentrated around low values for w3, thereby implicating y3 to be an outlier relative to a G-HE error model specification and such that its effect on fixed and random effects inferences are attenuated. Conversely, w2 has a somewhat flatter distribution relative to the corresponding parameter spaces as defined for the T-HE and S-HE error models. The posterior mean and 95% PPI for w2 corresponding to these 2 models were 1.11 and [0.31, 2.39] and 0.72 and [0.25, 0.99], respectively. That is, these results suggest that y2 is not outlying relative to a G-HE specification.

View larger version (11K):
[in this window]
[in a new window]
|
Figure 2. Posterior distribution of weighting variables corresponding to record 1 (w1, a mild outlier), record 2 (w2, a nearly perfect model fit), and record 3 (w3, an extreme outlier) under 2 robust models: a Students t heteroskedastic model (panel A) and a Slash heteroskedastic model (panel B).
|
|
The case of y1 illustrates the interdependence between robustness and heteroskedasticity specifications. The standardized residual of this record from G-HE error model was 2.53, which is less than the value, 3.08, estimated using G-HO error. However, its specific residual variance, as a function of CG, sex, breed, and heterozygosity scaling factors, was estimated to be 1.38 times greater than the overall residual variance under the G-HE error model. In other words, the estimated residual variance would be substantially underestimated for that particular record in the G-HO error model. Nevertheless, the posterior mean of 0.46 and and 95% PPI of [0.12, 1.07] for w1 under the T-HE error model suggested that the influence of y1 on statistical inference is still somewhat downweighted under this model, albeit to a lesser extent relative to y3. This down-weighting for y1 was also observed under the S-HE error model, because the posterior mean and 95% PPI for w1 were then, respectively, 0.36 and [0.06, 0.89].
For the Students t distribution, the prior expectation of wn is the neutral value of 1.00; therefore, the containment of 1.00 in the confidence set for wn could be used as an objective criterion to judge outlyingness with respect to a Gaussian error specification for a desired level of precision. For the example above, the 95% PPI under T-HE model contained 1.00 for w1 and w2 but not for w3, indicating that only y3 would be considered an outlier by this criterion.
Conceivably, posterior inference on wn could also be used for data quality monitoring if, for example, small posterior means (<1.00) appeared to be clustered within certain herds or there seemed to be substantial differences for the average posterior means of these weighting variables across herds. In such cases, herds having a high frequency of low posterior means for the weighting variables could be identified as having quality problems with their data collection process. Nevertheless, this issue deserves further investigation before its application can be advised for monitoring data quality in beef cattle performance programs.
The data edits used for determining which records are outliers in animal breeding are often ad hoc in nature and rely upon empirical justifications. For example, the ratio record-mean of its subclass approach advocated by Bertrand and Wiggans (1998)
, if applied to our PWG data, would have resulted in the deletion of 1,517 records (6.7% of the data), which fall outside the range of 60 to 140%. We believe that only a fraction of these observations likely correspond to recording errors such that the approach of Bertrand and Wiggans (1998)
might be too rigid for our particular data set. Evidently, record editing and deletion should be pursued for those cases when a recording error is obvious, and methods for detecting and correcting abnormal records, such as the one used for US dairy test-day yields (Wiggans et al., 2003
), provide useful safeguards for current genetic evaluation systems. Nevertheless, it is important to recognize that the proposed heavy-tailed error models differentially weight each record for inferences such that records deemed to be Gaussian-deviant (wn << 1) provide proportionately smaller contributions to parameter estimates compared with other records. Therefore, these models provide a much more elegant statistical treatment of deviant records than the rather abrupt alternative of outlier detection and deletion (or adjustment).
Genetic Evaluations
Inferences on random additive genetic effects are used to rank animals for selection of breedstock and to predict the expected progeny differences in purebred populations. However, for multibreed populations, fixed additive and nonadditive genetic effects as well as non-additive random effects are also of interest, because they determine multibreed expected progeny differences (MB-EPD; Elzo and Wakeman, 1998
; Pollak and Quaas, 1998
).
Comparisons Within Breed Compositions.
We did not consider random nonadditive genetic effects such that inference on random additive genetic effects would suffice for selection of breedstock within breed composition. The Kendall rank correlations between error models G-HO and T-HO, G-HO and G-HE, G-HE and THE, and between T-HO and T-HE for posterior means of these additive genetic effects within the most frequent genotypes are presented in Table 1
. The rank correlation between the Gaussian and Students t error models within either heteroskedasticity classification (G-HO vs. T-HO and G-HE vs. T-HE) were reasonably high within the most frequent genotypes, being always greater than 0.83 for the homoskedastic error models and greater than 0.89 for the heteroskedastic error models. However, when we considered only animals ranked in the top 10% using the G-HO error model, the rank correlation among the additive genetic values of these top animals by G-HO and T-HO error models decreased considerably (Table 1
). These latter results have greater implications for genetic evaluations and selection decisions than the correlation results involving all animals. The correlations for top-ranked animals might be expected to be lower, because Gaussian-deviant records should lead to more extreme genetic merit predictions on the contributing animal under the G-HO error model compared with the muting effect of the T-HO error model (Stranden and Gianola, 1998
), particularly when the outlying record is the primary phenotypic source of information for these animals. Therefore, as pointed out by an anonymous reviewer, the implications of using heavy-tailed error models would be expected to be more evident for replacement seed-stock and cows with few offspring than for greatly proven sires in extensively managed beef cattle populations. This is further evident from Figure 3
(top graph), in which posterior means of additive genetic effects for Nelores, Herefords, and F1s obtained by the G-HO model are plotted against the corresponding posterior means under the T-HO error model. In this plot, we observed that several animals with extreme posterior means for genetic effects under the G-HO model had these estimates shrunk closer to 0 under the T-HO error model. For example, the F1 animal associated with y3 (the extreme outlier described previously) had a posterior mean of 13 kg under the G-HO model but only 4 kg under the T-HO error model (Figure 3
, top graph). This result is in agreement with simulated results reported by Stranden and Gianola (1998)
for homoskedastic models, in which the Students t error model was shown to attenuate the effects of preferential treatment and led to more reliable inference on genetic effects of animals associated with outlying records. We also observed a lower rank correlation between the G-HE and T-HE error models when considering only the top 10% of animals than that for all animals (Table 1
); however, the magnitude of the decrease in correlation and the degree of change in genetic effects predictions (Figure 4
, top graph) were not as substantial, because they were between the G-HO and T-HO error models. Our results suggest then that modeling residual heteroskedasticity is potentially more important than modeling robustness for genetic evaluations. This was further seen by lower rank correlations between the G-HO and G-HE error models compared with between the G-HO and T-HO error models (Table 1
). One possible reason for the relatively low correlation between genetic predictions obtained by the G-HO and G-HE error models for the top 10% of animals is that the G-HE error model allows for more balanced selection of animals across environments in accounting for heteroskedasticity, whereas a larger proportion of animals from the most variable environments tend be ranked near the top when heterogeneous variances are ignored in statistical modeling (Hill, 1984
; Gianola, 1986
). Moreover, the large difference in genetic variance estimates between the G-HO and G-HE error models (i.e., Herefords had a larger genetic variance than the Nelores in the G-HO error model, whereas the opposite was true for the G-HE error model; see also Cardoso et al., 2005
) affects the dispersion of posterior mean additive genetic effects within different breed compositions (Figure 3
, bottom graph), and, consequently, the manner in which these predictions overlap within the whole population. Furthermore, several Hereford animals had their genetic effects predictions shrunk closer toward zero with the G-HE compared with the G-HO error models. However, this was not true for the posterior mean genetic effect of the animal with record y3, which was large under both models.
View this table:
[in this window]
[in a new window]
|
Table 1. Kendall rank correlations between the posterior means of random additive genetic effects of postweaning gain for combinations of the Gaussian homoskedastic (G-HO), Students t homoskedastic (T-HO), Gaussian heteroskedastic (G-HE), and Students t heteroskedastic (T-HE) models for all animals and for animals ranked in the top 10th percentile for G-HO within the most frequent genotypes
|
|

View larger version (36K):
[in this window]
[in a new window]
|
Figure 3. Scatter plot of posterior means of additive genetic effects for postweaning gain obtained by the Gaussian homoskedastic (G-HO) and Students t homoskedastic (T-HO) models (top) and by the G-HO and Gaussian heteroskedastic (G-HE) models (bottom) for the Nelore, Hereford, and F1 breed composition groups. The highlighted F1 animals are from the same contemporary group and are associated with records representing the G-HO model; 1 represents a mild outlier, being +3.08 SD from zero; 2 represents a near-zero residual (perfect fit); and 3 represents an extreme outlier, at 5.57 SD from zero.
|
|

View larger version (27K):
[in this window]
[in a new window]
|
Figure 4. Scatter plot of posterior means of the additive genetic effects for postweaning gain obtained by the Gaussian heteroskedastic (G-HE) and Students t heteroskedastic (T-HE) models (top) and by the Students t homoskedastic (T-HO) and T-HE models (bottom) for the Nelore, Hereford, and F1 breed composition groups. The highlighted F1 animals are from the same contemporary group and are associated with records representing the G-HO model; 1 represents a mild outlier, being +3.08 SD from zero; 2 represents a near-zero residual (perfect fit); and 3 represents an extreme outlier, at 5.57 SD from zero.
|
|
Table 1
and Figures 3
and 4
provide evidence that heteroskedasticity and robustness specifications interact, because rank correlations for posterior mean genetic effects were greater between the heteroskedastic Gaussian and Students t error models (G-HE and THE) compared with their homoskedastic counterparts (G-HO and T-HO). Similarly, these correlations were larger between homoskedastic and heteroskedastic t error models (T-HO and T-HE) than between homoskedastic and heteroskedastic Gaussian error models (G-HO and G-HE). These results indicate that the effect of adding either residual error specification factor (heteroskedasticity or robustness) to the model on stable genetic evaluations is diminished when the other factor has been modeled.
Comparisons Across Breed Compositions.
If selection decisions are to be made between breedstock across various breed compositions, fixed additive (A) and non-additive (D and AA) genetic effects and the interaction of A with other effects, as assumed in equation 1, must be considered in addition to random additive genetic effects to provide MB-EPD. In this case, expected progeny performance and their differences will be a function of genetic coefficients (breed proportion, heterozygosity, and recombination loss) that will vary according to any breedstock and the genotype of its potential mate. In other words, the selection of candidate parents across breed compositions and prediction of their MB-EPD must occur within a specified mating genotype.
To illustrate this point, we simulated matings of the 2 F1 animals (F1-1 with u = 7.8 kg and F1-2 with u = 0.4 kg) previously considered in this paper in relation to records y1 and y2, along with 2 Herefords and 2 Nelores of identical additive genetic effects, and compared their future progeny genotypes, genetic coefficients, genotypic means, MB-EPD, their MB-EPD percentiles across the whole population, and the relative rank among these 6 selection candidates, according to alternate dam genotypes (Table 2
). When animals Hereford-1, F1-1, and Nelore-1 are mated to Nelore dams, they present substantially different MB-EPD (Table 2
) rankings across the whole multibreed population, being within the first, 54th, and 76th percentiles, respectively, despite having the same random additive genetic effects. In this case, differences in MB-EPD are due to different progeny genotypic means (Table 2
). However, if these same 3 animals were to be mated to Hereford dams, their MB-EPD would be oppositely ranked, being within the 22nd, 14th, and first percentiles, respectively, across the multibreed population. Moreover, among the 6 animals considered in this example in Table 2
, Nelore-1 ranks first, third, or fifth depending on whether the genotype of its mate is Hereford, F1, or Nelore, respectively. These results further demonstrate the strong dependence between mating genotype and MB-EPD ranking across breed composition. Finally, it should be observed from Table 2
that the difference between the MB-EPD of any of the 2 selection candidates of the same breed composition does not change across the different mating genotypes, being always equal to the difference in their additive genetic effects; these results support the assertion that the latter information suffices for selection within breed composition, at least when nonadditive random genetic effects are not considered.
View this table:
[in this window]
[in a new window]
|
Table 2. Progeny genotypes and genetic coefficients for Nelore proportion (f), heterozygosity ( ), and recombination loss (r); estimated1 progeny genotypic means; random additive genetic effects (u), multibreed expected progeny differences (MB-EPD); MB-EPD percentiles across the whole population (%); and relative rank among 6 selection candidates of different breed compositions2 Hereford (H), 2 F1, and 2 Nelore (N) according to the mate genotype for postweaning gain obtained by the Students t heteroskedastic model
|
|
Practical Implementation and Computational Issues.
We have established the importance of jointly considering robustness and heteroskedascity in multibreed genetic evaluations using MCMC inference and a relatively small data set. For much larger data-sets involving several million records, such as, for example, that to be analyzed by the National Beef Cattle Consortium in the United States (Pollak, 2006
), computational issues will likely arise using MCMC (Jamrozik, 2004
). These issues, however, should not discourage the consideration of the proposed robust structural error variance models in beef cattle populations similar to the Nelore-Hereford cross studied in this paper. We have demonstrated these models to substantially outperform the conventional homoskedastic Gaussian error model for data fit. If the presence of outliers is due to the data collection process, the number of extreme records should be expected to be proportional to the data set size such that the importance of model robustness is not likely to diminish even for very large datasets. We further anticipate that computationally attractive analytical procedures will be developed; for example, parameters specifying the Students t error model could be conceptually estimated using maximum likelihood based on the expectation-maximization algorithm (Lange et al., 1989
). Genetic evaluation systems would be more computationally tractable if the df parameter was treated as known much like what is commonly assumed for heritabilities. Presumably, values for these parameters would be based on previous estimates or estimates based on a subset of the data; alternatively, plausible values could be chosen from a discrete set by using a model choice criterion (Stranden and Gianola, 1998
). Similarly, structural mixed models for modeling heterogeneous residual variances could also be implemented via the expectation-maximization algorithm (San Cristobal et al., 1993
; Weigel et al., 1993
) or by simply pursuing heterogeneous residual variance adjustments, similar to those that have been routinely used for dairy cattle genetic evaluations (Wiggans and VanRaden, 1991
).
In conclusion, the implementation of robust multibreed genetic evaluations that jointly model sources of heterogeneous genetic and residual variances is recommended to reliably infer upon genetic merit of crossbred animals. Robustness allows for differential weighting of records for inference, such that extremely deviant records provide smaller contributions to genetic evaluations compared with other records. The heterogeneous variances specification properly accounts for distinct genetic and environmental variability inherent to multibreed populations. In our analysis of a Nelore-Hereford cross, a robust Students t heterogeneous model was shown to better fit the data and to lead to substantially different ranking of top animals for selection as compared with the normal homogeneous error model. Consequently, the use of genetic evaluation models based on normally distributed residuals with homogeneous genetic and residual variances may hinder genetic progress in multibreed beef cattle populations.
 |
Footnotes
|
|---|
1 Funded by Coordenação de Aperfeiçoamento de Pessoal de Nivel Superior, Brasília, Brazil, and the Michigan Agricultural Experiment Station (project MICL 01822). We are grateful to I. Misztal for making available Sparsem90 and Fspak90 and to the Brazilian genetic evaluation alliance, Conexão Delta G, for providing the Nelore-Hereford data. 
2 Corresponding author: fcardoso{at}cppsul.embrapa.br
Received for publication October 4, 2006.
Accepted for publication December 13, 2006.
 |
LITERATURE CITED
|
|---|
Bertrand, J. K., and G. R. Wiggans. 1998. Validation of data and review of results from genetic evaluation systems for US beef and dairy cattle. Proc. 6th World Congr. Gen. Appl. Livest. Prod., Armidale, Australia. 27:425432.
Cardoso, F. F., G. J. Rosa, and R. J. Tempelman. 2005. Multiple breed genetic inference using heavy-tailed structural models for heterogeneous residual variances. J. Anim. Sci. 83:17661779.[Abstract/Free Full Text]
Cardoso, F. F., and R. J. Tempelman. 2004. Hierarchical Bayes multiple-breed inference with an application to genetic evaluation of a Nelore-Hereford population. J. Anim. Sci. 82:15891601.[Abstract/Free Full Text]
Elzo, M. A., and D. L. Wakeman. 1998. Covariance components and prediction for additive and nonadditive preweaning growth genetic effects in an Angus-Brahman multibreed herd. J. Anim. Sci. 76:12901302.[Abstract/Free Full Text]
Foulley, J. L., M. S. Cristobal, D. Gianola, and S. Im. 1992. Marginal likelihood and Bayesian approaches to the analysis of heterogeneous residual variances in mixed linear Gaussian models. Comput. Stat. Data Anal. 13:291305.
Gianola, D. 1986. On selection criteria and estimation of parameters when the variance is heterogeneous. Theor. Appl. Genet. 72:671677.
Hill, W. G. 1984. On selection among groups with heterogeneous variance. Anim. Prod. 39:473477.
Jamrozik, J. 2004. Implementation issues for Markov Chain Monte Carlo methods in random regression test-day models. J. Anim. Breed. Genet. 121:113.
Kinghorn, B. 1980. The expression of recombination loss in quantitative traits. J. Anim. Breed. Genet. 97:138143.
Kizilkaya, K., and R. J. Tempelman. 2005. A general approach to mixed effects modeling of residual variances in generalized linear mixed models. Genet. Sel. Evol. 37:3156.[Medline]
Lange, K. L., R. J. Little, and J. M. Taylor. 1989. Robust statistical modeling using the t-distribution. J. Am. Stat. Assoc. 84:881896.
Lange, K. L., and J. S. Sinsheimer. 1993. Normal/independent distributions and their applications in robust regression. J. Comput. Graph. Stat. 2:175198.
Lo, L. L., R. L. Fernando, and M. Grossman. 1993. Covariance between relatives in multibreed populations: Additive-model. Theor. Appl. Genet. 87:423430.
Pollak, E. J. 2006. Multibreed genetic evaluations of beef cattle in the United States. Proc. 8th World Congr. Genet. Appl. Livest. Prod., Belo Horizonte, Brazil. CD-ROM communication no. 0301.
Pollak, E. J., and R. L. Quaas. 1998. Multibreed genetic evaluations of beef cattle. Proc. 6th World Congr. Genet. Appl. Livest. Prod., Armidale, Australia. 23:8188.
Rosa, G. J., C. R. Padovani, and D. Gianola. 2003. Robust linear mixed models with normal/independent distributions and Bayesian MCMC implementation. Biometrical J. 45:573590.
San Cristobal, M., J. L. Foulley, and E. Manfredi. 1993. Inference about multiplicative heteroskedastic components of variance in a mixed linear Gaussian model with an application to beef cattle breeding. Genet. Sel. Evol. 25:330.
Stranden, I., and D. Gianola. 1998. Attenuating effects of preferential treatment with Students-t mixed linear models: A simulation study. Genet. Sel. Evol. 30:565583.
Weigel, K. A., D. Gianola, B. S. Yandell, and J. F. Keown. 1993. Identification of factors causing heterogeneous within-herd variance components using a structural model for variances. J. Dairy Sci. 76:14661478.[Abstract]
Wiggans, G. R., and P. M. VanRaden. 1991. Method and effect of adjustment for heterogeneous variance. J. Dairy Sci. 74:43504357.[Abstract]
Wiggans, G. R., P. M. VanRaden, and J. C. Philpot. 2003. Detection and adjustment of abnormal test-day yields. J. Dairy Sci. 86:27212724.[Abstract/Free Full Text]