J. Anim Sci.
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Roso, V. M.
Right arrow Articles by Schaeffer, L. R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Roso, V. M.
Right arrow Articles by Schaeffer, L. R.
J. Anim. Sci. 2005. 83:1788-1800
© 2005 American Society of Animal Science


ANIMAL GENETICS

Estimation of genetic effects in the presence of multicollinearity in multibreed beef cattle evaluation1

V. M. Roso*,{dagger}, F. S. Schenkel*,2, S. P. Miller* and L. R. Schaeffer*

* University of Guelph, Guelph, Ontario, N1G 2W1 Canada; and and {dagger} GenSys Consultores Associados S/S Ltda, Porto Alegre, RS, Brazil


    Abstract
 Top
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Implications
 Literature Cited
 
Breed additive, dominance, and epistatic loss effects are of concern in the genetic evaluation of a multibreed population. Multiple regression equations used for fitting these effects may show a high degree of multicollinearity among predictor variables. Typically, when strong linear relationships exist, the regression coefficients have large SE and are sensitive to changes in the data file and to the addition or deletion of variables in the model. Generalized ridge regression methods were applied to obtain stable estimates of direct and maternal breed additive, dominance, and epistatic loss effects in the presence of multicollinearity among predictor variables. Preweaning weight gains of beef calves in Ontario, Canada, from 1986 to 1999 were analyzed. The genetic model included fixed direct and maternal breed additive, dominance, and epistatic loss effects, fixed environmental effects of age of the calf, contemporary group, and age of the dam x sex of the calf, random additive direct and maternal genetic effects, and random maternal permanent environment effect. The degree and the nature of the multicollinearity were identified and ridge regression methods were used as an alternative to ordinary least squares (LS). Ridge parameters were obtained using two different objective methods: 1) generalized ridge estimator of Hoerl and Kennard (R1); and 2) bootstrap in combination with cross-validation (R2). Both ridge regression methods outperformed the LS estimator with respect to mean squared error of predictions (MSEP) and variance inflation factors (VIF) computed over 100 bootstrap samples. The MSEP of R1 and R2 were similar, and they were 3% less than the MSEP of LS. The average VIF of LS, R1, and R2 were equal to 26.81, 6.10, and 4.18, respectively. Ridge regression methods were particularly effective in decreasing the multicollinearity involving predictor variables of breed additive effects. Because of a high degree of confounding between estimates of maternal dominance and direct epistatic loss effects, it was not possible to compare the relative importance of these effects with a high level of confidence. The inclusion of epistatic loss effects in the additive-dominance model did not cause noticeable reranking of sires, dams, and calves based on across-breed EBV. More precise estimates of breed effects as a result of this study may result in more stable across-breed estimated breeding values over the years.

Key Words: Crossbreeding • Dominance • Epistatic Loss • Genetic Evaluation • Ridge Regression


    Introduction
 Top
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Implications
 Literature Cited
 
Breed additive, dominance, and epistatic loss effects are of concern in the genetic evaluation of a multibreed population. For fitting these effects, a multiple regression equation can be used (Koch et al., 1985Go; Arthur et al., 1999Go). Adequate interpretation of the least squares (LS) estimates depends on the assumption that the predictor variables are not strongly interrelated. Typically, when strong linear relationships exist, the regression coefficients have large SE, may have signs that are the opposite of what would be expected, and are sensitive to changes in the data file and to addition or deletion of variables in the model (Belsley, 1991Go). In addition, it is difficult to estimate the unique effect of an individual variable and the regression coefficients often cancel each other. This problem is known as collinearity or multicollinearity (Weisberg, 1985Go).

One alternative way of dealing with multicollinearity is ridge regression (Hoerl and Kennard, 1970Go). The ridge estimator is obtained by solving the system of equations (X'X + kI) k = X'y, to give k = (X'X + kI)–1 X'y, where k is the ridge parameter, with k > 0, and I is an identity matrix. In a generalized form, kI is replaced by a matrix K, where K = diag (k1 k2 . . . kp), with ki ≥ 0. The ridge estimators are biased, but might be useful in providing estimates that are more precise and therefore more stable than least squares estimates when multicollinearity is of concern. With an optimal choice of the ridge parameter matrix K, the ridge estimators have smaller mean squared error than the LS estimators (Lowerre, 1974Go; Gruber, 1998Go).

The objectives of this study were to identify sources and degree of multicollinearity in the genetic evaluation of a multibreed beef cattle populations and to apply ridge regression to obtain estimates of direct and maternal breed additive, dominance, and epistatic loss genetic effects compared with ordinary least squares method.


    Materials and Methods
 Top
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Implications
 Literature Cited
 
Data
The data were preweaning weight gains of animals from beef herds enrolled with Beef Improvement Ontario from 1986 to 1999. The data after preliminary edits consisted of 869,050 records, including records of both purebred and crossbred animals. A subset including purebred and crossbred animals from the 10 breeds with the largest number of records (Angus [AN], Blonde d’Aquitaine [BD], Charolais [CH], Gelbvieh [GV], Hereford [HE], Limousin [LM], Maine-Anjou [MA], Salers [SA], Shorthorn [SH], and Simmental [SM]) was chosen for the analysis. In addition, only records of animals with complete information for calculating direct and maternal dominance and epistatic loss coefficients were kept, and an analysis to check for connectedness among contemporary groups (herd-year-season-management group) across breeds was performed.

The method used to check for connectedness was the total number of direct genetic links between contemporary groups due to common sires and dams (Fries, 1998Go; Roso et al., 2004Go). Contemporary groups with more than 10 calves and with at least 10 direct genetic links and two classes of direct or maternal heterozygosities were considered connected and were retained for the analysis. There were nine classes of direct and maternal heterozygosities with an interval of 0.125, ranging from 0 to 1. The resulting dataset used in the analysis included 478,466 calves with a pedigree file of 714,220 animals.

Predictor Variables of Fixed Genetic Effects
Breed Additive Effects.
Coefficients for direct and maternal breed additive effects were equal to the proportion of each breed in the breed composition of the calf and in the breed composition of the dam (Rodríguez-Almeida et al., 1997Go), respectively.

Dominance Effects.
Coefficients of direct (HD) and maternal (HM) dominance effects were equal to expected direct and maternal breed heterozygosities (Rodríguez-Almeida et al., 1997Go), respectively. The HD and HM were calculated using the following equations:



where nb is the number of breeds, and Si, Di, MGSi, and MGDi are the fractions of the ith breed for the sire, dam, maternal grandsire, and maternal granddam breed composition, respectively.

Epistatic Loss Effects.
The assumption underlying the estimation of epistatic loss effects was that parents with larger heterozygosities produced more recombinant gametes than parents with smaller heterozygosities. Thus, the coefficients for direct (ED) and maternal (EM) epistatic loss effects were calculated as the average breed heterozygosities in uniting gametes that generated the individual (Fries et al., 2000Go). Epistatic loss was assumed to be proportional to the average heterozygosity observed in the parents of an individual, and it will achieve its largest value when both parents are F1. The ED and EM were calculated as follows:



where HSire, HDam, HMGS, and HMGD are the expected breed heterozygosities of the sire, dam, maternal grandsire, and maternal granddam, respectively. The average epistatic loss due to the breakdown of all kinds of gene interactions involving two or more loci, as deviation from the average additive and dominance effects, will be estimated by ED and EM (Fries et. al., 2002Go).

Multicollinearity Diagnostics
To identify possible linear dependencies among covariates included in the model, various measures of the degree of multicollinearity were obtained.

Variance Inflation Factor.
The variance inflation factor is the most common measure of multicollinearity. If Ri 2 is the coefficient of determination resulting when the predictor variable Xi is regressed on all the remaining predictor variables, the variance inflation factor for Xi (VIFi) is given by:


The VIF for ordinary LS are the diagonal elements of the inverse of the simple correlation matrix. The VIF indicate the inflation in the variance of each regression coefficient compared with a situation of orthogonality. The decision to consider a VIF to be large was essentially arbitrary. Usually, values larger than 10 suggest that multicollinearity may be causing estimation problems (Chatterjee et al., 2000Go).

Condition Index.
In the presence of multicollinearity, the determinant of the correlation matrix among predictor variables is very small. Because the determinant also is equal to the product of eigenvalues {lambda}i, the presence of one or more small eigenvalues results in a small determinant, thereby indicating multicollinearity. A measure of multicollinearity called condition index (CI) is obtained for each eigenvalue by computing:


where {lambda}max is the largest eigenvalue, and {lambda}i is the ith eigenvalue of the correlation matrix. Large CIi indicates dependencies among covariates because {lambda}i will be close to zero. Belsley (1991)Go suggested that a CI between 10 and 30 would indicate possible problems of multicollinearity, and CI larger than 30 suggest the presence of multicollinearity.

Variance-Decomposition Proportions Associated with the Eigenvalues.
This statistic indicates variables that are involved in linear dependencies and how much of the variance of the parameter estimate is associated with each eigenvalue. Following Belsley (1991)Go,


where 2 is the residual variance estimate, V is a matrix containing the eigenvectors, and {Lambda} is a diagonal matrix of eigenvalues, (i.e., diag ({lambda}1 {lambda}2 . . . {lambda}p)). Writing V = vij, the variance of the ith element of b, the vector of regression coefficients, can be decomposed into a sum of p components, each associated with one eigenvalue, as follows:


where p is the number of predictor variables.

Because eigenvalues appear in the denominator, variance components associated with dependencies (small {lambda}j) will be relatively large compared to the other components. Thus, a high proportion of two or more coefficients associated with the same small eigenvalue provides evidence that the corresponding dependencies are causing problems.

Let ,, with i = 1, . . ., p. The proportion of the variance of the ith regression coefficient associated with the jth component of its decomposition is obtained as follows:


with i, j = 1, . . ., p.

An approach recommended by Belsley et al. (1980)Go is to identify eigenvalues {lambda}j that have a CI greater than 30. Variables with variance-decomposition proportions {pi}ji larger than 0.5 for each of these eigenvalues are candidates for linear dependencies. Measures of multi-collinearity were obtained after standardization (centering and scaling) of predictor variables, as recommended by Freund and Littell (2000)Go. The regression procedure, option COLLINOINT, of the SAS statistical software (SAS Inst., Inc., Cary, NC) was used to perform computations.

Genetic Analysis
The genetic model for preweaning gain, in matrix notation, was:


[1]

where y = vector of observations and b = vector of fixed genetic effects. This vector included direct and maternal breed additive, dominance, and epistatic loss effects; v = vector of fixed environmental effects. This vector included age of the calf as a covariate (linear and quadratic effects), and age of the dam by sex of the calf and contemporary group (herd-year-season-management group) as classification variables; a = vector of random direct additive genetic effects; m = vector of random maternal additive genetic effects; p = vector of random maternal permanent environment effects; e = vector of random residual effects; and X, F, Z, W, and S are the appropriate incidence matrices relating records to fixed genetic, fixed environmental, direct genetic, maternal genetic, and permanent environment effects, respectively. Random direct additive genetic effects, random maternal additive genetic effects, and random maternal permanent environment effects were assumed to have variance matrices equal to A, A, I, and I, respectively, where A is the additive numerator relationship matrix among animals and I is an identity matrix. Covariance between a and m was assumed to be equal to A{sigma}am. The estimates of , , {sigma}am, , and used in the analyses were 254.5, 161.2, –128.6, 94.1, and 408.2 kg2, respectively. These estimates were obtained by restricted maximum likelihood, using a subset containing 300,002 records from randomly sampled herds, to overcome computational limitations. The genetic model assumed homogeneity of variances, the same dominance and epistatic loss effects for crosses of different pairs of breeds, and no interactions between genetic and environmental effects.

Solutions for Model [1] were obtained using the procedure described below.

Step 1) Obtain solutions for v, a, m, and p, using the following model:


where t denotes the tth iteration and y(t) = yX(t–1). In the first iteration, was set to the values obtained by LS. The DMU program (Madsen and Jensen, 2000Go) was used to perform computations.

Step 2) Using LS or ridge regression, obtain solutions for b, using the following model:


where y(t) = yF(t)Zâ(t) W(t)S(t), and , â, , and are solutions obtained in the first step of the tth iteration. Programs used in Step 2 were developed using the Fortran language and the IML procedure of SAS statistical software (SAS Inst., Inc., Cary, NC).

Steps 1 and 2 were repeated until convergence. Convergence was attained when the largest absolute difference between the solutions for in the current and in the previous iteration was smaller than 10–4.

Ridge Regression
The usual model for a multiple linear regression is:


where y is a (n x 1) vector of observations, X is a (n x p) design matrix of rank p, and {varepsilon} is a (n x 1) vector of random residuals with assumptions E({varepsilon}) = 0 and Var({varepsilon}) = I{sigma}2. The unknown parameter vector, b, using the least squares criterion, is estimated as = (X'X)–1 X'y; however, estimates and their variances could be unreliable in the presence of multicollinearity. The ridge regression estimator consists of adding a small positive number on the diagonal of the X'X matrix, causing a decrease in the variance of the estimates at the expense of introducing some bias. Thus, the ridge regression estimator of b takes the general form:


where K = diag (k1, k2, . . ., kp), ki ≥ 0. When all ki elements are equal to zero, k reduces to the LS estimator. From a Bayesian viewpoint (Goldstein and Smith, 1974Go; Sorensen and Gianola, 2002Go), the ridge regression can be considered as an estimate of b from the data subject to prior knowledge about the parameter, which is supplemented by the ridge parameter k. Given that k = {sigma}2/, where {sigma}2 is the residual variance and is a measure of the spread of the elements of b, large values of k imply an a priori belief that more restricted values of b are more likely than larger values, whereas small values of k imply an a priori belief that quite a large range of values of b are not unreasonable. The ridge regression is consistent with b considered as a random effect, given that variances {sigma}2 and are known.

The variance-covariance matrix of k estimates is as follows:


where 2 is the LS estimator of {sigma}2.

The mean square error (MSE), a measure of the expected squared distance between k and b, is as follows:



where Z = (X'X + K)–1 X'X. The ridge regression is advocated when the introduction of some bias in the estimates is compensated for by a substantial decrease in the estimation error variance, resulting in smaller MSE compared with LS (Hoerl and Kennard, 1970Go).

The variance inflation factors of the ridge regression coefficients are the diagonal elements of the matrix (X'X + K)–1 X'X(X'X + K)–1.

Ridge regression analyses were carried out in the standardized form of the model, using the correlation matrix. After estimation, the estimates were transformed to and presented on the original scale.

Objective Methods for Selecting the Ridge Parameter K
The theoretical optimal value of the ridge parameter K, which results in smaller MSE than that obtained with LS, depends on the unknown parameter vector b and the unknown error variance {sigma}2 (Hoerl and Kennard, 1970Go). Consequently, K must be determined empirically or estimated from the data, and there is no way to know whether the theoretical optimal value of the ridge parameter K was attained in a specific problem. Many methods have been proposed in the literature for selecting appropriate ki values, but there is no consensus on which method is the most adequate (Gruber, 1998Go). In general, the best method to estimate an optimal K depends on the data and model used. Here, the ridge parameter K was estimated through two objective methods.

Generalized Ridge Estimator of Hoerl and Kennard (R1).
In the Generalized Ridge Regression Estimator of Hoerl and Kennard (Hoerl and Kennard, 1970Go), an orthogonal transformation V is applied to reduce X'X to a diagonal matrix. We have that


where V is a (p x p) orthogonal matrix, whose columns v1, v2, . . ., vp are the eigenvectors of X'X and {Lambda} is a diagonal matrix of eigenvalues of X'X. Writing:


the model y = Xb + {varepsilon} can then be written as


The generalized ridge regression procedure is then defined as follows:


where K is a diagonal matrix with nonnegative diagonal elements k1, k1, . . ., kp.

Hoerl and Kennard (1970)Go showed that theoretically optimal values for ki, that minimize the MSE of the generalized ridge estimator, are given by ki = {sigma}2/{alpha}2i. These authors suggested an iterative procedure to estimate ki. This procedure can be summarized as follows: 1) Reduce the system to canonical form; 2) Take the LS solutions as the starting point to compute i(1) = 2/i 2, i = 1, 2, . . ., p, where i 2 = i 2 and i are the solutions from the LS equations; 3) Use the i(j) values in the ridge regression equation to obtain i(j+1), where j denotes the jth iteration; and 4) Compute a new estimate for ki using i(j+1)= 2/i(j+1) 2.

Go to Step 3 until convergence of i. Convergence was achieved when the maximum difference between i from two consecutive iterations was smaller than 10–7. After convergence the estimates k were converted back to k through the equation k= Vk.

Bootstrap in Combination with Cross Validation (R2).
The bootstrap and cross validation for estimating the ridge parameter, originally suggested by Delaney and Chatterjee (1986)Go, is based on minimization of the mean squared error of prediction (MSEP). This method was extended to consider the instability of each predictor variable, avoiding the introduction of unnecessary bias to those predictor variables not seriously involved in multicollinearity. The diagonal elements (ki) of the ridge parameter K were estimated by


where VIFi is the variance inflation factor of the ith predictor variable. A value of {theta} has to be chosen to generate a matrix that minimizes the MSEP. The magnitude of the elements i will be proportional to the variance inflation of each predictor variable.

The MSEP was estimated by combining bootstrap with cross validation. The bootstrap is a powerful re-sampling procedure originally proposed by Efron (1979)Go. In the bootstrap procedure, a random sample of n observations with replacement is taken for a particular population. The sample obtained in this manner is known as a bootstrap sample. If a large number of bootstrap samples are performed, the estimates of the parameters of interest will approach the true parameter.

A strategy using bootstrap in combination with cross-validation to estimate the ridge parameter matrix K can be summarized as follows: 1) Select a vector {Phi} containing values of {theta} between 0 and 1; 2) Choose a bootstrap sample of n observations with replacement; and 3) For each bootstrap sample and each value of {theta}, obtain and the ridge estimator vector k, where = diag(1 2 . . . p). Use the ridge estimator to predict observations that were not selected in the bootstrap sample. If the prediction vector for the unselected observations is y k(), the MSEP of the jth bootstrap sample and ridge parameter, given {theta}, is


where Nj is the number of unselected observations (randomly determined) in the jth bootstrap sample. 4) Repeat Steps 2 and 3 for B bootstrap samples and obtain a final average of MSEP for each {theta} value as:


A value of {theta} that generates a matrix of ridge parameters that minimizes the MSEP is then chosen. The MSEP were obtained for values of {theta} ranging from 0 to 1, with increments of 0.001, on the basis of 100 bootstrap samples.

Alternative Analyses
Three alternative analyses were performed using Model [1]: 1) ADE-LS, an additive-dominance-epistatic (ADE) model with breed additive, dominance, and epistatic loss effects estimated by LS; 2) ADE-R1, the ADE model with breed additive, dominance, and epistatic loss effect estimated by ridge regression method R1; and 3) ADE-R2, the ADE model with breed additive, dominance, and epistatic loss effect estimated by ridge regression method R2.

Mean Squared Error of Prediction and Variance Inflation Factor
The performance of ridge regression methods has been generally evaluated in terms of the decrease in MSE compared with LS using computer simulation (Gruber, 1998Go). A given simulation cannot hope to cover a large range of practical situations, particularly when a large number of factors are involved. In this study, the performance of ridge regression methods was evaluated in terms of MSEP, as in Delaney and Chatterjee (1986)Go and Hébel et al. (1993)Go, under the assumption that smaller MSE will result in smaller MSEP. A procedure combining bootstrap resampling and cross-validation was used to obtain the average MSEP over 100 bootstrap samples. This approach was deemed to be appropriate because sample statistics based on a large number of bootstrap samples tend to approach true parameter values (Delaney and Chatterjee, 1986Go). Average VIF of estimates computed by ridge regression methods and LS also were obtained and used to evaluate the performance of ridge regression methods. A model that results in lower VIF and smaller MSEP is desirable because these statistics indicate stability of estimates and ability of the model to predict future observations, respectively.

Bias Measurement
A known relationship between the ridge parameter and the variance and bias of ridge regression estimates is that, as the ridge parameter increases, the variance decreases and the bias increases. Given that E() = b and E(k) = (X'X + K)–1 X'Xb = Hb, a measurement of the bias of the ridge regression vector k was computed as , where || || denotes the Euclidean norm. Thus, a bias measurement closer to zero for a particular ridge regression method indicates smaller bias in the estimates.

Comparison of Across-Breed Estimated Breeding Values
Across-breed estimated breeding values (AB-EBV) from models that used LS and ridge regression methods for the estimation of fixed genetic effects were compared through correlations (Pearson and Spearman), and percentages of coincidence for different proportions of selected (top 1, 10, 20, and 40%) sires, dams, and calves. Across-breed estimated breeding values were calculated by adding EBV and estimates of direct breed additive effects, weighted by the breed composition of the animal. In addition to the analyses using ADE model, 3 alternative additive-dominance (AD) models (AD-AH, AD-LS, and AD-R2) were considered in the AB-EBV comparisons. For AD-AH, the preweaning gain was pre-adjusted for expected heterosis based on averages from literature. A heterosis (direct and maternal) of 5% for an animal with heterozygosity of 100% was assumed. Breed additive effects were estimated by LS. For AD-LS, breed additive and dominance effects were estimated by LS. For AD-R2, this model differed from Model AD-LS by the fact that breed additive and dominance (heterosis) effects were estimated by R2 instead of LS. The AB-EBV from Model ADE-R2 were assumed as the reference estimates for calculating Pearson and Spearman correlations, and percentages of coincidence with all other models.


    Results and Discussion
 Top
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Implications
 Literature Cited
 
Multicollinearity Diagnostics
The matrices X'X and X'y in the correlation form are presented in Table 1Go. Coefficients of maternal dominance and direct epistatic loss effects were strongly correlated (r = 0.95). Within-breed coefficients for direct and maternal breed additive effects were always equal to or higher than 0.80. The severity of multicollinearity, however, should not be quantified solely by the magnitude of these pairwise correlations because the interrelation among three or more variables might result in a high degree of multicollinearity, even when pairwise correlations are low. Better measures of the degree of multicollinearity are given by the eigenvalues of the correlation matrix and corresponding condition indices (Table 2Go), variance inflation factors (Figure 1Go), and variance-decomposition proportions associated with the eigenvalues (Table 3Go).


View this table:
[in this window]
[in a new window]
 
Table 1. Correlation coefficients among predictor variables of direct (D) and maternal (M) fixed genetic effects (n = 478,466)
 

View this table:
[in this window]
[in a new window]
 
Table 2. Eigenvalues of the correlation matrix among predictor variables of fixed genetic effects and corresponding condition indices
 


View larger version (14K):
[in this window]
[in a new window]
 
Figure 1. Variance inflation factor (VIF) associated with predictor variables of direct and maternal dominance (H), epistatic loss (E), and breed additive effects. The breeds are Angus (AN), Blonde d’Aquitaine (BD), Charolais (CH), Gelbvieh (GV), Hereford (HE), Limousin (LM), Maine-Anjou (MA), Salers (SA), Shorthorn (SH), and Simmental (SM).

 

View this table:
[in this window]
[in a new window]
 
Table 3. Decomposition of the variance structure of the parameter estimates associated with the two largest condition indices (38.85 and 7.50)a
 
Eigenvalues and corresponding condition indices are presented in Table 2Go. The last eigenvalue was very small ({lambda} = 0.00189). This eigenvalue was associated with condition index 38.85, reflecting dependencies between predictor variables. The second smallest eigenvalue was equal to 0.05078, with corresponding condition index equal to 7.50. Variance inflation factors shown in Figure 1Go indicate that the variance of the LS estimates of 16 out of 24 regression coefficients would be inflated by more than 10-fold (VIF larger than 10) compared with what would be expected in an orthogonal system.

Variance-decomposition proportions associated with the largest condition index (CI = 38.85) suggest that breed composition was the main candidate for the dependencies (Table 3Go). For nine direct and five maternal breed additive effects, a fraction of the variance of the estimated regression coefficients larger than 50% was associated with dependencies indicated by the largest condition index. Multicollinearity involving breed composition can be partially explained by the mathematical constraint among breeds because breed portions of the breed composition of an animal add to one, and the breed composition of a calf is equal to the average breed composition of the sire and of the dam. In practice, after fitting breeds that are more representative in the data, less new information is added by fitting the remaining breeds. Similarly, after fitting the breed of the dam, less new information is added fitting the breed of the calf, and vice versa.

Combining information from Figure 1Go with information from Table 3Go, breeds with smaller numbers of records (BD, GV, MA, SA, and SH) had lower VIF and a lower proportion of the variance of the estimates associated with linear dependences among predictor variables. In contrast, breeds with larger number of records and lower SE for the estimated regression coefficients (AN, CH, HE, LM, and SM) had higher VIF and proportion of the variance of the estimates associated with linear dependences among predictor variables.

The second largest condition index (CI = 7.50) indicates possible dependencies involving maternal dominance and direct epistatic loss effects (Table 3Go). Proportions equal to 85 and 83% of the variances of estimated regression coefficients of maternal dominance and direct epistatic loss effects, respectively, were associated with linear dependences between the corresponding predictor variables. This multicollinearity problem can be a consequence of the small proportion (10.7%) of crossbred sires in the data.

Ridge Parameter K and Bias Measurement
Ridge regression models that add the same amount to the diagonal of the matrix X'X are known in the literature as ordinary ridge regression (Gruber, 1998Go). Preliminary analyses using different ordinary ridge regression methods, however, resulted in a small reduction in the variance inflation factors and similar MSEP to LS (data not shown), in line with Delaney and Chatterjee (1986)Go. These authors stated that the ordinary ridge regression model is not appropriate for multicollinearity caused by physical or mathematical constraints in the data. Because breed composition sums to one for each observation, a mathematical constraint was present in the data. Generalized ridge regression methods, such as those used in this investigation, are better suited to deal with this source of multicollinearity.

The ridge parameters obtained by the two objective methods are shown in Table 4Go. The selected constant {theta} for calculating the ridge parameter K in R2, that minimizes the MSEP, was equal to 0.04. The mean and the SD of the number of unselected observations over the bootstrap samples in the last iteration for solving the genetic model were equal to 176,002 and 257, respectively. The elements of the ridge parameter K obtained on the basis of R1 were generally smaller than those on the basis of R2. Consequently, smaller bias in the estimates of regression coefficients of R1, compared with R2, can be expected. Bias measurements of R1 and R2 were equal to 1.49 and 5.61%, respectively.


View this table:
[in this window]
[in a new window]
 
Table 4. Values of the ridge parameter K obtained by ridge regression methods R1 and R2
 
Mean Squared Error of Prediction and Variance Inflation Factor
Table 5Go shows the average MSEP and VIF obtained over 100 bootstrap samples under LS and ridge regression methods. Both ridge regression methods outperformed the LS estimator with regard to MSEP and VIF. The MSEP of the two ridge regression methods were similar, and they were 3% less than the MSEP of LS. The fact that both ridge regression methods had similar MSEP suggests that specific linear combinations of estimated regression coefficients were equally determined, even though individual coefficients differed between methods (Belsley, 1991Go). Ridge regression methods were also superior to LS in terms of the decrease in VIF. The average VIF given by ridge regression methods R1 and R2 were 77 and 84% less, respectively, than the average VIF given by LS. Thus, larger bias in the estimates of R2 was compensated for by a substantial decrease in the variance of the estimates. Consequently, estimates obtained by ridge regression methods, notably by R2, will be less sensitive to small changes in the dataset, such as inclusion of new observations. This expectation was confirmed when ridge parameters determined using records from 1986 to 1996 were used in the estimation of fixed genetic effects of subsequent years (data not shown). The estimates of breed additive, dominance, and epistatic loss effects under ridge regression methods were more stable over the years than estimates under LS. This observation has potential practical implications in routine genetic evaluations for two reasons: 1) more consistency in the across-breed estimated breeding values can be expected in successive genetic evaluations, which can foster more confidence among producers in the genetic improvement program; and 2) the ridge parameter can be estimated less often than genetic evaluations are run, decreasing the computational demand.


View this table:
[in this window]
[in a new window]
 
Table 5. Summary of results obtained over 100 bootstrap samples for ordinary least squares (LS) and ridge regression methods R1 and R2
 
The last two rows in Table 5Go are the total variance and the square of the estimates. Methods used to deal with multicollinearity problems typically generate predictors with smaller variance and smaller range of the predictor vector compared with the LS estimator. These two statistics are influenced by the magnitude of the ridge parameter. Small ridge parameters imply less restriction (shrinkage) on the size of regression coefficients, whereas large values of ridge parameters imply an a priori belief that estimates of regression coefficients should be smaller or restricted.

From Table 5Go, the two ridge regression methods provided a general improvement over the LS, when evaluated by MSEP and average VIF obtained over a large number of bootstrap samples. Additional information for comparing the ridge regression methods, based on the decrease in instability of each parameter estimate, is presented in Figure 2Go. When multicollinearity was of concern, both ridge regression methods caused a substantial decrease in the VIF, but VIF given by R2 were smaller than VIF given by R1 for most predictor variables.



View larger version (25K):
[in this window]
[in a new window]
 
Figure 2. Variance inflation factor (VIF) associated with predictor variables of direct and maternal dominance (H), epistatic loss (E), and breed additive effects under ordinary least squares (LS) and ridge regressions methods R1 and R2. The breeds are Angus (AN), Blonde d’Aquitaine (BD), Charolais (CH), Gelbvieh (GV), Hereford (HE), Limousin (LM), Maine-Anjou (MA), Salers (SA), Shorthorn (SH), and Simmental (SM).

 
Breed Additive Effects
Estimates and SE of direct and maternal breed additive effects, as deviations from AN, are presented in Table 6Go. Estimates of direct breed additive effects showed large SE under LS. Ridge regression methods substantially decreased the SE when predictor variables were closely associated. The SE of the estimate for direct breed additive genetic effect for the worst case of multicollinearity (HE breed; largest VIF in Figure 1Go) was decreased from 0.63 in the LS to 0.20 in the R1 and to 0.10 in the R2.


View this table:
[in this window]
[in a new window]
 
Table 6. Estimates of direct and maternal breed additive effects on preweaning weight gain as deviations from Angus, obtained by ordinary least squares (LS) and ridge regression methods R1 and R2
 
Estimates of maternal breed additive genetic effects had a different pattern than direct effects. Ridge regression estimates of maternal breed effects for GV, HE, MA, and SM were of larger magnitude than LS estimates. Standard errors of estimates by ridge regression methods, however, were always smaller than SE given by LS. Increasing the ridge parameter K indefinitely in the ridge regression analysis will force all coefficients to zero, but for small values of ki it is not uncommon to see a regression coefficient increase in absolute value as ki increases (Marquardt and Snee, 1975Go).

Table 6Go shows that estimates of direct and maternal breed additive effects of BD, GV, MA, SA, and SH still had relatively large SE under ridge regression methods compared with the remaining breeds. It was previously shown, however, that variance-decomposition proportions of maternal breed additive effects for BD, GV, MA, SA, and SH associated with the largest condition index were lower than 0.5 (Table 3Go). Thus, the large SE of the estimates of maternal breed effects for BD, GV, MA, SA, and SH were more likely a consequence of the relatively small number of observations in these breeds than multicollinearity involving the corresponding predictor variables.

In addition to the statistical advantages of more stable solutions for breed differences, the practical implications of the observed differences in breed solutions across the three analyses should be carefully considered with respect to across breed selection using the AB-EBV. Results of this study indicate that estimation problems associated with multicollinearity among predictor variables, often seen in multibreed genetic evaluations, can be greatly minimized using ridge regression methods. Nevertheless, in the present study it was not possible to determine the correlation between estimated and true across breed breeding values (accuracy) for the alternative analyses. A simulation study would likely elucidate this question.

Dominance and Epistatic Loss Effects
Dominance effects indicate deviation from average dominance within breed due to differences in gene frequencies between breeds (breed heterozygosity). Epistatic loss effects express the recombination loss due to breed heterozygosity in relation to F2 calves and F2 dams, respectively. According to Koch et al. (1985)Go, long-term selection within a breed can increase frequencies of favorable non-allelic combinations, which result in favorable effects on phenotype. When breeds are crossed, random recombination of loci in the progeny tends to decrease the frequencies of these parental breed combinations towards Hardy-Weinberg equilibrium, resulting in recombination loss.

Estimates of dominance and epistatic loss effects and respective SE obtained by LS and by ridge regression methods are presented in Table 7Go. Both direct and maternal dominance effects resulted in a favorable effect on preweaning gain, whereas direct and maternal epistatic loss decreased preweaning gain, as expected. The estimate of maternal epistatic loss, however, was not different from zero (P > 0.05). Because predictor variables HM and ED were involved in multicollinearity, ridge regression methods caused substantial changes in the estimates of maternal dominance and direct epistatic loss effects. A small decrease in the SE of estimates of maternal dominance and direct epistatic loss effects was obtained through ridge regression methods. This decrease was slightly more pronounced under the R2 method. Estimated maternal dominance and direct epistatic loss effects were of opposite sign and comparable magnitude, and had large SE under LS (Table 7Go). Both ridge regression methods R1 and R2 seemed to slightly alleviate the multicollinearity involving maternal dominance and direct epistatic loss effects. The estimates of maternal dominance and direct epistatic loss effects were decreased from 2.28 and –2.19% in LS to 1.72 and –1.04% in the R1, and to 1.55 and –0.66% in the R2, respectively.


View this table:
[in this window]
[in a new window]
 
Table 7. Estimates of direct and maternal dominance (H) and epistatic loss (E) effects on preweaning weight gain, obtained by ordinary least squares (LS) and ridge regression methods R1 and R2
 
Estimates of direct and maternal dominance obtained in this study were lower than the range of heterosis from 3 to 8% (mean = 4%) reported by Long (1980)Go for preweaning gain. The causes for these low estimates of dominance effects are not clear. Partial confounding of contemporary group effects with breed composition and breed heterozygosity effects could be a reasonable explanation of the low estimates of dominance effects. Nonetheless, a preliminary analysis to check for connectedness among contemporary groups across breeds was performed, and only connected contemporary groups with at least two classes of direct or maternal heterozygosities were retained for the analysis. Additional analyses where dominance effects were estimated for each pair of breeds (including only breeds with a large number of records) to accommodate possible specific combining ability between breed pairs, excluding epistatic loss effects in the model, likewise resulted in low estimates (data not shown), which agrees with results obtained by Miller (1996)Go.

Sampling Correlations
To obtain information on the degree of confounding between estimates given by LS, R1, and R2, sampling correlations among estimates of breed additive, dominance, and epistatic loss effects were calculated. Overall averages of absolute values for pairwise correlations among estimates under LS, R1, and R2 were equal to 0.49, 0.30, and 0.18, respectively. These correlations indicated a substantial decrease in the degree of overall association between estimates given by ridge regression methods, especially with R2, compared with LS. The decrease in the degree of association between estimates was more pronounced between estimates of direct and maternal breed effects involving different breeds than between direct and maternal breed effects for the same breed.

Figure 3Go shows correlations between estimates of maternal dominance and direct epistatic loss effects and between direct and maternal breed additive effects for the same breed. Averages of these correlations were equal to –0.88, –0.79, and –0.74 under LS, R1, and R2, respectively. Under ridge regression methods, breeds with more multicollinearity showed a substantial decrease in the degree of confounding between estimates of direct and maternal breed additive effects, noticeably under R2. In contrast, estimates of maternal dominance and direct epistatic loss effects were still highly correlated under both ridge regression methods. The correlation between estimates of maternal dominance and direct epistatic loss was 0.94 in the LS and 0.93 in both ridge regression methods. These results suggest that the variety of crosses available in the data, aggravated by linear dependences between HM and ED, caused by mainly purebred sires being used, did not comprise enough information to effectively separate maternal dominance and direct epistatic loss effects, regardless of the fact that both effects were statistically significant.



View larger version (23K):
[in this window]
[in a new window]
 
Figure 3. Sampling correlations (multiplied by –1.0) between estimates of maternal dominance (HM) and direct epistatic loss (ED) effects and between estimates of direct and maternal breed additive effects given by ordinary least squares (LS) and ridge regression methods R1 and R2. The breeds are Angus (AN), Blonde d’Aquitaine (BD), Charolais (CH), Gelbvieh (GV), Hereford (HE), Limousin (LM), Maine-Anjou (MA), Salers (SA), Shorthorn (SH), and Simmental (SM).

 
Comparison of Across-Breed Estimated Breeding Values
Comparisons of AB-EBV from additive-dominance models AD-AH, AD-LS and AD-R2, and the additive-dominance-epistatic models ADE-LS and ADE-R1 with AB-EBV from additive-dominance-epistatic model ADE-R2 with respect to Pearson and Spearman correlations and percentages of coincidence for different proportions of selected calves are depicted in Figure 4Go. Results for sires and dams were not shown because they had a similar pattern to calves. Overall Pearson and Spearman correlations between AB-EBV were high, ranging from 0.85 to 1.0. The AB-EBV from Model AD-R2 were perfectly correlated with AB-EBV from Model ADE-R2. Even when only 1% of top animals were compared on the basis of AB-EBV, percentages of coincidence between AD-R2 and ADE-R2 were either equal to or higher than 0.99. Both Models AD-R2 and ADE-R2 used ridge regression method R2 for estimating the fixed genetic effects, but Model AD-R2 did not include epistasis. Thus, the inclusion of epistasis loss in the genetic model did not cause re-ranking of sires, dams, and calves. This observation is also corroborated by the fact that Pearson and Spearman correlations and percentages of coincidences of Models AD-LS and ADE-LS with Model ADE-R2 were very similar. Fixed genetic effects of both Models AD-LS and ADE-LS were estimated using LS, but they differed by the fact that Model AD-LS did not include epistasis.



View larger version (27K):
[in this window]
[in a new window]
 
Figure 4. Pearson and Spearman correlations, and percentages of coincidence for different proportions of selected (top 1, 10, 20, and 40%) calves on the basis of direct AB-EBV given by different models compared to Model ADE-R2. Models are additive-dominance-epistatic model (ADE) and additive-dominance model (AD). Estimation methods are ordinary least squares (LS), ridge regression methods 1 and 2 (R1 and R2), and pre-adjustment for heterosis and use of LS for estimating breed effects (AH).

 
When the highest 1% direct AB-EBV of sires, dams, and calves under Model ADE-R2 and under Models AD-AH, AD-LS, ADE-LS, and ADE-R1 were compared, percentages of coincidence were much lower than the overall Pearson and Spearman correlations, especially in the Model AD-AH (0.66, 0.65, and 0.61 for sires, dams, and calves, respectively). These results point out important reranking of top animals. Among the 1% best sires, dams, and calves, 34, 35, and 39% selected based on ADE-R2 would not be selected based on Model AD-AH, which assumed an ad hoc heterosis of 5% for both direct and maternal effects and did not account for multicollinearity among predictor variables. As the percentage of selected animals under Model AD-AH increased to 0.40, the percentages of coincidence with Model ADE-R2 increased to 0.78 for sires, 0.81 for dams, and 0.78 for calves. Higher percentages of coincidence between Models AD-AH and AD-LS (not shown) than between Models AD-AH and ADE-R2 suggested that practical differences between Models AD-AH and ADE-R2 were predominantly from differences in breed additive effects rather than different non-additive effects. When 1, 10, 20, and 40% highest calves’ AB-EBV under Models AD-AH and AD-LS were compared, percentages of coincidence were 0.81, 0.87, 0.92, and 0.93, respectively.

Models AD-LS and ADE-LS were similarly correlated with Model ADE-R2. Compared with Model AD-AH, Models AD-LS and ADE-LS had larger percentages of coincidence with Model ADE-R2; however, the difference with Model ADE-R2 was still substantial. Among the 1% highest AB-EBV, approximately 30% of selected animals, based on ADE-R2, would not be selected based on Models AD-LS and ADE-LS. Among the 40% highest AB-EBV under Model ADE-R2, approximately 20% of selected animals would not be selected based on Models AD-LS and ADE-LS. Model ADE-R1 showed a larger percentage of coincidence with Model ADE-R2 than Models AD-AH, AD-LS, and ADE-LS, but differences with ADE-R2 were still considerable. These results con-firm that the choice of the method to estimate the ridge parameter has consequences to genetic selection, resulting in different ranking of animals on the basis of across-breed estimated breeding values.

The ridge regression estimator is a type of weighted average between the actual data and other data taken according to an orthogonal experiment (in Bayesian terms, the prior information), for which the response values are arbitrarily set to zero (Marquardt, 1970Go). An alternative to ridge regression is to combine the actual data with prior information from the literature. This Bayesian procedure was used to estimate breed additive and heterosis effects in a multibreed model (Pollak and Quaas, 1998Go). The choice of prior distributions is not simple in practice. Pollak and Quaas (1998)Go used prior distributions based on expected values from literature such that neither prior information nor data would dominate the solutions. For heterosis, however, because the available data did not provide reasonable estimates, a prior distribution was chosen such that most of the weight was on the prior information. A comparison of ridge regression with the Bayesian procedure used by Pollak and Quaas (1998)Go would be recommended.


    Implications
 Top
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Implications
 Literature Cited
 
Estimates of breed effects obtained by ridge regression were more precise than those obtained by least squares. The ridge regression methods were particularly effective in decreasing the degree of multicollinearity involving predictor variables of breed additive effects. The use of estimated breed differences given by ridge regression methods will produce more stable across-breed comparisons over the years, but the implications on the accuracy of across breed comparisons should be further investigated. A simulation study would help to elucidate this matter. Due to high degree of confounding between estimates of maternal dominance and direct epistatic loss effects, it was not possible to compare the relative importance of these effects with a high level of confidence. The inclusion of epistatic loss effects in the standard additive-dominance evaluation model did not cause appreciable reranking of animals on the basis of across-breed estimated breeding values.


    Footnotes
 
1 The authors thank Beef Improvement Ontario (BIO) for providing data, BIO and Natural Sciences and Engineering Research Council of Canada for financial support, and the Canadian Foundation for Innovation, Ontario Innovation Trust, and Compaq for supporting the required computing infrastructure. We are thankful to G. J. Umphrey, J. W. Wilton, and P. G. Sullivan for their contributions to this paper. Back

2 Correspondence: Dept. of Anim. and Poultry Sci., Room 018 (phone: 1-519-824-4120, ext. 58650; fax: 1-519-767-0573; e-mail: Schenkel{at}uoguelph.ca).

Received for publication February 4, 2005. Accepted for publication April 22, 2005.


    Literature Cited
 Top
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Implications
 Literature Cited
 


Arthur, P. F., H. Hearshaw, and P. D. Stephenson. 1999. Direct and maternal additive and heterosis effect from crossing Bos indicus and Bos taurus cattle: Cow and calf performance in two environments. Livest. Prod. Sci. 57:231–241.

Belsley, D. A. 1991. Conditioning Diagnostics, Collinearity and Weak Data in Regression. 1st ed. John Wiley and Sons, Inc., New York, NY.

Belsley, D. A., E. Kuh, and R. E. Welsch. 1980. Regression Diagnostics. 1st ed. John Wiley and Sons, Inc., New York, NY.

Chatterjee, S., A. S. Hadi, and B. Price. 2000. Regression Analysis by Example. 3rd ed. John Wiley and Sons, Inc., New York, NY.

Delaney, N. J., and S. Chatterjee. 1986. Use of the bootstrap and cross-validation in ridge regression. J. Bus Econ. Statist. 4:255–262.

Efron, B. 1979. Bootstrap methods: Another look at the Jackknife. Ann. Stat. 7:1–26.

Freund, R., and R. C. Littell. 2000. SAS System for Regression. 3rd ed. SAS Inst., Inc. Cary, NC.

Fries, L. A. 1998. Connectability in beef cattle genetic evaluation: The heuristic approach used in MILC.FOR. Proc. 6th World Cong. Genet. Appl. Livest. Prod., Armidale, NSW, Australia. 27:449–500.

Fries, L. A., D. J. Johnston, H. Hearnshaw, and H. U. Graser. 2000. Evidence of epistatic effects on weaning weight in crossbreed beef cattle. Asian-Aust. J. Anim. Sci. 13(Suppl. B):242.

Fries, L. A., F. S. Schenkel, V. M. Roso, F. V. Brito, J. L. P. Severo, and M. L. Piccoli, 2002. "Epistazygosity" and epistatic effects. Proc. 7th World Cong. Genet. Appl. Livest. Prod., Montpellier, France. Communication No. 17–15.

Goldstein, M., and A. F. M. Smith. 1974. Ridge-type estimators for regression analysis. J. R. Stat. Soc. 36:284–291.

Gruber, M. H. J. 1998. Improving Efficiency by Shrinkage. The James-Stein and Ridge Regression Estimators. 1st ed., Marcel Dekker, Inc., New York, NY.

Hébel, P., R. Faivre, B. Goffinet, and D. Wallach. 1993. Shrinkage estimators applied to prediction of French winter wheat yield. Biometrics 49:281–293.

Hoerl, A. E., and R. W. Kennard. 1970. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12:55–67.

Koch, R. M., G. E. Dickerson, L. V. Cundiff, and K. E. Gregory. 1985. Heterosis retained in advanced generations of crosses among Angus and Hereford cattle. J. Anim. Sci. 60:1117–1132.

Long, C. R. 1980. Crossbreeding for beef productions: experimental results. J. Anim. Sci. 51:1197–1223.[Abstract/Free Full Text]

Lowerre, J. M. 1974. On the mean square error of parameter estimates for some biased estimators. Technometrics 16:461–464.

Madsen, P., and J. Jensen. 2000. DMU—A package for analysing multivariate mixed models. Danish Inst. of Agric. Sci. (DIAS), Tjele, Denmark.

Marquardt, D. W. 1970. Generalized inverses, ridge regression, biased linear estimation, and nonlinear estimation. Technometrics 12:591–612.

Marquardt, D. W., and R. D. Snee. 1975. Ridge regression in practice. Am. Statist. 29:3–20.

Miller, S. P. 1996. Studies on genetic evaluation and the effect of milk yield on profit potential in a multibreed beef cattle population. Ph.D. Thesis, Univ. of Guelph, Canada.

Pollak, E. J., and R. L. Quaas. 1998. Multibreed genetic evaluation of beef cattle. Proc. 6th World Cong. Genet. Appl. Livest. Prod., Armidale, Australia 23:81–88.

Rodríguez-Almeida, F. A., L. D. Van Vleck, and K. E. Gregory. 1997. Estimation of direct and maternal breed effects for prediction of expected progeny differences for birth and weaning weights in three multibreed populations. J. Anim. Sci. 75:1203–1212.[Abstract/Free Full Text]

Roso, V. M., F. S. Schenkel, and S. P. Miller. 2004. Degree of connectedness among groups of centrally tested beef bulls. Can. J. Anim. Sci. 84:37–47.

Sorensen, D., and D. Gianola. 2002. Likelihood, Bayesian, and MCMC methods in quantitative genetics. Springer-Verlag New York, Inc., New York, NY.

Weisberg, S. 1985. Applied Linear Regression. 2nd ed. John Wiley and Sons, Inc., New York, NY.


This article has been cited by other articles:


Home page
J ANIM SCIHome page
R. Carvalheiro, E. C. G. Pimentel, V. Cardoso, S. A. Queiroz, and L. A. Fries
Genetic effects on preweaning weight gain of Nelore-Hereford calves according to different models and estimation methods
J Anim Sci, November 1, 2006; 84(11): 2925 - 2933.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire