|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ANIMAL GENETICS |
,2
* Department of Statistics, University of Nebraska, Lincoln 68583–0963; and
USDA, ARS, Roman L. Hruska US Meat Animal Research Center, Lincoln, NE 68583–0908
| Abstract |
|---|
|
|
|---|
Key Words: average information matrix genetic parameter restricted maximal likelihood standard error
| INTRODUCTION |
|---|
|
|
|---|
When some traits were not recorded on some units (e.g., animals), the standard approach used has been to discard incompletely recorded units. In the worst case, when males have one trait measured and females have another trait measured, there are no animals that have both traits measured. A similar case is observed for genotype by environment interaction, where some animals have records in one environment and other (some related) animals have records in another environment.
Although the program uses a derivative-free algorithm (simplex) to minimize –2logL|y, where L|y is the likelihood (L) given the data (y), the asymptotic SE for single-trait analyses and multiple-trait analyses with all traits measured are based on the average information matrix (AIM; Johnson and Thompson, 1995
) as implemented by Dodenhoff et al. (1998)
.
| MATERIALS AND METHODS |
|---|
|
|
|---|
The limitation that all animals with observations have all traits measured can be overcome without any changes in the set of MTDFREML programs. A model-based procedure will be described to accomplish that goal, which makes use of properties of the mixed model equations. The relatively trivial changes in the data file and model will be described.
Each missing observation for a trait is assigned a unique level of a dummy factor associated with that trait. Each missing observation can be assigned the same arbitrary pseudo-value. The program accepts that value as a real observation. The result is that the program handles the analysis as if all traits were observed for each animal with records. In essence, the estimated residual element associated with the pseudo-observation is a structural zero because the pseudo-observation is alone in a unique level of the newly created fixed factor. Because structural zeros in the estimated residual vector contribute nothing to the AIM, the resulting AIM is the same as if it was computed with a more complex algorithm. Similarly, L|y also is the same as if it was computed with missing observations ignored. (See Appendix for proof of equivalence.)
| RESULTS |
|---|
|
|
|---|
The simplest case is when the data line for an animal that includes a missing measurement for a trait includes the levels of fixed factors that would have been associated with an actual measurement. (Often these would be the same as for another trait that has an observation.) As an example, consider the 2-trait analysis for the data file partially represented in Table 1
, which is the data file used by K. Meyer (University of New England, Armidale, New South Wales, Australia, unpublished data) as an example with early versions of her DFREML program but now with some observations missing. The table presents data for the first 20 of 284 mice. Observations on the 2 traits are in the last 2 columns. A –99.00 represents a pseudo-value for a missing observation. Actual observations will be assigned level 1 for the dummy factor (MTDFREML does not accept 0 as a valid level). Each missing observation for a trait will be assigned a unique level for the dummy factor. An easy-to-program method is to assign level 2 for the first missing observation, level 3 for the next, etc. This assignment of levels would be done for each trait. Table 1
shows the result for the first 20 animals that includes all animals with missing observations. The first animal has trait 1 measured so field m1 has a 1 for the level, but the first animal does not have trait 2 measured so field m2 has a 2 as the first unique level. Animal 2 has both traits measured, so fields m1 and m2 are both 1. The third animal is missing trait 1 but has trait 2, so m1 is 2 and m2 is 1. Animal 8 is missing both traits and both would represent the second missing observation, so m1 and m2 will both be 3. The pattern for the remaining missing observations is obvious. For the first 20 animals shown for the data file, trait 1 has 3 missing observations with corresponding unique levels for m1 of 2, 3, and 4. Trait 2 has 2 missing observations with corresponding unique levels for m2 of 2 and 3.
|
(number in litter) are the same for both traits and are present even for the eighth animal, which is missing both traits. A suitable strategy for handling the case when some of the levels for those 3 factors are missing will be described later. Note that the pseudo-values –99.00 will now be treated as actual observations in the analysis with MTDFREML. For trait 1 alone with the full model including maternal effects, –2logL|y is 744.53 for the analysis with missing observations treated as missing and also is 744.53 for the analysis with –99.00 treated as observed measurements for trait 1 but with the fixed factor m1 added to the model. Estimates of variance components and genetic parameters as well as the average information matrix and asymptotic SE of the genetic parameters are also the same for both analyses. Similarly, estimates of estimable functions of the fixed effects are the same although the analysis with –99.00 as pseudo measurements also has estimates for levels 1,..., 6 of m1.
Choice of a Pseudo-Value for a Missing Observation
When 0.00 is substituted for –99.00 to denote a missing measurement, the –2logL and estimates of (co)variance components and genetic parameters with SE are the same as when –99.00 is used. Although estimates of levels for some fixed factors can be different, estimates of estimable functions will be the same. Estimates for levels of m1 are greater by 99.00 when 0.00 rather than –99.00 is substituted for missing values. That difference forces the residual effects for missing observations to be zero in both cases. The solution for level 1 of m1 was constrained to zero for both analyses (when –99.00 and 0.00 were substituted for missing values). As had to happen, similar analyses of trait 2 resulted in the same results.
Different pseudo-values could be assigned to different missing observations because the key step is to have only 1 missing observation within a level of m1 or m2. Assignment of the same pseudo-value for each missing observation is usually the easiest way to modify the data file.
Assignment of Levels of Other Factors for Missing Observations
Often a missing observation will not have levels associated with fixed factors as will actual observations. The question then is what levels of these fixed factors should be assigned when pseudo-observations are analyzed within unique levels of factor m1 (and/or m2). Three options for an analysis of a single trait were tried: 1) a unique level different from those associated with actual observations was used (in the example, level 10 was used for generation, sex, and number in the litter when actual levels were 1 to 3, 1 to 2, and 1 to 7); 2) level 1 was assigned to gn, sx, and nl when a pseudo-observation was analyzed as a real observation; and 3) the last level of gn (3), sx (2), and nl (7) was assigned. As expected, results were the same as those described before for –2logL|y, estimates of variance components and genetic parameters, and estimable functions of estimates of fixed effects.
From a practical standpoint, assigning a different unique level from those for actual observations for each fixed factor would seem preferable. The output file from MTDFPREP, MTDF66, will then provide means for each level of each factor for checking. This combines both the pseudo-observations and actual observations (see Table 2
). The means for levels with actual observations would be actual means and means for the different level would be the missing value (pseudo-values of 0.00 or –99.00, etc.). The overall mean and unadjusted SD in MTDF66, however, would be calculated from a mixture of actual and pseudo-observations. Note that the mean for level 1 of m1 (missing 1) is the unadjusted mean for actual measurements.
|
Two-Trait Analyses
Two-trait analyses were also done with the preceding ways of handling missing observations. As before, –2logL, estimates of variance components and genetic parameters, and estimates of estimable functions of fixed effects were the same when missing observations were excluded or included as pseudo-observations assigned unique levels of fixed factor m1 for trait 1 and fixed factor m2 for trait 2 for the case when levels of the other fixed factors were already available.
When levels of the other fixed factors are not available for missing observations, then modifications of the fields for fixed factors are needed. The options could be as described earlier, taking care not to modify the levels for a field shared by a trait with real measures when modifying levels of the same field for a pseudo measurement of another trait. An easy solution is to have separate fields for each trait. In contrast to the usual analysis that would include fixed effects for gn, sx, and nl for trait 1 and trait 2, the augmented analysis has a model equation for trait 1 that has the fixed effects gn1, sx1, and nl1, whereas the model equation for trait 2 has gn2, sx2, and nl2. Table 3
shows how to accomplish this in the example. Two extra sets of 3 fields were added in addition to m1 and m2. The first line of Table 3
for missing trait 2 shows that the 3 extra fields for trait 1 are the same as the original 3 fields, but the 3 extra fields for trait 2 now contain the level 10. For data line 3, the 3 extra fields for trait 1 contain level 10, and the 3 extra fields for trait 2 are the original levels. For data line 8, both sets of 3 extra fields contain level 10. (Assigning first or last levels for each factor would also work). In all 3 cases, the –2logL|y, estimates of (co)variance components, genetic parameters, and estimable functions of fixed effects were the same as with excluding the missing observations from the analysis. With each of these 3 assignments of fixed levels along with dummy fixed factors m1 and m2, estimates of SE of estimates of genetic parameters were obtained as well as the same –2logL, estimates of genetic parameters, and estimates of estimable functions of levels of fixed factors.
|
could be used, but with other genotype x environmental interaction models that would not always be true.
|
A similar multiple-trait model is often used to estimate covariance components due to genotype x environmental interaction based on sires with progeny in more than one environment. Environment is often region or country. Often thousands of animals provide observations. For a 2-trait G x E analysis, each animal would have a record only in its own environment, but to obtain a SE for the estimate of the genetic correlation between environments, each animal would also have a pseudo-value for the other environment. The numbers of unique levels for the dummy factors (m1 and m2) would effectively be the total number of animals with observations in the 2 environments, which could be many thousands, whereas the number of genetic effects would usually be twice the total number of animals, with a much smaller number with a sire model. Most of the mixed model equations would be associated with levels of the dummy factors for pseudo-observations. The 2-trait file would usually be created by joining files from the 2 regions with separate sets of fields for fixed factors and with 2 new fields for the pseudo missing observations for environment 1 or 2 very much the same as for Table 4
. The ASREML program (Gilmour et al., 2002
) appears to use a similar procedure but without external changes in the data file.
A method has been developed to obtain SE of genetic parameters with the MTDFREML program for multiple-trait analyses when some observations are missing for some traits. The method involves relatively small changes in the data file. Estimates of genetic parameters, estimable functions of fixed effects, and L|y are the same as for the original data file.
The steps in changing the data file and analysis of the augmented data file may vary, but the following are suggested:
| APPENDIX |
|---|
|
|
|---|
Let yo be the no x 1 vector of observed data, where
![]() |
The REML estimators of the (co)variance components of G and Ro are the maximum likelihood estimators of those (co)variance components based on the distribution of the residual vector Ko ' yo. The (no – po) x no matrix Ko ' is selected such that 1) po is the rank of Xo, 2) Ko ' Xo = 0, and 3) the rank of Ko ' is no – po. These requirements define REML (e.g., Harville, 1977
; Searle et al., 1992
). The distribution of the residual vector is
![]() |
Because the selection of Ko ' is based on the column space of Xo, augmenting the original design matrix Xo with additional columns, Wo, will not impact the selection of Ko ' provided that the columns of Wo are a linear function of the columns of Xo. So without loss of generality, the vector of missing value effects, mo, and design matrix, Wo, can be added to the observed model equations for the data,
![]() |
The same residual vector, Ko ' yo, can be used provided that the columns of Wo are a linear function of the columns of Xo.
Next let yu be the nu x 1 vector of unobserved data, where
![]() |
with Wu is a matrix with rank nu and with u
N(0,G) and
![]() | (1) |
The n x 1 augmented data vector, y, is then formed by concatenating the observed and unobserved data vectors. The design matrix for the fixed effects for the augmented data is
![]() |
The residual vector for the observed data, Ko ' yo, can be written in terms of the augmented data as
![]() |
It remains to be shown that K'y is a (n – p) x 1 residual vector for the augmented data that satisfies: 1) p = rank of X, the design matrix for the augmented data, 2) K'X = 0, and 3) rank of K' is n – p. Because none of the last nu rows of X are a linear function of the first no rows, the rank of X is equal to the rank of the first no rows, po, plus the rank of the last nu rows, nu. Therefore, the rank of X is p = nu + po and K'y is a (n – p) x 1 residual vector because n – p = (no + nu) – (nu + po) = no – po. With K'X = Ko ' Xo + Ko ' Wo and Ko ' selected such that Ko ' Xo + Ko ' Wo = 0, K'X must also be equal to 0. Because K' = (Ko ' 0), the rank of K' is equal to the rank of Ko ' . With Ko ' selected to have rank no – po, which is equal to n – p, the rank of K' is also equal to n – p. Therefore, K'y satisfies the 3 conditions for a REML residual vector.
| Footnotes |
|---|
2 Corresponding author: lvanvleck{at}unlnotes.unl.edu
Received for publication April 5, 2007. Accepted for publication July 2, 2007.
| LITERATURE CITED |
|---|
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |