J. Anim Sci.
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


J. Anim Sci. 2007. 85:2375-2381. doi:10.2527/jas.2007-0202
© 2007 American Society of Animal Science

This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow All Versions of this Article:
jas.2007-0202v1
85/10/2375    most recent
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Kachman, S. D.
Right arrow Articles by Van Vleck, L. D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kachman, S. D.
Right arrow Articles by Van Vleck, L. D.

ANIMAL GENETICS

Technical Note: Calculation of standard errors of estimates of genetic parameters with the multiple-trait derivative-free restricted maximal likelihood programs

S. D. Kachman*,1 and L. D. Van Vleck{dagger},2

* Department of Statistics, University of Nebraska, Lincoln 68583–0963; and {dagger} USDA, ARS, Roman L. Hruska US Meat Animal Research Center, Lincoln, NE 68583–0908


    Abstract
 Top
 Abstract
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 APPENDIX
 LITERATURE CITED
 
The multiple-trait derivative-free REML set of programs was written to handle partially missing data for multiple-trait analyses as well as single-trait models. Standard errors of genetic parameters were reported for univariate models and for multiple-trait analyses only when all traits were measured on animals with records. In addition to estimating (co)variance components for multiple-trait models with partially missing data, this paper shows how the multiple-trait derivative-free REML set of programs can also estimate SE by augmenting the data file when not all animals have all traits measured. Although the standard practice has been to eliminate records with partially missing data, that practice uses only a subset of the available data. In some situations, the elimination of partial records can result in elimination of all the records, such as one trait measured in one environment and a second trait measured in a different environment. An alternative approach requiring minor modifications of the original data and model was developed that provides estimates of the SE using an augmented data set that gives the same residual log likelihood as the original data for multiple-trait analyses when not all traits are measured. Because the same residual vector is used for the original data and the augmented data, the resulting REML estimators along with their sampling properties are identical for the original and augmented data, so that SE for estimates of genetic parameters can be calculated.

Key Words: average information matrix • genetic parameter • restricted maximal likelihood • standard error


    INTRODUCTION
 Top
 Abstract
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 APPENDIX
 LITERATURE CITED
 
The multiple-trait derivative-free REML (MTDFREML; Boldman et al., 1995Go) set of programs was written to handle single-trait models and multiple-trait models with partially missing data in an expedient manner. When estimating (co)variance components and genetic parameters for multiple-trait models, the programs have not been able to estimate SE of those estimates for multiple-trait models when animals with observations do not have all traits measured.

When some traits were not recorded on some units (e.g., animals), the standard approach used has been to discard incompletely recorded units. In the worst case, when males have one trait measured and females have another trait measured, there are no animals that have both traits measured. A similar case is observed for genotype by environment interaction, where some animals have records in one environment and other (some related) animals have records in another environment.

Although the program uses a derivative-free algorithm (simplex) to minimize –2logL|y, where L|y is the likelihood (L) given the data (y), the asymptotic SE for single-trait analyses and multiple-trait analyses with all traits measured are based on the average information matrix (AIM; Johnson and Thompson, 1995Go) as implemented by Dodenhoff et al. (1998)Go.


    MATERIALS AND METHODS
 Top
 Abstract
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 APPENDIX
 LITERATURE CITED
 
Animal Care and Use Committee approval was not obtained for this study because no animals were used.

The limitation that all animals with observations have all traits measured can be overcome without any changes in the set of MTDFREML programs. A model-based procedure will be described to accomplish that goal, which makes use of properties of the mixed model equations. The relatively trivial changes in the data file and model will be described.

Each missing observation for a trait is assigned a unique level of a dummy factor associated with that trait. Each missing observation can be assigned the same arbitrary pseudo-value. The program accepts that value as a real observation. The result is that the program handles the analysis as if all traits were observed for each animal with records. In essence, the estimated residual element associated with the pseudo-observation is a structural zero because the pseudo-observation is alone in a unique level of the newly created fixed factor. Because structural zeros in the estimated residual vector contribute nothing to the AIM, the resulting AIM is the same as if it was computed with a more complex algorithm. Similarly, L|y also is the same as if it was computed with missing observations ignored. (See Appendix for proof of equivalence.)


    RESULTS
 Top
 Abstract
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 APPENDIX
 LITERATURE CITED
 
Revised Data File

The simplest case is when the data line for an animal that includes a missing measurement for a trait includes the levels of fixed factors that would have been associated with an actual measurement. (Often these would be the same as for another trait that has an observation.) As an example, consider the 2-trait analysis for the data file partially represented in Table 1Go, which is the data file used by K. Meyer (University of New England, Armidale, New South Wales, Australia, unpublished data) as an example with early versions of her DFREML program but now with some observations missing. The table presents data for the first 20 of 284 mice. Observations on the 2 traits are in the last 2 columns. A –99.00 represents a pseudo-value for a missing observation. Actual observations will be assigned level 1 for the dummy factor (MTDFREML does not accept 0 as a valid level). Each missing observation for a trait will be assigned a unique level for the dummy factor. An easy-to-program method is to assign level 2 for the first missing observation, level 3 for the next, etc. This assignment of levels would be done for each trait. Table 1Go shows the result for the first 20 animals that includes all animals with missing observations. The first animal has trait 1 measured so field m1 has a 1 for the level, but the first animal does not have trait 2 measured so field m2 has a 2 as the first unique level. Animal 2 has both traits measured, so fields m1 and m2 are both 1. The third animal is missing trait 1 but has trait 2, so m1 is 2 and m2 is 1. Animal 8 is missing both traits and both would represent the second missing observation, so m1 and m2 will both be 3. The pattern for the remaining missing observations is obvious. For the first 20 animals shown for the data file, trait 1 has 3 missing observations with corresponding unique levels for m1 of 2, 3, and 4. Trait 2 has 2 missing observations with corresponding unique levels for m2 of 2 and 3.


View this table:
[in this window]
[in a new window]

 
Table 1. Data file for the first 20 of 284 mice, modified to create levels for a dummy factor (m1) for trait 1 and a dummy factor (m2) for trait 2 which will allow analysis as if the missing value indicators (–99.00) are actual observations for m1 and m21,2
 
For the analysis with MTDFREML, an additional fixed factor is used for each trait (fields m1 and m2). In this example, levels for gn (generation), sx (sex), and n{ell} (number in litter) are the same for both traits and are present even for the eighth animal, which is missing both traits. A suitable strategy for handling the case when some of the levels for those 3 factors are missing will be described later. Note that the pseudo-values –99.00 will now be treated as actual observations in the analysis with MTDFREML.

For trait 1 alone with the full model including maternal effects, –2logL|y is 744.53 for the analysis with missing observations treated as missing and also is 744.53 for the analysis with –99.00 treated as observed measurements for trait 1 but with the fixed factor m1 added to the model. Estimates of variance components and genetic parameters as well as the average information matrix and asymptotic SE of the genetic parameters are also the same for both analyses. Similarly, estimates of estimable functions of the fixed effects are the same although the analysis with –99.00 as pseudo measurements also has estimates for levels 1,..., 6 of m1.

Choice of a Pseudo-Value for a Missing Observation

When 0.00 is substituted for –99.00 to denote a missing measurement, the –2logL and estimates of (co)variance components and genetic parameters with SE are the same as when –99.00 is used. Although estimates of levels for some fixed factors can be different, estimates of estimable functions will be the same. Estimates for levels of m1 are greater by 99.00 when 0.00 rather than –99.00 is substituted for missing values. That difference forces the residual effects for missing observations to be zero in both cases. The solution for level 1 of m1 was constrained to zero for both analyses (when –99.00 and 0.00 were substituted for missing values). As had to happen, similar analyses of trait 2 resulted in the same results.

Different pseudo-values could be assigned to different missing observations because the key step is to have only 1 missing observation within a level of m1 or m2. Assignment of the same pseudo-value for each missing observation is usually the easiest way to modify the data file.

Assignment of Levels of Other Factors for Missing Observations

Often a missing observation will not have levels associated with fixed factors as will actual observations. The question then is what levels of these fixed factors should be assigned when pseudo-observations are analyzed within unique levels of factor m1 (and/or m2). Three options for an analysis of a single trait were tried: 1) a unique level different from those associated with actual observations was used (in the example, level 10 was used for generation, sex, and number in the litter when actual levels were 1 to 3, 1 to 2, and 1 to 7); 2) level 1 was assigned to gn, sx, and nl when a pseudo-observation was analyzed as a real observation; and 3) the last level of gn (3), sx (2), and nl (7) was assigned. As expected, results were the same as those described before for –2logL|y, estimates of variance components and genetic parameters, and estimable functions of estimates of fixed effects.

From a practical standpoint, assigning a different unique level from those for actual observations for each fixed factor would seem preferable. The output file from MTDFPREP, MTDF66, will then provide means for each level of each factor for checking. This combines both the pseudo-observations and actual observations (see Table 2Go). The means for levels with actual observations would be actual means and means for the different level would be the missing value (pseudo-values of 0.00 or –99.00, etc.). The overall mean and unadjusted SD in MTDF66, however, would be calculated from a mixture of actual and pseudo-observations. Note that the mean for level 1 of m1 (missing 1) is the unadjusted mean for actual measurements.


View this table:
[in this window]
[in a new window]

 
Table 2. A condensed copy of the summary file (MTDF66) from running MTDFPREP for trait 1, with missing values (0.00) as actual observations using unique levels for each missing value for factor (m1) and one unique level (10) for each other factor (generation, sex, litter size) for missing values
 
The proceeding analyses were exploratory as SE have been available for single-trait analyses even without the trick of assigning pseudo-values for missing observations nested within unique levels of another fixed factor. The real strength is in the enhanced capability to estimate SE of genetic correlations with partially missing data.

Two-Trait Analyses

Two-trait analyses were also done with the preceding ways of handling missing observations. As before, –2logL, estimates of variance components and genetic parameters, and estimates of estimable functions of fixed effects were the same when missing observations were excluded or included as pseudo-observations assigned unique levels of fixed factor m1 for trait 1 and fixed factor m2 for trait 2 for the case when levels of the other fixed factors were already available.

When levels of the other fixed factors are not available for missing observations, then modifications of the fields for fixed factors are needed. The options could be as described earlier, taking care not to modify the levels for a field shared by a trait with real measures when modifying levels of the same field for a pseudo measurement of another trait. An easy solution is to have separate fields for each trait. In contrast to the usual analysis that would include fixed effects for gn, sx, and nl for trait 1 and trait 2, the augmented analysis has a model equation for trait 1 that has the fixed effects gn1, sx1, and nl1, whereas the model equation for trait 2 has gn2, sx2, and nl2. Table 3Go shows how to accomplish this in the example. Two extra sets of 3 fields were added in addition to m1 and m2. The first line of Table 3Go for missing trait 2 shows that the 3 extra fields for trait 1 are the same as the original 3 fields, but the 3 extra fields for trait 2 now contain the level 10. For data line 3, the 3 extra fields for trait 1 contain level 10, and the 3 extra fields for trait 2 are the original levels. For data line 8, both sets of 3 extra fields contain level 10. (Assigning first or last levels for each factor would also work). In all 3 cases, the –2logL|y, estimates of (co)variance components, genetic parameters, and estimable functions of fixed effects were the same as with excluding the missing observations from the analysis. With each of these 3 assignments of fixed levels along with dummy fixed factors m1 and m2, estimates of SE of estimates of genetic parameters were obtained as well as the same –2logL, estimates of genetic parameters, and estimates of estimable functions of levels of fixed factors.


View this table:
[in this window]
[in a new window]

 
Table 3. Data file for the first 20 of 284 mice for a 2-trait analysis modified to create levels for dummy fixed factors (m1 for trait 1 and m2 for trait 2) and also modified to create separate sets of fixed factors for traits 1 and 2, so that the actual levels are used for actual observations but with a unique level used for a missing value (0.00) analyzed as an actual observation1
 
A rigorous test of the method was to develop a file (Table 4Go) with trait 1 measured only for sex 1 and trait 2 only for sex 2. Thus, trait 2 would always be missing for animals of sex 1 and trait 1 would always be missing for animals of sex 2. As before, each missing observation (0.00) for trait 1 was assigned a unique level for factor m1, and each missing observation for trait 2 was assigned a unique level for factor m2. In this case, the number of unique levels for m1 is number of the animals of sex 2 plus 1, and the number of unique levels for m2 is the number of animals of sex 1 plus 1. For this data file, the number of levels for m1 was 1 + 134 = 135 and for m2 was 1 + 150 = 151. With large data files, many extra equations will result: basically one extra equation for each animal with an observation on one trait but no observation for the other trait for a 2-trait analysis. In this case the original levels for gn, sx, and n{ell} could be used, but with other genotype x environmental interaction models that would not always be true.


View this table:
[in this window]
[in a new window]

 
Table 4. Data file for first 20 of 284 mice for a 2-trait analysis with trait 1 measured only on sex 1 and trait 2 on sex 2 modified to create unique levels for factors m1 and m2 for missing observations (0.00) for trait 1 (m1) or trait 2 (m2)1
 
Analysis of the 2 sex-limited traits with missing observations excluded gave the same estimates of fixed and random effects, (co)variance components, heritabilities, and genetic correlations as for the analysis with missing observations for 1 trait or the other trait treated as actual observations (0.00) with corresponding unique levels of factors, m1 and m2. With factors m1 and m2 and pseudo-values as "real" observations, SE of the estimates of the genetic parameters were obtained for the 2-trait analysis even when no animals had actual measurements for both traits.

A similar multiple-trait model is often used to estimate covariance components due to genotype x environmental interaction based on sires with progeny in more than one environment. Environment is often region or country. Often thousands of animals provide observations. For a 2-trait G x E analysis, each animal would have a record only in its own environment, but to obtain a SE for the estimate of the genetic correlation between environments, each animal would also have a pseudo-value for the other environment. The numbers of unique levels for the dummy factors (m1 and m2) would effectively be the total number of animals with observations in the 2 environments, which could be many thousands, whereas the number of genetic effects would usually be twice the total number of animals, with a much smaller number with a sire model. Most of the mixed model equations would be associated with levels of the dummy factors for pseudo-observations. The 2-trait file would usually be created by joining files from the 2 regions with separate sets of fields for fixed factors and with 2 new fields for the pseudo missing observations for environment 1 or 2 very much the same as for Table 4Go. The ASREML program (Gilmour et al., 2002Go) appears to use a similar procedure but without external changes in the data file.

A method has been developed to obtain SE of genetic parameters with the MTDFREML program for multiple-trait analyses when some observations are missing for some traits. The method involves relatively small changes in the data file. Estimates of genetic parameters, estimable functions of fixed effects, and L|y are the same as for the original data file.

The steps in changing the data file and analysis of the augmented data file may vary, but the following are suggested:

1) Do the analysis with missing observations treated as missing because there will be fewer equations and nonzero coefficients in the mixed model equations. Thus, the analysis will run faster and result in best possible starting values to run with missing observations treated as actual values. At convergence, copy updated starting answer file, MTDF4, to MTDF4.1.
2a) When levels of other factors are known, the data file is modified by using the missing value indicator as a real measurement and including a new fixed factor for each trait which would have a level of 1 for an actual measurement and a unique level (easiest as, 2, 3, ..., etc.) for each missing measurement.
2b) When levels of other factors are not known, the data file can also be modified by creating new fields for other fixed factors for each trait and using the original levels when the trait is actually measured and, for data checking, using a unique level different from any of the original levels when the missing value is used as the measurement. Other options will also work.
3) When the data file with extra fixed factors is created, change levels for other factors for missing observations to a level unique for that factor. Means from MTDF66 will be correct for other levels, and the unique level will have as a mean the original "missing value indicator".
4) Next run MTDFPREP with the modified data file with a different "missing value indicator". The MTDF66 file from MTDFPREP must be interpreted carefully as the original missing value indicator will be used to calculate overall mean, unadjusted SD, and high and low observations. These items of data description can be obtained from MTDF66 from running the analysis with missing observations treated as missing (suggestion 1).
5) Then restart MTDFRUN with command MTDFRUN<MTDF4.1. MTDFRUN should need only 1 round (i.e., modify MTDF4.1 to do 1 round). Disregard "not converged" message. To be sure, compare –2logL from 1) with –2logL from 5).
6) For the extra factor, do not always ask for a summary because some analyses may have 1,000s of levels (e.g., G x E analyses). For some analyses, the user may need to change sizes of vectors included in the PARAM.DAT file and recompile MTDFPREP and MTDFRUN.


    APPENDIX
 Top
 Abstract
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 APPENDIX
 LITERATURE CITED
 
The REML estimators differ from maximum likelihood estimators because the maximum likelihood estimators are based on the distribution of the data vector, whereas REML estimators are based on the distribution of a lower dimensional linear function of the data vector or residual vector. If the same residual vector can be used for both the observed data and the augmented data, then the REML estimators along with their sampling properties will also be the same as for the maximum likelihood estimators. To show that the same residual vector can be used for both the observed and augmented data, we will first define a residual vector for the observed data and then show that this residual vector can be used for the augmented data.

Let yo be the no x 1 vector of observed data, where


Formula

The REML estimators of the (co)variance components of G and Ro are the maximum likelihood estimators of those (co)variance components based on the distribution of the residual vector Ko ' yo. The (no – po) x no matrix Ko ' is selected such that 1) po is the rank of Xo, 2) Ko ' Xo = 0, and 3) the rank of Ko ' is no – po. These requirements define REML (e.g., Harville, 1977Go; Searle et al., 1992Go). The distribution of the residual vector is


Formula

Because the selection of Ko ' is based on the column space of Xo, augmenting the original design matrix Xo with additional columns, Wo, will not impact the selection of Ko ' provided that the columns of Wo are a linear function of the columns of Xo. So without loss of generality, the vector of missing value effects, mo, and design matrix, Wo, can be added to the observed model equations for the data,


Formula

The same residual vector, Ko ' yo, can be used provided that the columns of Wo are a linear function of the columns of Xo.

Next let yu be the nu x 1 vector of unobserved data, where


Formula

with Wu is a matrix with rank nu and with u ~ N(0,G) and


Formula 1(1)

The n x 1 augmented data vector, y, is then formed by concatenating the observed and unobserved data vectors. The design matrix for the fixed effects for the augmented data is


Formula 1

The residual vector for the observed data, Ko ' yo, can be written in terms of the augmented data as


Formula 1

It remains to be shown that K'y is a (n – p) x 1 residual vector for the augmented data that satisfies: 1) p = rank of X, the design matrix for the augmented data, 2) K'X = 0, and 3) rank of K' is n – p. Because none of the last nu rows of X are a linear function of the first no rows, the rank of X is equal to the rank of the first no rows, po, plus the rank of the last nu rows, nu. Therefore, the rank of X is p = nu + po and K'y is a (n – p) x 1 residual vector because n – p = (no + nu) – (nu + po) = no – po. With K'X = Ko ' Xo + Ko ' Wo and Ko ' selected such that Ko ' Xo + Ko ' Wo = 0, K'X must also be equal to 0. Because K' = (Ko ' 0), the rank of K' is equal to the rank of Ko ' . With Ko ' selected to have rank no – po, which is equal to n – p, the rank of K' is also equal to n – p. Therefore, K'y satisfies the 3 conditions for a REML residual vector.


    Footnotes
 
1 Current address: Dep. Statistics, 340 Hardin Hall, University of Nebraska, Lincoln 68583–0963. Back

2 Corresponding author: lvanvleck{at}unlnotes.unl.edu

Received for publication April 5, 2007. Accepted for publication July 2, 2007.


    LITERATURE CITED
 Top
 Abstract
 INTRODUCTION
 MATERIALS AND METHODS
 RESULTS
 APPENDIX
 LITERATURE CITED
 


Boldman, K. G., L. A. Kriese, L. D. Van Vleck, C. P. Van Tassell, and S. D. Kachman. 1995. A manual for USE of MTDFREML. A set of programs to obtain estimates of variances and covariances [DRAFT]. USDA-ARS, Washington, DC.

Dodenhoff, J., L. D. Van Vleck, S. D. Kachman, and R. M. Koch. 1998. Parameter estimates for direct, maternal and grandmaternal genetic effects for birth weight and weaning weight in Hereford cattle. J. Anim. Sci. 76:2521–2527.[Abstract/Free Full Text]

Gilmour, A. R., B. J. Gogel, B. R. Cullis, J. J. Welham, and R. Thompson. 2002. ASReml User Guide. VSN International Ltd., Hemel Hempstead, UK.

Harville, D. A. 1977. Maximum-likelihood approaches to variance component estimation and to related problems. J. Am. Stat. Assoc. 72:320–340.[CrossRef]

Johnson, D. L., and R. Thompson. 1995. Restricted maximum likelihood estimation of variance components for univariate animal models using sparse matrix techniques and average information. J. Dairy Sci. 78:449–456.[Abstract]

Searle, S. R., G. Casella, and C. E. McCulloch. 1992. Variance Components. John Wiley and Sons Inc., New York, NY.



This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow All Versions of this Article:
jas.2007-0202v1
85/10/2375    most recent
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Kachman, S. D.
Right arrow Articles by Van Vleck, L. D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kachman, S. D.
Right arrow Articles by Van Vleck, L. D.


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS