|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ANIMAL GENETICS |




* Department of Animal Sciences, University of Wisconsin, Madison 53706;
and
Department of Dairy Science, University of Wisconsin, Madison 53706; and
Aviagen Ltd., Newbridge, Midlothian, EH28 8SZ, United Kingdom
| Abstract |
|---|
|
|
|---|
Key Words: chicken genetic association genotype by environment interaction machine learning mortality single nucleotide polymorphism
| INTRODUCTION |
|---|
|
|
|---|
In the context of SNP subset selection, G x E can be viewed in different ways. One perspective is that the best SNP subsets selected in different environments do not overlap fully; also, the level of allelic associations between the 2 sets of SNP is low, as measured by some linkage disequilibrium (LD) parameters. Another perspective is that a subset of well-chosen SNP selected in one environment may not predict equally well the same trait in a different environment. That is, across-environment predictions would be poorer. This would imply that, in addition to genetic background (SNP), environmental factors need to be taken into account in the development of genomic-assisted selection models.
The objective of this study was to examine G x E using machine-learning techniques in a SNP-mortality association case study. The filter-wrapper feature selection of Long et al. (2007)
was applied to 4 age-hygiene environment specific strata, to test whether environment affects SNP subset selection for early and late mortality traits. Genotype x environment interactions would be present if the SNP subsets selected from 2 hygiene environments are not consistent, in terms of across-environment predictive ability and extent of LD between subsets.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Data Set
Genotypic and phenotypic data came from the Genomics Initiative Project at Aviagen Ltd. (Newbridge, UK). Phenotypes consisted of early (0 to 14 d) and late (14 to 42 d) living status (dead or alive) of 333,483 birds. These chicks were raised in either a high (H, n = 251,539) or low (L, n = 81,944) hygiene condition. The high and low hygiene environments were representative of selection nucleus and commercial level conditions, respectively. Information included sire, dam, age of dam, hatch, and sex of each bird. There were 1,552 sires and 9,921 dams; age of dam was categorized into 18 levels (30 to 64 wk), and there were 232 hatch groups.
There were 5,523 SNP genotyped on 253 of the 1,552 sires. Each SNP was biallelic (e.g., A or G alleles), and genotypes were coded as 0 (AA), 1 (AG), or 2 (GG). Detailed description of these SNP is in Long et al. (2007)
.
Sex-specific mortality was not studied, because sex was assigned post 35 d; therefore, sex-specific early mortality (0 to 14 d) could not be evaluated. Also, average mortality post 35 d was low for both sexes (data not shown), hence precluding assessment of sex-specific late mortality (14 to 42 d).
Calculation of Adjusted Sire Means and Formation of 4 Strata
To create a response variable for the sires genotyped, first, effects of factors (age of dam and hatch) potentially affecting individual bird survival were removed via a GLM model (GLMM) without sire and hygiene effects. For each bird, the residual derived from the GLMM was calculated. Then, birds were classified into 2 groups, corresponding to the hygiene environments (L and H) in which they had been raised. In each hygiene group, residuals of progeny of a sire were averaged, producing an adjusted progeny mortality mean as the response variable for each sire.
The individual record on each bird was binary (dead or alive), and the GLMM fitted was
![]()
where pijk = the death probability of bird k, progeny of a dam of age I, and born in hatch j. Here, DAi stands for the fixed effect of the ith level of age of dam (i = 1, 2,..., 18); Hj denotes the random effect of hatch j (j = 1, 2,..., 232), which was assumed normal, independent, and identically distributed as Hj
NIID(0,
2H), where
2H = the variance between hatches. Let yijk be the true binary status of bird ijk (0 = alive, 1 = dead) and
ijk be the fitted death probability using model [1]; then, the residual for a given bird is rijk = yijk –
ijk, with a sampling space of [–1, 1]. The GLMM was implemented in SAS PROC GLIMMIX (SAS Institute Inc., Cary, NC).
Model [1] was fitted to both early and late mortality data. Subsequently, birds were divided into the 2 hygiene groups, and progeny residuals were averaged for each sire, as described above. Thus, 4 strata of age-hygiene combinations were formed, with each stratum containing the adjusted progeny mortality means calculated for: 1) birds of early age raised in low hygiene (EL), 2) birds of early age raised in high hygiene (EH), 3) birds of late age raised in low hygiene (LL), and 4) birds of late age raised in high hygiene (LH). After removing SNP with missing values (i.e., those that were not genotyped on some sires), number of sires and SNP used were as follows: EL and LL—222 sires, 5,119 SNP and EH and LH—232 sires, 5,166 SNP; 201 sires had progeny in both the L and H environments.
Discretization of Phenotypes and Filter-Wrapper Feature Selection
Methods were essentially those employed in Long et al. (2007)
, designed to circumvent the difficulty encountered in standard regression models when fitting effects of a much greater number of SNP than the number of observations available. Filter-wrapper feature selection operates on sire group means and results in an optimal SNP subset. Because the procedure is tailored for discrete traits, the continuous adjusted mortality rates were discretized into 2 classes, namely, decreased and increased mortality. By means of a pair of cut-off values, c1 and c2, sire groups with intermediate means were discarded, to make the remaining sire samples (sires with mortality rates
c1 or
c2) more informative. Here, the cut-off pair (c1, c2) was such that c1 is the
quantile of the distribution of adjusted mortality rates over all sires and c2 is the 1-
quantile. Four pairs of (c1, c2) corresponding to 4
values (0.05, 0.20, 0.35, and 0.50) were adopted to generate 4 groups of case-control sire samples (group 1 to group 4) for use in the filter-wrapper procedure.
An information-gain (Mitchell, 1997
)-based filter was used to decrease the feature (SNP) space from thousands of SNP to a small number (e.g., the top 50 SNP), thus eliminating irrelevant features from further consideration. Subsequently, the wrapper optimized the performance of the 50 SNP selected in the filter step, by adopting a naïve Bayes (NB) classifier (Elkan, 1997
), acting as a black box for scoring each of the searched subsets (relative to the full set of 50 SNP), with the score being classification accuracy. The filter procedure was coded in Java, and the wrapper procedure was carried out on the Weka platform (Witten and Frank, 2005
).
For each of the 4 case-control groups, the filter-wrapper procedure selected an optimal subset of SNP (i.e., the subset with greatest NB classification accuracy). Because there were 4
groups per stratum, 4 subsets of SNP were produced at each age-hygiene condition combination. These were validated as described below.
Validation of the 4 Subsets of SNP in Each Stratum
Each of the 4 SNP subsets in a stratum has an associated NB classification accuracy. These accuracies are not comparable with each other, because each subset is specific to a certain
value and, therefore, to a different number of sire families classified into either low or high mortality groups. To compare the 4 subsets within each stratum fairly, and to arrive at an optimal subset, cross-validation was used. A fixed-effects linear model was fitted in which predictors were SNP from the subset under consideration, and response variables were the adjusted mortality rates of all sire families studied in the stratum in question. The model was

where Mi = the adjusted mortality mean of sire i (after standardization, to achieve a zero mean and unit variance); SNPij = the fixed effect of the jth genotype of SNP (chosen from the fiter-wrapper process) in sire i; and ng = the number of SNP in the subset under consideration; this model has 2 ng df. The analysis was by weighted least squares, in which the weight assigned to sire mean i was Ti/T (Ti = the number of progeny in sire family i; T = the total number of progeny over all sire families). In addition, Q-Q plots of the fitted residuals (derived from model [2] assuming normal errors) versus either normal or t-distributions with 4, 6, and 8 df (t-4, t-6, and t-8, respectively) were examined, to determine a proper error distribution for this model.
Predicted residual sum of squares (PRESS) was used as a criterion for comparing the predictive ability of the SNP subsets. Predicted residual sum of squares, as described in Ruppert et al. (2003)
, is a cross-validation-type statistic and predicts Mi using all sire means except the ith (i = 1, 2,...,N). Letting
(i) be the predicted value of the ith sire mean when the latter is not used to estimate parameters in model [2], PRESS is given by

A subset of SNP was considered best if it produced the smallest PRESS when employing them as predictors. A SAS macro was written to generate PRESS statistics for model [2], which was embedded in SAS PROC GLIMMIX (SAS Institute Inc.).
Across-Environment Prediction
Across-environment prediction assesses whether or not the predictive ability of SNP selected under one hygiene environment stands in a different hygiene environment. Variation in predictive ability of a certain SNP subset under different hygiene conditions is indicative of sensitivity of SNP to environments (i.e., G x E).
This step utilized the 4 SNP subsets derived above, in each stratum. It was performed for early and late mortality separately and in the same way. For example, there were 2 strata for early mortality: namely, EL and EH. Four subsets of SNP in low hygiene (EL) were then used to predict sire mortality means in high hygiene (EH), using model [2]. Similarly, 4 subsets in H were also used to predict sire mortality means in L. The PRESS statistics were computed to compare with the PRESS values obtained in the validation step, in which sire progeny mortality means were predicted using SNP selected from the hygiene environment in which the progeny of the sire had been raised.
Graphical Display of LD Between SNP
The magnitude of LD between 2 SNP loci is a measure of association between them, although existence of LD does not imply physical linkage (Ott, 1999
). A group of SNP could be a proxy for another if there is strong LD between the 2 groups, regardless of differences in physical locations. Pairwise LD between the 2 best SNP subsets (each selected in a different environment, L or H) was summarized graphically. The LD measure adopted was r2, the squared correlation coefficient between 2 SNP loci (Thomas, 2004
). Graphs were made with R (http://www.r-project.org/) package LDheatmap (Shin et al., 2006
).
| RESULTS |
|---|
|
|
|---|
The distribution of stratum-specific (EL, EH, LL, and LH) adjusted sire mortality means is shown in Figure 1
. The fitted densities were skewed, suggesting that adjusted mortality rates were not well described by a normal distribution.
|
The 4 case-control sire samples per stratum (from the 4 distinct
values) are shown in Table 1
(left part). Sample size (number of sires) included in the filter-wrapper procedure ranged from 24 to 232 and increased with
, as the phenotypic difference between the case and control sires became smaller.
|
. The maximum value for information gain is 1; hence, information gains of individual SNP were relatively low, except when
= 0.05 (maximum difference between 2 groups of sires among the 4
values).
|
Table 1
(right part) gives the sizes of the SNP subsets selected accordingly. In the most extreme case-control samples (
= 0.05), the number of SNP ranged from 4 to 9. On the other hand, the number of SNP ranged from 33 to 46 when
= 0.50, much greater than in the case of
= 0.05. The difference in the size of SNP subsets over different
sample groups is due to the fact that redundancy between the top 50 SNP was much greater in small
-groups than in large
-groups (Long et al., 2007
).
Figure 3
shows the cross-validation classification accuracy of NB of SNP before (full set) and after feature selection (subset). Subset selection increased classification accuracy, except in group 1 (
= 0.05), in which sire families in the case-control groups were more extreme and therefore more informative; it was hard to improve classification accuracy in that case. Classification accuracies of SNP subsets were high (>90% in most cases) and decreased with increasing
.
|
According to the Q-Q plots, a t-8 distribution fitted the standardized adjusted mortality rates best (data not shown). Hence, model [2] was fitted with a t-8 error distribution to generate PRESS statistics for the 4 SNP subsets (corresponding to
= 0.05, 0.20, 0.35, and 0.50, respectively) in each age-hygiene condition stratum as shown in Table 2
. The smallest value in each stratum corresponds to the best SNP subset selected based on a case-control sire sample specific to an
value. Based on these results, the best subsets were as follows: a) early mortality: H05 (low hygiene,
= 0.35) containing 34 SNP and H05 (high hygiene,
= 0.35) containing 35 SNP; b) late mortality: L35 (low hygiene,
= 0.35) containing 29 SNP and H05 (high hygiene,
= 0.05) containing 4 SNP.
|
The PRESS statistics for across-environment predictive ability of hygiene condition-specific SNP are in Table 3
. For comparison, PRESS values from the validation step discussed in the previous section are presented as well.
|
xy) of the 2 ratios being 
. Similarly, for late mortality,
xy = 0.77. These figures suggest that G x E decreased the efficiency of SNP-assisted prediction by 20 to 30%.
Optimal SNP subsets were not consistent over the 2 hygiene environments. For example, for early mortality, the SNP subset selected from the
= 0.20 sample group was best (smallest PRESS) among all low hygiene subsets for prediction of mortality in the high hygiene condition. However, for prediction of mortality in the low hygiene condition, the SNP subset selected from the
= 0.35 sample group was best among all low hygiene subsets.
LD Between SNP Selected from Low and High Hygiene Environments
The 2 best SNP subsets (L35 and H35 for early mortality, L35 and H05 for late mortality) spread along the genome distinctly, with no overlap in terms of physical locations (data not shown). Figure 4
shows a map of pairwise LD among SNP from the 2 subsets. Each square represents a pair of SNP, and darkness indicates magnitude of LD. The LD measure, r2, ranges from 0 (no LD) to 1 (complete LD). The 2 triangular regions represent within-subset LD, whereas the rectangular region represents between-subset LD. For example, in the top graph (early mortality), the lower triangle corresponds to the L35 subset consisting of 34 SNP; the upper triangle corresponds to the H35 subset consisting of 35 SNP. Each square in the rectangular region denotes a pair of SNP in which one comes from the L35 subset and the other one from the H35 subset.
|
| DISCUSSION |
|---|
|
|
|---|
In feature selection, it was not surprising that information gain decreased with
in each age-hygiene condition stratum: an increase of
shortens the distance between the 2 cut-off values c1 and c2. The low information gain over all groups implies that highly discriminative or major SNP affecting mortality are unlikely to exist. The usefulness of subset selection in increasing classification accuracy was corroborated.
Conventional approaches for detecting G x E rely on the assumption that genotypes rank differently among environments or exhibit differential environmental variability. For example, multiplicative (joint effect of genetic and environmental factors is the product of each) or additive models (joint effect of genetic and environmental factors is the sum of each) provide specifications under the hypothesis of no G x E (Hunter, 2005
). However, with more than 5,000 SNP as in this study, it is not feasible to examine all possible interactions between the 5,000 SNP and 2 hygiene environments. Therefore, it was reasonable to assess G x E by carrying out the SNP subset selection separately in each hygiene condition (L and H), to then evaluate the extent to which SNP subsets selected in different hygiene environments are distinct. Here, G x E was examined via across-environment prediction and by examination of the extent of SNP loci LD between L and H subsets.
The PRESS statistics measured the predictive ability of each SNP subset and ranked the 4 subset candidates (corresponding to
= 0.05, 0.20, 0.35, and 0.50) within the same hygiene environment. Likewise, prediction of mortality in 1 hygiene environment using the 4 SNP subsets selected in another hygiene environment was used to gauge G x E. It was found that, given a set of candidate SNP subsets, the best one for predicting phenotypes in one environment was not the best one in another environment. A statistic for measuring cross-prediction consistency over environments (
xy) had mean values of 0.72 (for early mortality) and 0.77 (for late mortality), suggesting that SNP x environment interaction lowered predictive ability by 20 to 30%, at least when measured in terms of PRESS. This is analogous to the fact that ranking of genotypes might change from one environment to another, due to G x E (Mulder and Bijma, 2005
). The implication is that there may not be a universally best SNP subset for predicting mortality, illustrating the interplay between genetic and environmental factors.
The idea that SNP selected under the 2 environments are different was reinforced by the extremely low level or absence of LD between the 2 best SNP subsets (L and H subsets) for early and late mortality rates. If LD was strong and frequent between 2 subsets of SNP, one would expect that a set of SNP can work as effectively as the other set in predicting a response variable. In the extreme case, in which each SNP in one subset is found to be in LD with a SNP in the other subset, these 2 subsets can be treated as tag for each other.
Due to incomplete sex information on birds, sex-specific SNP could not be investigated. When calculating residuals, there may be a need to remove other nuisance effects (e.g., contemporary groups) or half of the EBV of a dam of a bird, or its environmental effect. This was not done for 2 reasons: first, calculation of breeding value introduces the assumption of additive inheritance, which may not hold for a trait such as mortality. Second, removing dam effects under the assumption that these are entirely environmental could also remove genetic effects, and this was deemed undesirable. Naturally, if there are strong reasons to believe that a trait is strictly additive, one could introduce the additive genetic and environmental effects of the dam in the GLMM model, provided it is feasible to do so. Lastly, when individuals are tested in many different environments, the machine-learning-based feature selection method may not be able to capture G x E suitably. Instead, other statistical approaches could be used (e.g., the mixed model of Gogel et al., 1995
), in which environments are taken as random effects. When using dense marker data, semiparametric procedures such as mixed models with kernel regressions (Gianola et al., 2006
; Gianola and van Kaam, 2008
) could be tailored to capture cryptic forms of interactions that are difficult to model.
| Footnotes |
|---|
2 Corresponding author: nlong{at}wisc.edu
Received for publication March 9, 2008. Accepted for publication August 7, 2008.
| LITERATURE CITED |
|---|
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |