|
|
||||||||
ANIMAL GENETICS |
The Cooperative Research Centre for Cattle and Beef Quality, CSIRO Livestock Industries, Queensland Bioscience Precinct, St. Lucia, Queensland 4067, Australia
| Abstract |
|---|
|
|
|---|
Key Words: Beef Complementary DNA Gene Expression Mixed Models
| Introduction |
|---|
|
|
|---|
Multivariate mixed-model equations (MME) are a natural framework to tackle these problems because they allow for a general covariance structure. We propose a model that incorporates the simultaneous information from seemingly independent experiments by allowing a nonzero correlation among gene expressions across experiments, while imposing a null residual covariance. In this article, we highlight the complexity of the subject by presenting the heterogeneous designs of three experiments in beef cattle; second, we introduce the selected seven-variate model to perform the joint analysis of the previously mentioned experiments; third, we present how DE genes are identified using model-based clustering, and where emphasis is given to details of gene ontology classes and genes that are jointly expressed across experiments; and finally, the results are presented and discussed in a biological context.
| Materials and Methods |
|---|
|
|
|---|
To date, we have used this array in three experiments (Exp. 1, 2, and 3), each with different design structures and alternatives of biological and technical replication (dye-swaps). Powerful experimental design depends on a consideration of the statistical consequences of comparison structure (Yang and Speed, 2002
). Practical considerations included in the development of our designs included transitivity (Townsend, 2003
) and extendibility (Kerr, 2003
). In this context, transitivity is defined by the ability to compare any two samples of interest directly from the same hybridization, as well as indirectly through transitive association across two or more hybridizations. The concept of extendibility implies that additional samples can be added to a given design in a sensible way.
In Exp. 1, we used 14 slides to contrast the gene expression profile of muscle in 10 steers fed varying quality diets, from high to medium to low. In Exp. 2, we compared the expression profile in muscle tissue between six animals, three of each of two breeds of cattle (Japanese Black and Holstein) at three time points corresponding to 11, 15, and 20 mo of age. The design of Exp. 2 comprised all possible dye-swaps within time points and across breed (12 slides), plus six slides providing information across time, within breed. Within time and breed, the RNA of the three animals was pooled, and two samples of this pool were used in the within-timeacross-breed comparisons (dye-swaps). Finally, Exp. 3 contained 22 slides and studied the mechanisms underlying in vitro adipogenesis. Duplicate flasks of fibroblasts cell cultures were compared, with one (treated culture) induced to undergo adipogenic differentiation, while the other (untreated culture) received no treatment. The gene expression profiling of each culture was explored over six time points from 0 h to 14 d. Figure 1
illustrates the configuration for the experimental design developed for each experiment. Note that the design for Exp. 1 and 3 contained two distinct configurations: a reference and a within-time pairwise comparison with dye-swap replicates.
|
The evaluation of the peculiarities of the three experiments (Figure 1
) resulted in the consideration of seven variables: y11 with 197,802 signals from the eight arrays of the reference design component of Exp. 1; y12 with 74,030 signals from the six arrays of the dye-swap component of Exp. 1; y21 with 110,308 signals from the four arrays of the dye-swaps in the breed comparison on the first time point plus the four arrays connecting the remaining time points within breed in Exp. 2; y22 with 116,409 signals from the four arrays of the dye-swaps in the breed comparison on the second time point plus the four arrays connecting the remaining time points within breed in Exp. 2; y23 with 117,687 signals from the four arrays of the dye-swaps in the breed comparison on the third time point plus the four arrays connecting the remaining time points within breed in Exp. 2; y31 with 106,591 signals from the seven arrays of the reference component of Exp. 3; and y32 with 236,671 signals from the 15 arrays of the time x fat treatment component of Exp. 3. Thus, in this notation, first and second subscripts indicate experiment of origin and design component within experiment, respectively.
Summary statistics for each of the seven variables are given in Table 1
. Note that the maximal value for the observations was constant and equal to 15.99, indicating the presence of color saturation in all variables. Note also that the apparent heteroskedasticity in the gene expression intensity readings both between and within experiments anticipates the convenience of resorting to multivariate models for the analysis of these data. In particular, green intensities from y11 originate from individual RNA, whereas all intensities from y12 originate from pooled RNA. Smaller variances are expected when pooled RNA samples are used relative to individual RNA (Kendziorski et al., 2003
; Peng et al., 2003
).
|
![]() | [1] |
where yij is the (nij x1) vector of signals from the ith experiment and the jth design component within experiment; ni is the size of yij; Xij is an (nij xpij) incidence matrix relating the pij fixed comparison group (CG) effects in ßij with signals in yij; Gij is an (nij xg) matrix relating the random gene (or array elements) effects in gij with yij; Aij is an (nij xaij) matrix relating the random gene xarray-printing-block interaction effects in aij with yij; Vij is an (nij x vij) matrix relating the random gene x variety effects in vij with yij; and eij is the (nij x 1) vector of random errors.
Features of Model [1] include the following: 1) although both experiments use the same array (making the column rank of each Gij equal to g), each may contain a different number of signals, arrays, and varieties; 2) a single fixed effect is fitted, CG and defined as the group of intensity signals originating from the same array, printing block, and fluorescent dye channel. These effects lack biological importance and their fitting aims to normalize the data by accounting for systematic effects. Main effects and first- and higher-order interactions are accounted in this definition of CG; 3) the term aij models the spot effect, accounts for spot-to-spot variability, allows us to estimate the treatment effects, and obviates the need to form ratios (Wolfinger et al., 2001
); 4) unlike the model of Chilingaryan et al. (2002)
, our approach accommodates variation between studies, and the source of multidimensionality is the various studies and components within studies, as opposed to the individual genes.
We make standard stochastic assumptions about Model [1]. Further details about this model are given in Table 2
. A total of 7,638 genes (or array elements) was included. However, not all of them were represented in each yij. The number of genes represented in yij averaged 5,856 and ranged from 3,580 for y12, to 6,749 for y11. The percentage of genes providing signals in only one, only two, and so on until all the seven variables were accounted for, was 7.0, 6.2, 6.0, 9.9, 13.0, 22.5, and 35.4%, respectively. Genes jointly evaluated across yij provided the basis for the estimation of 56 covari-ances. Covariance matrices are
g,
a,
v, and
e, for the g, a, v, and e, respectively. Their components are as follows:
|
![]() |
![]() |
![]() |
![]() |
where diag
indicates a diagonal matrix. Note there is an implied identity matrix of appropriate dimension multiplying each (co)variance component. Note also that higher accuracies in the resulting BLUP will be obtained as the number and size of off-diagonal elements (covariances) increases.
Because no hybridization was performed across experiments, no off-diagonal elements are allowed in
e. On the other extreme,
g is fully dense due to gene covariances between experiments and between components within experiments. In Exp. 2, the six slides connecting samples across time allow for the estimation of the off-diagonal elements in
a. Off-diagonal elements in
v are covariances between varieties within experiments. These are nonzero for Exp. 1 and 2, but zero for Exp. 3, as v31 contains gene x age interaction and v32 contains the interaction of gene x age x fat treatment. Finally, the MME for the seven-variate Model [1] contained 752,476 equations arranged in a 28 x 28 block-matrix structure, and 6,587,868 nonzero elements. We obtained restricted maximum likelihood estimates of covariance components and BLUP using the VCE software (Groeneveld and García-Cortés, 1998
).
Measures of Differential Expression
For each experiment, the following linear combination of the BLUP in
1j were calculated for each gene in k (k = 1 to 7,638):
![]() | [2] |
For Exp. 1, large positive d1k values are likely to belong to genes whose expression is upregulated in animals being fed the high-quality diet. Accordingly, large negative d1k values are likely to belong to upregulated genes in animals under nutritional restriction. Similar reasoning applies for d2k when contrasting Holstein vs. Japanese Black in Exp. 2, and for d3k when comparing fat treatment vs. control across the 6 time points (t = 0 to 5) in Exp. 3. Note that although the statistics in d1k, d2k, and d3k have the standard form of the hypothesis to be tested in ANOVA approaches to microarray data analyses (Cui and Churchill, 2003
), we did not conduct hypothesis testing because there is no evidence to support the assumptions of normality and homogeneity of variance, and there exists the problem of multiple comparisons. Also, this hypotheses testing would clash with the assumption of vij being random because the probability of a random variable being equal to any given value is zero.
Model-based clustering via a mixture of distributions has been proposed as a method of choice to identify DE genes (Yeung et al., 2001
; McLachlan et al., 2002
; Medvedovic et al., 2004
). Similar to the approach of Moser et al. (2004)
, our goal here was to apply model-based cluster analysis to the values in d1k, d2k, and d3k, and to determine which genes have relative levels far away from the majority. Each data point in d1k, d2k, and d3k are assumed to be independent observations from a mixture density with c (possible unknown but finite) components and with probability density function
![]() |
where
denotes the trivariate normal density function with mean µm and covariance matrix Vm, and the mixing proportions
i are constrained to be nonnegative and sum to unity. All unknown parameters are represented in
c for a c-component (or c-cluster) mixture model. In the current study, mixture models with up to five components (or clusters) were contemplated. Parameters of the mixture were estimated using the EMMIX software (McLachlan et al., 1999
). This program has various criteria for model selection, including the Akaike information criterion (Akaike, 1969
), the Bayesian information criterion (Schwartz, 1978
), the likelihood ratio test (Stram and Lee, 1994
), and the bootstrap approximation of the distribution of the likelihood ratio test (McLachlan, 1987
). Once the model of choice has been identified, the probability of each data point belonging to each cluster is given by its posterior probability.
Finally, the biological interpretation of the results was facilitated using an iterative group analysis approach (Breitling et al., 2004
) to identify functional classes of genes that are DE. Functional classes were obtained from gene ontology assignments (Ashburner et al., 2000
; http://www.geneontology.org). There were 928 gene ontology classes containing an average of nine genes each. Following Breitling et al. (2004)
, values listed in d1k, d2k, and d3k were sorted once by upregulation and once by downregulation. Moving along each sorted list, and for each gene ontology class, every time a new gene member of the class is encountered, the probability of having that many members that high up in the list by chance alone is given by the hypergeometric probability in
![]() |
where n is the total number of genes (n = 7,638), x is the total number of class members in the gene ontology, and t is the rank of the zth class member (z being the current step in the iteration). The position in the list that yields the smallest P-value determines the probability of change by chance alone for the entire gene ontology class and all class members above that position are considered as DE.
| Results and Discussion |
|---|
|
|
|---|
![]() |
![]() |
![]() |
and
e = diag{0.14 0.16 0.26 0.22 0.19 0.12 0.13}
The main effect of gene (g) accounted for the vast majority of this variation for all variables except y31 for which both g and a (for gene x array-printing-block) accounted each for approximately 45% of the total variation. Also apparent was the large gene correlation existing within and between experiments (with the exception of Exp. 3). These estimates revealed the strong goodness of fit of the model. The amount of total variation accounted by the model (given by the diagonal elements of (
g +
a +
v)(
g +
a +
v +
e) 1) ranged from 94.5% for y21 to 97.1% for y32.
Of interest is the large correlation (>0.95 in all cases) observed in Exp. 2 for a between time points (i.e., between y21, y22, and y23). Hence, the variation in size and quality of the spots for the same gene was basically identical across hybridizations. Also, the 0.76 (from 0.10/
[0.09 x 0.19]) correlation for the gene x diet inter- action between y11 and y12 indicates that a similar set of DE genes is expected to be found in both components of Exp. 1. This same correlation was much smaller (averaging 0.37) between y21, y22, and y23 and was attributed to the age effect in Exp. 2.
Estimates in 
provide the amount of total variation that can be attributed to DE genes. They averaged 3.0% and ranged from 2.3 to 5.4%. These percentages provide an indication of the expected proportion of DE genes and serve to control the false discovery rate (FDR), a major concern in gene expression (Van den Oord and Sullivan, 2003
; Meuwissen and Goddard, 2004
) as well as in genome scan mapping studies (Fernando et al., 2004
). The reasoning behind using these percentages to control the FDR can be found in Efron (2004)
, in which FDR is defined as fdr(z)
f0(z)/f(z), where f(z) = p0f0(z) + p1f1(z) is the estimated density from the observed ensemble of z-values, f0(z) is the distribution of the z-values under the null hypothesis (i.e., nonsignificant or non-DE in our context), and f1(z) is the distribution under the alternative hypothesis (i.e., DE in our context).
Finally, and to assess whether this full model offers any improvement over the individual analysis of the different experiments, a reduced model was fitted by forcing to zero the 16 covariance components defining between-study gene expression variation and given by
gi,j for i < 3 and j > 2 (i.e., a total of 10) and
gi,j for 2 < i < 6 and j > 5 (i.e., a total of six). The likelihood ratio test (Stram and Lee, 1994
) concluded strongly in favor of the full model (P < 0.001). This result is not surprising given the magnitude of the estimates of covariance components from the full model.
Measures of Differential Expression, Model-Based Clustering, and Gene Ontology Analyses
Values in d1k, d2k, and d3k averaged 0.00 and ranged from 1.85 to 3.54 for d1, from 1.79 to 1.71 for d2, and from 1.67 to 2.20 for d3. The fitting of mixtures of distributions identified three components (or clusters) as follows:
![]() |
Figure 2
illustrates the empirical marginal densities for d1k, d2k, and d3k, as well as the posterior probability (decision functions) of each value belonging to each cluster. Note that there is no y scale for the densities. This scale corresponds to probability 1 for the decision functions and densities are drawn to proportionality. The first cluster encapsulates the majority of genes (67%) for which no DE was observed. The second cluster, with a much larger variance for d1, identified 450 DE genes from Exp. 1 (of which 409 were upregulated). This cluster also contained 68 upregulated genes from Exp. 2. The third cluster contained 319 downregulated genes from Exp. 2 plus 361 DE genes from Exp. 3 (with 252 being upregulated).
|
|
|
Contrary to what was expected, the least number of comodulated genes occurred between Exp. 2 and 3, and included genes involved in cell proliferation and alpha B crystallin. The latter has been implicated in the ubiquitin pathway, where it promotes FBX-4-dependent ubiquitination (den Engelsman et al., 2003
), as well as in osteoblastic differentiation (Furushima et al., 2002
). The finding that alpha B crystallin is DE in both Exp. 2 and 3 may point to changes in protein turnover processes that accompany cell differentiation processes in both systems. The DE genes from Exp. 1 and 2 revealed genes associated with extracellular matrix, detoxification, and energy metabolism. In this case, these genes are modulated between different breeds and during different diets. Stearoyl-CoA desaturase and adipocyte enhancer binding protein 1 are examples of two genes that were found to be DE in all three experiments. These genes are involved in fatty acid synthesis, suggesting that genes modulated during in vitro adipogenesis are intrinsic to changes occurring during nutritional manipulation and breed differences.
An additional use of this analysis is to identify housekeeping genes from those elements having a consistent expression across experiments. These are likely to be included in a set of genes with a very large posterior probability of being clustered in the second components with near zero mean and small variance. Alternatively, the application of algorithms to identify housekeeping genes (Vandesompele et al., 2002
; Wilson et al., 2003
) to d1, d2, and d3 becomes straightforward.
Finally, because of the large number of genes, the statistical technique presented here still poses challenges to the control of FDR. First, many genes that agree between two experiments could be due to chance and would not pass a hypergeometric test (Man et al., 2000
). Second, the linear combinations defined in Model [2] are only illustrative. Other combinations could also have a biological meaning (e.g., high- vs. medium-quality diet in Exp. 1, and time contrasts in Exp. 2 and 3). These new definitions could potentially identify a different set of DE genes. Applying the method of Reverter et al. (2004)
to the number and distribution of DE genes found in this study (i.e., 450, 387, and 361 genes for Exp. 1, 2, and 3, respectively), yielded an estimate of the sensitivity equal to 100, 280, and 400 transcripts per million for Exp. 1, 2, and 3, respectively.
Finally, robustness to the model assumption is critical for the control of FDR. Significance analysis of microarrays (Tusher et al., 2001
) is a powerful approach; however, it does not take care of the correlation and heteroskedastic structure within and across arrays, let alone within and across microarray studies. The mixed-model approach proposed here overcomes this drawback; no significance level needs to be specified before identifying DE genes, FDR is controlled from the proportion of total variance due to gene x variety interaction, and violations to the assumptions of the model can be easily ascertained by exploring residuals.
| Implications |
|---|
|
|
|---|
| Footnotes |
|---|
2 Correspondence: 306 Carmody Rd. (phone: +61 7 3214 2392; fax: +61 7 3214 2900; e-mail: Tony.Reverter-Gomez{at}csiro.au).
Received for publication June 12, 2004. Accepted for publication August 31, 2004.
| Literature Cited |
|---|
|
|
|---|
ß-crystallin promotes FBX4-dependent ubiquitination. J. Biol. Chem. 278:46994704.This article has been cited by other articles:
![]() |
E. de la Vega, M. R. Hall, K. J. Wilson, A. Reverter, R. G. Woods, and B. M. Degnan Stress-induced gene expression profiling in the black tiger shrimp Penaeus monodon Physiol Genomics, September 11, 2007; 31(1): 126 - 138. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Vuocolo, K. Byrne, J. White, S. McWilliam, A. Reverter, N. E. Cockett, and R. L. Tellam Identification of a gene network contributing to hypertrophy in callipyge skeletal muscle Physiol Genomics, February 12, 2007; 28(3): 253 - 272. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Reverter, N. J. Hudson, Y. Wang, S.-H. Tan, W. Barris, K. A. Byrne, S. M. McWilliam, C. D. K. Bottema, A. Kister, P. L. Greenwood, et al. A gene coexpression network for bovine skeletal muscle inferred from microarray data Physiol Genomics, December 13, 2006; 28(1): 76 - 83. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. A. Lehnert, K. A. Byrne, A. Reverter, G. S. Nattrass, P. L. Greenwood, Y. H. Wang, N. J. Hudson, and G. S. Harper Gene expression profiling of bovine skeletal muscle in response to and during recovery from chronic and severe undernutrition J Anim Sci, December 1, 2006; 84(12): 3239 - 3250. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Reverter, A. Ingham, S. A. Lehnert, S.-H. Tan, Y. Wang, A. Ratnakumar, and B. P. Dalrymple Simultaneous identification of differential gene expression and connectivity in inflammation, adipogenesis and cancer Bioinformatics, October 1, 2006; 22(19): 2396 - 2404. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Reverter, W. Barris, S. McWilliam, K. A. Byrne, Y. H. Wang, S. H. Tan, N. Hudson, and B. P. Dalrymple Validation of alternative methods of data normalization in gene co-expression studies Bioinformatics, April 1, 2005; 21(7): 1112 - 1120. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |