J. Anim Sci.
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Reverter, A.
Right arrow Articles by Lehnert, S. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Reverter, A.
Right arrow Articles by Lehnert, S. A.
J. Anim. Sci. 2003. 81:1900-1910
© 2003 American Society of Animal Science

A mixture model-based cluster analysis of DNA microarray gene expression data on Brahman and Brahman composite steers fed high-, medium-, and low-quality diets1

A. Reverter*,2, K. A. Byrne*, H. L. Bruce{dagger}, Y. H. Wang*, B. P. Dalrymple* and S. A. Lehnert*

* Cooperative Research Centre for Cattle and Beef Quality, CSIRO Livestock Industries, Queensland Bioscience Precinct, St Lucia, Queensland 4067, Australia; and {dagger} Food Science Australia, Tingalpa DC, Queensland D 4173, Australia


    Abstract
 Top
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Implications
 Literature Cited
 
The objective of this study is to explore aspects of the statistical analysis of gene expression response at the muscle tissue level to varying levels of energy and protein in the diet. Eleven Brahman and Brahman composite steers (weighing 302 ± 9.8 kg, on average) were allocated randomly into high- (HIGH), medium- (MED), and low- (LOW) quality forage diets for 27 d. After this period, a biopsy of the longissimus dorsi muscle was taken from each animal and total RNA was extracted to generate the labeled target for microarray experimentation. These targets were hybridized to a complementary DNA (cDNA) microarray of 9,274 probes from cattle muscle and subcutaneous fat cDNA libraries. After edits, 151,904 expression intensity levels of 4,747 genes were analyzed. Emphasis was given to the choice of power transformation of the intensity channel readings and to the consistency of readings within each diet quality group. The statistical approach to isolate differentially expressed genes was based on model-based clustering via a mixture of normal distributions estimated through maximal likelihood. The base-2 logarithm was found to be the optimal power transformation to normalize gene intensity levels. A two-sample t-statistic was defined as a measure of possible differential expression. For each of the three diet contrasts, HIGH vs. LOW, HIGH vs. MED, and MED vs. LOW, three clusters were found, two of which contained more than 94% genes with almost no altered gene expression levels, whereas the third cluster contained the remaining genes with a differential expression. Results from the HIGH vs. LOW contrast identified 27 genes with a greater than 95% posterior probability of belonging to the cluster of differentially expressed genes.

Key Words: Beef • Complementary DNA • Gene Expression • Maximum Likelihood • Statistical Analysis


    Introduction
 Top
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Implications
 Literature Cited
 
Gene expression technology is becoming increasingly accessible to animal scientists. The expectation is that these techniques will contribute to a greater understanding of the genetic basis of economically important traits. Moody (2001)Go provides a review of techniques for evaluating gene expression in livestock species. Recently, Lee and Hossner (2002)Go used PCR products to demonstrate that a nutrient challenge can stimulate or inhibit the expression of various well-known candidate genes involved with lipogenesis and adipose tissue metabolism.

One area of intensive development is the analysis and interpretation of the large data sets generated by these techniques and in particular complementary DNA (cDNA) microarray. Novel statistical challenges are presented because microarray data are very high dimensional with very little replication. In a typical experiment, the expression of a number of genes ranging anywhere from 1,000 to over 20,000 could be explored. However, RNA samples from only a handful of experimental animals may be available.

The Cooperative Research Centre for Cattle and Beef Quality (Beef CRC) undertook gene expression profiling of bovine muscle tissue to explore the gene regulatory pathways that control muscle development, as well as the gene expression response at the muscle tissue level to varying levels of energy and protein in the diet. Within this project, the objective of this paper is to present the analysis of cDNA microarray data on 9,274 genes on Brahman and Brahman composite steers fed high-, medium-, and low-quality diets. Initial emphasis is given to the choice of power transformation of the gene expression intensity levels and to the consistency of readings within each diet group. Finally, the statistical approach to isolate differentially expressed genes is based on model-based clustering estimated via maximum likelihood.


    Materials and Methods
 Top
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Implications
 Literature Cited
 
Animals, Diets, and Microarray Design
Three Brahman and eight Brahman composite steers weighing 302 ± 10 kg were selected at the Tropical Beef Centre of the Commonwealth Scientific and Industrial Research Organisation (CSIRO) in Rockhampton, Queensland, Australia. The Animal Ethics Committee approved all animal procedures. These steers were settled before the experiment for 2 wk in a paddock with lush forage. The steers were then weighed and randomly allocated to three diets such that the mean weight of the steers in each diet was approximately equal. The ad libitum alfalfa (lucerne) hay diet (HIGH; n = 3 animals; namely A1, A2, and A3) was designed to allow the cattle to continue to grow at 0.8 to 1.0 kg/d and was the control diet. The restricted alfalfa hay diet (MED; n = 4 animals; namely A4, A5, A6, and A7) was used to produce a moderate weight gain of about 0.3 kg/d; and the ad libitum low-quality grass-based hay diet (LOW; n = 4 animals; namely A8, A9, A10, and A11) was included to produce a weight loss of 0.3 kg/d. Steers were housed individually in pens under cover from sun and rain for the duration of the experiment.

After 27 d on the experimental diets, feed was removed from the steers at noon the day before they were scheduled to be biopsied from the longissimus dorsi (LD) muscle. Prior to biopsy, each steer was weighed. For biopsy, steers were isolated randomly and immobilized. Each received a sedative of 0.2 mg of Rompun/kg of BW s.c. in the neck. The surgical area over the LD between the 10th and 13th ribs was clipped with fine clippers and washed with an alcohol and disinfectant mixture. Once the steer showed signs of sedation (drooling, lying down), the steer was haltered and the analgesic and sedative ketamine was administered at 2.0 mg/kg of BW into the jugular vein. Once the animal was heavily sedated, a 1- to 2-g muscle sample was removed using a scalpel. The incision was then stitched internally using soluble stitches and the hide was closed with insoluble #2 nylon suture. The wound was sprayed with Chloromide (Troy Laboratories, Smithfield, NSW), and then each steer received 20 mL of Propen, a penicillin derivative, s.c. at three injection sites. Once each steer had roused, it was encouraged to stand and was returned to its pen to recover.

Biopsy samples were wrapped in aluminum foil that had been labeled with cryogenic pen written on cryogenic tape. The samples were then immersed and stored in liquid nitrogen, transported to the laboratory in Brisbane, Queensland, Australia, and stored at -70°C until total RNA was extracted to generate the labeled target for microarray experimentation. Total RNA was prepared from each sample biopsy using TRIzol reagent in accordance with the manufacturers protocol (Invitrogen, Carlsbad, CA) and stored at -80°C. Sample quantification was measured by spectrophotometry and RNA quality assessments using agarose gel electrophoresis were carried out. All samples were treated with DNase I (Ambion DNA Free Kit, Austin, TX) for the removal of DNA contamination. A second purification, using RNAeasy Mini Kit (Qiagen, Valencia, CA) and a second DNase I on-column treatment (RNase-Free DNase Set, Qiagen, Valencia, CA) was performed on 100 µg of total RNA prior to cDNA synthesis.

The main objective of the gene expression experiment was to identify which genes were differentially expressed between the HIGH and the other MED and LOW groups. To this end, a reference sample design was developed. Factors considered in designing the experiment included the availability of reference vs. treatment RNA, as well as the cost of the arrays themselves. If the degree of variation between biological replicates is thought to be very high, or not enough RNA can be collected from just one individual, then pooling samples may be beneficial.

Animals in HIGH were considered the reference sample and the total RNA from these animals was pooled and labeled with fluorescent green dye (G) using direct incorporation of Cye3-dUTP during the reverse-transcribed cDNA synthesis step. The entire HIGH pool was labeled at once and then allocated to the various hybridizations. Consequently, MED and LOW were considered the two treatments and their corresponding total RNA samples were labeled with fluorescent red dye (R) using Cye5-dUTP as above, but not pooled. For efficiency, HIGH vs. MED (HvM) and HIGH vs. LOW (HvL) comparisons were made directly by hybridizing them on the same array but separately for each animal in MED and LOW groups. The biopsy of animal A11 yielded insufficient RNA for cDNA for analysis; therefore, in order to maintain the balance of the reference design, two hybridizations were performed for animal A8 (randomly selected) in LOW. Thus, a total of eight hybridizations was carried out: four HvM resulting from one of each animal in the MED group, and four HvL resulting from twice A8, plus A9 and A10. The fluorescently labeled cDNA targets were hybridized to a cDNA microarray constructed from two cattle cDNA libraries derived from LD and subcutaneous fat tissue of a 14-mo-old grass-fed Angus steer. A total of 9,600 elements were printed in duplicate onto CMT GAPSII (Corning) glass slides using a Virtek Microarrayer with the Stealth 48 Pin head and Micro Spotting Pins (Telechem International Inc., Sunnyvale, CA) at a spacing of 220 µM at the Institute of Molecular Bioscience of the University of Queensland, Brisbane, Australia. The array consisted of 9,222 cattle probes comprised of 7,291 anonymous cDNA from the bovine skeletal muscle and subcutaneous fat cDNA libraries and 1,915 bovine EST (expressed sequence tags) selected from the muscle and fat libraries, CSIRO cattle skin library, the MARC 1-4BOV libraries (U.S. Meat Animal Research Center) and contributing Beef CRC laboratories. In addition, 16 EST-derived oligonucleotides and the Lucidea Microarray Scorecard V1.1 (Amersham Biosciences, Piscataway, NJ) were printed on the array. The DNA was covalently crosslinked to the slides using ultraviolet irradiation with Stratalinker set for 65 mJ.

Each array contained 19,200 cells arranged in 48 blocks of 20 rows x 20 columns each. Hence, EST were duplicated at least once on each array, and those EST duplicated more than once corresponded to "housekeeping" genes (i.e., genes of known sequence and function believed not to be differentially expressed between the two cell types of interest, HIGH and either LOW or MED).

The commercial image analysis software GenePixPro 3.0, and a GenePix 4000 optical scanner, both from Axon Instruments Inc. (Union City, CA) were used to quantify the gene expression level intensities. This software provides a distinct quality reading for bad quality spots that were subsequently edited from the analyses. Also, only genes that scored twice in each of the eight microarrays slides were used in this study. These editing criteria resulted in 151,904 intensity readings from 4,747 genes out of the original 9,274. Hence, the 151,904 records originated from the 4,747 genes x two replicates x two dye channels x eight array slides. For these 4,747 genes, the R and G intensity levels were background-corrected by subtracting the background (Rb and Gb) from the foreground (Rf and Gf) intensities. Thus, for analyses:


Choice of Power Transformation
It is customary to apply the log transformation (and typically the base-2 logarithm) to the intensity levels R and G so that they are more likely to have a normal distribution (which in turn will reduce the number of clusters found in a model-based clustering approach). Although microarray data may benefit from transformation, it is not immediately apparent which transformation should be used. Very few studies pay attention to the choice of power transformation (one exception being Yeung et al., 2001Go).

Many commonly used transformations, including the log transformation, are subsumed in the parametric family of power transformations from x to x({lambda}) with parameter {lambda} (Box and Cox, 1964Go):


[1]

which is continuous in {lambda} for x > 0. Following the derivations of Johnson and Wichern (1992)Go, the parameter {lambda} will be estimated by maximum likelihood. Given the observations x1, x2, ..., xn, the Box-Cox solution for the choice of an appropriate power {lambda} is the one that maximizes the following expression:


[2]

where


[3]

is the arithmetic average of the transformed observations.

The first term in Eq. [2Go] is proportional to the logarithm of a normal likelihood function, after maximizing it with respect to the population mean and variance parameters. Therefore, maximizing Eq. [2Go] is of interest in order to increase the chances of producing a transformed set of variables that adequately conform to a normal distribution. In this study, a range of values of {lambda} was explored to transform the gene expression intensity levels R and G, and the ones that maximized l({lambda}) were selected.

After taking the transformation, for each comparison (HvL and HvM), the gene expression levels were standardized by subtracting the median expression value for each dye and slide. This within-slide location standardization is based on the assumption that most genes, at least half, will not be differentially expressed. This method also assumes that the R and G dye intensities are related by a constant factor: the median. The median is chosen, as opposed to the mean, because it is a better indicator of central tendency in skewed distribution and as such, is more robust against outliers.

The resulting expression levels after transformation and median standardization were denoted as follows:

gij and g'ij
for the G levels (i.e., green levels out of the reference HIGH group) in the HvL and HvM comparison, respectively; and

rij and r'ij
for the R levels (i.e., red levels out of the treatment LOW or MED) in the HvL and HvM comparison, respectively.

In this notation, subindex i refers to gene (i = 1 to 4,747), subindex j refers to gene replicate (two within-slide x four slides in each comparison; thus j = 1 to 8).

Consistency of Expression Levels in the Reference Sample Across Comparisons
In this reference design, caution must be taken when comparing the effects of LOW groups with those of MED groups in gene expressions because comparisons were not made within slide, but across slides through the reference group HIGH. To a great extent, the accuracy of the comparison LOW vs. MED will depend on the consistency of results of the G levels belonging to HIGH across comparisons HvL and HvM, and previously denoted as gij and g'ij, respectively.

A measure of the consistency of the gij and g'ij levels was obtained by exploring the mean and the SD for each gene and calculated according to the following formulas:


[4]

and


[5]

for comparisons HvL and HvM, respectively.

Measure of Differential Expression
For each comparison, and as a measure of (possible) differential gene expression, the following two-sample t-statistic was calculated for each gene:


[6]

where and are the arithmetic mean of rij and r'ij, respectively.

The numerator in Eq. [6Go] is the difference of mean gene expression levels under the comparisons HvL and HvM for and , respectively. The denominator is the standard error of the numerator and aims at standardizing the observed differences by penalizing those with large, and thus less reliable, variation.

Provided gene expressions in gij and g'ij are consistent with each other as defined in Eq. [4Go] and [5Go], a third t-statistic can be computed to identify differentially expressed genes in the MED vs. LOW group contrast (MvL) as follows:


[7]

Model-Based Clustering
The mixture model assumes that each cluster (or component) of the data is generated by an underlying normal distribution. Each of the data in y = y1 to yn is assumed to be an independent observation from a mixture density with k (possible unknown but finite) components and with the probability density function:


[8]

where {phi}(y; µi,Vi) denotes the normal density function with mean µi and (co)variance matrix Vi, and the mixing proportions {pi}i are constrained to be non-negative and sum to unity. All unknown parameters are represented in {Phi}k for a k-component (or k-cluster) mixture model. In the present study, mixture models with up to five components (or clusters) were contemplated.

The mixture model in Eq. [8Go] was fitted to the data in and , and parameters were estimated by maximal likelihood using EMMIX software (McLachlan et al., 1999Go). This program is available at http://www.maths.uq.edu.au/~gjm/emmix/emmix.html and has many interesting features, including multiple starts of the expectation-maximization (EM) algorithm and various criteria for model selection.

In short, the EM algorithm obtains the maximal likelihood estimate of {Phi}k by iteration. In the (m+1)th iteration, the estimates of the parameters of interest are updated by:




for i = 1 to k, where


[9]

is the posterior probability that yj belongs to the ith component of the mixture.

Criteria for model selection includes the Akaike Information Criterion (AIC; Akaike, 1969Go) and the Bayesian Information Criterion (BIC; Schwartz, 1978Go):



where {upsilon}k is the number of independent parameters in {Phi}k. The selected model is the one with the smallest AIC or BIC. There is no clear consensus on which criteria is best to use, although the empirical work of Fraley and Raftery (1998)Go seems to favor BIC. Alternatively, and as proposed by McLachlan (1987)Go, the distribution of the likelihood ratio test (LRT) under the null hypothesis H0: k = k0 can be approximated by bootstrapping and a P-value against the alternative H1: k = k0 + 1 can be obtained. In the present study, a combination of the three criteria (AIC, BIC, and LRT) was explored to identify the selected model.

Comparison of Model-Based Clustering with ANOVA Methods
The available 151,904 measures of gene expression intensities (log-transformed and within-array median-centered) were further analyzed by fitting a linear model that included the effects of array (df = 7), block nested within array (df = 376), diet (df = 2), array x diet interaction (df = 6), gene (df = 4,746), and gene x diet interaction (df = 9,492). This model is similar to the one suggested by Kerr and Churchill (2001)Go within the scope of microarray data analysis. The effects of array, block, diet, and array x diet interaction are not gene specific, have no biological interest, and their fitting aims at normalizing the data. The gene effect contains the average level of gene expression, whereas the effect of interest is the interaction between genes and diets because it captures differences from overall averages that are attributable to a specific combination of diet and gene.

Initially, goodness of fit for this (completely fixed-effect) model was assessed by using the ANOVA procedure of SAS (version 8.2, SAS Inst., Inc., Cary, NC). Later, the same model was fitted by treating the effects of gene and the interaction of gene x diet as random. The optimality of mixed-model approaches to assess gene significance has been reported for spotted cDNA microarray expression data (Wolfinger et al., 2001Go) as well as for oligonucleotide array experiments (Chu et al., 2002Go). The VCE software (Groeneveld and Garcia-Cortes, 1998Go) was used to obtain REML estimates of variance components and BLUP for random effects. The effects of gene, gene x diet interaction, and residual were assumed to be independent with zero mean and variance , , and , respectively.

The difference in gene expression for gene i at diet HIGH compared with diet LOW was estimated from the difference between the corresponding BLUP for gene x diet interaction effect. Similarly, the difference in gene expression for gene i at diet HIGH compared with diet MED was estimated from the difference between the BLUP for gene x diet interaction effect. Finally, and without any distributional assumptions, the significance of such differences was estimated from the observed proportion of more extreme differences.


    Results and Discussion
 Top
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Implications
 Literature Cited
 
Choice of Power Transformation
Figure 1Go illustrates the shape of the likelihood function l({lambda}) in Eq. [2Go] for values of {lambda} ranging from -1.0 to 1.0, and for the background-corrected intensity levels observed in each dietary treatment group. Maximal likelihood estimates of {lambda} were found at -0.225, -0.075 and -0.125 for HIGH, MED, and LOW, respectively. Because the value of {lambda} = 0.0 was close to the maximum in all three treatment groups, this value was preferred because of its simplicity. Therefore, the log transformation, and in particular the base-2 logarithm, was applied to all the background-corrected intensity levels. These results agree with the log transformation being the most widely used transformation in microarray data analysis.



View larger version (13K):
[in this window]
[in a new window]
 
Figure 1. Likelihood evaluation at different values of {lambda} for the choice of power transformation to be applied to the background-corrected intensity levels, and for each treatment: high- (HIGH), medium- (MED) and low- (LOW) quality diets.

 
Data transformation aims at producing a more realistic sense of variation by breaking the dependency between magnitude and variation of intensity. In cDNA microarray data analysis, transforming digital records produced by the optical scanner to the base-2 log scale has the additional advantage of enhancing the interpretation of comparisons between red and green fluorescent intensity channel by providing units of twofold change. The optical scanner images the array, producing a digital record that takes the form of a pair of 16-bit tagged image file format (tiff), one for each channel. Recorded values range from 20 (lack of color) to 216 (maximum saturation).

The study by Yeung et al. (2001)Go compared the performance of different data transformations to assess the extent to which the Gaussian assumption holds. Working with two real gene expression data sets and three sets of simulated data, the authors concluded that the square root transformation was closer to multivariate normal than the log transformation in all cases except one of the real gene expression data sets explored. In the present study, and as shown in Figure 1Go, the likelihood of the square root (i.e., {lambda} = 0.5) and also the likelihood of the inverse of the square root (i.e., {lambda} = -0.5) are smaller than that for the log transformation (i.e., {lambda} = 0.0).

The choice of power transformation can also be affected by the data editing criteria. In this study, three selection criteria were applied, including 1) only genes scoring twice in each of the eight arrays, 2) only readings with the background intensity smaller than the foreground (and thus resulting in a positive value for the background-corrected intensity for which the logarithm is defined) and for both the red and the green channels, and 3) only readings not flagged as bad quality spots by the image analysis software. These criteria, although standard in most microarray studies, resulted in almost half of the genes being lost. Loss of information is inevitable in typical microarray analyses and is not always a negative feature because such analyses are reductionistic by nature due to the problems of irreproducibility as well as cost limitations in confirming candidate genes resulting from these analyses. Other microarray data acquisition techniques, such as the correlation between the mean and the median signal intensities described by Tran et al. (2002)Go could have resulted in a different set of genes being edited and a different power transformation being required.

The impact that the base-2 log transformation had on the gene expression data for the HIGH group in the HvL comparison is shown in Figure 2Go, where the empirical densities and the SD by amount of intensity is illustrated before and after transformation. To enhance distinguishing among plots, the intensities presented in Figure 2Go ranged from 128 (or 27) to 16,348 (or 214) when over 95% of records were observed. The plots of the SD by amount of intensity were obtained after sorting the data with 37,976 records and generating 189 groups with 200 records each, plus an additional group with 176 records. The log-transformation reduced both the skewness of the distribution and the pattern of increasing variation with amount of intensity.



View larger version (20K):
[in this window]
[in a new window]
 
Figure 2. Empirical densities (upper graphs) and the SD (lower graphs) by amount of intensity before (G) and after (log2[G]) the base-2 logarithmic transformation for the green intensities of gene expressions observed in the high-quality diet group at the high- vs. low-quality diet comparison.

 
Other recently proposed methods for variance-stabilizing data transformation (for example, Durbin et al., 2002Go; Workman et al., 2002Go) were not explored. However, homogeneity of variances in gene expression intensity levels is not a required constraint in a model-based clustering analysis.

Consistency of Expression Levels in the Reference Sample Across Comparisons
Table 1Go shows summary statistics including the mean, SD, minimum, maximum, skewness, and kurtosis coefficients for , , and corresponding to expression levels of the HIGH group in the HvL and HvM comparisons as defined in Eq. [4Go] and [5Go]. The empirical density plots (not shown) of and were indistinguishable from each other and so were those of and . Further, the correlation coefficient between and , and between and was 0.988 and 0.915, respectively.


View this table:
[in this window]
[in a new window]
 
Table 1. Summary statistics for the gene expression levels in the reference sample across comparisons
 
There is an expected pattern of gene expression levels by quality of diets from LOW to MED to HIGH. Given the strong to very strong similarities observed between gene expression levels of the reference group (HIGH) at the two comparisons, contrasts between MED and LOW could be made with reasonable accuracy. Therefore, the decision to explore differentially expressed genes in the MvL comparison through as defined in Eq. [7Go] was justified. However, no statistical test was used to determine if the means were equal for individual genes. Data analyzed in the MvL comparison could have been filtered by only including genes where at a given confidence level. Additionally, gene filtering could have been based on the within gene variation (i.e., and ). However, this additional gene-filtering criteria could have resulted in 1) an even smaller set of genes being included in the MvL comparison, and 2) an unbalanced number of observations across arrays for the MvL comparison.

Measures of Differential Expression
The reference design used in this experiment did not include a "dye swap" (i.e., a replicate experiment in which the cDNA target that was previously labeled with red dye is now labeled with green dye and vice versa). Swapping dyes aims at accounting for possible differences in the efficiency of label incorporation. In the presence of such differences, measures of differential expression based on within-slide median centered intensities are expected to be less affected than traditional measures such as the ratio of green to red intensities.

Mean and SD of the t-statistics , and were -2.27 and 3.17, -1.21 and 0.99, and -1.06 and 2.73, respectively. The scatter plots of , and against the average intensity levels are given in Figure 3Go. Horizontal lines corresponding to the bounds of the 95% empirical confidence region are also presented.



View larger version (31K):
[in this window]
[in a new window]
 
Figure 3. Scatter plots of t-statistics to evaluate high- vs. low- (top), high- vs. medium- (middle), and medium- vs. low- (bottom) quality diet comparisons against average intensity of gene expression. Horizontal lines correspond to the bounds of the 95% empirical confidence interval.

 
Differentially expressed genes are expected to be associated with extreme absolute values of the t-statistic. The lack of any apparent relationship between the absolute value of the t-statistics and the average gene expression levels (Figure 3Go) is encouraging and indicates that differentially expressed genes can be identified irrespective of the amount of expression observed for any given gene or genes.

In order to avoid extreme t-values being the result of lowly variable gene expression across replicates, several authors have suggested the computation of an adjusted t-statistic by introducing in the denominator a factor that is a function of the gene expression variability (see for instance Efron et al., 2001Go; Tusher et al., 2001Go; Lönnstedt and Speed, 2002Go). In the light of the behavior of expression levels (Figure 2Go) and t-statistics (Figure 3Go), none of these approaches were considered in this study.

Correlation coefficients between and , and , and , and were 0.566, 0.954, and 0.293, respectively. These results indicate that the biological distance at the gene expression level between groups HIGH and MED was much shorter than that between groups MED and LOW. Groups HIGH and MED were defined as steers having ad libitum and restricted access to lucerne, respectively. Irrespective of the amount of restriction, there is always scope for a feed-efficient animal to express its full genetic potential. For HIGH and MED, the animals were gaining weight, and therefore presumably building muscle and/or lying down fat. This will, in turn, make the comparison between HIGH and MED particularly difficult to assess. In this experiment, diet restriction was designed to produce less of an effect on weight gain and hence genetic expression than the high-quality diet. Instead, the low-quality diet was methodically designed to produce a weight loss. Hence, LOW is clearly different from HIGH and MED, which just differ in degree.

Model-Based Clustering and ANOVA Results
A summary of the mixture model fitting results is given in Table 2Go, where the values for the various model selection criteria (AIC, BIC, and LRT) are presented for each model ranging from one to five components. For all three comparisons, there was a dramatic decrease in both AIC and BIC, along with an increase in the log-likelihood when moving from a mixture model with one component to a model with two components. This pattern remained when fitting a model with three components, and became less conclusive with more components. The bootstrap method to approximate the distribution of the LRT to test a model with three components against a model with four components resulted in a P-value of 0.29, 0.71, and 0.52 for the HvL, HvM, and HvL comparisons, respectively. Therefore, a mixture model with three components (or clusters) was chosen for each of the three comparisons.


View this table:
[in this window]
[in a new window]
 
Table 2. Clustering results after fitting various number of components (k) and for each dietary quality group comparisona
 
The fitted mixture models for the HvL, HvM, and MvL comparisons were, respectively:




Although there is not a substantial difference between the means of the three components within (and to some extent across) models, the vast majority of records (95.6, 73.7, and 94.1% for HvL, HvM, and MvL, respectively) fall into the last two clusters with much smaller variances. The first cluster in each mixture model with a variance about 10 times bigger than the preceding two is likely to capture extreme data points corresponding to differentially expressed genes. However, only 4.4, 26.3, and 5.9% of records from HvL, HvM, and MVL, respectively, were contained in this first cluster. The strong similarities observed between and are further manifested in the resulting mixture models for HvL and MvL.

When the 151,904 measures of gene expression intensities (log-transformed and within-array median-centered), averaging 1.68 (SD = 1.93) and ranging from -8.04 to 8.30, were analyzed via ANOVA, the fixed effect model containing the effects of array, block nested within array, diet, array x diet interaction, gene and gene x diet interaction explained 98% of the total variation. All the effects included in the model were significant sources of variation (P < 0.001) with gene and gene x diet interaction accounting for 81 and 3% of the total variation, respectively. When gene and gene x diet interaction were treated as random and the remaining effects as fixed, REML estimates of , , and were 2.880, 0.128, and 0.241, respectively.

Cluster analysis techniques have played a major role in determining which genes are differentially expressed across two kinds of tissue samples or samples obtained under two experimental conditions. Tibshirani et al. (1999)Go presented a review of clustering methods for the analysis of DNA microarray data, including hierarchical clustering (Eisen et al., 1998Go), K-means clustering (Tavazoie et al., 1999Go), and self-organizing maps (Tamayo et al., 1999Go). A comparative review of statistical methods (not only clustering) is given by Pan (2002)Go, including the two-sample Welch’s t-test (Welch, 1951Go) with unequal variances, a regression modeling approach (Thomas et al., 2001Go), a mixture model approach (Pan et al., 2001Go), the empirical Bayesian method of Efron et al. (2001)Go, and the significance analysis of microarray method of Tusher et al. (2001)Go.

Model-based clustering via mixture of distributions has been identified as a method of choice to identify which genes have differential expression levels. Model-based clustering has a clear definition that a cluster is a subpopulation with a certain distribution, the clustering results are stable, and several statistical methods can be applied to estimate the number of clusters (Yeung et al., 2001Go; Pan et al., 2002aGo). In addition, model-based clustering has been shown to provide an elegant framework to calculate the power of detecting a specified magnitude of change (Rekaya, 2002Go), as well as to estimate the number of replicates needed for precise inferences (Pan et al., 2002bGo).

Besides the ability to determine the number of clusters, model-based clustering has the additional appeal of providing posterior probabilities of observations belonging to each cluster (Figure 4Go). A gene is classified to a cluster if its posterior probability of belonging to that cluster is the largest.



View larger version (27K):
[in this window]
[in a new window]
 
Figure 4. Posterior probability (P) of a gene being in each cluster as a function of the t-statistic to evaluate high- vs. low- (top), high- vs. medium- (middle), and medium- vs. low- (bottom) quality diet comparisons. A gene is classified to a cluster if its posterior probability is the largest.

 
For the HvL comparison, there were 27 genes with a greater than 95% posterior probability of belonging to cluster 1. These differentially expressed genes had extreme expression measures such that <-16.65 or >10.10, and are listed in Table 3Go along with their t-statistic values for each comparison. In order to compare these results with those from the mixed-model approach, the difference in BLUP for gene x high-quality diet minus gene x low-quality diet interaction solution along with the significance of such difference is also presented in Table 3Go. Most of these genes were up-regulated (positive t-statistics) in the HIGH treatment group and therefore attributed to muscle development. Many of these genes are in fact oligonucleotide EST or clones of various lengths of the same gene; the sequence and real-time PCR will be the basis of future research.


View this table:
[in this window]
[in a new window]
 
Table 3. Differentially expressed genes with a posterior probability of belonging to cluster 1 greater than 95% in the high- vs. low-quality diet comparison along with the values for the t-statistics , , and for the comparisons high- vs. low-, high- vs. medium- and medium- vs. low-quality diet, respectively, and BLUP for the difference between gene x high-quality diet minus gene x low-quality diet interaction
 
Although the results from the ANOVA approach were consistent with model-based clustering (i.e., genes allocated into cluster 1 also produced extreme BLUP for the gene x diet interaction), ANOVA results must be taken with caution because there is no evidence to support the questionable normality assumption required by ANOVA. There is also the problem of multiple comparisons if we test gene by gene.

The HvM comparison was less conclusive at identifying a single cluster containing differentially expressed genes. The fitted mixture model allocated 683 genes to cluster 1 having a <-2.51 (262 genes) or >-0.206 and <1.98 (413 genes) or >2.76 (8 genes). For the same comparison, the 70 genes with a between 1.99 and 2.69 comprised cluster 3 and were attributed to genes that, in spite of their seemingly large differences in expression between the HIGH and MED treatment groups, were not significant. However, the posterior probability of these genes belonging to cluster 1 never dropped below 26%. The exploration of these same 70 genes in the HvL contrast revealed that all of them were classified into cluster 2, but the probability of belonging to cluster 1 averaged 15.2% and ranged from 3.8 to 46.5%.

Even in the most extreme comparison (i.e., HvL with a clearer definition of differentially expressed genes) there were 195 genes with a less than 50% probability of belonging to any cluster. Further research with more replications would be needed to provide a more definitive clear-cut answer regarding the possible differential expression of these genes.


    Implications
 Top
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Implications
 Literature Cited
 
The present study explores various aspects of the statistical analysis of gene expression response at the muscle tissue level to varying levels of diet quality in Brahman and Brahman composite steers. A reference design was implemented with the ribonucleic acid from muscle tissue of the three animals fed the high-quality diet being the reference sample. Eight gene expression replications were considered per comparison for a total of 4,747 genes. The base-2 logarithmic transformation was found to be optimal to approximate normality. The behavior of the expression intensities in the reference sample was consistent across comparisons anticipating the quality of gene expression data. Differentially expressed genes were identified using model-based clustering via mixtures of distributions.


    Footnotes
 
1 The authors acknowledge the following funding bodies: the Cooperative Research Centre for Cattle and Beef Quality and its core partners, The University of New England, NSW Agriculture, CSIRO, and Queensland DPI. All CRC participants, both scientists and technical staff, who contributed to or supported the work, including those involved in animal management, data collection, laboratory analyses, and data handling are gratefully acknowledged. The assistance of A. Day and B. van den Heuvel during collection of blood samples is gratefully acknowledged. The authors wish to thank P. Allingham for performing the tissue biopsies and B. Hunter for providing the animals. Back

2 Correspondence—phone: +61-7-3214-2392; fax: +61-7-3214-2881; E-mail:
Tony.Reverter-Gomez{at}csiro.au.

Received for publication December 19, 2002. Accepted for publication April 16, 2002.


    Literature Cited
 Top
 Abstract
 Introduction
 Materials and Methods
 Results and Discussion
 Implications
 Literature Cited
 


Akaike, H. 1969. Fitting autoregressive models for prediction. Ann. Inst. Stat. Math. 21:243–247.

Box, G. E. P., and D. R. Cox. 1964. An analysis of transformations (with discussion). J. Royal Stat. Soc. 26:211–252.

Chu, T.-M., B. Weir, and R. Wolfinger. 2002. A systematic statistical linear modeling approach to oligonucleotide array experiments. Math. Biosci. 176:35–51.[Medline]

Durbin, B. P., J. S. Hardin, D. M. Hawkins, and D. M. Rocke. 2002. A variance-stabilizing transformation for gene-expression microarray data. Bioinformatics (Oxf.) 18(Suppl. 1):S105–S110.[Abstract]

Efron, B., R. Tibshirani, J. D. Storey, and V. Tusher. 2001. Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc. 96:1151–1160.

Eisen E., P. Spellman, P. Brown, and D. Botstein. 1998. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. 95:14863–14868.[Abstract/Free Full Text]

Fraley, C., and A. E. Raftery. 1998. How many clusters? Which clustering methods? Answers via model-based cluster analysis. Computer J. 41:578–588.

Groeneveld, E., and L. A. Garcia-Cortes. 1998. VCE 4.0, a (co)variance components package for frequentists and Bayesians. Proc. 6th World Cong. Genet. Appl. Livest. Prod., Armidale, NSW, Australia 27:455–458.

Johnson, R. A., and D. W. Wichern. 1992. Pages 164-166 in Applied Multivariate Statistical Analysis. 3rd ed. Prentice Hall, Englewood Cliffs, NJ.

Kerr, M. K., and G. A. Churchill. 2001. Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments. Proc. Nat. Acad. Sci. 98:8961–8965.[Abstract/Free Full Text]

Lee, S. H., and K. L. Hossner. 2002. Coordinate regulation of ovine adipose tissue gene expression by propionate. J. Anim. Sci. 80:2840–2849.[Abstract/Free Full Text]

Lönnstedt, I., and T. Speed. 2002. Replicated microarray data. Statistica Sinica 12:31–46.

McLachlan, G. J. 1987. On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. Appl. Stat. 36:318–324.

McLachlan, G. J., D. Peel, K. E. Basford, and P. Adams. 1999. The EMMIX software for the fitting of mixtures of normal and t-components. J. Stat. Software 4:2. Available: http://www.jstatsoft.org/v04/i02. Accessed Dec. 18, 2002.

Moody, D. E. 2001. Genomic techniques: An overview of methods for the study of gene expression. J. Anim. Sci. 79(E. Suppl.):128–135.

Pan, W., L. Jizhen, and C. T. Le. 2001. A Mixture Model Approach to Detecting Differentially Expressed Genes with Microarray Data. Tech. Report 2001-001. Div. Biostatistics, Univ. Minnesota, St. Paul. Available: http://www.biostat.umn.edu/~weip. Accessed Dec. 18, 2002.

Pan, W. 2002. A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics (Oxf.) 18:546–554.[Abstract/Free Full Text]

Pan, W., J. Lin, and C. T. Le. 2002a. Model-based cluster analysis of microarray gene-expression data. Genome Biol. 3:research0009.1–0009.8.

Pan, W., J. Lin, and C. T. Le. 2002b. How many replicates of array are required to detect gene expression changes in microarray experiments? A mixture model approach. Genome Biol. 3:research0022.1–0022.10.

Rekaya, R. 2002. Bayesian mixture models with unknown number of components: Application to power calculation is microarray experiments. Proc. 7th World Cong. Genet. Appl. Livest. Prod., Montpellier, France, CD-ROM Communication No.16-12.

Schwartz, G. 1978. Estimating the dimensions of a model. Ann. Stat. 6:461–464.

Tamayo, P., D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. S. Lander, and T. R. Golub. 1999. Interpreting patterns of gene expression with self-organizing maps: Methods and applications to hematopietic differentiation. Proc. Natl. Acad. Sci. 96:2907–2912.[Abstract/Free Full Text]

Tavazoie, S., J. D. Hughes, M. J. Campbell, R. J. Cho, and G. M. Church. 1999. Systematic determination of genetic network architecture. Nat. Genet. 22:281–285.[Medline]

Thomas, J. G., J. M. Olson, S. J. Tapscott, and L. P. Zhao. 2001. An efficient and robust statistical modelling approach to discover differentially expressed genes using genomic expression profiles. Genome Res. 11:1227–1236.[Abstract/Free Full Text]

Tibshirani, R., T. Hastie, M. Eisen, G. Ross, D. Botstein, and P. Brown. 1999. Clustering methods for the analysis of DNA microarray data. Tech. Report, Stanford Univ., Stanford, CA. Available: http://www-stat.stanford.edu/~tibs/research.html. Accessed Dec. 18, 2002.

Tran, P. H., D. A. Peiffer, Y. Shin, L. M. Meek, J. P. Brody, and K. W. Y. Cho. 2002. Microarray optimizations: Increasing spot accuracy and automated identification of true microarray signals. Nucleic Acids Res. 30:12e54.

Tusher, V. G., R. Tibshirani, and G. Chu. 2001. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. 98:5116–5121.[Abstract/Free Full Text]

Welch, B. L. 1951. On the comparison of several mean values: An alternative approach. Biometrika 38:330–336.[Free Full Text]

Wolfinger, R. D., G. Gibson, E. D. Wolfinger, L. Bennett, H. Hamadeh, P. Bushel, C. Afshari, and R. S. Paules. 2001. Assessing gene significance from cDNA microarray expression data via mixed models. J. Comput. Biol. 8:625–637.[Medline]

Workman, C., J. J. Jensen, H. Jarmer, R. Berka, L. Gautier, H. B. Nielsen, H.-H. Saxild, C. Nielsen, S. Brunak, and S. Knudsen. 2002. A new non-linear normalization method for reducing variability in DNA microarray experiments. Genome Biol. 3:research0048.1–0048.16.

Yeung, K. Y., C. Fraley, A. Murua, A. E. Raftery, and W. L. Ruzzo. 2001. Model-based clustering and data transformations for gene expression data. Bioinformatics (Oxf.) 17:977–987.[Abstract/Free Full Text]


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
Y. Lu, X. He, and S. Zhong
Cross-species microarray analysis with the OSCAR system suggests an INSR->Pax6->NQO1 neuro-protective pathway in aging and Alzheimer's disease
Nucleic Acids Res., July 13, 2007; 35(suppl_2): W105 - W114.
[Abstract] [Full Text] [PDF]


Home page
Physiol. GenomicsHome page
A. Reverter, N. J. Hudson, Y. Wang, S.-H. Tan, W. Barris, K. A. Byrne, S. M. McWilliam, C. D. K. Bottema, A. Kister, P. L. Greenwood, et al.
A gene coexpression network for bovine skeletal muscle inferred from microarray data
Physiol Genomics, December 13, 2006; 28(1): 76 - 83.
[Abstract] [Full Text] [PDF]


Home page
J ANIM SCIHome page
S. A. Lehnert, K. A. Byrne, A. Reverter, G. S. Nattrass, P. L. Greenwood, Y. H. Wang, N. J. Hudson, and G. S. Harper
Gene expression profiling of bovine skeletal muscle in response to and during recovery from chronic and severe undernutrition
J Anim Sci, December 1, 2006; 84(12): 3239 - 3250.
[Abstract] [Full Text] [PDF]


Home page
J ANIM SCIHome page
J. M. Reecy, D. M. Spurlock, and C. H. Stahl
Gene expression profiling: Insights into skeletal muscle growth and development
J Anim Sci, April 1, 2006; 84(13_suppl): E150 - E.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. Reverter, W. Barris, S. McWilliam, K. A. Byrne, Y. H. Wang, S. H. Tan, N. Hudson, and B. P. Dalrymple
Validation of alternative methods of data normalization in gene co-expression studies
Bioinformatics, April 1, 2005; 21(7): 1112 - 1120.
[Abstract] [Full Text] [PDF]


Home page
J ANIM SCIHome page
K. A. Byrne, Y. H. Wang, S. A. Lehnert, G. S. Harper, S. M. McWilliam, H. L. Bruce, and A. Reverter
Gene expression profiling of muscle tissue in Brahman steers during nutritional restriction
J Anim Sci, January 1, 2005; 83(1): 1 - 12.
[Abstract] [Full Text] [PDF]


Home page
J ANIM SCIHome page
A. Reverter, Y. H. Wang, K. A. Byrne, S. H. Tan, G. S. Harper, and S. A. Lehnert
Joint analysis of multiple cDNA microarray studies via multivariate mixed models applied to genetic improvement of beef cattle
J Anim Sci, December 1, 2004; 82(12): 3430 - 3439.
[Abstract] [Full Text] [PDF]


Home page
J ANIM SCIHome page
R. J. Moser, A. Reverter, C. A. Kerr, K. J. Beh, and S. A. Lehnert
A mixed-model approach for the analysis of cDNA microarray gene expression data from extreme-performing pigs after infection with Actinobacillus pleuropneumoniae
J Anim Sci, May 1, 2004; 82(5): 1261 - 1271.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Reverter, A.
Right arrow Articles by Lehnert, S. A.
Right arrow Search for Related Content
PubMed