|
|
||||||||
TRIENNIAL REPRODUCTION SYMPOSIUM |

* Laboratory of Mammalian Reproductive Biology and Genomics, Departments of Animal Science and Physiology, Michigan State University, East Lansing 48824-1225; and
and
Department of Dairy Science, University of Wisconsin, Madison 53706
| Abstract |
|---|
|
|
|---|
Key Words: microarray data analysis false discovery rate biological theme cattle
| INTRODUCTION |
|---|
|
|
|---|
Key factors influencing the success of a microarray experiment include 1) formulation of an appropriate hypothesis and rationale for the experiments; 2) choice of the microarray platform (e.g., cDNA, oligonucleotide) based on the goals of the experiment and tissue/cell type and species of interest; 3) complexity of the samples of interest and use of whole tissue vs. purified cell types; 4) experimental design issues, including appropriate biological replication, the need for technical replication, and the appropriateness of pooling of samples; 5) control of the false discovery rate and avoidance of arbitrary data analysis procedures; and 6) use of gene classification procedures, pathway analysis, and other available tools to facilitate the interpretation of biological themes. The above factors are also addressed in detail in an excellent recent review by Allison et al. (2006)
, which is highly recommended and served as a valuable source of information during the preparation of this paper.
For the purpose of this paper, we will use the term microarray to refer to DNA microarrays used as a high throughput platform for the quantification of differences in relative RNA transcript abundance for a very large number of genes simultaneously. The paper is also presented from the perspective of a reproductive biologist (G. W. Smith) who has utilized microarray technology successfully, so emphasis is on the lessons learned and the strategies found to be successful, not on the details of the statistical theory behind the approaches described.
The power of microarray technology is self-explanatory, and the term gene expression profiling is commonly used to describe it. However, as is true for other common technologies used for quantification of the abundance of RNA for genes of interest, one must exercise caution in directly attributing changes in RNA transcript abundance to differences in transcription or in directly inferring that such changes automatically result in differences in biological activity of the genes of interest.
| HYPOTHESIS AND RATIONALE |
|---|
|
|
|---|
| CHOICE OF THE MICROARRAY PLATFORM |
|---|
|
|
|---|
In addition to the numerous cDNA microarrays available for livestock species, high-density oligonucleotide arrays for gene expression profiling of samples derived from cattle, pigs, and chickens also are now commercially available. For those farm species for which microarrays are less readily available (e.g., sheep), cross-species hybridization to existing arrays, particularly cDNA arrays, is possible but not ideal. Such approaches may yield a significant proportion of false negatives in situations where sequence similarity between a heterologous clone on the array and a corresponding transcript in the samples of interest is not sufficient to allow robust hybridization. Generally, the availability of microarrays for farm animal species is no longer a limiting factor in the execution of expression profiling experiments, but construction of custom arrays may still be warranted when the goal is examination of the regulation of rare or tissue- or cell-specific transcripts not likely to be highly represented on the existing arrays.
| COMPLEXITY OF THE SAMPLES |
|---|
|
|
|---|
| EXPERIMENTAL DESIGN AND DATA ANALYSIS |
|---|
|
|
|---|
Questions also arise about the necessity of technical replication; i.e., running multiple arrays with the same samples. Given the current status of DNA microarray technology and the available platforms, technical replication is not a prerequisite for successful microarray experiments. Technical replication only provides an estimate of the variability in measurement, whereas biological replication (analysis of multiple, independent samples per treatment group on separate slides) in essence accounts for biological variation between samples and variation in measurement.
Another common question related to the design of microarray experiments is the potential benefits and risks associated with pooling of samples. For example, pooling of samples may be considered when sample costs are low relative to the cost of the microarray procedures. Whereas pooling of samples can reduce the variation between arrays, potential outliers may get masked or may compromise the entire pool. Pooling of samples has also been proposed when the starting material (RNA) from individual samples is limited. Although this is potential justification for the pooling of samples, there are alternatives.
We have validated linear amplification procedures for use in microarray experiments when the input RNA is limiting (Patel et al., 2004
) and have applied such procedures in microarray studies using bovine oocytes (Patel et al., 2005
). In such experiments, RNA from thousands of oocytes would have been required to conduct the experiment using standard microarray procedures. Pooling of samples can be advantageous under certain circumstances, but not at the expense of biological replication. Independent biological samples are always required, regardless of whether the samples were derived from pools or from individuals.
Avoidance of Arbitrary Data Analysis Procedures and Control of the False Discovery Rate
Size and complexity are major obstacles to overcome in the analysis of microarray data sets. For example, the output for a single replicate with the ~15,000-gene bovine cDNA microarray that we have used (Suchyta et al., 2003
) contains more than 19,000 rows and 100 columns of data. This can be overwhelming. Thus, it is tempting for investigators to employ simple, arbitrary methods of microarray data analysis, such as calculation of the mean fold-change (differential expression) alone or the use of individual t-tests. Such approaches are not advisable (Allison et al., 2006
). For example, use of the mean fold-change to select genes of interest will yield arbitrary results with no associated degree of confidence. Moreover, the use of independent, gene-specific t- (or ANOVA) tests may generate unreliable variance estimates, especially in situations with a limited number of data points for each gene. In such cases, small fold-changes may be called statistically significant by chance because of the adversely small estimates of variability among samples. Likewise, important differential expression among samples may be missed because of overestimated variances. To overcome this problem, significance testing approaches that combine information across genes, for example by using shrinkage estimation of variance components, are advised (for example, see Cui et al., 2005
).
Another central issue with microarray data analysis is the multiple testing problem. As thousands of genes are tested in a single experiment, large numbers of false positives are expected even if there is no differential expression at all. For example, suppose that RNA isolated from a single bovine corpus luteum is divided into 2 aliquots, and each aliquot is subjected to microarray analysis using the previously described, bovine cDNA array with
15,000 genes represented (Suchyta et al., 2003
). Because the 2 RNA aliquots used for microarray analysis were derived from the same sample (i.e., self-self hybridization), in reality there should be no real differences in the gene expression detected. However, one would expect 0.05 x 15,000 = 750 false positives to be detected with a gene-wise type I error rate set at 0.05 in this experiment.
To control the number of false positives when performing multiple tests, the significance level adopted for each test should be more stringent than the desired overall significance level. Traditional multiple testing correction approaches that control for the experiment-wise significance level (i.e., the probability of a single false positive) are shown to be too conservative, decreasing the power of the experiment to undesirable levels, and consequently increasing the number of false negatives. For large-scale multiple testing situations, such as with microarray experiments, it is more important and sound to control the false discovery rate (FDR), defined to be the proportion of false positives among all significant tests (Benjamini and Hochberg, 1995
; Storey and Tibshirani, 2003
).
Freely available software for the analysis of microarray experiments, including significance tests using shrinkage-based procedures (such as LIMMA and R/ MAANOVA packages) and FDR approaches, can be found at the Bioconductor Web site (http://www.bioconductor.org/; last accessed Oct. 3, 2006).
Interpretation of Biological Themes
After successful design and execution of a microarray experiment, with appropriate biological replication and control of the FDR during statistical analysis, investigators are still left with the exciting, yet overwhelming task of elucidating the biological significance of the gene lists obtained and interpretation of the meaning of the results, where hundreds or even thousands of genes exhibiting differential expression are represented. Such gene lists represent a myriad of unorganized findings. It is tempting to spend months staring at a data set and performing individual PubMed searches to try to intuitively interpret biological themes de novo. It is also tempting to organize genes merely based on relative fold-change or degree of differential expression, but such approaches yield no useful information about gene function, and such approaches alone do not help dramatically in terms of interpretation of the findings.
However, a logical and systematic data analysis strategy to help delineate biological themes can relieve unnecessary anxiety and significantly reduce the amount of time spent trying to interpret microarray data. Such gene classification approaches are geared to reveal commonality in function that might not readily be interpreted solely from a laborious, manual, literature-based approach alone. We have used publicly available software (Dennis et al., 2003
) to group genes based on commonality in function (gene ontology) and to determine the frequency with which genes are represented. Such an approach can, in itself, help reveal major biological themes within microarray data sets. Furthermore, we have successfully used a program named EASE (Hosack et al., 2003
) to identify genes from microarray data sets that are overrepresented in a gene ontology category at a significantly greater frequency than would be expected based on the frequency in which genes of the given category are present on the array.
Pathway analysis represents another useful tool to facilitate interpretation of biological themes. We have obtained information on the representation of genes in microarray data sets within their respective biological pathways and potential gene interactions using the Kegg pathway database (Kanehisa and Goto, 2000
). These gene classification approaches have revealed novel biological themes from our microarray data sets, leading to new hypotheses and investigations in model systems of interest that would not have otherwise been pursued (Patel et al., 2005
) and have reduced the time-line from discovery/descriptive microarray studies to the formulation of specific hypotheses and the testing of gene function.
| SUMMARY |
|---|
|
|
|---|
| Footnotes |
|---|
2 Corresponding author: smithge7{at}msu.edu
Received for publication July 18, 2006. Accepted for publication September 17, 2006.
| LITERATURE CITED |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
R. Sarwar and S. A. Cook Genomic Analysis of Left Ventricular Remodeling Circulation, August 4, 2009; 120(5): 437 - 444. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |