|
|
||||||||


* Departments of Animal and Food Sciences and
and
Range, Wildlife, and Fisheries Management, Texas Tech University, Lubbock 79409 and
and
ARS,USDA, Bushland, TX 79012
| Abstract |
|---|
|
|
|---|
) level. Retrospective power analysis provides additional information about previous experiments that may be helpful in designing subsequent investigations. However, in retrospective power analyses, power is inversely related to observed significance level. Benefits of prospective and retrospective power analyses in beef cattle experiments are similar to those for other species; however, because of differences in the methods and conditions involved, considerations for the use of power test procedures are specific for beef cattle research. Retrospective power analyses were conducted on 78 published experiments and on two unpublished experiments. Experiments were compiled into categories that represented group (or pen) feeding, individual feeding, and metabolism studies. Estimated power in pen feeding experiments using randomized block designs (RBD, n = 30) was less than 0.80 for ADG and feed efficiency (FE), but not different from 0.80 for completely random designs (CRD, n = 4). Furthermore, estimated power was less for ADG than for FE in both design types. For individual feeding experiments using RBD (n = 4), power was not different from 0.80 for either ADG or FE; however, for CRD (n = 18), power was less than 0.80 for both ADG and FE. Power was similar for ADG and FE for both RBD and CRD in individual feeding experiments. In metabolism experiments, estimated power for nitrogen retention was less than 0.80 for Latin square designs (n = 20) but not for CRD (n = 4). Comparisons of power between experimental design types were likely influenced by the number of experiments involved. These results indicate that retrospective power in beef cattle experiments is affected by design type, and response variable measured.
Key Words: Beef Cattle Experimental Design Statistics
| Introduction |
|---|
|
|
|---|
Although several publications are available on the subject of power tests for use by researchers (Steidl et al., 1997; Gerard et al., 1998; Novak and Haslberger, 2000; Kuiper et al., 2002), none specifically addresses beef cattle experiments and the associated factors that make them unique. Shortcomings of power calculations as a means to evaluate research also have been reported (Thomas, 1997; Hoenig and Heisey, 2001). Our objectives were to analyze data from publications of beef cattle research to 1) estimate retrospective power for selected response variables and 2) compare retrospective power between types of experiments (pen, individual, metabolism) and types of experimental design (completely random designs, randomized block designs, and Latin square designs).
| Methods |
|---|
|
|
|---|
Both prospective and retrospective power analyses are explored in this paper. Prospective power analysis is reviewed through a Monte Carlo exercise. This is explained more thoroughly in the Results and Discussion section. The procedures used to retrospectively analyze power in experiments reviewed are explained below.
Retrospective Power Analysis
To estimate power retrospectively, the following information was used: 1) an observed F-statistic (and its degrees of freedom) associated with a test of the null hypothesis of no treatment effect in the original data set, 2) a stated
level, and 3) the noncentrality parameter associated with the F-statistic. The observed F-statistic can be calculated using treatment means, samples sizes, and the standard error of the mean. This calculation assumes that variances of treatment means are homogeneous, and in the case of randomized block designs, both block and treatment effects are fixed.
There are several estimators of the noncentrality parameter. We used the estimator described by Johnson et al. (1995):
![]() | [1] |
where v1 and v2 are numerator and denominator degrees of freedom of the calculated F-statistic. Power is given by (Graybill, 1976):
![]() | [2] |
whereFa;v1,v2 = the upper
probability point of the central F-distribution with v1 and v2 degrees of freedom; F(w:v1, v2;
) is the probability density function of the noncentral F-distribution; and w is the F-statistic; when
is used in Eq. 2
, then power is estimated. It should be noted that
(Eq. 1
) is an unbiased estimator of the noncentrality parameter. However, negative estimates are possible, in which case power cannot be estimated; in these cases,
was set to zero (this biases the estimator). Other common estimators of power, namely
1 = v1F (used by the software "GPOWER") and
2 =
, where t = number of treatments, TRTMS = the treatment mean square, and MSE = the experimental error mean square (e.g., Winer et al. 1991; Kirk, 1995), are also biased estimators of the true noncentrality parameter.
For each of the studies in the database, power was estimated using Eq. 2
with the noncentrality parameter estimated by Eq. 1
. This estimate of power was then used for further analyses. In particular, we wished to provide descriptive information about estimated power in beef cattle research as it is related to type of experimental design, kind of experiment (e.g., pen-fed animals, individually fed animals, and metabolism trials), and response variable (e.g., ADG, feed efficiency [FE], and nitrogen retention). We are not aware of any theoretical studies of the distribution of power as a random variable. Thus, in our statistical tests that used power as the dependent variable, we employed nonparametric tests. For example, many texts suggest that power of 0.80 or greater is desirable. In our analysis, we tested the hypothesis that power equals 0.80 with a one-sample, Wilcoxon signed-ranks test. Similarly, when we compared power of ADG to power of FE, we used a Wilcoxon paired-samples test, and when we compared power in randomized block designs (RBD) to power in completely random designs (CRD) for a particular response variable (e.g., ADG), we used a Kruskal-Wallis test.
| Results and Discussion |
|---|
|
|
|---|
) level. For example, suppose a researcher is studying ADG by 300-kg crossbred steers. Even when steers of a similar breeding and weight class are fed the same control diet, ADG will not be the same for each animal because of variability resulting from experimental error (i.e., variation among experimental units treated the same). Now, suppose that it is known that the variance in ADG among animals treated alike is
2 = 0.01. If the researcher is interested in studying ADG by animals fed three different diets, he or she randomly selects animals from the target population (300 kg crossbred steers) and randomly assigns them to diets, with r animals per diet. The experimental design is therefore a CRD, with t = 3 treatments and r replications per treatment. Under the assumptions that experimental errors are normally and independently distributed with a common variance in each treatment, an ANOVA and its accompanying F-test is used to test the null hypothesis that there is no difference in mean ADG among the three treatments. The alternative hypothesis is that the three treatment means are not equal.
Two distinct situations are possible. First, suppose that there is no true difference in ADG among the three diets (i.e., the null hypothesis is true). Because of variability in ADG, even among animals treated alike, it is possible that an experiment might yield data that would lead the investigator to reject the null hypothesis that there is no difference in ADG among the three dietsthis would be a Type I error, and its probability of commission is controlled by selecting a desired
level, typically,
= 0.05.
A second kind of error is possible. Suppose that there actually is a difference in ADG among treatments. Again, because of experimental error, it is possible that an experiment might yield data that would lead the investigator not to reject the null hypothesisthis would be a Type II error, and its probability is denoted ß.
Clearly, it is undesirable to reject the null hypothesis when it is in fact true. Likewise, it is undesirable to conclude that there is no difference among treatment means when in fact a true difference exists. This leads to the concept of statistical power: the power of a test is the probability of rejecting the null hypothesis when it is in fact false. Power is denoted by 1 ß. It is clear that the power of a test depends, in part, on the exact nature of the inequality among treatment means.
Suppose that in addition to knowing that
2 = 0.01, the investigator knows that the true treatment means of ADG are µ1 = 0, µ2 = 0.1, and µ3 = 0.2, with an overall population mean of µ = 0.1. In this example, it is also true that the differences between each treatment mean and the grand mean are
1 = µ1 µ = 0.1;
2 = µ2 µ = 0.0; and
3 = µ3 µ = 0.1. Suppose that logistic considerations and available resources are such that it is possible to have r = 6 replications per treatment. Even though it might be known that treatment means differ, it is nevertheless possible that, because of experimental error, an experiment might yield data that would lead to the conclusion that there is no difference among treatment means (i.e., a Type II error). With this possibility in mind, the investigator might ask, "Given that I know that the true population means are different, and further given that ADG varies even among animals on the same diet, what is probability of rejecting the null hypothesis with six replications per treatment?"
When the population parameters are known, power can be calculated. In this example, with
2 = 0.01, and
, the noncentrality parameter is (Graybill, 1976):
![]() | [3] |
where E[SS(H0)] is the expected value of the sum of squares associated with the null hypothesis, df(Ho) are the degrees of freedom associated with the null hypothesis, and
2 is the experimental error. For a CRD,
![]() |
when
. In this example,
= 6. A final parameter, usually denoted
, is calculated by
, where n1 = df(H0) + 1. In this example,
= 2.0. With
, power can be determined by consulting Table T-11 in Graybill (1976). In this example, power is 0.8049; thus, the probability of rejecting the null hypothesis is 80.49%.
The foregoing example can be empirically supported using Monte Carlo simulation methods. The IML procedure of SAS (SAS Inst., Inc., Cary, NC) was used to write a program that would generate experimental errors from a normal distribution with a variance of 0.01. Six experimental errors were assigned randomly to each of three treatments; then, a value of 0.1 was added to each experimental error in Treatment 2, and a value of 0.2 was added to each experimental error in Treatment 3. With randomly generated data, an ANOVA was completed with an F-statistic. This process was repeated 100,000 times in the Monte Carlo simulation.
Out of these 100,000 simulated experiments, the null hypothesis was rejected 80,497 times. Thus, the empirical power of the test statistic was 80.497%. The theoretical power, following Graybill (1976; see above) was 80.49%. We conclude that when population parameters are known, and assumptions underlying the F-test are satisfied, power can be determined.
A total of 100,000 experiments were simulated in the Monte Carlo exercise, and the results showed clearly that for these 100,000 experiments, there was an 80.497% chance that the null hypothesis was rejected. Clearly, this concept deals with the population of experiments.
For each of these experiments, power also can be retrospectively estimated. It is instructive to consider the possibilities that emerge from post hoc power estimation. For example, in one of the simulated experiments, the calculated F-statistic was F = 8.0699. This result can be used to estimate the noncentrality parameter using Eq. 1
, from which power can be estimated using Eq. 2
. For this particular simulated experiment, power is retrospectively estimated to be 80.49023%, a value very close to the true power in this example. However, another of the experiments in the Monte Carlo simulation yielded a calculated F-statistic of F = 14.731091, with an estimated power of 98.057%. And still another of the simulated experiments yielded a calculated F-statistic of F = 1.2001352, with an estimated power of 5.499%.
These results clearly demonstrate that estimated power is a random variable. Thus, any given set of experimental data can be used to estimate power, and the resulting estimate may or may not be close to true power. This is a consequence of the inherent variability in the experimental errors associated with the observed dependent variable.
A Disclaimer on Interpretation of Retrospective Power.
Retrospective power is related to observed significance level (Figure 1
). It is important to appreciate that when the observed significance level is large (i.e., nonsignificant), then retrospectively estimated power will be low; conversely, when the observed significance level is small (i.e., significant), then retrospectively estimated power will be high. Hoenig and Heisey (2001) provided a theoretical explanation of this relationship, and Figure 1
provides empirical support. The importance of this relationship between observed significance level and retrospectively estimated power cannot be overstated. For example, some practitioners advocate estimating power following a nonsignificant test result, with the following in mind: if power is relatively high, then this tends to support the null hypothesis (i.e., because power is relatively high, then one would probably have found a difference if it existed, but because the null was not rejected, then there probably was not a difference). In a similar manner, if observed power is relatively low and the null hypothesis is not rejected, then a common interpretation is that perhaps there may actually have been a difference, but the low power of the test made it unlikely to reject the null hypothesis. Hoenig and Heisey (2001) termed this "thought experiment" the "power approach paradox." Given the relationship between observed significance level and retrospective power estimation (Figure 1
), it is clear that "computing the observed power after observing the P-value should cause nothing to change our interpretation of the P-value" (Hoenig and Heisey, 2001).
|
2 = 0.01. Further, suppose that the investigator would like to conduct an experiment (using a CRD) studying differences in ADG among three treatments. The investigator wishes to know how many experimental units are needed so that a difference between the largest and the smallest means of 0.2 kg is statistically significant at the 5% significance level with a power of 80%. Thus, four pieces of information are needed: 1) experimental error; 2) the maximum difference between treatments that is desired to detect; 3) a significance level; and 4) power. With this information, the tables provided by Bowman and Kastenbaum (1975) can be used to determine the necessary sample size. In particular, following the notation of Bowman and Kastenbaum (1975), let µmax =
max = 0.20 and µmin =
min = 0.00. Then let
= (
max
min)/
= 2 for the present example. With k = 3 treatments,
= 0.05, and ß = 0.20, Table 1
6. The Monte Carlo simulation described above empirically confirms the accuracy of the Bowman and Kastenbaum (1975) tables.
|
![]() |
where
is the 100(1
) percentile point of the
2 distribution with (t 1) df; zß is the ß percentile point of the standard normal distribution;
is the square root of the experimental error variance, and
is the difference between the largest and smallest of t treatment means. The sample size needed per treatment is r = [m*] + 1, where [] is the greatest integer function. In this example, m* = 4.73, so the approximate sample size is five replications per treatment.
Another method for determining the number of replicates required in a proposed animal experiment is detailed by Berndtson (1991). Similar to the previous methods, four pieces of information must be known in order to determine sample size. Three of these (difference between treatments, significance level, and power) are the same in each of these methods. However, in the Berndtson (1991) approach, a coefficient of variation involving the control treatment is used in tables he provided to estimate sample sizes. In addition to requiring a coefficient of variation that involves only the control treatment, his tables provide estimated sample sizes only for two-treatment experiments and only for situations involving a 5% significance level.
Because of inherent differences between research facilities, and thus in the experimental error associated with beef cattle experiments, it is to be expected that the estimated number of replications needed to achieve a desired power will vary between facilities. Continuing with the above example, suppose that a beef nutritionist wants to study ADG as affected by three treatments in a CRD; he or she wishes to know how many replications are needed to detect a difference among treatments if the largest difference between treatment means is 0.2 kg/d, with alpha = 0.05 and power = 0.80. She or he estimates experimental error at their research facility to be
2 = 0.0036. Using the tables of Bowman and Kastenbaum (1975),
= 3.33 and thus r = 3 replications would be used in this experiment. Now, suppose that this nutritionist is collaborating with a colleague at a different facility, where the estimate of experimental error is
2 = 0.02. Using the tables of Bowman and Kastenbaum (1973),
= 1.414 and r = 11 replications would be used in the experiment at the second facility. Thus, even when the same experiment is repeated at two different facilities, estimated sample sizes will be affected by the inherent variability in the response variable associated with each facility. Finally, it should also be appreciated that if true experimental error at the first facility was in fact
2 = 0.01, but a value of 0.0036 was used to estimate sample sizes, then an experiment with r = 3 replications would in reality have a power of 0.3857 rather than the nominal value of 0.80. Likewise, if the (overestimate) of 0.02 was used when the actual experimental error was 0.01, then an experiment with r = 11 replications would in reality have a power exceeding 0.99. For these reasons, it is imperative that researchers collect and retain historical data associated with their facilities, and that every effort is made to provide an accurate estimate of experimental error to be used in designing future experiments.
Pen-Feeding Experiments.
A description of experiments included in the pen feeding database and retrospective power values for ADG and FE are given in Table 1
. Thirty pen-feeding evaluations of animal performance in our database used RBD. Only four studies included used a CRD.
A comparison of the influence of design type on response variables and power data is presented in Table 2
. In RBD, the estimated power associated with ADG was less (P < 0.001) than 0.80. Similarly, estimated power associated with FE was less (P < 0.001) than 0.80. In CRD, estimated power did not differ (P > 0.125) from 0.80 for either ADG or FE; however, sample sizes associated with these tests limit conclusions.
|
In studies of FE, estimated power did not differ (P > 0.520) for CRD and RBD. Nonetheless, there was an indication (P < 0.073) that RBD were more powerful than CRD when ADG was measured.
Individual Feeding Experiments.
A description of experiments included in the individual feeding database and retrospective power values for ADG and FE are given in Table 3
. Eighteen studies of animal performance that were based on individual animals used a CRD, whereas only four studies used a RBD. A comparison of the influence of design type on response variables and power data is presented in Table 4
. In RBD, estimated power did not differ (P > 0.125) from 0.80 for ADG or FE. However, in CRD, estimated power was less than 0.80 when ADG was measured (P < 0.003), as well as when FE was measured (P < 0.001).
|
|
In evaluation of FE, power did not differ (P > 0.853) in CRD and RBD. Similarly, in studies of ADG, power did not differ (P > 0.158) in CRD and RBD.
The relationship of estimated P-value to retrospective power in individual feeding experiments for ADG and FE is graphically presented in Figure 2
.
|
It is also important to address the manner in which pen data are analyzed, especially the analysis of ADG and carcass data. In research settings, it is often tempting to analyze ADG and carcass measurements on an individual-animal basis because individual BW of each animal is known. When conducting pen-based studies, however, all variables should be analyzed with pen as the experimental unit. This is because treatments are applied to the pen of animals, not to the individual animal. Consider a completely randomized design with pen as the experimental unit; suppose that there are five animals per pen. In this setting, there are two sources of unexplained variation: 1) variation among animals within a pen (called "sampling error") and 2) variation among pens within a treatment ("experimental error"). Perhaps the most important assumption for the ANOVA is independence of experimental errors. When pens are randomly assigned to treatments, it is reasonable to assume independence of experimental errors; the correct error term for testing treatment effects is the experimental error mean square. It is also reasonable to suggest that social interactions among animals within a pen might result in correlated sampling errors. However, the nonindependence of sampling errors within a pen does not bias the F-test on the treatment effects when the experimental error mean square is used to test treatment effects (although it can affect the power of this test). When individual animals are used as the experimental unit instead of pen, this leads to an analysis wherein the errors comprising the error (residual) mean square (i.e., the denominator of the F-test) are not independent. This violation of independence has serious consequences on the performance of the F-statistic in detecting differences among treatment means. Therefore, it is imperative that the F-test on the treatment effects use the appropriate error mean square.
Metabolism Experiments.
Our database included 20 studies of nitrogen retention that used a Latin square and four studies that used a CRD (Table 5
). A comparison of design type on nitrogen retention and power data for metabolism experiments are presented in Table 6
. For Latin squares, estimated power was less (P < 0.002) than 0.80. However, for CRD, estimated power did not differ (P > 0.625) from 0.80. There was no difference (P > 0.844) in estimated power between Latin squares and CRD.
|
|
|
| Implications |
|---|
|
|
|---|
| Footnotes |
|---|
2 Appreciation is expressed to M. S. Brown, West Texas A&M Univ., Canyon, for assistance in compiling the pen-feeding database. ![]()
3 Correspondence: Box 42141 (phone: 806-742-2516; fax: 806-742-4003; E-mail: reed.richardson{at}ttu.edu).
Received for publication July 10, 2003. Accepted for publication October 8, 2003.
| Literature Cited |
|---|
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |