|
|
||||||||
Department of Animal Science, Michigan State University, East Lansing 48824-1225
| Abstract |
|---|
|
|
|---|
Key Words: Bioequivalence Testing Dairy Science Experimental Design Genetically Modified Feedstuffs Sample Size
| Introduction |
|---|
|
|
|---|
The primary intent of this paper is to present hypothesis-testing procedures when the working hypothesis is mean equivalence. Statistical methods for bioequivalence testing have been well developed and extensively applied in pharmaceutical research (FDA, 2001). Experimental design and statistical analysis issues pertinent to both classical and bioequivalence testing on mean differences are reviewed in this paper, indicating subtle differences in design choices between the two testing scenarios. Power-of-test computations are also illustrated for both hypothesis-testing paradigms based on replicated Latin square and repeated crossover designs. The utility of observed or retrospective power for conclusions on equivalence is also criticized. Additional design and analysis considerations that may help refine classical and bioequivalence power of test in future dairy nutrition studies are discussed. Finally, some reporting recommendations for future studies are provided.
| Experimental Design |
|---|
|
|
|---|
In animal feedstuff cropping designs, test and control hybrids are typically assigned to plots within the same field or grown in adjacent fields; nevertheless, in many cases, only one or perhaps two plots per hybrid are considered. The instructions to authors for the Journal of Animal Science and the Journal of Dairy Science in 2003 explicitly define the experimental unit as "the smallest unit to which an individual treatment is imposed." Applied then to feedstuff comparisons, the experimental unit definition does not only include plot, but also silo or silage bag if fermentation conditions differ considerably between silos (Gill, 1981). Furthermore, this definition might also be logically expanded to include year of crop as feedstuff comparisons involving GM crops have been shown to be influenced by weather conditions, such as insect pressure, drought (Novak and Haslberger, 2000), or ambient temperature at harvest (Grant et al., 2003). As statistically significant and meaningful nutrient compositional differences have not yet otherwise been generally detected between GM and conventional dairy feedstuffs (Donkin et al., 2003; Folmer et al., 2002; Grant et al., 2003; Ipharraguerre et al., 2003), it may be likely that potential confounding factors due to the utilization of only one or two plots, silos, or cropping years per hybrid within each study have not yet become issues with, perhaps, the exception of Grant et al. (2003). However, enough ambiguity in experimental unit definition may exist for some studies such that a meta-analysis approach, which treats each individual study as an experimental unit (St-Pierre, 2001) should be advocated to periodically formalize consensus on substantial equivalence. Naturally, any experimental design issues as they pertain to feedstuff nutrient profile comparisons would also influence animal performance comparisons. This was recognized by Grant et al. (2003), who harvested their GM corn silage at a very high ambient temperature, resulting in a crop having a high-DM content, thereby adversely affecting fermentation conditions and subsequent DMI and milk production in cows fed the GM silage.
Rigorous designs, such as the incomplete block or lattice designs used in agronomy (Stroup, 2002), warrant greater consideration, although much larger plot sizes, fewer hybrids for comparison, and fewer replicates (plots) per hybrid would likely be required to facilitate animal feeding experiments. However, even with these robust designs, analyses based on spatial error models seem to be important. The use of conventional ANOVA often fails to adequately account for spatially correlated sources of variability (due to, e.g., fertility, moisture, or slope gradients), thereby leading to serious misrankings of crop varieties for yields (Littell et al., 1996; Stroup, 2002). Whether this might be also true for nutrient profile and livestock performance comparisons of GM vs. reference hybrids is unclear.
Feedstuff Comparisons for Animal Performance
The most statistically efficient experimental designs for dietary treatment comparisons involve blocking on cows to facilitate powerful intra-subject comparisons on treatments. The Latin square design entails such a structure whereby the two blocking factors are cow and period to facilitate comparisons on a third treatment factor. Logically, cow represents a random source of variability, whereas period may or may not, depending on the variability of DIM within a period as discussed later. Single Latin square designs do not allow the consideration of interaction among the three factors; however, replicated Latin square designs do facilitate investigation of treatment x period interactions if the same periods are considered within each square. This issue is not trivial because, for example, mean milk yield differences between diets may depend on the stage of lactation (Longuski et al., 2000). Nevertheless, a large number of studies, including GM vs. reference hybrid studies, have not considered this potentially important interaction.
Repeated crossover designs are potentially better suited for bioequivalence studies compared to replicated Latin square designs. Let us suppose that the two treatments to be considered are labeled A and B. A two-period, two-treatment crossover design is synonymous with a replicated two-period Latin square design in which half of the cows are assigned the treatment sequence AB over the two periods whereas the remaining half are assigned to the sequence BA. With this nonrepeated crossover design, however, one is not able to separately infer upon potential carryover or residual effects due to the diet fed in the preceding period. Repeated crossover designs, on the other hand, are characterized by each animal receiving a treatment for more than one period. In the four-period two-treatment design, also called the double-reversal design (Gill, 1978), the two treatment sequences, randomly assigned among cows, might naturally be ABAB and BABA. If the fourth period is not considered within each of the two sequences, then a somewhat less powerful switchback design is created, that is, with sequences ABA and BAB. This design was considered for the assessment of glyphosate-tolerant corn vs. its isogenic reference by Donkin et al. (2003). The random effects variance component due to cow x hybrid interaction is estimable in both switchback and double-reversal designs, but not in the replicated Latin square design. By analogy, this interaction term is similar to subject x drug treatment interaction modeled in human bioequivalence studies on drug testing, which, if large, indicates highly variable subject-specific differences in drug efficacy. Treatment-specific residual variabilities are also more efficiently estimated in repeated crossover designs, thereby providing a potential indication of whether or not there are individual differences in variability of response between two or more treatments. If two treatments lead to similar mean responses, the more desirable treatment may be the one that leads to less variability in response. For these reasons and others, the double-reversal and switchback designs have been advocated (FDA, 2001) for bioequivalence testing in medical and pharmaceutical research. Variants of these designs have been popularized in dairy science since Lucas (1956; 1957).
On-farm studies may warrant important design considerations, particularly when production or feeding groups or pens of animals dictate those groups to be the experimental units. However, because these environments are not highly controlled, on-farm trials are not likely to be suitable for first-stage regulatory testing (St-Pierre and Jones, 1999) as it may pertain to the evaluation of GM crops.
| Statistical Models and Analyses |
|---|
|
|
|---|
![]() | [1] |
where yijk represents the observation on cow k given treatment i at period j;
i represents the fixed effect of the ith treatment, i = 1, 2, . . . , nt; ßj represents the fixed effect of the jth period, j = 1, 2, . . . , np; and ck represents the random effect of the kth cow, k = 1, 2, . . . , nc, with variance component
. The period effect is considered fixed if it is nearly synonymous with stage of lactation, provided that variability in DIM is low within each period. Therefore, the interaction
ßij between the ith treatment and jth period is also considered fixed and inferable, provided the same periods are considered within each Latin square. If squares are used as blocks for starting DIM (Grant et al., 2003), then treatment by period and treatment by square interaction are both inferable, but either term then is individually weakly associated with treatment by stage of lactation interaction.
The mixed model in Eq. [1]
is typically specified as if the residual terms eijk are normally, independently, and identically distributed with variance
. However, this Latin square design is just one particular example of a repeated measures design, whereby cows are observed over time, albeit with different treatment assignments at each period. Therefore, a richer class of covariance structures could be considered for residuals across periods within cows, such as for example, a first-order autoregressive covariance structure in which residuals in adjacent periods are specified to be more highly correlated than residuals from distal periods. Details on using SAS PROC MIXED (SAS Institute, Inc., Cary, NC.) for such specifications are provided in Littell et al. (1998) and in the Appendix of this paper. Also, carryover effects may be additionally modeled in Eq. [1] provided that the replicated Latin squares are orthogonal to each other and/or are Williams designs (Williams, 1949). Carryover effects can be estimated separately from treatment x period interaction effects if squares are balanced for carryover effects but not necessarily otherwise. Coding strategies for carryover effects for use in statistical analyses can be found in Kuehl (2000).
The Repeated Crossover Design
The statistical model for the repeated crossover design does not differ much from Eq. [1] for the replicated Latin square design except that the number of periods, say, np, exceeds the number of treatments nt. For nt = 2 treatments, np = 3 defines the switchback design whereas np = 4 defines the double-reversal design. Additionally, random treatment x cow interaction effects should be considered as well:
![]() | [2] |
Here, the terms in Eq. [2]
are defined similarly as in Eq. [1]
, with the additional term being
cik, the random treatment x cow interaction effect corresponding to the ith treatment and kth cow with variance component
. As with the Latin squares design, alternative repeated measures covariance structures can be modeled across periods within cows (see Appendix herein).
The fact that the treatment x cow effect serves as the ANOVA experimental error term for testing the significance of the treatment effect in repeated crossover designs has often been ignored. Consequently, reported P-values in these studies that have not recognized this inherent design feature may be overstated against the classical null hypothesis. As previously indicated, statistically significant and large values of
would indicate that differences in responses between GM and reference feedstuffs are not consistent across cows and may be due to potential individually (i.e., genotype) specific allergenicity, antinutrient, or feed preference effects.
Mixed Model Analyses
The ANOVA has effectively provided the statistical inference engine for the analyses of efficient designs, such as replicated Latin squares or crossover designs. However, there are some important distinctions to be made between the two most common ANOVA inferential strategies currently used in dairy nutrition studies, namely the ordinary least squares (OLS) approach using, say, PROC GLM of SAS, and the more recently developed true mixed model (MM) analysis using, say, PROC MIXED. For the two broad types of mixed models, [1] and [2], considered in this paper, the use of OLS may dramatically understate the estimated standard error of an estimated treatment mean (SEM) because OLS ignores the between-cow variability in such a determination. Admittedly, however, the estimated standard errors of the estimated mean differences (SED) between treatments may not differ between an OLS vs. MM analyses, provided that the design remains balancedthat is, no data are missing. Nevertheless, there are two potentially important exceptions. First, OLS treats all effects, whether labeled as fixed or random, as fixed effects, potentially leading to substantial confounding between various effects. For example, in the double-reversal design in Eq. [2], cow effects are partially confounded with treatment x period effects if cow is treated as fixed; furthermore, cow x treatment effects are partially confounded with period effects if cow x treatment is treated as fixed. This confounding may represent the historical basis as to why treatment x period and cow x treatment have not been considered as sources of variation in earlier work using OLS. Similar estimability problems with OLS inference using PROC GLM have also been pointed out by Littell et al. (1996) and in the context of bioequivalence testing by LeRoux et al. (1998). These confounding issues do not arise with a true MM analysis because cow and cow x treatment effects are appropriately treated as random.
A second advantage of MM over OLS analysis pertains to the recovery of inter-block information, or in the context of this paper, inter-cow information. Ordinary least squares facilitates only intra-block comparisons of treatments, whereas MM analyses utilizes both intra-block and inter-block information (Littell et al., 1996). Although OLS and MM inference will often differ little, if at all, for treatment mean comparisons in balanced complete block designs, this may not necessarily be true for incomplete block designs, such as Latin squares, and particularly when some data are missing.
| Hypothesis Testing and Power of Test |
|---|
|
|
|---|
![]() | [3] |
Here
= µT µR represents the difference of inferential interest between the GM or test crop mean µT and the mean µR of its isogenic or reference counterpart.
Classical power is defined as the probability of concluding a mean difference when such a difference truly exists. Stroup (1999) demonstrated how classical power computations for
could be readily derived for a linear mixed model of any complexity (e.g., Eq. [1] or [2], above) using SAS PROC MIXED software. Additional details are provided, albeit in an agronomic context, by Stroup (2002). Although power is readily specified as of function of Type I error rate
(set throughout this article as 0.05), sample size or number of cows nc, and
, the most difficult inputs to elicit for power determinations are the variance components
In both Latin square and repeated crossover designs, it is
and not
that vitally determines power of test, ignoring, for the moment, the existence of nonzero
. Based on various statistical consulting experiences pertaining to Latin square studies involving milk production at Michigan State University, estimates of
have ranged from 2 to 10 (kg/d)2. The lower limit is in close agreement with much earlier sample size and design work (Gill, 1969; Gill and Magee, 1976), whereas the upper limit might be more indicative of currently higher production environments, as evidence exists that variability increases with mean milk production. This relationship is manifested in Jensen (2001), who illustrates in a figure that
ranges from 2 (kg/d)2 in late lactation to 14 (kg/d)2 in early-lactation Canadian Jerseys, with larger residual variances being further associated with higher-producing multiparous animals.
As an example, we consider classical power for replicated 4 x 4 Latin squares (i.e., each square with four cows) in the spirit of recent GM dairy feedstuff studies (Folmer et al., 2002; Grant et al., 2003; Ipharraguerre et al., 2003), concentrating on power for
involving only two of the four hypothetical treatments. We define sufficient power as being 0.80, or 80%. Figure 1
relates power to number of 4 x 4 Latin squares where
is either 2 (kg/d)2 or 10 (kg/d)2, with
ranging from 1 to 4 kg/d. For
, four 4 x 4 squares (i.e., nc = 16 cows) would be adequate to detect
2 kg/d with sufficient power, whereas eight squares (nc = 32 cows) would be required to detect
= 1 kg/d. When
, eight squares would be insufficient to detect
= 2 kg/d, whereas four squares would be insufficient to detect
= 3 kg/d.
|
is either 2 or 10 (kg/d)2 with
= 0, but recognizing that the error degrees of freedom for testing the main effect of treatment is defined by the treatment x cow interaction term. Here, nc = 16 cows would be nearly sufficient to detect
= 1 kg/d with
, whereas the same number of cows would be insufficient to detect
= 2 kg/d when
. Note that, in general, the power of the double-reversal design is slightly greater than that for the replicated 4 x 4 Latin square design for the same number of cows and periods, albeit the number of records per treatment for each cow is twice that for the double-reversal design such that then only half as many treatments can be studied for the same total number of records. Nevertheless, as previously indicated, this design feature confers additional advantages for inference on treatment x cow interaction, which may be pertinent for GM feedstuff studies.
|
=µT µR, say,
L and
U, respectively. It may be sufficient to specify both limits with one parameter
=
U =
L such that the equivalence interval is symmetric about zero. Naturally, these limits should not be set by the statistician but by investigators and/or regulators (Schuirmann, 1987). The null and alternative hypotheses in the TOST procedure are written as
![]() | [4] |
Schuirmann (1987) demonstrates that rejection of the null hypothesis in Expressions [4] with a Type I error rate of
in [4] is based on observing the 100(1 2
)% confidence interval for
to fall within the bioequivalence limits [
L,
U]. If, for example, a Type I error rate of 5% is adopted for the bioequivalence hypothesis test, then one would construct a 90% confidence interval (CI) on
. If this CI falls within [
L,
U], then one can reject H0:
L or
U in favor of H1:
L <
<
U. As with classical hypothesis testing, failing to reject H0:
L or
U does not imply that
L or
U because a Type II error, that is, failing to correctly conclude equivalence, may have been committed.
As an example, consider the data from the double-reversal design involving 24 cows, 12 within each sequence, from Problem 8.9 in Gill (1978). Using a mixed effects analyses based on Eq. [2]
, the 90% CI on the mean difference between the two treatments is [0.20 kg/d, 0.87 kg/d]. Since this CI falls within [1 kg/d, +1 kg/d], average bioequivalence is established within the limits of ± 1 kg/d between the two treatments with P < 0.05. Representative SAS PROC MIXED code that can be used to provide bioequivalence inference is provided in the Appendix herein.
Bioequivalence power or the one minus the probability of making a Type II error for the hypothesis test in [4]
can be readily determined. Extending the presentation of Phillips (1990), the two t-statistics for the bioequivalence test are
![]() |
where SED(
) represents the estimated standard error of
=
T
R, with
T and
R corresponding to the least squares means (LSM) of µT and µR, respectively. The required point estimates and standard errors are readily determined using appropriate mixed models software (e.g., SAS PROC MIXED).
In concordance with the TOST confidence interval, bioequivalence is concluded if TL
t1
,
and TU
t
,
, where t1
,
and t
,
are the 100(1
)% and 100
% percentiles of a t-distribution with
degrees of freedom as determined by the design. The probability of this occurrence is known as the bioequivalence power (Phillips, 1990), denoted by
![]() | [5] |
Here TL and TU are joint random draws from a bivariate noncentral t-distribution with
degrees of freedom and a correlation of 1. Note that the bioequivalence power in Expression [5]
depends further upon noncentrality parameters
![]() |
with SED(
) representing the true standard error of
.
As an illustration involving milk production based on the two designs discussed in this paper, we consider just one value of
, specifically
= 0 kg. This specification ideally defines absolute mean bioequivalence. However, it is much more likely that bioequivalence will be established when
is not truly 0 but small enough to be considered as being bioequivalent, say, for example, 0.5 kg/d <
< 0.5 kg/d. Bioequivalence power for any combination of
L,
U and
can be readily computed provided that
L <
<
U. Figure 3
provides bioequivalence power determinations vs. number of 4 x 4 Latin squares for
=
U =
L = 0.5, 1.0, 1.5, and 2.0 kg/d when a)
= 2 kg2 and for
= 0.5, 1.0, 1.5, . . . , 4.0 kg/d when b)
= 10 (kg/d)2. From Figure 3a
, it can be seen that there is sufficient power to conclude treatment bioequivalence within equivalence limits of ± 1.5 kg/d with four 4 x 4 Latin squares when
= 2 (kg/d)2. However, nine squares would be required to establish sufficient power to conclude average bioequivalence within equivalence limits of ± 1.0 kg/d. If
= 10 (kg/d)2 (Figure 3b
), four squares do not provide sufficient power to establish equivalence within ± 3.0 kg/d and seven squares are needed to establish equivalence within ±2.5 kg/d.
|
= 2 (kg/d)2, 16 cows is nearly sufficient to establish equivalence within ± 1.0 kg/d with 80% power, whereas 16 cows would be minimally required to establish equivalence within ±2.5 kg/d when
= 10 (kg/d)2.
|
= 0. Observed power, also called retrospective, or post-hoc, power, has been provided by some statistical software as a supplementary analysis to establish bioequivalence. Observed power is the power of being able to detect a mean difference as large as that observed from a data analysis. A related computation is based on the detectable effect size; that is, based on the data analysis at hand, what effect size (i.e., mean difference) would be detectable for a given power of, say, 0.80? If this effect size is small, and the corresponding P-value for the classical hypothesis test exceeds
, then plausible evidence, presumably, exists in favor of the classical null hypothesis of equivalence. As Hoenig and Heisey (2001) point out, this use of observed power has been tenaciously advocated by a large number of peer-reviewed scientific journals and statistical textbooks!
Unfortunately, drawing conclusions on equivalence based on observed power is logically inconsistent. As noted by Hoeneg and Heisey (2001), nonsignificant P-values invariably correspond to low observed powers because observed power is a 1:1 function of the P-value. Reconsider the four 4 x 4 Latin squares design. Suppose that a 2 kg/d mean difference is considered to be within the limit of equivalence, analogous to the choice of
= 2 kg/d for the bioequivalence-testing approach. Using the observed power strategy, one might then "reject" the classical alternative hypothesis H1:
0 or
0 if the estimated mean difference falls within the classical acceptance region for the test, and the power for detecting an effect of magnitude
= 2 kg/d is, say, 80% or greater. Figure 5
is an adaptation of Figures 1
and 2
from Schuirmann (1987) but for the four 4 x 4 Latin squares design. The illustrated filled triangles represent the "rejection" regions that would lead to a conclusion of bioequivalence as a function of the estimated residual standard deviation se =
and the estimated mean difference
that determine this rejection region using either the observed power approach or the bioequivalence-testing approach of Schuirmann (1987). That is, for any values of se and
that jointly fall within the illustrated triangle, one would conclude that 2 kg/d <
< 2 kg/d based on either of the respective approaches.
|
that would conclude support for the classical null hypothesis, not even
= 0 kg/d, because the observed power for detecting
= 2 kg/d is always less than 0.80 for se > 1.96 kg/d. The rejection region for this approach is also illogical. Consider the following: Experiment 1 is executed with relatively poor precision, say, se = 1.8 kg/d, whereas Exp. 2 is characterized by se = 0.5 kg/d. Suppose the same mean difference (
= 1 kg/d) is observed in both experiments. In Exp. 1, we would conclude that 2 kg/d <
< 2 kg/d because se = 1.8 kg/d and
= 1 kg/d fall within the rejection region for concluding equivalence. However, the more precisely executed Exp. 2 would fail to draw that same conclusion! This discrepancy is troubling because one should logically expect that an experiment with better precision should be more likely to conclude true equivalence than another experiment with the same estimated mean difference but higher se. However, if the null hypothesis is mean equivalence as in classical tests, more information (i.e., smaller variance components, larger sample sizes) can only strengthen a conclusion of a nonzero mean difference. That is, using the observed power approach, poorly executed experiments would be rewarded a greater chance of concluding equivalence.
Figure 5b
illustrates the rejection region for the Schuirmann bioequivalence-testing approach as a function of se and
, with
= 2kg/d. This figure is an adaptation of Figure 2
from Schuirmann (1987) for the four 4 x 4 Latin squares design. Recall that for
= 0.05, the bioequivalence null hypothesis H0:
<
or
>
is rejected in favor of
<
<
if the 90% CI for
falls within ±
. Note that, in contrast to the observed power approach illustrated in Figure 5a
, the rejection region is widest for experiments with greater precision (smallest standard error); that is, experiments with greater precision are more likely to conclude bioequivalence. Consider again Exp. 1 and 2. Using the bioequivalence-testing approach, the conclusions are directly opposite to those based on the observed power approach. That is, Exp. 1 does not lead to a bioequivalence conclusion, whereas Experiment 2 with superior precision does. Schuirmann (1987) further demonstrates that the power of the bioequivalence-testing approach far exceeds that for observed power approach; this can be visually deduced by the relative sizes of the two rejection regions in Panels a and b of Figure 5
.
| Additional Pertinent Design and Analysis Issues |
|---|
|
|
|---|
Latin square and crossover designs that block on cows over periods have been recommended in this paper. However, as elucidated by Morris (1999), there are a number of situations in which these designs should not be used in animal science. First, if one wishes to measure the long-term cumulative effects of treatments on, for example, longevity, then animals must serve as the experimental units and not the blocking factors. Second, if treatment effects persist well beyond one period (i.e., second- or higher-order carryover effects), then designs blocking on animals should be discouraged.
Statistical Analysis
Classical and bioequivalence power computations for both the replicated Latin square and repeated crossover designs have been demonstrated in this paper. However, as previously indicated, treatment x period interaction is also inferable. Various contrasts as they pertain to this interaction can be studied under classical or bioequivalence-testing frameworks, such as the mean difference between GM and reference feedstuffs in Period 1 vs. the mean difference between the two feedstuffs in Period 4 (presuming period as fixed). Certainly, such contrasts should take precedence in hypothesis testing compared to overall treatment comparisons, as primarily addressed in this paper, if treatment x period interaction is determined to be important.
In some cases, noninferiority testing rather than bioequivalence testing may be more appropriate for the comparison of GM vs. conventional feedstuffs. That is, rather than establishing equivalence, it may be of greater interest to test whether or not the GM feedstuff is inferior to the reference feedstuff for various aspects. More details on hypothesis-testing procedures and power assessments in noninferiority testing can be found in Hauschke et al. (1999).
Historically, periods have been treated as fixed in both Latin square and repeated crossover designs. This approach is suitable if the variability in days in milk (DIM) within each treatment period is substantially less than the period length such that period and stage of lactation are nearly synonymous. However, if the range in DIM far exceeds period length, then period is less associated with stage of lactation and more reflective of seasonal or temporal management effects such as, the quality of silage as it is unloaded from the silo. In that case, period may be more appropriately treated as random with some modeled temporal correlation structure as is possible with the use of SAS PROC MIXED. Furthermore, stage-of-lactation effects should then be modeled separately from period effects using lactation curve models, such as Wilminks curve (Wilmink, 1987) or even high-order polynomials. Random regression models have been recently popularized in modern dairy cattle genetic evaluation programs (Jensen, 2001) where Dairy Herd Improvement Association test-day effects are used to model seasonal effects in a manner similar to Latin square periods and DIM is used to model parity, herd, and random cow-specific lactation curves, allowing for differences in lactation shape (i.e., persistency) over levels of those various factors. Treatment-specific lactation curves could be similarly modeled in replicated Latin square and crossover designs if treatment periods and DIM substantially overlap. Additionally, genetic sources of variability may be removed by fitting half-sib or full-sib families as random effects in these models as well. These implementations may substantially improve both classical and bioequivalence power of test.
This paper has concentrated primarily on classical and bioequivalence testing of milk production. However, the values for
or
chosen could be extrapolated to other testing scenarios involving various intake, digestibility, fitness, or reproductive measures. That is, the key contributor to classical or bioequivalence power is not singly
or
e =
, but their ratio. As an example, power determinations for
= 1, 2, 3, 4 and
= 2 in Figure 1a
apply to any ratio value of
![]() |
Studies on GM vs. reference corn as dairy feedstuffs have involved comparisons for a number of production, growth, milk components, and feed intake responses. Multiple testing corrections, such as Bonferroni or other alternatives, might plausibly be considered further in such cases; however, Berger and Hsu (1996) argue that these corrections may not be entirely necessary based on the intersection-union principle. Furthermore, multivariate data reduction alternatives should also be considered when some response variables are highly correlated with each other. Johnson (1998), for example, demonstrated how canonical variates analysis can be effectively used as a precursor data reduction strategy; subsequent classical or bioequivalence tests can then be applied as appropriate to the resulting canonical variates.
Current discussion on bioequivalence hypothesis testing has evolved from the exposition on average bioequivalence provided in this paper to measures of population and individual bioequivalence. These aggregate measures are not only functions of mean treatment differences but also of subject x treatment interaction and treatment differences for within-cow variability. As indicated previously, the double-reversal design is particularly useful for inference on these additional measures of dispersion. Nevertheless, there is currently sufficient controversy regarding the additional utility of population and individual bioequivalence aggregate measures beyond that provided by average bioequivalence (Hauschke and Steinijans, 2000; Munk, 2000). Bioequivalence testing will therefore likely become an area for future regulatory statistical research that will be pertinent for the testing of GM crops as feedstuffs for dairy cattle.
| Final Recommendations |
|---|
|
|
|---|
Power computations illustrated in this paper seem to indicate that experiment sizes may need to be greater for bioequivalence testing as opposed to classical hypothesis testing. It should be noted that existing FDA guidelines (FDA, 2001) for drug testing advocate that equivalence limits be, respectively, 80% and 125% of the reference means (i.e.,
L = 0.80µR and
U = 1.25µR) for various pharmacokinetic responses. Appropriate equivalence limits should be also characterized for key response variables in studies comparing GM to their reference hybrids as feedstuffs for dairy cattle.
| Implications |
|---|
|
|
|---|
| APPENDIX |
|---|
|
|
|---|
= 0.05) in a double-reversal or switchback design.
SAS code
proc mixed;
class cow period treatment;
model y = treatment period treatment*period;
random cow cow*treatment;
lsmeans treatment /diff alpha = .10 cl;
run;
Interpretative summary
The class statement defines the classification factors, namely cow, period, and treatment. Fixed effects treatment and period, including their interaction treatment*period are included in the model statement, whereas random effects cow and its interaction with treatment, cow*treatment, are specified in the random statement. The lsmeans statement specifies estimated treatment means and their mean differences based on the diff option. The options alpha = .10 and cl specifies 90% CI on the treatment mean difference as appropriate for a bioequivalence test with a Type I error rate of 0.05. Conclusions for bioequivalence are then based on whether or not the reported lower and upper confidence limits on the mean differences of interest fall within prespecified equivalence limits. If significant treatment x period interaction exists, then substitute
![]() |
for the fifth line in the above code to infer upon period-specific treatment differences or alternatively treatment-specific period differences.
The same program can be used for a replicated Latin square design except that cow*treatment is not inferable and should be deleted; that is, alter the fourth line in the code above to
![]() |
The code above further presumes that residuals are identically, independently, and normally distributed. However, it may be possible, for example, that temporal correlations in the residuals exist across periods within cows. If periods are equally spaced apart, then a candidate specification might be a first-order autoregressive (AR(1)) correlation structure, in which case the following statement should be added to the code above.
![]() |
where the ar(1) specification specifies the residual correlations between periods to be a function of the time interval between them. See Littell et al. (1996; 1998) for further details on this and many other potentially better fitting specifications.
| Footnotes |
|---|
2 Correspondence: 1205 Anthony Hall (phone: 517-355-8445; e-mail: tempelma{at}msu.edu).
Received for publication July 3, 2003. Accepted for publication October 8, 2003.
| Literature Cited |
|---|
|
|
|---|
This article has been cited by other articles: