|
|
||||||||
ANIMAL GENETICS |

* Grup de Recerca en Remugants, Departament de Ciència Animal i dels Aliments, Universitat Autònoma de Barcelona, 08193 Bellaterra (Barcelona), Spain;
and
Àrea de Producció Animal, Centre UdL-IRTA, 25198 Lleida, Spain
| Abstract |
|---|
|
|
|---|
Key Words: model fitting parametric bootstrap proportional hazard survival analysis
| INTRODUCTION |
|---|
|
|
|---|
Whereas nonparametric approaches show a great flexibility to fit survival data, parametric models suffer from a lower flexibility due to the assumptions made on the baseline survival function. Nevertheless, parametric models can lead to substantial reductions in computational resources and time requirements (Ducrocq et al., 2000
). An accurate choice of the survival analysis technique is essential for the validity of the results obtained. However, there are only a few graphical tests described to evaluate model fitting (Ducrocq et al., 2000
), and in addition to their subjective interpretation, some incompatibilities have been observed with time-dependent effects (Tarrés et al., 2005
).
Here we propose a parametric bootstrap procedure to test survival model fitting that uses the empirical Kaplan-Meier survival function (Kaplan and Meier, 1958
) as the reference parameter. Our methodology provides an easy determination of model fit along the time-space, establishing the bootstrap interval for each time point and model. The proposed method was tested with a data set of survival up to weaning of calves of the Bruna dels Pirineus beef cattle breed, with different assumptions for the baseline survival function.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Statistical Background for Proportional Hazard Models
Using the survival model described by Cox (1972)
as a starting point, the hazard h(t) and survival S(t) functions at time t of a given animal i influenced by a set of explanatory variables wi are written as
![]() |
where h0(t) and S0(t) are the baseline hazard and survival functions, respectively, which represent the aging process of the whole population, wi is a row vector of incidences, and
is the column vector of regression coefficients including systematic, environmental permanent, and genetic effects. The baseline functions may be arbitrarily chosen in a large class of smooth functions (Cox, 1972
; Prentice and Gloeckler, 1978
) or may have a known parametric form, generally the Weibull distribution (h0(t) = 
(
t)
1 and S0(t) = exp((
t)
), where
and
are 2 positive parameters) or its simplification to an exponential density when
= 1 (Ducrocq et al., 1988a
,b
).
The attractive feature of Coxs (1972)
semiparametric methodology is that it permits the estimation of
under less stringent assumptions about the form of h0(t) and S0(t), whereas parametric models allow for a much less demanding use of computing and time resources along with reduced flexibility (Ducrocq et al., 2000
; Yazdi et al., 2002
). In this sense, the choice of an adequate baseline survival function plays a central role in survival analysis modeling, and incorrect assumptions at this point could invalidate further inferences on
(Allison, 1995
).
Unfortunately, there are no statistical tests to verify the fit of the parametric distributions to the data. Only graphical tests have been described for the Weibull and exponential distributions. Taking the baseline survival function estimated in a nonparametric approach, like the Kaplan-Meier method (SKM(t); Kaplan and Meier, 1958
) as a starting point, values of ln(ln(SKM(t))) plotted against ln(t) should result in a straight line with intercept
ln(
) and slope of
if the Weibull assumption is true. Nevertheless, when survivability is influenced by a time-dependent factor, ßTD, that changes its effect at time t1,t2,...tn, this method cannot be applied because the intercept becomes
ln(
) + ßTD, and its expectation varies with ßTD, as described by Tarrés et al. (2005)
. On the other hand, a plot of sorted generalized residuals against the expected order statistics of a censored unit exponential function should display a straight line with slope 1 and going through the origin (Cox and Oakes, 1984
). Both graphical tests allow for an easy detection of inappropriate assumptions, but they are difficult to interpret when graphics show moderate deviations [see Ducrocq et al. (2000)
for graphical examples]. As a whole, accurate methods to determine the fit of survival models to the real aging process of the analyzed populations are lacking.
Parametric Bootstrapping in Survival Analysis Models
We present here a parametric bootstrap approach to test goodness of fit of proportional hazard regression models. Bootstrap methods, introduced by Efron (1979)
, have become a routine method for approximating the distribution of a statistical quantity of interest, and they have been applied to the animal breeding framework (García-Cortés et al., 1992
; Reverter et al., 1998
). From a parametric point of view, bootstrapping methodology consists of 3 characteristic steps: a) definition of an appropriate model for the observed data, b) use of Monte Carlo methods to generate n sample data sets from the fitted model and calculate the statistic or quantity of interest (
), and c) construction of the bootstrap distribution of
.
Step 1.
Within the proportional hazards framework, several models are defined in the first step of the bootstrap, assuming Weibull, exponential, or nonparametric baselines, and different incidences of systematic and random effects, as well as time-dependent factors. The subsequent analysis of those models provides the estimated parameters of the baseline (
,
) and the regression coefficients (
) that are needed to simulate new data sets by Monte Carlo methods.
Step 2.
New data sets must be generated from
,
, and
, and the incidence vector w related to each original record. If time-dependent effects are not considered in the model, a simulated value (Ti), corresponding to the ith animal in the real data set, can be generated by the transformation method (Press et al., 1992
) as follows:
![]() |
where
is a random number generated from a uniform distribution function between 0 and 1. Note that this formula comes from the inverse of the cumulative distribution function. Under models with time-dependent effects or with the baseline estimated by nonparametric approaches, recursive methods are required to obtain simulated values of longevity (see Appendix). When a new data set is entirely simulated, the observed survival function can be estimated by the Kaplan-Meier estimate (Kaplan and Meier, 1958
), which does not require any other information or parameter than the simulated records.
The Kaplan-Meier estimate can be viewed as a weighted survival estimate of the overall population, and its validity for making inferences on the overall survival function relies on the homogeneity of the population and the validity of the proportional hazards assumption. However, these assumptions are not easily tested. Provided that the proportional hazards assumption is valid across the effects included in the model, this can be accounted for during the stochastic simulation process. The resampled Kaplan-Meier survival would then be consistent for studying the fitting of the different models. Alternatively, Kaplan-Meier estimates could be obtained on subpopulations that are sure to be homogeneous.
Step 3.
The bootstrapping procedure ends with the creation of confidence intervals for the survival probability at each temporal point t and from the samples obtained by Kaplan-Meier in step 2. Following a standard approach, the bounds of the confidence interval are usually fitted to percentiles of 0.025 and 0.975.
The final objective of the bootstrap technique presented herein was to evaluate the fit obtained by the current model and its estimates. Confidence intervals obtained for the survival expectation at each age point t were stated as the 0.025 and 0.975 percentiles of the bootstrap samples, and they were easily contrasted with the Kaplan-Meier survival function of the real data. Significant fitting deficiencies were revealed when the observed survival function was not included within the confidence interval, and they could be statistically quantified though the bootstrapped P values (Hesterberg et al., 2005
).
Bootstrapping in the Survival Analysis of Bruna dels Pirineus Beef Calves
Field Data Source.
The procedure described above was applied to a data set of Bruna dels Pirineus beef calves, previously analyzed through survival techniques by Tarrés et al. (2005)
. Bruna dels Pirineus is a beef type cattle breed selected from the old Brown Swiss, similar to the American Braunvieh, and it is used for extensive beef cattle production in the geo-climatic and management conditions of the Pyrenean mountain areas of Catalonia, Spain. Cows remain in valleys close to villages from November to June, when most calvings occur, after which animals are taken to the mountains to graze alpine pastures. After editing and excluding twin births, stillbirths and mummified fetuses, survival records from birth to weaning of 2,504 beef calves born between 1994 and 2002 were considered, all of them belonging to 3 commercial herds included in the yield recording scheme of the Bruna dels Pirineus breed. The survival time was defined as the difference between the date of death and the date of birth (complete records) and, when the date of death was unknown, we assumed that the calf record was censored at the age of weaning (180 d).
Operational Model and Assumptions for the Baseline.
Systematic effects assumed were month of calving, length of productive life, calving difficulty, and birth weight with categories and estimated hazard ratios (Tarrés et al., 2005
), as indicated in Table 1
. Moreover, the additive effect of the sire also was included (estimates not shown). Following Tarrés et al. (2005)
, 4 models were assumed, depending on the baseline distribution function: exponential, Weibull, exponential with a time-dependent effect changing at d 16 and 31, and Weibull with the same time-dependent effect.
|
Bootstrapping Procedure.
A total of 10,000 new data sets (Efron, 1979
) were simulated for each baseline assumption following the procedures described above. Parameters used in Monte Carlo simulations are shown in Table 1
, and they were previously obtained by Tarrés et al. (2005)
with the same data set. The Kaplan-Meier survival function was calculated for each data set until day 180, which was the weaning age. The bootstrap interval was defined for each age t assuming the 0.025 and 0.975 percentiles of the bootstrap samples.
To compare the results obtained by bootstrap, the Akaikes (1973)
Information Criterion (AIC) was estimated for each model and a likelihood ratio test (LRT) was carried out for each pair of parametric nested models. Note that the Weibull model and the exponential time-dependent model were not nested models, thus the likelihood ratio test between them was not calculated.
| RESULTS AND DISCUSSION |
|---|
|
|
|---|
These values reveal stable behavior for the parametric bootstrap, providing accurate estimates of the expected survival probability with dispersion smaller than 2 percentage points. Assuming the validity of the Kaplan-Meier estimate of the survival function, this procedure allowed for a direct assessment of the model fit to the real data. The more parameterized models reached the lower dispersions in their bootstrap intervals, but this is not surprising because the bootstrap dispersion reflects the bound for the asymptotic variance of the estimates in play. In addition, these bounds obviously decrease with model parameterization (van der Laan and Robins, 2002
). This can also be related to the fact that parametric bootstrap procedures do not take into account the precision of parameters used in the stochastic simulation (Gelman et al., 1996
; Blasco, 2001
). Moreover, the change in survival probability becomes almost null from d 32 to weaning in time-dependent parametric models and in the Cox model (Table 1
), with a reduced incidence of deaths that minimizes the variability of the expected survival distribution and, therefore, the bootstrap interval.
The hazard function is constant for the exponential assumption and reduces (
< 1) or increases (
> 1) with time for the Weibull one (Allison, 1995
). A value of
smaller than 1 allows for a greater incidence of deaths after the first month of life, although this situation can also be modeled through the time-dependent exponential model, as well as the Coxs semiparametric model.
The fitting of the Exponential and Weibull models were clearly inadequate, with important over- and underestimates during the first days and the last 4 mo before weaning that provided highly significant bootstrapped P values (P < 0.001; Figures 1
and 2
). Specifically, the exponential model overestimated the survival probability from d 7 to 45, with a posterior underestimation from d 79. The Weibull model showed a slightly better fit with a substantial underestimation of calf survival in the first week of life and from d 61 to censoring age. These results are in agreement with the ones obtained by Tarrés et al. (2005)
with the same data set using graphical tests, and they are a representative example of the substantial flexibility of the parametric functions to fit survival data, previously pointed out by several authors (Allison, 1995
; Kleinbaum, 1996
). Both models seem to attain smaller biases during the first weeks, probably related to the high concentration of deaths in the first month of life (81%; Tarrés et al., 2005
), along with the fact that the complete survival records provide a substantial amount of information to the survival analysis.
|
|
|
|
Parametric survival models can be compared through the AIC, or they can be viewed as nested models and tested by a LRT (Table 2
). Both analytical procedures showed that time-dependent models were preferable for the survival analysis of Bruna dels Pirineus calves, although they do not provide substantial information about the fit of the final model. Indeed, results of AIC and LRT for nontime-dependent models indicated that the Weibull was better (P < 0.001) and, without other available models or additional information, we might accept the Weibull model when its fit was clearly poor (Figure 2
). The LRT implied that a
different than 1 was preferable although, within the exponential models, it favored the time-dependent one (P < 0.001), showing that a constant hazard function was not preferable in our population. Moreover, AIC and LRT criteria evaluated the model as a whole, and they did not detect substantial differences between the time-dependent models (P > 0.1), although the exponential one seemed preferable due to its lower number of parameters (more parsimonious model). The parametric bootstrap tested the fitting for each temporal point, providing a straightforward framework to detect problematic zones. In this sense, parametric bootstrap could become a very useful tool in parametric survival analysis.
|
and the Kaplan-Meier estimation are trivial procedures in computing (Press et al., 1992
|
| IMPLICATIONS |
|---|
|
|
|---|
| APPENDIX |
|---|
|
|
|---|
1,
2, ...,
k that change at time
0 = 0,
1, ...,
k, respectively. Note that
and
are the estimated parameters for the Weibull baseline survival function, and w*i and
* are the ith incidence row vector of the data set and the column vector of estimated regression parameters, both excluding the elements related to the time-dependent effect. Following Tarrés et al. (2005)
![]() |
According to the transformation method (Press et al., 1992
), we can generate random samples (Ti) characterized by Si(t; wi*) = S0*(t)exp(wi*
*), with the next formula:
![]() |
Obviously, this does not allow for a direct simulation of random values, and thus a recursive approach has to be applied, with the following steps: i) generate a random value (
) from a uniform distribution function between 0 and 1, ii) assume k = 1, iii) calculate Ti with the previous equation, and iv) if Ti >
k, assume k = k + 1 and return to step iii.
Simulation of Longevity Records Through a Nonparametric Baseline
In the case of baseline survival functions estimated through nonparametric methodologies (Kaplan and Meier, 1958
; Tsiatis, 1981
), a recursive approach is also required to obtain random samples (Ti) from the survival distribution: i) generate a random value (
) from an uniform distribution between 0 and 1, ii) assume t = 1, iii) calculate S(t; wi) = S0(t)exp(wi
),, and iv) if S(t; wi)
then Ti = t, or else assume t = t + 1 and return to step iii. Note that nonparametric approaches assume that survival probability is constant in the interval between 2 complete records.
| Footnotes |
|---|
3 Current address: VIT, Heidewerg 1, D-27283 Verden, Germany. ![]()
2 Corresponding author: joaquim.casellas{at}uab.es
Received for publication December 16, 2005. Accepted for publication May 12, 2006.
| LITERATURE CITED |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
A. Cecchinato, V. Bonfatti, L. Gallo, and P. Carnier Survival analysis of preweaning piglet survival in a dry-cured ham-producing crossbred line J Anim Sci, October 1, 2008; 86(10): 2486 - 2495. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Casellas, G. Caja, X. Such, and J. Piedrafita Survival analysis from birth to slaughter of Ripollesa lambs under semi-intensive management J Anim Sci, February 1, 2007; 85(2): 512 - 517. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |