Seeking Optimal Design for Animal Bioassay Studies
http://www.100md.com
《毒物学科学杂志》
Gradient Corporation
ABSTRACT
The article highlighted in this issue is "A Statistical Evaluation of Toxicity Study Designs for the Estimation of the Benchmark Dose in Continuous Endpoints," by W. Slob, M. Moerbeek, E. Rauniomaa, and A. H. Piersma (pp. 167–185).
The experimental design of animal bioassay studies—the numbers and placement of doses and the allotment of animals among doses—tends to adhere, if not to specified rules, then at least to strong conventions. In part, this is due to mandated procedures for testing done to support regulation, and in part, it is the codified conventional wisdom gleaned from many years of experience. But there is also a big role for simple precedent and simplicity. For example, choosing even allocation of animals among regularly spaced doses is a measure that could come from the lack of a perceived reason to follow any specific alternative.
This consistency has the advantage of promoting comparability among tests on different compounds in different laboratories. Issues of interpretation can be settled for the general case and codified into guidelines and policies that promote consistency and transparency. Since no-effect levels must be among the doses actually tested, and since the demonstration of an elevated response depends on statistical power (which in turn depends on numbers of animals at different doses), there is the potential that bioassay design could be manipulated to bias the outcome. Standardization of bioassay design avoids this possibility by mandating beforehand a particular design to be considered acceptable for regulatory purposes.
Bioassays are not merely sources of data, however; they are experiments that are (or ought to be) designed to find out something in particular about the tested agent—about its ability to cause certain toxic effects, its potency in doing so, and the nature of the dose-response relationship. To make the most efficient use of resources, one should seek the experimental design that achieves the best accuracy (lack of bias) and precision (low residual uncertainty) about values of interest estimated from the data. The optimal experimental design for achieving this will depend on two things: (1) what specific values one is trying to estimate, and (2) what the true underlying dose-response relationship may be.
The first question—what one is trying to estimate—is not as straightforward as it might seem. A forthcoming report from a panel on dose selection organized by the Risk Science Institute of the International Life Science Institute (previewed in Olin and Rhomberg, 2005) discusses how the objectives that chronic bioassays, and in particular carcinogenicity bioassays, are asked to address have expanded over time. An initial focus on screening for a compound's ability to cause toxicity has expanded to include characterizing the shape of the dose-response curve, providing evidence for the relevance of modes of high-dose action to lower doses, serving as the basis for low-dose extrapolation, identifying no-observed effect levels as well as estimating doses associated with particular levels of response. The experimental design that is optimal for one purpose will not generally be best for others.
For endpoints believed to have exposure thresholds, the aim is usually to identify the dose level below which no adverse impacts are generated. The traditional analysis of a bioassay result is to identify a no-observed-adverse-effect level (NOAEL), judged by pair-wise comparisons of the responses at each dose against a suitable control group. Thus, the bioassay is usually designed to include at least one high dose level with clear toxicity, at least one low dose without significantly elevated toxicity, and a few doses in between that may be found to show the effect or not, so that the experimental NOAEL dose defines an approximate "bottom" to the range of doses capable of causing the effect in question.
In recent years, there has been a good deal of attention focused on the Benchmark Dose (BMD) approach as an alternative to the NOAEL (Crump, 1984). In the benchmark dose approach, a dose-response curve is fit to all the bioassay data, and the dose (or the lower bound on the dose) corresponding to a given low level of response on the fitted curve (say, a 5% elevation in incidence over background) is selected as a characterization of the dose level at which a detectable increase in the measured effect occurs. This so-called Benchmark Dose, after the application of appropriate uncertainty factors, is then used in place of the NOAEL in determinations of acceptable intakes, reference doses, or other safe-dose determinations. Among the advantages of the BMD over the NOAEL are that all the data are used, with information about the dose-response relationship playing a role, that the estimated value need not be one of the tested doses, and that the method is not particularly sensitive to arbitrary details such as the exact curve-fitting method or the exact dose placements.
As the BMD approach becomes more widely used, and as possible modifications of mandated experimental designs are being considered, it is worthwhile to ask whether the numbers of dose levels, their placement, and the allocation of animals among doses should be reconsidered so as to enhance the precision and accuracy of BMD estimation. Since the need to array dose groups as alternative candidates for designation as a NOAEL is obviated, the experimental design is freed to better address the estimation of a particular point on the dose-response curve.
The impact of bioassay experimental design on the efficiency of estimation of Benchmark Doses has mostly been investigated for dichotomous endpoints—those for which the data comprise counts of responders and non-responders at each dose (Kavlock et al., 1996; Kelly, 2001; Weller et al., 1984). In many bioassays, however, the outcomes are continuous variables such as body weight, organ weight, or physiological variables that are measured on a continuous scale in all tested individuals. Responses consist of changes in the distribution of these values among dosed animals compared against the levels (and variation) observed among controls. Such data require a different approach to dose-response analysis, and hence to Benchmark Dose definition.
In an article in the present issue of Toxicological Sciences, the question of bioassay design for continuous endpoint Benchmark Dose estimation is examined by Slob, Moerbeek, Rauniomaa, and Piersma (2005). The approach is by generating computer simulation of large numbers of hypothetical datasets from dose-response relationships of various shapes, and then fitting a variety of dose-response models to each dataset. The best-fitting model for each dataset is chosen by a likelihood ratio test, and the implied Benchmark Dose (using a Critical Effect Size of 5% change from controls) is calculated. The resulting distribution of values over simulated datasets can be compared with the true dose causing a 5% change as determined by the original relationship from which the datasets were simulated. Using this system, one can systematically vary aspects of the experimental design (number of dose groups, the dose levels chosen, the allocation of animals among them) and assess how the BMD estimate's accuracy and precision changes. Importantly, the sensitivity to the true shape of the underlying dose-response curve (an aspect of bioassay conduct not under the control of the experimenter), and the performance of various experimental designs in the face of diversity in true dose-response shapes can also be systematically investigated.
One might think, if the goal is to estimate the dose at which a 5% change in the control value is reached, that an optimal design would concentrate dose groups (and animal allocation) in the dose range where this is expected to occur. While this is true to some extent, the simulations of Slob et al. show clearly that it is important to include high doses (above this critical region and associated with larger effects) as well. This helps establish the shape of the fitted curve in all regions, enhancing the ability to identify a model with the appropriate shape, and it overcomes the unfavorable signal-to-noise ratio near the BMD.
Slob et al. also show that distributing the total number of animals over more dose groups (with fewer in each as a consequence) does not result in poorer performance for BMD estimation. (This can be a bad strategy for NOAEL determination, since it lowers the statistical power of pair-wise comparison with the controls.) Indeed, more dose groups provide the advantage of hedging against poor dose placement that may result from bad guesses about where the critical range of doses is likely to lie.
The simulations found that selecting the "right" dose-response model (i.e., the one from which the simulated data were generated) is not critical to good performance. Often, good estimates of the BMD were obtained even though the best-fitting model was one different from the one used to generate the simulated data. Nonetheless, fitting several models of different shapes (and then selecting the best-fitting one) is important so as to avoid biases resulting from marked model misspecification.
The simulations presume that the top dose leads to an effect size of 30%; in effect, it is presumed that one knows beforehand the dose-range over which the observable modulation of response by dose occurs. Usually, range-finding experiments, prior testing, and experience with similar compounds will allow experimenters to guess at the range of doses that will prove most informative. But in practice, it may be necessary to hedge the dose selection somewhat to allow for the possibility that one's prior presumption about the appropriate dose range is mistaken. This is especially important for long-term bioassays, the expense of which may preclude retesting if one finds that all the doses are too low (so that no responses are seen) or too high (so that no dose-response is seen among generally high responses). In an earlier simulation study of NOAELs, Brand et al. (1999) found that, for certain dose-response shapes, certain experimental designs, and certain degrees of lack of prior knowledge about the dose-range of ultimate interest, there was a considerable probability that an experiment would fail to detect a true effect or fail to find a significant dose-response trend. Similarly, Slob et al. find that end points with high residual variation (i.e., pronounced differences among individuals in the response at a given dose) risk poor performance of the bioassay in identifying the effect and in efficiently finding the BMD. Studies with more dose groups hedge against poor selection of the dose range, but they are also found in the simulations to have a higher risk of failing to find a significant dose-response relationship.
All in all, the study shows that traditional bioassay designs (such as the 4-dose OECD 28-day study design) do not perform badly for BMD estimation despite their original intent for use in identifying NOAELs, but there are improvements that could be made. Studies of the sort conducted by Slob et al. will contribute to the development of toxicity testing as it shifts its focus increasingly toward characterization of dose-response relationships.
REFERENCES
Brand, K. P., Rhomberg, L., and Evans, J. S. (1999). Estimating non-cancer uncertainty factors: Are ratios NOAELs informative Risk Analysis 19(2), 295–308.
Crump, K. S. (1984). A new method for determining allowable daily intakes. Fundam. Appl. Toxicol. 4, 845–871.
Kavlock, R. J., Schmid, J. E., and Setzer, R. W., Jr. (1996). A simulation study on the influence of study design on the estimation of benchmark doses for developmental toxicity. Risk Analysis 16, 399–410.
Kelly, G. E. (2001). The median lethal dose – design and estimation. Statistician 50, 41–50.
Olin, S., and Rhomberg, L. R. (2005). Dose-related issues in the design and interpretation of chronic toxicity and carcinogenicity studies in rodents. Poster at the Society of Toxicology 44th Annual Meeting, New Orleans, 7 March 2005.
Slob, W., Moerbeek, M., Rauniomaa, E., and Piersma, A. H. (2005). A statistical evaluation of toxicity study designs for the estimation of the benchmark dose in continuous endpoints. Toxicol. Sci. x,: 167–185.
Weller, E. A., Catalano, P. J., and Williams, P. L. (1995). Implications of developmental toxicity study design for quantitative risk assessment. Risk Analysis 15, 567–574.(Lorenz R. Rhomberg)
ABSTRACT
The article highlighted in this issue is "A Statistical Evaluation of Toxicity Study Designs for the Estimation of the Benchmark Dose in Continuous Endpoints," by W. Slob, M. Moerbeek, E. Rauniomaa, and A. H. Piersma (pp. 167–185).
The experimental design of animal bioassay studies—the numbers and placement of doses and the allotment of animals among doses—tends to adhere, if not to specified rules, then at least to strong conventions. In part, this is due to mandated procedures for testing done to support regulation, and in part, it is the codified conventional wisdom gleaned from many years of experience. But there is also a big role for simple precedent and simplicity. For example, choosing even allocation of animals among regularly spaced doses is a measure that could come from the lack of a perceived reason to follow any specific alternative.
This consistency has the advantage of promoting comparability among tests on different compounds in different laboratories. Issues of interpretation can be settled for the general case and codified into guidelines and policies that promote consistency and transparency. Since no-effect levels must be among the doses actually tested, and since the demonstration of an elevated response depends on statistical power (which in turn depends on numbers of animals at different doses), there is the potential that bioassay design could be manipulated to bias the outcome. Standardization of bioassay design avoids this possibility by mandating beforehand a particular design to be considered acceptable for regulatory purposes.
Bioassays are not merely sources of data, however; they are experiments that are (or ought to be) designed to find out something in particular about the tested agent—about its ability to cause certain toxic effects, its potency in doing so, and the nature of the dose-response relationship. To make the most efficient use of resources, one should seek the experimental design that achieves the best accuracy (lack of bias) and precision (low residual uncertainty) about values of interest estimated from the data. The optimal experimental design for achieving this will depend on two things: (1) what specific values one is trying to estimate, and (2) what the true underlying dose-response relationship may be.
The first question—what one is trying to estimate—is not as straightforward as it might seem. A forthcoming report from a panel on dose selection organized by the Risk Science Institute of the International Life Science Institute (previewed in Olin and Rhomberg, 2005) discusses how the objectives that chronic bioassays, and in particular carcinogenicity bioassays, are asked to address have expanded over time. An initial focus on screening for a compound's ability to cause toxicity has expanded to include characterizing the shape of the dose-response curve, providing evidence for the relevance of modes of high-dose action to lower doses, serving as the basis for low-dose extrapolation, identifying no-observed effect levels as well as estimating doses associated with particular levels of response. The experimental design that is optimal for one purpose will not generally be best for others.
For endpoints believed to have exposure thresholds, the aim is usually to identify the dose level below which no adverse impacts are generated. The traditional analysis of a bioassay result is to identify a no-observed-adverse-effect level (NOAEL), judged by pair-wise comparisons of the responses at each dose against a suitable control group. Thus, the bioassay is usually designed to include at least one high dose level with clear toxicity, at least one low dose without significantly elevated toxicity, and a few doses in between that may be found to show the effect or not, so that the experimental NOAEL dose defines an approximate "bottom" to the range of doses capable of causing the effect in question.
In recent years, there has been a good deal of attention focused on the Benchmark Dose (BMD) approach as an alternative to the NOAEL (Crump, 1984). In the benchmark dose approach, a dose-response curve is fit to all the bioassay data, and the dose (or the lower bound on the dose) corresponding to a given low level of response on the fitted curve (say, a 5% elevation in incidence over background) is selected as a characterization of the dose level at which a detectable increase in the measured effect occurs. This so-called Benchmark Dose, after the application of appropriate uncertainty factors, is then used in place of the NOAEL in determinations of acceptable intakes, reference doses, or other safe-dose determinations. Among the advantages of the BMD over the NOAEL are that all the data are used, with information about the dose-response relationship playing a role, that the estimated value need not be one of the tested doses, and that the method is not particularly sensitive to arbitrary details such as the exact curve-fitting method or the exact dose placements.
As the BMD approach becomes more widely used, and as possible modifications of mandated experimental designs are being considered, it is worthwhile to ask whether the numbers of dose levels, their placement, and the allocation of animals among doses should be reconsidered so as to enhance the precision and accuracy of BMD estimation. Since the need to array dose groups as alternative candidates for designation as a NOAEL is obviated, the experimental design is freed to better address the estimation of a particular point on the dose-response curve.
The impact of bioassay experimental design on the efficiency of estimation of Benchmark Doses has mostly been investigated for dichotomous endpoints—those for which the data comprise counts of responders and non-responders at each dose (Kavlock et al., 1996; Kelly, 2001; Weller et al., 1984). In many bioassays, however, the outcomes are continuous variables such as body weight, organ weight, or physiological variables that are measured on a continuous scale in all tested individuals. Responses consist of changes in the distribution of these values among dosed animals compared against the levels (and variation) observed among controls. Such data require a different approach to dose-response analysis, and hence to Benchmark Dose definition.
In an article in the present issue of Toxicological Sciences, the question of bioassay design for continuous endpoint Benchmark Dose estimation is examined by Slob, Moerbeek, Rauniomaa, and Piersma (2005). The approach is by generating computer simulation of large numbers of hypothetical datasets from dose-response relationships of various shapes, and then fitting a variety of dose-response models to each dataset. The best-fitting model for each dataset is chosen by a likelihood ratio test, and the implied Benchmark Dose (using a Critical Effect Size of 5% change from controls) is calculated. The resulting distribution of values over simulated datasets can be compared with the true dose causing a 5% change as determined by the original relationship from which the datasets were simulated. Using this system, one can systematically vary aspects of the experimental design (number of dose groups, the dose levels chosen, the allocation of animals among them) and assess how the BMD estimate's accuracy and precision changes. Importantly, the sensitivity to the true shape of the underlying dose-response curve (an aspect of bioassay conduct not under the control of the experimenter), and the performance of various experimental designs in the face of diversity in true dose-response shapes can also be systematically investigated.
One might think, if the goal is to estimate the dose at which a 5% change in the control value is reached, that an optimal design would concentrate dose groups (and animal allocation) in the dose range where this is expected to occur. While this is true to some extent, the simulations of Slob et al. show clearly that it is important to include high doses (above this critical region and associated with larger effects) as well. This helps establish the shape of the fitted curve in all regions, enhancing the ability to identify a model with the appropriate shape, and it overcomes the unfavorable signal-to-noise ratio near the BMD.
Slob et al. also show that distributing the total number of animals over more dose groups (with fewer in each as a consequence) does not result in poorer performance for BMD estimation. (This can be a bad strategy for NOAEL determination, since it lowers the statistical power of pair-wise comparison with the controls.) Indeed, more dose groups provide the advantage of hedging against poor dose placement that may result from bad guesses about where the critical range of doses is likely to lie.
The simulations found that selecting the "right" dose-response model (i.e., the one from which the simulated data were generated) is not critical to good performance. Often, good estimates of the BMD were obtained even though the best-fitting model was one different from the one used to generate the simulated data. Nonetheless, fitting several models of different shapes (and then selecting the best-fitting one) is important so as to avoid biases resulting from marked model misspecification.
The simulations presume that the top dose leads to an effect size of 30%; in effect, it is presumed that one knows beforehand the dose-range over which the observable modulation of response by dose occurs. Usually, range-finding experiments, prior testing, and experience with similar compounds will allow experimenters to guess at the range of doses that will prove most informative. But in practice, it may be necessary to hedge the dose selection somewhat to allow for the possibility that one's prior presumption about the appropriate dose range is mistaken. This is especially important for long-term bioassays, the expense of which may preclude retesting if one finds that all the doses are too low (so that no responses are seen) or too high (so that no dose-response is seen among generally high responses). In an earlier simulation study of NOAELs, Brand et al. (1999) found that, for certain dose-response shapes, certain experimental designs, and certain degrees of lack of prior knowledge about the dose-range of ultimate interest, there was a considerable probability that an experiment would fail to detect a true effect or fail to find a significant dose-response trend. Similarly, Slob et al. find that end points with high residual variation (i.e., pronounced differences among individuals in the response at a given dose) risk poor performance of the bioassay in identifying the effect and in efficiently finding the BMD. Studies with more dose groups hedge against poor selection of the dose range, but they are also found in the simulations to have a higher risk of failing to find a significant dose-response relationship.
All in all, the study shows that traditional bioassay designs (such as the 4-dose OECD 28-day study design) do not perform badly for BMD estimation despite their original intent for use in identifying NOAELs, but there are improvements that could be made. Studies of the sort conducted by Slob et al. will contribute to the development of toxicity testing as it shifts its focus increasingly toward characterization of dose-response relationships.
REFERENCES
Brand, K. P., Rhomberg, L., and Evans, J. S. (1999). Estimating non-cancer uncertainty factors: Are ratios NOAELs informative Risk Analysis 19(2), 295–308.
Crump, K. S. (1984). A new method for determining allowable daily intakes. Fundam. Appl. Toxicol. 4, 845–871.
Kavlock, R. J., Schmid, J. E., and Setzer, R. W., Jr. (1996). A simulation study on the influence of study design on the estimation of benchmark doses for developmental toxicity. Risk Analysis 16, 399–410.
Kelly, G. E. (2001). The median lethal dose – design and estimation. Statistician 50, 41–50.
Olin, S., and Rhomberg, L. R. (2005). Dose-related issues in the design and interpretation of chronic toxicity and carcinogenicity studies in rodents. Poster at the Society of Toxicology 44th Annual Meeting, New Orleans, 7 March 2005.
Slob, W., Moerbeek, M., Rauniomaa, E., and Piersma, A. H. (2005). A statistical evaluation of toxicity study designs for the estimation of the benchmark dose in continuous endpoints. Toxicol. Sci. x,: 167–185.
Weller, E. A., Catalano, P. J., and Williams, P. L. (1995). Implications of developmental toxicity study design for quantitative risk assessment. Risk Analysis 15, 567–574.(Lorenz R. Rhomberg)