--- title: "Frequentist and hybrid calibration of one-stage ROPE-based designs for single-arm phase II trials" author: | | Riko Kelter | Institute of Medical Statistics and Computational Biology | Faculty of Medicine | University of Cologne | Cologne, Germany date: "`r format(Sys.Date(), '%d %B %Y')`" bibliography: references.bib output: rmarkdown::html_vignette: toc: true number_sections: true mathjax: default includes: in_header: mathjax-config.html vignette: > %\VignetteIndexEntry{Frequentist and hybrid calibration of one-stage ROPE-based designs for single-arm phase II trials} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, dpi = 100, fig.retina = 1, dev = "png", dev.args = list(type = "cairo-png") ) library(bfbin2arm) ``` # Introduction This vignette introduces the calibration of one-stage ROPE-based designs for single-arm phase II trials with binary endpoints, implemented in the function `design_singlearm_onestage_rope()`. In these trial types, the goal is to establish equivalence between a standard of care with success probability $p_0$ and a novel drug or treatment. For each patient, a failure or success is recorded in the single treatment arm, so the primary endpoint is binary. This offers flexibility and a wide range of applications. The design is then based on a region of practical equivalence (ROPE) around the benchmark response probability \(p_0\). The ROPE is defined as \[ \mathcal{R}_p = [p_0 - \delta,\; p_0 + \delta]\cap(0,1), \] where \(\delta > 0\) denotes the half-width of the ROPE. Equivalence is accepted when the posterior probability that \(p\) lies inside the ROPE exceeds a chosen threshold, \[ \Pr(p \in \mathcal{R}_p \mid Y=y) \ge \gamma_{\mathrm{eq}}. \] As for the Bayes-factor-based one-stage design, `bfbin2arm` provides four calibration modes for ROPE designs: - **Bayesian**: Calibrate Bayesian predictive power and predictive type-I error. - **Frequentist**: Calibrate frequentist power at a fixed point alternative and frequentist type-I error. - **Hybrid**: Calibrate Bayesian predictive power and frequentist type-I error. - **Full**: Calibrate all four operating characteristics simultaneously. In addition, the ROPE design has three important tuning parameters that influence the frequentist operating characteristics: - the posterior probability threshold \(\gamma_{\mathrm{eq}}\), - the ROPE half-width \(\delta\), - the point alternative \(dp\) at which frequentist power is evaluated. This vignette explains the calibration modes and illustrates the impact of these parameters in a worked example. # Design setup We consider a single-arm phase II trial with a binary response. We formally test the hypotheses $$H_0:p\in \mathcal{R}_p \text{ versus } H_1:p\notin \mathcal{R}_p$$ The null hypothesis $H_0$ implies that the novel drug or treatment is equivalent from a clinical perspective to the standard of care. The alternative hypothesis $H_1$ implies that it is not. In the latter case, the novel drug or treatment could either be substantially more effective or substantially less effective. In a phase II trial which aims to demonstrate equivalence between the standard of care and a novel drug or treatment, both of these results are undesirable. The benchmark response probability is \(p_0 = 0.30\), and the ROPE half-width is chosen as \(\delta = 0.12\), so that \[ \mathcal{R}_p = [0.18, 0.42]. \] The analysis prior for the response probability is a \(\mathrm{Beta}(1,1)\) distribution and is used to compute posterior ROPE probabilities at interim or final analysis. For calibration of predictive operating characteristics we use separate design priors under equivalence (\(H_1\)) and non-equivalence (\(H_0\)), - under non-equivalence: \(\mathrm{Beta}(60, 40)\), - under equivalence: \(\mathrm{Beta}(36, 84)\). ```{r design-setup} p0 <- 0.30 delta <- 0.12 a <- 1; b <- 1 # analysis prior da0 <- 60; db0 <- 40 # design prior under H0 da1 <- 36; db1 <- 84 # design prior under H1 ``` The ROPE probability threshold \(\gamma_{\mathrm{eq}}\) will be treated as a tuning parameter. In the examples below we will use \(\gamma_{\mathrm{eq}} = 0.925\), which yields a design with Bayesian predictive power close to 0.8 and a frequentist type-I error near 0.1 under our specification. # Operating characteristics For a fixed sample size \(n\), the ROPE decision rule induces an equivalence acceptance region \[ \mathcal{A}_{\mathrm{eq}}(n) = \bigl\{ y \in \{0,\dots,n\} : \Pr(p \in \mathcal{R}_p \mid Y=y) \ge \gamma_{\mathrm{eq}} \bigr\}. \] If the region is contiguous, we can write \(\mathcal{A}_{\mathrm{eq}}(n) = \{y_{\min}^{\mathrm{eq}}(n),\dots,y_{\max}^{\mathrm{eq}}(n)\}\). Predictive (Bayesian) operating characteristics are computed under the design priors: - predictive power under equivalence: \[ \mathrm{power}(n) = \Pr(\text{equivalence accepted} \mid H_1) = \sum_{y \in \mathcal{A}_{\mathrm{eq}}(n)} \Pr(Y=y \mid H_1), \] - predictive type-I error under non-equivalence: \[ \mathrm{type1}(n) = \Pr(\text{equivalence accepted} \mid H_0) = \sum_{y \in \mathcal{A}_{\mathrm{eq}}(n)} \Pr(Y=y \mid H_0). \] Frequentist operating characteristics are computed under fixed response probabilities: - frequentist power at a point alternative \(p \in \mathcal{R}_p\): \[ \mathrm{freq\_power}(n; p) = \Pr_{p}(\text{equivalence accepted}) = \sum_{y \in \mathcal{A}_{\mathrm{eq}}(n)} \binom{n}{y} p^y (1-p)^{n-y}, \] - frequentist type-I error at a point \(p\): \[ \mathrm{freq\_type1}(n; p) = \Pr_p(\text{equivalence accepted}). \] In this vignette, frequentist type-I error is defined as the worst case at the ROPE boundaries, \[ \mathrm{freq\_type1}^{\max}(n) = \max\{ \mathrm{freq\_type1}(n; p_0-\delta),\; \mathrm{freq\_type1}(n; p_0+\delta)\}. \] The calibration modes select the sample size \(n\) such that these operating characteristics meet specified targets. # Calibration modes The function `design_singlearm_onestage_rope()` supports four calibration modes, specified via the argument `calibration`. ## Bayesian calibration In **Bayesian** mode, we calibrate the design using only the predictive operating characteristics under the design priors: - predictive power under \(H_1\) must be at least `target_power`, - predictive type-I error under \(H_0\) must be at most `target_type1`. Frequentist power and frequentist type-I error are computed post hoc (if a point alternative `dp` is supplied), but they do not influence the selection of the sample size. ```{r calib-bayes, eval = FALSE} des_bayes <- design_singlearm_onestage_rope( n_min = 20, n_max = 300, p0 = p0, delta = delta, gamma_eq = 0.925, a = a, b = b, da0 = da0, db0 = db0, da1 = da1, db1 = db1, calibration = "Bayesian", target_power = 0.80, target_type1 = 0.10, sustain_n = 10 ) ``` We can inspect the results as follows: ```{r, eval = FALSE} des_bayes ``` ``` One-stage single-arm ROPE design Calibration: Bayesian Search range n: 20 to 300 Null probability p0: 0.3 ROPE half-width delta: 0.12 Probability threshold gamma_eq: 0.925 Analysis prior: Beta(1, 1) Design prior (H0): Beta(60, 40) Design prior (H1): Beta(36, 84) Target Bayesian power: 0.8 Target Bayesian type-I error: 0.1 Sustain n: 10 Selected sample size n*: 173 Bayesian power(n*): 0.8166 Bayesian type-I(n*): 0.0001 Equivalence region: [39, 63] ``` We can plot the results as follows: ```{r, eval = FALSE} plot(des_bayes) ``` ```{r echo = FALSE, out.width = "100%", fig.align = "center", fig.cap = "Figure 1: Bayesian calibration of a ROPE-based clinical phase II trial with binary endpoints."} knitr::include_graphics("figures/singlearm-onestage-rope-calibration-bayes.png") ``` The plot shows the selected sample size \(n^\ast\), the predictive power and type-I error at \(n^\ast\), and, if requested, frequentist quantities for comparison. As these are not requested, they are not shown in the upper right panel. The bottom left panel visualizes the design priors, the benchmark probability $p_0$ and the ROPE. The bottom right panel visualizes the analysis prior, the benchmark probability $p_0$ and the ROPE. ## Frequentist calibration In **frequentist** mode, we calibrate the design using frequentist power and frequentist type-I error only. This requires specification of a point alternative `dp` inside the ROPE: - frequentist power at `dp` must be at least `target_freq_power`, - frequentist type-I error (worst case at \(p_0 \pm \delta\)) must be at most `target_freq_type1`. Bayesian predictive power and predictive type-I error are then reported post hoc. ```{r calib-freq, eval = FALSE} des_freq <- design_singlearm_onestage_rope( n_min = 20, n_max = 300, p0 = p0, delta = delta, gamma_eq = 0.925, a = a, b = b, da0 = da0, db0 = db0, da1 = da1, db1 = db1, calibration = "frequentist", dp = 0.30, target_freq_power = 0.80, target_freq_type1 = 0.10, sustain_n = 10 ) des_freq ``` ``` One-stage single-arm ROPE design Calibration: frequentist Search range n: 20 to 300 Null probability p0: 0.3 ROPE half-width delta: 0.12 Probability threshold gamma_eq: 0.925 Analysis prior: Beta(1, 1) Design prior (H0): Beta(60, 40) Design prior (H1): Beta(36, 84) Frequentist power point dp: 0.3 Target frequentist power: 0.8 Target frequentist type-I error: 0.1 Sustain n: 10 Selected sample size n*: 109 Bayesian power(n*): 0.6755 Bayesian type-I(n*): 0.0002 Frequentist power(n*): 0.8227 Frequentist type-I(n*): 0.0779 at p0 - delta: 0.0749 at p0 + delta: 0.0779 Equivalence region: [26, 38] ``` This mode is useful if regulatory or design requirements are expressed in terms of frequentist power and type-I error, while still employing a Bayesian ROPE decision rule in the analysis. ```{r, eval = FALSE} plot(des_freq) ``` ```{r echo = FALSE, out.width = "100%", fig.align = "center", fig.cap = "Figure 2: Frequentist calibration of a ROPE-based clinical phase II trial with binary endpoints. In contrast to Bayesian calibration, frequentist type-I-error rates are computed as worst-case scenarios at the ROPE-boundaries. Frequentist power is calculated under a specified point value for the success probability."} knitr::include_graphics("figures/singlearm-onestage-rope-calibration-frequentist.png") ``` We can see that the selected sample size now shifts from \(n^\ast\)=173 when using Bayesian calibration to \(n^\ast\)=109 when using frequentist calibration. ## Hybrid calibration In **hybrid** mode, calibration combines a Bayesian power condition with a frequentist type-I constraint: - predictive power under \(H_1\) must be at least `target_power`, - frequentist type-I error (worst case at \(p_0 \pm \delta\)) must be at most `target_freq_type1`. Frequentist power and Bayesian predictive type-I error are computed and reported post hoc. ```{r calib-hybrid, eval = FALSE} des_hybrid <- design_singlearm_onestage_rope( n_min = 20, n_max = 300, p0 = p0, delta = delta, gamma_eq = 0.925, a = a, b = b, da0 = da0, db0 = db0, da1 = da1, db1 = db1, calibration = "hybrid", dp = 0.30, target_power = 0.80, target_freq_type1 = 0.10, sustain_n = 10 ) des_hybrid ``` ``` One-stage single-arm ROPE design Calibration: hybrid Search range n: 20 to 300 Null probability p0: 0.3 ROPE half-width delta: 0.12 Probability threshold gamma_eq: 0.925 Analysis prior: Beta(1, 1) Design prior (H0): Beta(60, 40) Design prior (H1): Beta(36, 84) Target Bayesian power: 0.8 Frequentist power point dp: 0.3 Target frequentist type-I error: 0.1 Sustain n: 10 Selected sample size n*: 173 Bayesian power(n*): 0.8166 Bayesian type-I(n*): 0.0001 Frequentist power(n*): 0.9597 Frequentist type-I(n*): 0.0784 at p0 - delta: 0.0755 at p0 + delta: 0.0784 Equivalence region: [39, 63] ``` ```{r, eval = FALSE} plot(des_hybrid) ``` ```{r echo = FALSE, out.width = "100%", fig.align = "center", fig.cap = "Figure 3: Hybrid calibration of a ROPE-based clinical phase II trial with binary endpoints. In hybrid calibration mode, Bayesian power is calibrated together with frequentist type-I-error, which often is required from a regulatory agencies perspective."} knitr::include_graphics("figures/singlearm-onestage-rope-calibration-hybrid.png") ``` Hybrid calibration may be attractive when one wants to retain the prior-based predictive power criterion while explicitly limiting the frequentist type-I error at the ROPE boundary. The resulting sample size now is identical to the one obtained in the Bayesian calibration. The above plot shows why: Bayesian power is the limiting factor in this case, as frequentist type-I-error is calibrated already for much smaller sample sizes. Adjusting the design priors to be more informative could thus further reduce the required sample size in hybrid calibration, as Bayesian power then accumulates faster. ## Full Bayes–frequentist calibration In **full** mode, all four operating characteristics are used in calibration: - predictive power under \(H_1\) ≥ `target_power`, - predictive type-I error under \(H_0\) ≤ `target_type1`, - frequentist power at `dp` ≥ `target_freq_power`, - frequentist type-I error (worst case at \(p_0 \pm \delta\)) ≤ `target_freq_type1`. This is the ROPE analogue of the “full Bayes–frequentist” calibration described for the Bayes factor design in the single-arm one-stage BF vignette. ```{r calib-full, eval = FALSE} des_full <- design_singlearm_onestage_rope( n_min = 20, n_max = 300, p0 = p0, delta = delta, gamma_eq = 0.925, a = a, b = b, da0 = da0, db0 = db0, da1 = da1, db1 = db1, calibration = "full", dp = 0.30, target_power = 0.80, target_type1 = 0.10, target_freq_power = 0.80, target_freq_type1 = 0.10, sustain_n = 10 ) print(des_full) ``` ```{r, eval = FALSE} plot(des_full) ``` ```{r echo = FALSE, out.width = "100%", fig.align = "center", fig.cap = "Figure 4: Full calibration of a ROPE-based clinical phase II trial with binary endpoints. In full calibration mode, Bayesian and frequentist power and type-I-error must be calibrated simultaneously, which is the strongest form of calibration."} knitr::include_graphics("figures/singlearm-onestage-rope-calibration-full.png") ``` For the chosen priors, ROPE width, and \(\gamma_{\mathrm{eq}} = 0.925\), this yields a design with: - \(n^\ast = 173\), - predictive power ≈ 0.82 under \(H_1\), - predictive type-I error ≈ 0.0001 under \(H_0\), - frequentist power ≈ 0.96 at \(dp = 0.30\), - frequentist type-I error ≈ 0.078 at the ROPE boundary. The sustainable feasibility requirement (`sustain_n = 10`) ensures that the operating characteristics remain within target bounds for several larger sample sizes as well. # Tuning parameters for frequentist calibration When using calibration modes that involve frequentist operating characteristics, three parameters play a central role: 1. the ROPE probability threshold \(\gamma_{\mathrm{eq}}\), 2. the ROPE half-width \(\delta\), 3. the point alternative \(dp\). ## The posterior probability threshold \(\gamma_{\mathrm{eq}}\) The threshold \(\gamma_{\mathrm{eq}}\) controls how demanding the ROPE decision rule is. It is the posterior probability which is required to be located inside the ROPE to establish equivalence. Larger values of \(\gamma_{\mathrm{eq}}\) shrink the set of \(y\) for which equivalence is accepted, which: - decreases frequentist type-I error at the ROPE boundary, - typically decreases predictive power and frequentist power as well. In the current example, setting \(\gamma_{\mathrm{eq}} = 0.8\) leads to a frequentist type-I error around 0.20–0.23 at the ROPE boundary, which is incompatible with a target of 0.10. Increasing the threshold to \(\gamma_{\mathrm{eq}} = 0.925\) yields a boundary-based frequentist type-I error around 0.08, compatible with a 0.10 target, while still achieving predictive and frequentist power values above 0.8. Users can treat \(\gamma_{\mathrm{eq}}\) as a tuning parameter (similar to a Bayes factor threshold) and explore its impact on operating characteristics: ```{r gamma-sensitivity, eval = FALSE} gamma_grid <- c(0.80, 0.85, 0.90, 0.925, 0.95) res_gamma <- lapply(gamma_grid, function(gam) { design_singlearm_onestage_rope( n_min = 20, n_max = 300, p0 = p0, delta = delta, gamma_eq = gam, a = a, b = b, da0 = da0, db0 = db0, da1 = da1, db1 = db1, calibration = "frequentist", dp = 0.30, target_freq_power = 0.80, target_freq_type1 = 0.10, sustain_n = 10 ) }) ``` ## The ROPE half-width \(\delta\) The ROPE half-width \(\delta\) encodes what is considered “clinically equivalent” to \(p_0\). A narrower ROPE: - makes equivalence harder to achieve, - tends to reduce frequentist type-I error at the boundary, - but also reduces power to declare equivalence when the true \(p\) is only moderately different from \(p_0\). Conversely, a wider ROPE relaxes the equivalence notion but may increase frequentist type-I error and require more careful calibration of \(\gamma_{\mathrm{eq}}\). Users can combine changes in \(\delta\) and \(\gamma_{\mathrm{eq}}\) to achieve desired trade-offs between clinical tolerance and statistical error control. ## The Point alternative \(dp\) The point alternative `dp` determines where frequentist power is evaluated. It should lie inside the ROPE, for example at the center (`dp = p0`) or at a clinically relevant equivalence point. In frequentist or full calibration modes: - `dp` must be specified, - frequentist power at `dp` is calibrated to exceed `target_freq_power`. For pure Bayesian or hybrid calibration, `dp` is optional. If supplied, the design function reports frequentist power at `dp` post hoc. Choosing `dp` near the center of the ROPE emphasizes performance when the true response probability lies well inside the equivalence region; choosing `dp` closer to a ROPE boundary focuses on performance near the edge of equivalence. # Summary The ROPE-based one-stage design in `bfbin2arm` supports the same four calibration modes as the Bayes-factor-based design: - purely Bayesian, - purely frequentist, - hybrid, - full Bayes–frequentist. For the frequentist and full calibration modes, the interplay of the ROPE threshold \(\gamma_{\mathrm{eq}}\), the ROPE width \(\delta\), and the point alternative `dp` determines whether both Bayesian and frequentist operating characteristics can reach their targets simultaneously. The current vignette illustrates how to specify these parameters and interpret the resulting operating characteristics for a typical single-arm phase II scenario.