--- title: "ROPE-based trial design for single-arm one-stage phase II trials with binary endpoints" author: | | Riko Kelter | Institute of Medical Statistics and Computational Biology | Faculty of Medicine | University of Cologne | Cologne, Germany date: "`r format(Sys.Date(), '%d %B %Y')`" bibliography: references.bib output: rmarkdown::html_vignette: mathjax: default includes: in_header: mathjax-config.html vignette: > %\VignetteIndexEntry{ROPE-based trial design for single-arm one-stage phase II trials with binary endpoints} %\VignetteEngine{knitr::rmarkdown} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 6, fig.height = 5, dpi = 100, fig.retina = 1, dev = "png", dev.args = list(type = "cairo-png") ) library(bfbin2arm) ``` ## Introduction This vignette illustrates how to use `design_singlearm_onestage_rope()` to calibrate ROPE-based equivalence designs for single-arm phase II trials with binary endpoints. ROPE stands for the region of practical equivalence and has been proposed by @Kruschke2018, @kruschkeDoingBayesianData2014 and @Kruschke2018a, even though the idea itself is older and appears under various names in different contexts, see @Kelter2021BMCHodgesLehmann, @Liao2020, @lindeDecisionsEquivalenceComparison2023, @Lakens2018, @Wellek2010 and @panUntappedPotentialBayesian2025. The idea to replace the test of a point-null hypothesis with a small-interval goes at least back until @Hodges1954. ## Setup We consider a single-arm binomial model \[ Y \mid p \sim \mathrm{Binomial}(n, p), \] where \(Y\) is the number of responders among \(n\) patients and \(p \in (0,1)\) is the true response probability under the experimental treatment. We fix a benchmark response rate \(p_0\) (e.g. historical control or standard of care) and define the risk difference \[ \Delta = p - p_0 \] We work with a symmetric ROPE formulation on the risk-difference scale. Let \(\Delta = p - p_0\) denote the risk difference between the experimental treatment and the benchmark response probability \(p_0\), and let \(\delta > 0\) be the equivalence margin. On the risk-difference scale we define \[ H_0:\; |\Delta| > \delta, \] \[ H_1:\; |\Delta| \le \delta. \] Equivalently, on the response-probability scale the ROPE is \[ [p_0 - \delta,\; p_0 + \delta] \cap (0,1), \] and the hypotheses can be written as \[ H_0:\; p \notin [\,p_0 - \delta,\; p_0 + \delta\,], \] \[ H_1:\; p \in [\,p_0 - \delta,\; p_0 + \delta\,]. \] ## The region of practical equivalence (ROPE) The **region of practical equivalence (ROPE)** on the risk-difference scale is \[ \mathcal{R}_\Delta = [-\delta, \delta], \] where \(\delta > 0\) is a prespecified equivalence margin. Equivalently, on the response-probability scale the ROPE for \(p\) is \[ \mathcal{R}_p = [p_0 - \delta,\; p_0 + \delta] \cap (0,1). \] Given a beta analysis prior \[ p \sim \mathrm{Beta}(a, b), \] the posterior after observing \(Y = y\) is \[ p \mid y \sim \mathrm{Beta}(a + y,\; b + n - y), \] and the posterior ROPE probability is \[ \Pr\bigl(p \in \mathcal{R}_p \mid y\bigr) = F_{\mathrm{Beta}(a+y,\,b+n-y)}(p_0 + \delta) - F_{\mathrm{Beta}(a+y,\,b+n-y)}(p_0 - \delta), \] with endpoints truncated to \([0,1]\) if needed. A ROPE-based equivalence decision rule declares **practical equivalence** if \[ \Pr\bigl(p \in \mathcal{R}_p \mid y\bigr) \ge \gamma_{\mathrm{eq}}, \] where \(\gamma_{\mathrm{eq}} \in (0.5, 1)\) is a pre-specified evidence threshold. ## Design and analysis priors At the **design stage** we distinguish between three priors: - an **analysis prior** \(\mathrm{Beta}(a, b)\) used to compute posterior ROPE probabilities, - a **design prior under equivalence** \(H_1: \Delta \in [-\delta, \delta]\), typically \(\mathrm{Beta}(a_1, b_1)\) centred near \(p_0\), - a **design prior under non-equivalence** \(H_0: \Delta \notin [-\delta, \delta]\), typically \(\mathrm{Beta}(a_0, b_0)\) centred away from the ROPE. These design priors induce beta–binomial predictive distributions for \(Y\) under equivalence and non-equivalence, respectively. Under the equivalence design prior \(\pi_1\) we define **ROPE-based Bayesian power** as \[ \text{Power}_\text{ROPE}(n) = \Pr_{\pi_1}\bigl( \Pr(p \in \mathcal{R}_p \mid Y) \ge \gamma_{\mathrm{eq}} \bigr), \] and under the non-equivalence design prior \(\pi_0\) we define the **ROPE-based Bayesian type-I error** as \[ \alpha_\text{ROPE}(n) = \Pr_{\pi_0}\bigl( \Pr(p \in \mathcal{R}_p \mid Y) \ge \gamma_{\mathrm{eq}} \bigr). \] ## ROPE decision illustrations In this section we illustrate the ROPE-based decision rule for four prototypical outcomes in a single-arm binomial model with analysis prior \(p \sim \mathrm{Beta}(1,1)\), benchmark response rate \(p_0 = 0.30\), and ROPE \(\mathcal{R}_p = [p_0 - \delta, p_0 + \delta] = [0.18, 0.42]\) with \(\delta = 0.12\). For an observed responder count \(Y = y\) out of \(n\) patients, the posterior is \[ p \mid y \sim \mathrm{Beta}(a + y,\; b + n - y), \] and the symmetric ROPE probability is \[ \Pr\bigl(|p - p_0| \le \delta \mid y\bigr) = \Pr(p_0 - \delta \le p \le p_0 + \delta \mid y). \] We adopt the following simple decision rule: - **Equivalence accepted** if \(\Pr(|p - p_0| \le \delta \mid y) \ge \gamma_{\mathrm{eq}}\). - **Non-equivalence accepted** if \(\Pr(|p - p_0| > \delta \mid y) \ge \gamma_{\mathrm{diff}}\). - **Indecisive** otherwise, with \(\gamma_{\mathrm{eq}} = \gamma_{\mathrm{diff}} = 0.80\) in the examples below. ```{r, echo = FALSE} plot_rope_posterior <- function(n, y, p0 = 0.30, delta = 0.12, a = 1, b = 1, gamma_eq = 0.80, gamma_diff = 0.80, main = "") { shape1 <- a + y shape2 <- b + n - y p_min <- max(0, p0 - delta) p_max <- min(1, p0 + delta) p_grid <- seq(0, 1, length.out = 1000) dens <- dbeta(p_grid, shape1, shape2) rope_prob <- pbeta(p_max, shape1, shape2) - pbeta(p_min, shape1, shape2) diff_prob <- 1 - rope_prob decision <- if (rope_prob >= gamma_eq) { "Equivalence accepted" } else if (diff_prob >= gamma_diff) { "Non-equivalence accepted" } else { "Indecisive" } plot(p_grid, dens, type = "n", xlab = expression(p), ylab = "Posterior density", main = main) usr <- par("usr") x_min <- usr[1] x_max <- usr[2] y_min <- usr[3] y_max <- usr[4] # Lighter matte background regions h0_col <- adjustcolor("#DCEAF7", alpha.f = 0.55) # light matte blue h1_col <- adjustcolor("#F7DDDD", alpha.f = 0.55) # light matte red # H0: outside ROPE rect(xleft = x_min, ybottom = y_min, xright = p_min, ytop = y_max, col = h0_col, border = NA) rect(xleft = p_max, ybottom = y_min, xright = x_max, ytop = y_max, col = h0_col, border = NA) # H1: inside ROPE rect(xleft = p_min, ybottom = y_min, xright = p_max, ytop = y_max, col = h1_col, border = NA) # Posterior density and benchmark lines(p_grid, dens, lwd = 2) abline(v = p0, lty = 2) # Build plotmath label explicitly rope_label <- expression(scriptstyle(R)[p]) text(x = (p_min + p_max) / 2 + 0.05, y = y_min + 0.06 * (y_max - y_min), labels = rope_label, cex = 1.05) text(x = (x_min + p_min) / 2, y = y_min + 0.50 * (y_max - y_min), labels = expression(H[0]), col = "#5B84B1", cex = 1.1) text(x = (p_max + x_max) / 2, y = y_min + 0.50 * (y_max - y_min), labels = expression(H[0]), col = "#5B84B1", cex = 1.1) text(x = (p_min + p_max) / 2, y = y_min + 0.78 * (y_max - y_min), labels = expression(H[1]), col = "#C06C84", cex = 1.1) legend("topright", legend = c( sprintf("y = %d / n = %d", y, n), sprintf("Pr(ROPE | y) = %.2f", rope_prob), sprintf("Pr(outside ROPE | y) = %.2f", diff_prob), sprintf("Decision: %s", decision) ), bty = "n") } ``` ### 1) Equivalence accepted We choose an outcome \((n, y)\) for which the posterior is concentrated inside the ROPE and \(\Pr(|p - p_0| \le \delta \mid y) \ge \gamma_{\mathrm{eq}}\), so the decision is to **accept equivalence**. ```{r, echo = FALSE, eval = FALSE} par(mar = c(4, 4, 3, 1)) plot_rope_posterior( n = 100, y = 30, # close to p0 * n = 30 main = "Scenario 1: Equivalence accepted" ) ``` ```{r, echo = FALSE, out.width = "80%", fig.align = "center", fig.cap = "Figure 1: Illustration of the first possible scenario in a ROPE-based clinical phase II trial with binary endpoints: Equivalence is accepted, because sufficient posterior probability mass concentrates inside the ROPE. The true data-generating process follows the alternative hypothesis, that is, equivalence indeed holds."} knitr::include_graphics("figures/singlearm-onestage-rope-scenario1.png") ``` The plot illustrates this first possible outcome. ### 2) Type-I error: equivalence concluded under the null hypothesis Conceptually, a type-I error occurs when the *true* data-generating process is non-equivalent (e.g. \(p = 0.55\) or 0.60), but the observed data still lead the ROPE rule to **accept equivalence**. Thus, $H_0$ is true and $p \notin [\,p_0 - \delta,\; p_0 + \delta\,]$ holds. In this plot we **do not** change the posterior calculation—posterior is always conditional on the observed \((n,y)\) and the analysis prior. To illustrate a type-I error, we choose \((n,y)\) such that: - \(y\) is plausible under a non-equivalence scenario (e.g. generated from \(p = 0.55\)), **and** - the resulting posterior still satisfies \(\Pr(|p - p_0| \le \delta \mid y) \ge \gamma_{\mathrm{eq}}\). For illustration we tune \(y\) so that this happens: ```{r echo = FALSE, eval = FALSE} par(mar = c(4, 4, 3, 1)) plot_rope_posterior( n = 100, y = 35, # pick y so Pr(ROPE|y) >= 0.8 but mean > p0 + delta main = "Scenario 2: Equivalence accepted (type-I error case)" ) ``` ```{r echo = FALSE, out.width = "80%", fig.align = "center", fig.cap = "Figure 2: Illustration of the second possible scenario in a ROPE-based clinical phase II trial with binary endpoints: Equivalence is accepted, because sufficient posterior probability mass concentrates inside the ROPE. In contrast to the first possible scenario, the true data-generating process follows the null hypothesis. Thus, a ROPE-based type-I-error occurs."} knitr::include_graphics("figures/singlearm-onestage-rope-scenario2.png") ``` In this scenario, in contrast to scenario 1 above, the *true* \(p\) lies outside the ROPE (under $H_0$), but due to sampling variability the posterior still concentrates enough mass inside the ROPE to meet the equivalence threshold. ### 3) Indecisive result Here we choose \((n,y)\) such that neither threshold is reached: - \(\Pr(|p - p_0| \le \delta \mid y) < \gamma_{\mathrm{eq}}\), - \(\Pr(|p - p_0| > \delta \mid y) < \gamma_{\mathrm{diff}}\). The posterior spreads substantial mass both inside and outside the ROPE, and the decision is **indecisive**. ```{r echo = FALSE, eval = FALSE} par(mar = c(4, 4, 3, 1)) plot_rope_posterior( n = 100, y = 18, # tuned so ROPE probability is between ~0.3 and 0.7 main = "Scenario 3: Indecisive (posterior straddles ROPE)" ) ``` ```{r echo = FALSE, out.width = "80%", fig.align = "center", fig.cap = "Figure 3: Illustration of the third possible scenario in a ROPE-based clinical phase II trial with binary endpoints: The result is indecisive, because neither does sufficient posterior probability mass concentrate inside the ROPE, nor outside the ROPE."} knitr::include_graphics("figures/singlearm-onestage-rope-scenario3.png") ``` ### 4) Clear non-equivalence Finally, we choose an outcome where the posterior lies mostly outside the ROPE, so that \(\Pr(|p - p_0| > \delta \mid y) \ge \gamma_{\mathrm{diff}}\) and we **accept non-equivalence**. ```{r echo = FALSE, eval = FALSE} par(mar = c(4, 4, 3, 1)) plot_rope_posterior( n = 100, y = 10, # clearly below the ROPE region main = "Scenario 4: Non-equivalence accepted" ) ``` ```{r echo = FALSE, out.width = "80%", fig.align = "center", fig.cap = "Figure 4: Illustration of the fourth possible scenario in a ROPE-based clinical phase II trial with binary endpoints: Non-equivalence is accepted, because sufficient posterior probability mass concentrates outside the ROPE."} knitr::include_graphics("figures/singlearm-onestage-rope-scenario4.png") ``` In this last case, the treatment is worse than the standard of care with success probability $p_0$. ## A first ROPE-based design example We now provide a simple example of the calibration function `design_singlearm_onestage_rope()`, which calibrates a single-arm one-stage phase II design using the ROPE as the primary measure of evidence. We consider a setting with benchmark response rate \(p_0 = 0.30\) and regard differences up to 0.12 as clinically negligible. Thus the ROPE on \(p\) is \(\mathcal{R}_p = [0.18, 0.42]\). We use: - a **uniform analysis prior** \(\mathrm{Beta}(1,1)\), - a **non-equivalence design prior** \(\mathrm{Beta}(60,40)\) with mean 0.60, representing clearly superior response compared to 0.30 (non-equivalence); this is the design prior under $H_0$ - an **equivalence design prior** \(\mathrm{Beta}(36,84)\) with mean 0.30, representing plausible equivalence scenarios; this is the design prior under $H_1$ - an **equivalence threshold** \(\gamma_{\mathrm{eq}} = 0.80\), - a **target ROPE-based power** of 0.80 under the equivalence design prior, - a **maximum ROPE-based type-I error** of 0.10 under the non-equivalence design prior, - a **sustain requirement** of `sustain_n = 10`, meaning the criteria must hold for 10 consecutive sample sizes starting from the selected \(n^\ast\). ```{r} des_baseline <- design_singlearm_onestage_rope( n_min = 20, n_max = 200, p0 = 0.30, # benchmark response rate p0 delta = 0.12, # ROPE half-width: equivalence if p in [0.18, 0.42] gamma_eq = 0.80, # posterior ROPE probability threshold for equivalence # Analysis prior: p ~ Beta(a, b), used for posterior and ROPE decision a = 1, b = 1, # Design prior under H0 (non-equivalence): p ~ Beta(da0, db0) # Here: mean 0.60, representing clearly higher response than 0.30. da0 = 60, db0 = 40, # Design prior under H1 (equivalence): p ~ Beta(da1, db1) # Here: mean 0.30, representing plausible equivalence scenarios. da1 = 36, db1 = 84, # Target ROPE-based power under H1 (equivalence design prior) target_power = 0.80, # Maximum ROPE-based type-I error under H0 (non-equivalence design prior) target_type1 = 0.10, # Stability requirement: criteria must hold for 10 consecutive n values sustain_n = 10 ) ``` We can take a look at the resulting design object: ```{r} des_baseline ``` The printed output reports: - the search range for \(n\), - the ROPE specification (`p0`, `delta`, `gamma_eq`), - the analysis and design priors in beta parameterization, - the target power and type-I constraints, - the chosen `sustain_n`, - the selected sample size `Selected n`, - the ROPE-based power and type-I error at that \(n\), - and the equivalence decision region \([y_{\min}^{\mathrm{eq}}, y_{\max}^{\mathrm{eq}}]\), i.e. all responder counts \(y\) that lead to practical equivalence. ### Summarizing the design We can summarize the calibration grid and the selected design via: ```{r, eval = FALSE} summary(des_baseline) ``` The summary object (not shown here) contains: - the selected row of the grid (with `n`, `y_eq_min`, `y_eq_max`, `power`, `type1`), - the first and last 10 rows of the evaluated `n` values. In particular: - `y_eq_min` and `y_eq_max` are the smallest and largest responder counts for which the posterior ROPE probability exceeds `gamma_eq` and equivalence would be concluded; - `power` is the ROPE-based Bayesian power under the equivalence design prior at that `n`; - `type1` is the ROPE-based Bayesian type-I error under the non-equivalence design prior at that `n`. These summaries allow you to inspect how power and type-I error evolve with increasing sample size, and how the equivalence decision region moves. ### Plotting the design The overview plot visualizes operating characteristics, priors, and a textual summary: ```{r echo = TRUE, eval = FALSE} plot(des_baseline) ``` ```{r, echo = FALSE, out.width = "100%", fig.align = "center", fig.cap = "Figure 5: Illustration of calibrated single-arm one-stage design of a ROPE-based clinical phase II trial with binary endpoint."} knitr::include_graphics("figures/singlearm-onestage-rope-fig5.png") ``` - The **upper left panel** shows ROPE-based power and type-I error as functions of \(n\), with horizontal lines at `target_power` and `target_type1`, and a vertical line at the selected `n`. - The **upper right panel** displays a textual summary of the key inputs and outputs (priors, ROPE, thresholds, selected `n`, power, type-I, and equivalence region). - The **lower left panel** displays the design priors under `H0` and `H1` overlaid: their beta densities highlight which response probabilities are regarded as typical under non-equivalence and equivalence, respectively. - The **lower right panel** displays the analysis prior \(\mathrm{Beta}(a,b)\), which governs the posterior ROPE probabilities used in the decision rule. You can also visualize only the operating characteristics or the decision region: ```{r echo = TRUE, eval = FALSE} plot(des_baseline, what = "operating_characteristics") ``` ```{r echo = FALSE, out.width = "80%", fig.align = "center", fig.cap = "Figure 6: Visualization of the operating characteristics of a calibrated single-arm one-stage design of a ROPE-based clinical phase II trial with binary endpoint."} knitr::include_graphics("figures/singlearm-onestage-rope-fig6.png") ``` ```{r echo = TRUE, eval = FALSE} plot(des_baseline, what = "decision_region") ``` ```{r echo = FALSE, out.width = "80%", fig.align = "center", fig.cap = "Figure 7: Visualization of the equivalence region for increasing sample size of a calibrated single-arm one-stage design of a ROPE-based clinical phase II trial with binary endpoint."} knitr::include_graphics("figures/singlearm-onestage-rope-fig7.png") ``` The decision-region plot shows how the range of responder counts leading to equivalence changes with `n`, providing intuition about how stringent the rule is at different sample sizes. ## Example 1: Oncology phase II equivalence trial In this section we illustrate a full ROPE-based design calibration in a setting resembling a single-arm phase II oncology trial with a binary endpoint such as objective response rate (ORR), compare @chenBayesianTwostageDesign2022, @kelterBayesianGroupSequentialPredictive2024 and @Lee2008. For definiteness, we assume: - Historical control ORR \(p_0 = 0.25\) based on previous phase II data. - The new treatment is considered *clinically non-inferior / equivalent* if its true ORR lies within ±12 percentage points of \(p_0\), that is, \(\mathcal{R}_p = [0.13, 0.37]\). This is a common margin in phase II oncology trials, compare @hashimSystematicReviewNoninferiority2021. - We want a high probability to conclude practical equivalence when the true ORR is near 0.25, and a low probability to conclude equivalence when the true ORR is clearly better or worse than 0.25 (non-equivalence). ### Clinical hypotheses and ROPE On the response-probability scale we set \(p_0 = 0.30\) and \(\delta = 0.12\). The ROPE for equivalence is \[ \mathcal{R}_p = [p_0 - \delta,\; p_0 + \delta] = [0.18, 0.42]. \] We formulate the hypotheses as \[ H_0:\; p \notin [\,p_0 - \delta,\; p_0 + \delta\,] \quad \text{(non-equivalence, clinically relevant difference)}, \] \[ H_1:\; p \in [\,p_0 - \delta,\; p_0 + \delta\,] \quad \text{(practical equivalence)}. \] We adopt the following ROPE-based decision rule: - **Accept equivalence** (\(H_1\)) if \(\Pr(p \in \mathcal{R}_p \mid y) \ge \gamma_{\mathrm{eq}}\). - **Accept non-equivalence** (\(H_0\)) if \(\Pr(p \notin \mathcal{R}_p \mid y) \ge \gamma_{\mathrm{diff}}\). - **Indecisive** otherwise. For this example, we set \(\gamma_{\mathrm{eq}} = \gamma_{\mathrm{diff}} = 0.80\). ### Analysis and design priors We separate the analysis prior from the design priors. - **Analysis prior** for ORR: \[ p \sim \mathrm{Beta}(1,1), \] a uniform prior on \((0,1)\), reflecting weak prior information. - **Design prior under equivalence** \(H_1\): \[ p \sim \mathrm{Beta}(a_1, b_1) = \mathrm{Beta}(36, 84), \] which has mean \(36 / (36 + 84) = 0.30\) and moderate concentration around \(p_0 = 0.30\). This prior represents plausible ORR values under practical equivalence. - **Design prior under non-equivalence** \(H_0\): we consider superior scenarios where ORR is clinically higher than 0.42. For concreteness we choose \[ p \sim \mathrm{Beta}(60, 40), \] which is centred at 0.6 and places most mass clearly outside the ROPE interval [0.18, 0.42]. This prior represents clinically relevant departures from equivalence (e.g. strong improvement), and is used to quantify ROPE-based type-I error for wrongly declaring equivalence in such scenarios. These design priors induce beta–binomial predictive distributions for the response count \(Y\) under \(H_1\) and \(H_0\), respectively. Under the equivalence design prior \(\pi_1\), the ROPE-based Bayesian power is \[ \text{Power}_\text{ROPE}(n) = \Pr_{\pi_1}\bigl( \Pr(p \in \mathcal{R}_p \mid Y) \ge \gamma_{\mathrm{eq}} \bigr), \] and under the non-equivalence design prior \(\pi_0\), the ROPE-based Bayesian type-I error is \[ \alpha_\text{ROPE}(n) = \Pr_{\pi_0}\bigl( \Pr(p \in \mathcal{R}_p \mid Y) \ge \gamma_{\mathrm{eq}} \bigr). \] ### Calibration target For this oncology-inspired example we consider the following calibration goals: - ROPE-based power under \(H_1\) at least 80%: \(\text{Power}_\text{ROPE}(n) \ge 0.80\). - ROPE-based type-I error under \(H_0\) at most 10%: \(\alpha_\text{ROPE}(n) \le 0.10\). - A stability requirement `sustain_n = 10`, meaning that the criteria must hold for 10 consecutive sample sizes starting at the selected \(n^\ast\). This guards against local non-monotonicities in the discrete predictive curves. We search over a one-stage sample size range of 20 to 200 patients. ```{r} des_onc <- design_singlearm_onestage_rope( n_min = 20, n_max = 200, p0 = 0.30, delta = 0.12, gamma_eq = 0.80, # Analysis prior p ~ Beta(a, b) a = 1, b = 1, # Design priors under H0 and H1 da0 = 60, db0 = 40, # H0: non-equivalence, mean ~0.60 da1 = 36, db1 = 84, # H1: equivalence, mean ~0.3 target_power = 0.80, target_type1 = 0.10, sustain_n = 10 ) des_onc ``` The printed output shows the selected sample size \(n^\ast\), ROPE-based power and type-I error at that \(n^\ast\), and the equivalence decision region in terms of the responder counts \(y\) that lead to practical equivalence. ```{r, eval = FALSE} summary(des_onc) ``` The summary (not shown here) gives the first and last rows of the calibration grid, along with the selected design point. These values can be reported, e.g. as a table listing \(n^\ast\), the ROPE region \(\mathcal{R}_p = [0.18, 0.42]\), the decision thresholds \(\gamma_{\mathrm{eq}}, \gamma_{\mathrm{diff}}\) and the resulting ROPE-based power and type-I error. This is primarily helpful when analyzing a specific design or the relationship of the operating characteristics and the sample size. ### Visualization We can inspect the operating characteristics and prior structure in more detail. ```{r echo = TRUE, eval = FALSE} plot(des_onc) ``` ```{r echo = FALSE, out.width = "100%", fig.align = "center", fig.cap = "Figure 8: Visualization of the calibrated ROPE-based oncology single-arm one-stage phase II design with binary endpoints."} knitr::include_graphics("figures/singlearm-onestage-rope-fig8.png") ``` - The upper-left panel shows ROPE-based power and type-I error as functions of \(n\). - The upper-right panel summarizes the design numerically. - The lower-left and middle panels overlay the design priors under \(H_0\) and \(H_1\). - The lower-right panel shows the analysis prior. For example, the equivalence design prior `Beta(36, 84)` reflects prior belief that in realistic equivalence scenarios, the ORR is close to 30%, whereas the non-equivalence design prior `Beta(60, 40)` reflects scenarios with substantially higher ORR around 60%. To see how the equivalence decision region changes with sample size, we can plot the decision region directly: ```{r echo = TRUE, eval = FALSE} plot(des_onc, what = "decision_region") ``` ```{r echo = FALSE, out.width = "100%", fig.align = "center", fig.cap = "Figure 9: Visualization of the equivalence region of ROPE-based oncology single-arm one-stage phase II designs with binary endpoints for increasing sample size."} knitr::include_graphics("figures/singlearm-onestage-rope-fig9.png") ``` This plot shows, for each evaluated sample size \(n\), the range of responder counts \(y\) that would lead the trial to conclude practical equivalence. For the selected \(n^\ast\), this region is reported in the upper right panel of Figure 8: If 20 to 35 patients show a success in the phase II trial (out of \(n^\ast\)=94), then equivalence of the novel drug or treatment to the reference probability $p_0=0.30$ (of the standard of care) is established. Thus, we then accept $H_1:p \notin [\,p_0 - \delta,\; p_0 + \delta\,]$. Figure 8 also shows that both the Bayesian power and type-I-error rate are calibrated. ## Example 2: Sensitivity analysis via grid exploration Here we explore the impact of different design priors, ROPE half-widths $\delta$ and the posterior probability threshold $\gamma_{eq}$ for establishing equivalence. ```{r grid-exploration, message=FALSE, warning=FALSE, echo = FALSE} library(dplyr) library(tidyr) library(purrr) library(ggplot2) library(knitr) # Fixed setup: oncology-inspired equivalence example n_min <- 10 n_max <- 250 p0 <- 0.30 # Analysis prior a <- 1 b <- 1 # Design priors under H0 (non-equivalence) and H1 (equivalence) da0 <- 60 db0 <- 40 # mean = 0.60 da1 <- 36 db1 <- 84 # mean = 0.30 # Calibration targets target_power <- 0.80 target_type1 <- 0.10 sustain_n <- 10 # Grid for ROPE half-width and posterior threshold delta_grid <- c(0.10, 0.12, 0.15) # 0.08 removed gamma_eq_grid <- c(0.75, 0.80, 0.90) grid <- expand.grid( delta = delta_grid, gamma_eq = gamma_eq_grid, KEEP.OUT.ATTRS = FALSE, stringsAsFactors = FALSE ) # Helper to extract a concise summary from the design object extract_design_summary <- function(fit, delta, gamma_eq) { tibble( delta = delta, gamma_eq = gamma_eq, n_star = if (!is.null(fit$n_star)) fit$n_star else NA_real_, power_H1 = if (!is.null(fit$selected$power)) fit$selected$power else NA_real_, type1_H0 = if (!is.null(fit$selected$type1)) fit$selected$type1 else NA_real_ ) } # Wrapper to run the design calibration for one grid point run_design_grid <- function(delta, gamma_eq) { fit <- design_singlearm_onestage_rope( n_min = n_min, n_max = n_max, p0 = p0, delta = delta, gamma_eq = gamma_eq, gamma_diff = gamma_eq, # same threshold for non-equivalence direction = "equivalence", a = a, b = b, da0 = da0, db0 = db0, da1 = da1, db1 = db1, calibration = "Bayesian", dp = NULL, target_power = target_power, target_type1 = target_type1, target_pce_h0 = NULL, target_freq_power = NULL, target_freq_type1 = NULL, sustain_n = sustain_n, return_grid = TRUE ) extract_design_summary(fit, delta, gamma_eq) } # Run the grid with the *updated* delta_grid and gamma_eq_grid results_grid <- pmap_dfr( list(grid$delta, grid$gamma_eq), run_design_grid ) %>% arrange(delta, gamma_eq) # Keep only rows where a feasible design was found results_grid_feasible <- results_grid %>% filter(!is.na(n_star), !is.na(power_H1), !is.na(type1_H0)) # Inspect which combinations dropped out (for checking) results_grid %>% mutate(feasible = !is.na(n_star)) %>% print() # Table for the vignette / paper kable( results_grid, digits = 3, caption = "Grid exploration for the oncology equivalence example: calibrated sample size n*, ROPE-based Bayesian power under H1, and ROPE-based Bayesian type-I error under H0 for different ROPE half-widths and posterior probability thresholds." ) ``` ```{r echo = FALSE, eval = FALSE} # Plot n* versus gamma_eq, stratified by delta (feasible designs only) ggplot( results_grid_feasible, aes(x = gamma_eq, y = n_star, color = factor(delta), group = factor(delta)) ) + geom_line(linewidth = 0.8) + geom_point(size = 2) + labs( x = expression(gamma[eq]), y = expression(n^"*"), color = expression(delta), title = "Calibrated sample size n* across ROPE widths and posterior thresholds" ) + theme_minimal(base_size = 12) ``` ```{r echo = FALSE, out.width = "80%", fig.align = "center", fig.cap = "Figure 10: Calibrated sample size n* across ROPE widths and posterior thresholds for the oncology equivalence phase II trial."} knitr::include_graphics("figures/singlearm-onestage-rope-fig10.png") ``` ```{r echo = FALSE, eval = FALSE} # Plot type-I error versus gamma_eq, stratified by delta (feasible designs only) ggplot( results_grid_feasible, aes(x = gamma_eq, y = type1_H0, color = factor(delta), group = factor(delta)) ) + geom_line(linewidth = 0.8) + geom_point(size = 2) + geom_hline( yintercept = target_type1, linetype = "dashed", color = "grey40" ) + labs( x = expression(gamma[eq]), y = expression(alpha[H[0]](n^"*")), color = expression(delta), title = "ROPE-based Bayesian type-I error at the calibrated sample size" ) + theme_minimal(base_size = 12) ``` ```{r echo = FALSE, out.width = "80%", fig.align = "center", fig.cap = "Figure 10: ROPE-based Bayesian type-I-error at the calibrated sample sizes for the oncology equivalence phase II trial for different ROPE widths."} knitr::include_graphics("figures/singlearm-onestage-rope-fig11.png") ``` --- ## Example 3: revisiting the first example with PCE(H0) and frequentist power Here we revisit the first example of the oncology trial, now adding a target constraint on the probability of compelling evidence for $H_0$ and also reporting frequentist power post-hoc for the resulting design: ```{r example-pce-freq, message=FALSE, warning=FALSE} library(dplyr) library(tidyr) library(purrr) library(ggplot2) library(knitr) # Oncology-inspired equivalence example: revisited n_min <- 10 n_max <- 300 p0 <- 0.30 # ROPE and evidence thresholds delta <- 0.12 gamma_eq <- 0.925 gamma_diff <- 0.90 # Analysis prior a <- 1 b <- 1 # Design priors as in the first example da0 <- 60 db0 <- 40 # non-equivalence prior (H0) da1 <- 36 db1 <- 84 # equivalence prior (H1) # Calibration targets target_power <- 0.80 # Bayesian predictive power under H1 target_type1 <- 0.10 # Bayesian predictive type-I error under H0 target_pce_h0 <- 0.80 # predictive compelling evidence for H0 target_freq_power <- 0.80 # frequentist power at dp (here dp = p0) target_freq_type1 <- 0.10 # frequentist type-I error at ROPE boundaries # Point alternative for frequentist power dp <- p0 # Design calibration in "full" mode fit_pce_freq <- design_singlearm_onestage_rope( n_min = n_min, n_max = n_max, p0 = p0, delta = delta, gamma_eq = gamma_eq, gamma_diff = gamma_diff, direction = "equivalence", a = a, b = b, da0 = da0, db0 = db0, da1 = da1, db1 = db1, calibration = "full", dp = dp, target_power = target_power, target_type1 = target_type1, target_pce_h0 = target_pce_h0, target_freq_power = target_freq_power, target_freq_type1 = target_freq_type1, sustain_n = 10, return_grid = TRUE ) fit_pce_freq ``` You can summarise and visualise the calibrated design: ```{r echo = FALSE, eval = FALSE} plot(fit_pce_freq) ``` ```{r echo = FALSE, out.width = "100%", fig.align = "center", fig.cap = "Figure 12: Calibrated one-stage ROPE-based oncology equivalence phase II design with additional constraints on the probability of compelling evidence for the null hypothesis. In contrast to the earlier example, the probability of compelling evidence must reach 80% now, and frequentist power and type-I-error rate must also fulfill their respective target constraints of 80% and 10%."} knitr::include_graphics("figures/singlearm-onestage-rope-fig12.png") ``` ```{r example-pce-freq-summary, message=FALSE, warning=FALSE} library(dplyr) library(tidyr) library(purrr) library(ggplot2) # Extract selected row and key operating characteristics sel <- fit_pce_freq$selected summary_tab <- tibble( quantity = c( "Selected sample size n*", "Bayesian power under H1 at n*", "Bayesian type-I error under H0 at n*", "PCE(H0) at n*", "Frequentist power at p = p0", "Frequentist type-I error (worst boundary)" ), value = c( fit_pce_freq$n_star, sel$power, sel$type1, sel$pce_h0, sel$freq_power, sel$freq_type1 ) ) kable( summary_tab, digits = 3, col.names = c("Quantity", "Value"), caption = "Operating characteristics of the calibrated equivalence design with constraints on Bayesian power, Bayesian type-I error, PCE(H0), and frequentist power/type-I error." ) ``` Optionally, you can compare this design to the original first example (purely Bayesian calibration) by recomputing the first example and putting both designs side by side in a small table: ```{r example-pce-freq-comparison, message=FALSE, warning=FALSE} des_onc_with_freq_power <- design_singlearm_onestage_rope( n_min = 20, n_max = 200, p0 = 0.30, delta = 0.12, gamma_eq = 0.80, # frequentist power at p = 0.3 dp = 0.3, # Analysis prior p ~ Beta(a, b) a = 1, b = 1, # Design priors under H0 and H1 da0 = 60, db0 = 40, # H0: non-equivalence, mean ~0.60 da1 = 36, db1 = 84, # H1: equivalence, mean ~0.3 target_power = 0.80, target_type1 = 0.10, target_freq_type1 = 0.10, target_freq_power = 0.80, sustain_n = 10, calibration = "Bayesian" ) sel_orig <- des_onc_with_freq_power$selected sel_new <- fit_pce_freq$selected comparison_tab <- tibble( design = c("Bayesian (original)", "Full (Bayes + frequentist + PCE(H0))"), n_star = c(sel_orig$n, fit_pce_freq$n), bayes_power = c(sel_orig$power, sel_new$power), bayes_type1 = c(sel_orig$type1, sel_new$type1), pce_h0 = c(sel_orig$pce_h0, sel_new$pce_h0), freq_power = c(sel_orig$freq_power, sel_new$freq_power), freq_type1 = c(sel_orig$freq_type1, sel_new$freq_type1) ) kable( comparison_tab, digits = 3, caption = "Comparison of the original Bayesian calibration and the extended design with additional constraints on PCE(H0) and frequentist power/type-I error." ) ``` This third example stays within the equivalence framework but shows how the **same posterior-threshold decision rule** can be calibrated to satisfy additional Bayesian and frequentist criteria, including a lower bound on predictive compelling evidence for \(H_0\). ## Summary This vignette has shown how `design_singlearm_onestage_rope()` can be used to: - define a baseline ROPE-based equivalence design in a realistic phase II range, - quantify how the evidence threshold `gamma_eq`, ROPE width `delta`, design priors, and sustain requirement influence the required sample size, power, and type-I error. In practice, we recommend exploring such grids of tuning parameters collaboratively with clinicians, to arrive at a design where the ROPE region, evidence thresholds, and priors are all clinically interpretable and the resulting sample size is operationally feasible. ## References