Test Configuration
5.0%
80%
Control (A)
Variants
+ Add Variant
📚 Methodology — Frequentist AB Test
The Test
Two-proportion z-test for binary metrics; Welch's t-test for continuous means. Tests H₀: no difference between control and variant.Formula
Proportion:z = (p̂_B − p̂_A) / √(p̂(1−p̂)(1/n_A + 1/n_B))Mean:
t = (x̄_B − x̄_A) / √(s²_A/n_A + s²_B/n_B)
Assumptions
Independent observations · Random assignment · ≥5 expected events per cell · Sufficient sample size (check Sample Size tab).Interpretation
p < α → reject H₀. Relative lift = (rate_B − rate_A)/rate_A. CI excludes 0 → significant. Always check practical significance (is the lift worth shipping?).Bias Risks
Peeking inflates Type I error · Novelty effect inflates early lifts · Simpson's Paradox can reverse aggregate results · Multiple comparisons → Bonferroni applied automatically.⚗
No results yet
Fill in data on the left and click Run Analysis
Verdict
Statistics
Rate Comparison
Prior (Beta Distribution)
λModel: Beta-Binomial. α=1,β=1 = uniform prior.
1
1
0.50%
Control (A)
Variant (B)
📚 Methodology — Bayesian AB Test
The Model
Beta-Binomial conjugate model. Prior Beta(α,β) updated with data → Posterior Beta(α+x, β+n−x). Monte Carlo (30k samples) estimates P(B>A).Decision Metrics
P(B beats A) — probability variant is better. Expected Loss = E[max(0, θ_A − θ_B)] — average cost of choosing B if wrong.Decision Rule
Ship whenP(B>A) ≥ 95% AND Expected Loss ≤ threshold. Threshold = max acceptable conversion rate you're willing to sacrifice.
Prior Choice
α=1, β=1 = uniform (no prior belief). Increase α,β to encode historical knowledge. With large samples, the prior has negligible impact.vs Frequentist
Bayesian gives probability statements ("90% chance B is better"). Frequentist gives yes/no at a fixed error rate. Bayesian allows early stopping without inflating error rates.🎲
Results appear here
P(B beats A)
Posterior Distributions
Define Factors & Levels
ℹDefine up to 3 factors (e.g. "Button Color", "Headline", "CTA"). Each factor can have 2–4 levels. The system generates a full factorial design.
+ Add Factor (max 3)
📚 Methodology — Multivariate Testing (MVT)
Method
Full factorial design — every combination of factor levels is tested simultaneously. Main effects tested via chi-squared per factor. Interaction heatmap shows synergistic/antagonistic effects.When to Use
Testing 2–3 independent changes at once to find the optimal combination faster than sequential A/B tests.Sample Size Warning
Each cell needs ≥200 visitors. 3 factors × 3 levels = 27 cells → 5,400+ visitors minimum. Power per cell is much lower than a simple A/B test.Interpretation
Main effect significant (p<0.05) → that factor matters. Interaction heatmap: if a cell dramatically outperforms additive prediction, a synergistic interaction exists — factor effects are not independent.Gotchas
For 4+ factors use fractional factorial designs. Bonferroni-adjust α across all cell comparisons. Never pick a winner without checking statistical significance per cell.🧪
Define factors & generate design
Full factorial analysis — main effects, interactions, winning combination
Winning Combination
Main Effects
All Combinations Ranked
Interaction Heatmap (Factor 1 × 2)
Covariate Balance Check
⚖Verify that pre-experiment covariates (age, past revenue, days active…) are balanced between control and treatment. Imbalance = potential confounding bias. Rule: |SMD| < 0.1 = well balanced.
5.0%
Covariate
Ctrl N
Ctrl Mean
Ctrl SD
Trt N
Trt Mean
Trt SD
+ Add Covariate
📚 Methodology — Covariate Balance
The Metric
Standardized Mean Difference (SMD) = (mean_T − mean_C) / √((SD²_C + SD²_T)/2). Scale-free — works across covariates with different units.Thresholds
|SMD| < 0.10 ✓ well balanced · 0.10–0.20 marginal · > 0.20 action needed.
Why SMD over p-value
p-values are sensitive to sample size — large N makes trivial imbalances "significant"; small N misses real ones. SMD reflects practical imbalance regardless of N.Remedies for Imbalance
Re-randomize with stratification · CUPED (regress out pre-experiment covariate) reduces variance by up to 50% · Regression adjustment includes covariate in analysis model.Love Plot
Standard visualization in clinical trials. Each bar = SMD for one covariate. Yellow dashed lines mark ±0.10 threshold.Interpretation Guide
• |SMD| < 0.10 — Well balanced ✓
• |SMD| 0.10–0.20 — Marginal — consider CUPED
• |SMD| > 0.20 — Imbalanced — stratify or re-randomize
• p < α — Statistically significant imbalance
• SMD is independent of sample size — preferred over p-value alone
• |SMD| 0.10–0.20 — Marginal — consider CUPED
• |SMD| > 0.20 — Imbalanced — stratify or re-randomize
• p < α — Statistically significant imbalance
• SMD is independent of sample size — preferred over p-value alone
⚖
Covariate balance results appear here
Add covariates and run the check
Balance Verdict
Love Plot — Standardized Mean Differences
Per-Covariate Results
Parameters
10%
10%
5%
80%
Notes
📚 Methodology — Sample Size
Formula (Two-Proportion)
n = (z_α/2 + z_β)² × (p₁(1−p₁) + p₂(1−p₂)) / (p₂−p₁)²where p₂ = p₁ × (1 + MDE%). Bonferroni: α_adj = α/k for k variants.
MDE Explained
Minimum Detectable Effect — smallest relative change worth detecting. 10% relative MDE on 5% baseline = 5.5% target (0.5% absolute). Smaller MDE → larger n.Power
80% power = 20% chance of missing a real effect (Type II error). 90% power costs ~35% more traffic. Standard: 80%.Runtime Rules
Run ≥ 2 full business cycles (weeks) to capture Mon–Sun seasonality. Never extend based on intermediate results — this inflates false positives (peeking problem).Gotchas
Sample size is a minimum not a target · Traffic spikes invalidate estimates · Continuous metrics need mean + SD (different formula) · Ratio metrics need delta method.Required Sample Size
MDE Sensitivity
Runtime vs Traffic
Segment Data Input
ℹEnter conversion data by segment (Mobile/Desktop, US/EU…). Detects Simpson's Paradox and HTE.
5.0%
Segment
Ctrl N
Ctrl Conv
Var N
Var Conv
+ Add Segment
📚 Methodology — Segments, HTE & Simpson's Paradox
Simpson's Paradox
Aggregate trend reverses in subgroups. Happens when subgroup sizes are unbalanced across arms AND subgroups have different baseline rates. Always report segment-level results alongside aggregate.HTE — Heterogeneous Treatment Effects
Cochran's Q test measures effect heterogeneity across segments.Q = Σ wᵢ(liftᵢ − lift̄)², df = k−1I² = (Q−df)/Q × 100% — proportion of variance due to true heterogeneity.
I² Benchmarks
<25% Low · 25–50% Moderate · >50% Substantial → personalized rollout warranted.
Pre-registration Rule
Define segment cuts before the experiment — post-hoc segment fishing inflates false discoveries. Apply Bonferroni to segment-level α.Action on HTE
Ship only to segments with positive, significant effects. Investigate mechanism — is the effect driven by a confounder or a genuine subgroup difference?🔬
Segment results appear here
Aggregate vs Stratified
Simpson's Paradox Detection
Heterogeneous Treatment Effects (HTE)
Per-Segment Results
Sample Ratio Mismatch (SRM)
ℹSRM occurs when actual traffic differs from the intended split — signals a randomization or logging bug.
OBSERVED VISITORS
+ Add Variant
📚 Methodology — SRM Check
The Test
Chi-squared goodness-of-fit comparing observed visitor counts to expected counts under the intended split.χ² = Σ(O−E)²/E, df = k−1.
Why It Matters
SRM invalidates the experiment — assignment is non-random. Any measured effect could be an artifact of the biased sample, not a real treatment effect.Common Causes
Caching layers · Redirect chains losing assignment · Bot filtering applied unevenly · Logging delays · Hash-based bucketing collisions · Cookie deletion.Threshold
p < 0.01 flags SRM (stricter than the primary test). Some teams use p < 0.001 for large experiments.
Action on SRM
Stop the experiment. Do not report results. Fix the pipeline. Re-run from scratch. Never attempt to "correct" SRM results statistically.Suspicious Balance
A perfectly balanced split (p > 0.99) can also be suspicious — true randomization has natural variance. Focus on the magnitude of deviation, not just the p-value.🚦
Enter traffic data above
SRM Result
—
χ² Statistic
—
p-value
—
Degrees of Freedom
Expected vs Observed
Test Selector
5.0%
Groups (≥ 3 recommended)
Group
n
Mean
SD
+ Add Group
Chi-Square
Category
Observed
Expected
+ Add Category
ℹ2 rows (Group A / B). Add columns for each category.
Group
Group A
Group B
+ Add Column
T-Test
GROUP 1
GROUP 2
Z-Test
📚 Methodology — Statistical Tests Guide
ANOVA (One-Way)
Tests whether ≥3 group means differ.F = MS_between / MS_within. Significant F only tells you some group differs — use Tukey HSD post-hoc to identify which pairs. Effect size: η² (small=0.01, medium=0.06, large=0.14). Assumes normality + homogeneity of variance.
Chi-Square
GoF: tests whether observed frequencies match expected. Contingency: tests independence between two categorical variables. Effect size: Cramér's V (small=0.1, medium=0.3, large=0.5). Requires expected cell counts ≥5.T-Test Variants
One-sample: tests if a mean equals a hypothesized value μ₀.Two-sample pooled: assumes equal variances (use when Levene's test passes).
Welch's: does not assume equal variances — preferred default for two groups.
Paired: for matched/repeated measures (before/after). Effect size: Cohen's d (small=0.2, medium=0.5, large=0.8).
Z-Test Variants
One-proportion: tests p̂ against null p₀. Valid when np₀≥5 and n(1−p₀)≥5.Two-proportion: same as AB Analyzer but standalone. Uses pooled SE under H₀.
One-mean (σ known): rare in practice — use t-test if σ is estimated from data.
Distribution Charts
Yellow dashed line = observed statistic. Red shaded area = rejection region at chosen α. If the yellow line falls inside red, p < α.🧮
Select a test type & enter data
ANOVA · Chi-Square · T-Test · Z-Test
Result
Statistics
Distribution Chart
Secondary Chart