← divga.com ⚗ Statistical Tools Gagan Agrawal

⚗ AB Testing Suite

Full-Stack Experimentation Platform
Frequentist Bayesian MVT Bias Detection
Test Configuration
5.0%
80%
Control (A)
Variants
+ Add Variant
📚 Methodology — Frequentist AB Test

The Test

Two-proportion z-test for binary metrics; Welch's t-test for continuous means. Tests H₀: no difference between control and variant.

Formula

Proportion: z = (p̂_B − p̂_A) / √(p̂(1−p̂)(1/n_A + 1/n_B))
Mean: t = (x̄_B − x̄_A) / √(s²_A/n_A + s²_B/n_B)

Assumptions

Independent observations · Random assignment · ≥5 expected events per cell · Sufficient sample size (check Sample Size tab).

Interpretation

p < α → reject H₀. Relative lift = (rate_B − rate_A)/rate_A. CI excludes 0 → significant. Always check practical significance (is the lift worth shipping?).

Bias Risks

Peeking inflates Type I error · Novelty effect inflates early lifts · Simpson's Paradox can reverse aggregate results · Multiple comparisons → Bonferroni applied automatically.
No results yet
Fill in data on the left and click Run Analysis
Prior (Beta Distribution)
λModel: Beta-Binomial. α=1,β=1 = uniform prior.
1
1
0.50%
Control (A)
Variant (B)
📚 Methodology — Bayesian AB Test

The Model

Beta-Binomial conjugate model. Prior Beta(α,β) updated with data → Posterior Beta(α+x, β+n−x). Monte Carlo (30k samples) estimates P(B>A).

Decision Metrics

P(B beats A) — probability variant is better. Expected Loss = E[max(0, θ_A − θ_B)] — average cost of choosing B if wrong.

Decision Rule

Ship when P(B>A) ≥ 95% AND Expected Loss ≤ threshold. Threshold = max acceptable conversion rate you're willing to sacrifice.

Prior Choice

α=1, β=1 = uniform (no prior belief). Increase α,β to encode historical knowledge. With large samples, the prior has negligible impact.

vs Frequentist

Bayesian gives probability statements ("90% chance B is better"). Frequentist gives yes/no at a fixed error rate. Bayesian allows early stopping without inflating error rates.
🎲
Results appear here
Define Factors & Levels
Define up to 3 factors (e.g. "Button Color", "Headline", "CTA"). Each factor can have 2–4 levels. The system generates a full factorial design.
+ Add Factor (max 3)
📚 Methodology — Multivariate Testing (MVT)

Method

Full factorial design — every combination of factor levels is tested simultaneously. Main effects tested via chi-squared per factor. Interaction heatmap shows synergistic/antagonistic effects.

When to Use

Testing 2–3 independent changes at once to find the optimal combination faster than sequential A/B tests.

Sample Size Warning

Each cell needs ≥200 visitors. 3 factors × 3 levels = 27 cells → 5,400+ visitors minimum. Power per cell is much lower than a simple A/B test.

Interpretation

Main effect significant (p<0.05) → that factor matters. Interaction heatmap: if a cell dramatically outperforms additive prediction, a synergistic interaction exists — factor effects are not independent.

Gotchas

For 4+ factors use fractional factorial designs. Bonferroni-adjust α across all cell comparisons. Never pick a winner without checking statistical significance per cell.
🧪
Define factors & generate design
Full factorial analysis — main effects, interactions, winning combination
Covariate Balance Check
Verify that pre-experiment covariates (age, past revenue, days active…) are balanced between control and treatment. Imbalance = potential confounding bias. Rule: |SMD| < 0.1 = well balanced.
5.0%

Covariate Ctrl N Ctrl Mean Ctrl SD Trt N Trt Mean Trt SD
+ Add Covariate
📚 Methodology — Covariate Balance

The Metric

Standardized Mean Difference (SMD) = (mean_T − mean_C) / √((SD²_C + SD²_T)/2). Scale-free — works across covariates with different units.

Thresholds

|SMD| < 0.10 ✓ well balanced · 0.10–0.20 marginal · > 0.20 action needed.

Why SMD over p-value

p-values are sensitive to sample size — large N makes trivial imbalances "significant"; small N misses real ones. SMD reflects practical imbalance regardless of N.

Remedies for Imbalance

Re-randomize with stratification · CUPED (regress out pre-experiment covariate) reduces variance by up to 50% · Regression adjustment includes covariate in analysis model.

Love Plot

Standard visualization in clinical trials. Each bar = SMD for one covariate. Yellow dashed lines mark ±0.10 threshold.
Interpretation Guide
|SMD| < 0.10 — Well balanced ✓
|SMD| 0.10–0.20 — Marginal — consider CUPED
|SMD| > 0.20 — Imbalanced — stratify or re-randomize
p < α — Statistically significant imbalance
• SMD is independent of sample size — preferred over p-value alone
Covariate balance results appear here
Add covariates and run the check
Parameters
10%
10%
5%
80%
Notes
📚 Methodology — Sample Size

Formula (Two-Proportion)

n = (z_α/2 + z_β)² × (p₁(1−p₁) + p₂(1−p₂)) / (p₂−p₁)²
where p₂ = p₁ × (1 + MDE%). Bonferroni: α_adj = α/k for k variants.

MDE Explained

Minimum Detectable Effect — smallest relative change worth detecting. 10% relative MDE on 5% baseline = 5.5% target (0.5% absolute). Smaller MDE → larger n.

Power

80% power = 20% chance of missing a real effect (Type II error). 90% power costs ~35% more traffic. Standard: 80%.

Runtime Rules

Run ≥ 2 full business cycles (weeks) to capture Mon–Sun seasonality. Never extend based on intermediate results — this inflates false positives (peeking problem).

Gotchas

Sample size is a minimum not a target · Traffic spikes invalidate estimates · Continuous metrics need mean + SD (different formula) · Ratio metrics need delta method.
Required Sample Size
MDE Sensitivity
Runtime vs Traffic
Segment Data Input
Enter conversion data by segment (Mobile/Desktop, US/EU…). Detects Simpson's Paradox and HTE.
5.0%

Segment Ctrl N Ctrl Conv Var N Var Conv
+ Add Segment
📚 Methodology — Segments, HTE & Simpson's Paradox

Simpson's Paradox

Aggregate trend reverses in subgroups. Happens when subgroup sizes are unbalanced across arms AND subgroups have different baseline rates. Always report segment-level results alongside aggregate.

HTE — Heterogeneous Treatment Effects

Cochran's Q test measures effect heterogeneity across segments.
Q = Σ wᵢ(liftᵢ − lift̄)², df = k−1
= (Q−df)/Q × 100% — proportion of variance due to true heterogeneity.

I² Benchmarks

<25% Low · 25–50% Moderate · >50% Substantial → personalized rollout warranted.

Pre-registration Rule

Define segment cuts before the experiment — post-hoc segment fishing inflates false discoveries. Apply Bonferroni to segment-level α.

Action on HTE

Ship only to segments with positive, significant effects. Investigate mechanism — is the effect driven by a confounder or a genuine subgroup difference?
🔬
Segment results appear here
Sample Ratio Mismatch (SRM)
SRM occurs when actual traffic differs from the intended split — signals a randomization or logging bug.
OBSERVED VISITORS
+ Add Variant
📚 Methodology — SRM Check

The Test

Chi-squared goodness-of-fit comparing observed visitor counts to expected counts under the intended split. χ² = Σ(O−E)²/E, df = k−1.

Why It Matters

SRM invalidates the experiment — assignment is non-random. Any measured effect could be an artifact of the biased sample, not a real treatment effect.

Common Causes

Caching layers · Redirect chains losing assignment · Bot filtering applied unevenly · Logging delays · Hash-based bucketing collisions · Cookie deletion.

Threshold

p < 0.01 flags SRM (stricter than the primary test). Some teams use p < 0.001 for large experiments.

Action on SRM

Stop the experiment. Do not report results. Fix the pipeline. Re-run from scratch. Never attempt to "correct" SRM results statistically.

Suspicious Balance

A perfectly balanced split (p > 0.99) can also be suspicious — true randomization has natural variance. Focus on the magnitude of deviation, not just the p-value.
🚦
Enter traffic data above
Test Selector
5.0%
Groups (≥ 3 recommended)
Group n Mean SD
+ Add Group
Chi-Square
Category Observed Expected
+ Add Category
T-Test
GROUP 1
GROUP 2
Z-Test
📚 Methodology — Statistical Tests Guide

ANOVA (One-Way)

Tests whether ≥3 group means differ. F = MS_between / MS_within. Significant F only tells you some group differs — use Tukey HSD post-hoc to identify which pairs. Effect size: η² (small=0.01, medium=0.06, large=0.14). Assumes normality + homogeneity of variance.

Chi-Square

GoF: tests whether observed frequencies match expected. Contingency: tests independence between two categorical variables. Effect size: Cramér's V (small=0.1, medium=0.3, large=0.5). Requires expected cell counts ≥5.

T-Test Variants

One-sample: tests if a mean equals a hypothesized value μ₀.
Two-sample pooled: assumes equal variances (use when Levene's test passes).
Welch's: does not assume equal variances — preferred default for two groups.
Paired: for matched/repeated measures (before/after). Effect size: Cohen's d (small=0.2, medium=0.5, large=0.8).

Z-Test Variants

One-proportion: tests p̂ against null p₀. Valid when np₀≥5 and n(1−p₀)≥5.
Two-proportion: same as AB Analyzer but standalone. Uses pooled SE under H₀.
One-mean (σ known): rare in practice — use t-test if σ is estimated from data.

Distribution Charts

Yellow dashed line = observed statistic. Red shaded area = rejection region at chosen α. If the yellow line falls inside red, p < α.
🧮
Select a test type & enter data
ANOVA · Chi-Square · T-Test · Z-Test