Statistical Tools | Gagan Agrawal

Test Configuration

Metric Type Significance Level α

5.0%

Statistical Power 1−β

80%

Test Type

Control (A)

Visitors

Conversions

Variants

+ Add Variant

📚 Methodology — Frequentist AB Test

The Test

Two-proportion z-test for binary metrics; Welch's t-test for continuous means. Tests H₀: no difference between control and variant.

Formula

Proportion: z = (p̂_B − p̂_A) / √(p̂(1−p̂)(1/n_A + 1/n_B))
Mean: t = (x̄_B − x̄_A) / √(s²_A/n_A + s²_B/n_B)

Assumptions

Independent observations · Random assignment · ≥5 expected events per cell · Sufficient sample size (check Sample Size tab).

Interpretation

p < α → reject H₀. Relative lift = (rate_B − rate_A)/rate_A. CI excludes 0 → significant. Always check practical significance (is the lift worth shipping?).

Bias Risks

Peeking inflates Type I error · Novelty effect inflates early lifts · Simpson's Paradox can reverse aggregate results · Multiple comparisons → Bonferroni applied automatically.

⚗

No results yet

Fill in data on the left and click Run Analysis

Prior (Beta Distribution)

λModel: Beta-Binomial. α=1,β=1 = uniform prior.

Prior α

Prior β

Risk Threshold

0.50%

Control (A)

Visitors

Conversions

Variant (B)

Visitors

Conversions

📚 Methodology — Bayesian AB Test

The Model

Beta-Binomial conjugate model. Prior Beta(α,β) updated with data → Posterior Beta(α+x, β+n−x). Monte Carlo (30k samples) estimates P(B>A).

Decision Metrics

P(B beats A) — probability variant is better. Expected Loss = E[max(0, θ_A − θ_B)] — average cost of choosing B if wrong.

Decision Rule

Ship when P(B>A) ≥ 95% AND Expected Loss ≤ threshold. Threshold = max acceptable conversion rate you're willing to sacrifice.

Prior Choice

α=1, β=1 = uniform (no prior belief). Increase α,β to encode historical knowledge. With large samples, the prior has negligible impact.

vs Frequentist

Bayesian gives probability statements ("90% chance B is better"). Frequentist gives yes/no at a fixed error rate. Bayesian allows early stopping without inflating error rates.

🎲

Results appear here

Define Factors & Levels

ℹDefine up to 3 factors (e.g. "Button Color", "Headline", "CTA"). Each factor can have 2–4 levels. The system generates a full factorial design.

+ Add Factor (max 3)

📚 Methodology — Multivariate Testing (MVT)

Method

Full factorial design — every combination of factor levels is tested simultaneously. Main effects tested via chi-squared per factor. Interaction heatmap shows synergistic/antagonistic effects.

When to Use

Testing 2–3 independent changes at once to find the optimal combination faster than sequential A/B tests.

Sample Size Warning

Each cell needs ≥200 visitors. 3 factors × 3 levels = 27 cells → 5,400+ visitors minimum. Power per cell is much lower than a simple A/B test.

Interpretation

Main effect significant (p<0.05) → that factor matters. Interaction heatmap: if a cell dramatically outperforms additive prediction, a synergistic interaction exists — factor effects are not independent.

Gotchas

For 4+ factors use fractional factorial designs. Bonferroni-adjust α across all cell comparisons. Never pick a winner without checking statistical significance per cell.

🧪

Define factors & generate design

Full factorial analysis — main effects, interactions, winning combination

Covariate Balance Check

⚖Verify that pre-experiment covariates (age, past revenue, days active…) are balanced between control and treatment. Imbalance = potential confounding bias. Rule: |SMD| < 0.1 = well balanced.

Significance Level α (for t-tests)

5.0%

Covariate Ctrl N Ctrl Mean Ctrl SD Trt N Trt Mean Trt SD

+ Add Covariate

📚 Methodology — Covariate Balance

The Metric

Standardized Mean Difference (SMD) = (mean_T − mean_C) / √((SD²_C + SD²_T)/2). Scale-free — works across covariates with different units.

Thresholds

|SMD| < 0.10 ✓ well balanced · 0.10–0.20 marginal · > 0.20 action needed.

Why SMD over p-value

p-values are sensitive to sample size — large N makes trivial imbalances "significant"; small N misses real ones. SMD reflects practical imbalance regardless of N.

Remedies for Imbalance

Re-randomize with stratification · CUPED (regress out pre-experiment covariate) reduces variance by up to 50% · Regression adjustment includes covariate in analysis model.

Love Plot

Standard visualization in clinical trials. Each bar = SMD for one covariate. Yellow dashed lines mark ±0.10 threshold.

Interpretation Guide

• |SMD| < 0.10 — Well balanced ✓
• |SMD| 0.10–0.20 — Marginal — consider CUPED
• |SMD| > 0.20 — Imbalanced — stratify or re-randomize
• p < α — Statistically significant imbalance
• SMD is independent of sample size — preferred over p-value alone

⚖

Covariate balance results appear here

Add covariates and run the check

Parameters

Baseline Conversion Rate

10%

MDE (relative %)

10%

Significance Level α

Statistical Power

80%

Variants (excl. control) Daily Traffic

Notes

📚 Methodology — Sample Size

Formula (Two-Proportion)

n = (z_α/2 + z_β)² × (p₁(1−p₁) + p₂(1−p₂)) / (p₂−p₁)²
where p₂ = p₁ × (1 + MDE%). Bonferroni: α_adj = α/k for k variants.

MDE Explained

Minimum Detectable Effect — smallest relative change worth detecting. 10% relative MDE on 5% baseline = 5.5% target (0.5% absolute). Smaller MDE → larger n.

Power

80% power = 20% chance of missing a real effect (Type II error). 90% power costs ~35% more traffic. Standard: 80%.

Runtime Rules

Run ≥ 2 full business cycles (weeks) to capture Mon–Sun seasonality. Never extend based on intermediate results — this inflates false positives (peeking problem).

Gotchas

Sample size is a minimum not a target · Traffic spikes invalidate estimates · Continuous metrics need mean + SD (different formula) · Ratio metrics need delta method.

Required Sample Size

MDE Sensitivity

Runtime vs Traffic

Segment Data Input

ℹEnter conversion data by segment (Mobile/Desktop, US/EU…). Detects Simpson's Paradox and HTE.

Significance Level α

5.0%

Segment Ctrl N Ctrl Conv Var N Var Conv

+ Add Segment

📚 Methodology — Segments, HTE & Simpson's Paradox

Simpson's Paradox

Aggregate trend reverses in subgroups. Happens when subgroup sizes are unbalanced across arms AND subgroups have different baseline rates. Always report segment-level results alongside aggregate.

HTE — Heterogeneous Treatment Effects

Cochran's Q test measures effect heterogeneity across segments.
Q = Σ wᵢ(liftᵢ − lift̄)², df = k−1
I² = (Q−df)/Q × 100% — proportion of variance due to true heterogeneity.

I² Benchmarks

<25% Low · 25–50% Moderate · >50% Substantial → personalized rollout warranted.

Pre-registration Rule

Define segment cuts before the experiment — post-hoc segment fishing inflates false discoveries. Apply Bonferroni to segment-level α.

Action on HTE

Ship only to segments with positive, significant effects. Investigate mechanism — is the effect driven by a confounder or a genuine subgroup difference?

🔬

Segment results appear here

Sample Ratio Mismatch (SRM)

ℹSRM occurs when actual traffic differs from the intended split — signals a randomization or logging bug.

Intended % per Variant

OBSERVED VISITORS

+ Add Variant

📚 Methodology — SRM Check

The Test

Chi-squared goodness-of-fit comparing observed visitor counts to expected counts under the intended split. χ² = Σ(O−E)²/E, df = k−1.

Why It Matters

SRM invalidates the experiment — assignment is non-random. Any measured effect could be an artifact of the biased sample, not a real treatment effect.

Common Causes

Caching layers · Redirect chains losing assignment · Bot filtering applied unevenly · Logging delays · Hash-based bucketing collisions · Cookie deletion.

Threshold

p < 0.01 flags SRM (stricter than the primary test). Some teams use p < 0.001 for large experiments.

Action on SRM

Stop the experiment. Do not report results. Fix the pipeline. Re-run from scratch. Never attempt to "correct" SRM results statistically.

Suspicious Balance

A perfectly balanced split (p > 0.99) can also be suspicious — true randomization has natural variance. Focus on the magnitude of deviation, not just the p-value.

🚦

Enter traffic data above

Test Selector

Test Type Significance Level α

5.0%

Groups (≥ 3 recommended)

Group n Mean SD

+ Add Group

Chi-Square

Mode

Category Observed Expected

+ Add Category

T-Test

Sub-type

Sample Mean (x̄)

Sample SD

Hypothesized Mean (μ₀)

GROUP 1

n₁

Mean₁

SD₁

GROUP 2

n₂

Mean₂

SD₂

n (pairs)

Mean Difference (d̄)

SD of Differences

Tails

Z-Test

Sub-type

Observed Conversions (x)

Null Proportion (p₀)

n₁

Conversions₁

n₂

Conversions₂

Sample Mean (x̄)

Population σ (known)

Null Mean (μ₀)

Tails

📚 Methodology — Statistical Tests Guide

ANOVA (One-Way)

Tests whether ≥3 group means differ. F = MS_between / MS_within. Significant F only tells you some group differs — use Tukey HSD post-hoc to identify which pairs. Effect size: η² (small=0.01, medium=0.06, large=0.14). Assumes normality + homogeneity of variance.

Chi-Square

GoF: tests whether observed frequencies match expected. Contingency: tests independence between two categorical variables. Effect size: Cramér's V (small=0.1, medium=0.3, large=0.5). Requires expected cell counts ≥5.

T-Test Variants

One-sample: tests if a mean equals a hypothesized value μ₀.
Two-sample pooled: assumes equal variances (use when Levene's test passes).
Welch's: does not assume equal variances — preferred default for two groups.
Paired: for matched/repeated measures (before/after). Effect size: Cohen's d (small=0.2, medium=0.5, large=0.8).

Z-Test Variants

One-proportion: tests p̂ against null p₀. Valid when np₀≥5 and n(1−p₀)≥5.
Two-proportion: same as AB Analyzer but standalone. Uses pooled SE under H₀.
One-mean (σ known): rare in practice — use t-test if σ is estimated from data.

Distribution Charts

Yellow dashed line = observed statistic. Red shaded area = rejection region at chosen α. If the yellow line falls inside red, p < α.

🧮

Select a test type & enter data

ANOVA · Chi-Square · T-Test · Z-Test