A/B Testing Engineering: Statistical Significance, Experiment Design, and Feature Flag Rollouts
Build a rigorous A/B testing program — statistical significance and power calculations, minimum detectable effect, novelty effect, feature flag-based experiment
A/B Testing Engineering: Statistical Significance, Experiment Design, and Feature Flag Rollouts
A/B testing is widely misused. Most product teams run tests too short, declare success too early, and make decisions based on noise. A single p < 0.05 result with 100 users is not evidence. These are the principles and implementation patterns that make experiments produce reliable signal.
The Statistics You Need to Know
Statistical significance (p-value): The probability that you'd see this result by random chance if there were actually no effect. p < 0.05 means "< 5% chance this is noise." But:
- With 20 concurrent tests, you'd expect 1 false positive by chance even if nothing works
- Always pre-specify your primary metric before running the test
Statistical power (1-β): The probability the test detects an effect that truly exists. At 80% power, a test has a 20% chance of missing a real effect. Standard target: 80% power.
Minimum Detectable Effect (MDE): The smallest change worth detecting. If you can only detect effects ≥ 5%, a 2% improvement will be invisible. Set your MDE based on what would actually change a product decision.
Sample size calculation:
# scripts/sample_size.py
from scipy import stats
import math
def required_sample_size(
baseline_rate: float, # Current conversion rate (e.g., 0.05 for 5%)
minimum_detectable_effect: float, # Relative lift to detect (e.g., 0.10 for 10% relative)
alpha: float = 0.05, # Significance level (two-tailed)
power: float = 0.80, # Statistical power
) -> int:
"""
Calculate required sample size per variant for a two-proportion z-test.
"""
p1 = baseline_rate
p2 = baseline_rate * (1 + minimum_detectable_effect)
z_alpha = stats.norm.ppf(1 - alpha / 2) # Two-tailed
z_beta = stats.norm.ppf(power)
p_pooled = (p1 + p2) / 2
effect = abs(p2 - p1)
n = (z_alpha * math.sqrt(2 * p_pooled * (1 - p_pooled)) +
z_beta * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2 / effect ** 2
return math.ceil(n)
# Example: 5% baseline checkout rate, want to detect 15% relative improvement
n = required_sample_size(
baseline_rate=0.05,
minimum_detectable_effect=0.15, # Detect 5% → 5.75% conversion
)
print(f"Required per variant: {n:,}") # ~3,000 per variant
print(f"Total (2 variants): {n * 2:,}") # ~6,000 total users
# How long will the test run?
daily_users = 500
days_needed = (n * 2) / daily_users
print(f"Days needed: {days_needed:.1f}") # ~12 days
Common Experiment Mistakes
| Mistake | What Happens | Fix |
|---|---|---|
| Peeking | Checking results daily and stopping when significant | Pre-specify sample size; stop only at predetermined date |
| Multiple metrics | One of 10 metrics is significant — probably by chance | Pre-specify one primary metric; others are exploratory |
| Too short | Week 1 novelty effect inflates results | Run ≥ 2 full business cycles (2 weeks minimum) |
| No holdback | All users in experiment; no baseline | Always have a control group |
| Post-hoc segmentation | "Men 25-34 showed significance" | Pre-specify segments; correct for multiple comparisons |
| Sample ratio mismatch | 53% control vs 47% treatment (should be 50/50) | Diagnose assignment bias before analyzing results |
🚀 SaaS MVP in 8 Weeks — Seriously
We have launched 50+ SaaS platforms. Multi-tenant architecture, Stripe billing, auth, role-based access, and cloud deployment — all handled by one senior team.
- Week 1–2: Architecture design + wireframes
- Week 3–6: Core features built + tested
- Week 7–8: Launch-ready on AWS/Vercel with CI/CD
- Post-launch: Maintenance plans from month 3
Feature Flag-Based Experiment Infrastructure
// lib/experiments.ts
import { Redis } from 'ioredis';
import { createHash } from 'crypto';
const redis = new Redis(process.env.REDIS_URL!);
interface Experiment {
id: string;
variants: Array<{ id: string; weight: number }>; // weights sum to 1
targetingRules?: {
newUsersOnly?: boolean;
countries?: string[];
minAccountAgeDays?: number;
};
startDate: Date;
endDate?: Date;
}
// Deterministic variant assignment: same user always gets same variant
function assignVariant(userId: string, experiment: Experiment): string {
// Hash userId + experimentId for stable, experiment-specific assignment
const hash = createHash('sha256')
.update(`${userId}:${experiment.id}`)
.digest('hex');
// Convert first 8 hex chars to 0-1 float
const bucket = parseInt(hash.slice(0, 8), 16) / 0xFFFFFFFF;
// Assign based on cumulative weights
let cumulative = 0;
for (const variant of experiment.variants) {
cumulative += variant.weight;
if (bucket <= cumulative) return variant.id;
}
return experiment.variants[experiment.variants.length - 1].id;
}
export async function getVariant(
userId: string,
experimentId: string,
): Promise<string | null> {
const experimentData = await redis.get(`experiment:${experimentId}`);
if (!experimentData) return null;
const experiment: Experiment = JSON.parse(experimentData);
// Check experiment is active
const now = new Date();
if (now < experiment.startDate) return null;
if (experiment.endDate && now > experiment.endDate) return null;
const variant = assignVariant(userId, experiment);
// Track assignment for analysis
await redis.pfadd(`exp:${experimentId}:users:${variant}`, userId);
return variant;
}
// React hook for experiment variants
export function useExperiment(experimentId: string): {
variant: string | null;
isControl: boolean;
isTreatment: (variantId: string) => boolean;
} {
const { user } = useAuth();
const [variant, setVariant] = useState<string | null>(null);
useEffect(() => {
if (!user?.id) return;
fetch(`/api/experiments/${experimentId}/variant?userId=${user.id}`)
.then(r => r.json())
.then(data => setVariant(data.variant));
}, [user?.id, experimentId]);
return {
variant,
isControl: variant === 'control',
isTreatment: (id: string) => variant === id,
};
}
Using the experiment in a component:
// components/PricingPage.tsx
export function PricingPage() {
const { isTreatment } = useExperiment('pricing-page-redesign');
// Control: original pricing table
// Treatment: new card-based layout with annual toggle prominent
return isTreatment('card-layout')
? <PricingPageV2 />
: <PricingPageV1 />;
}
Analyzing Results
# scripts/analyze_experiment.py
import numpy as np
from scipy import stats
from dataclasses import dataclass
@dataclass
class ExperimentResults:
variant: str
users: int
conversions: int
@property
def rate(self) -> float:
return self.conversions / self.users
def analyze_experiment(
control: ExperimentResults,
treatment: ExperimentResults,
alpha: float = 0.05,
) -> dict:
# Two-proportion z-test
p1, n1 = control.rate, control.users
p2, n2 = treatment.rate, treatment.users
p_pooled = (control.conversions + treatment.conversions) / (n1 + n2)
se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n1 + 1/n2))
z = (p2 - p1) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z))) # Two-tailed
# Relative lift
lift = (p2 - p1) / p1
# 95% confidence interval for the lift
se_diff = np.sqrt(p1*(1-p1)/n1 + p2*(1-p2)/n2)
z_crit = stats.norm.ppf(1 - alpha/2)
ci_lower = (p2 - p1) - z_crit * se_diff
ci_upper = (p2 - p1) + z_crit * se_diff
# Sample ratio mismatch check
expected_ratio = 0.5 # Assuming 50/50 split
actual_ratio = n1 / (n1 + n2)
chi2_srm = ((actual_ratio - expected_ratio)**2 / expected_ratio) * 2 * (n1 + n2)
p_srm = 1 - stats.chi2.cdf(chi2_srm, df=1)
return {
'control_rate': f'{p1:.3%}',
'treatment_rate': f'{p2:.3%}',
'relative_lift': f'{lift:+.1%}',
'confidence_interval_95': f'[{ci_lower:+.3%}, {ci_upper:+.3%}]',
'p_value': round(p_value, 4),
'significant': p_value < alpha,
'sample_ratio_mismatch': p_srm < 0.01, # Warn if SRM detected
}
# Example usage
control = ExperimentResults('control', users=3200, conversions=160)
treatment = ExperimentResults('card-layout', users=3150, conversions=181)
result = analyze_experiment(control, treatment)
print(result)
# {
# 'control_rate': '5.000%',
# 'treatment_rate': '5.746%',
# 'relative_lift': '+14.9%',
# 'confidence_interval_95': '[+0.173%, +1.319%]',
# 'p_value': 0.0312,
# 'significant': True,
# 'sample_ratio_mismatch': False
# }
💡 The Difference Between a SaaS Demo and a SaaS Business
Anyone can build a demo. We build SaaS products that handle real load, real users, and real payments — with architecture that does not need to be rewritten at 1,000 users.
- Multi-tenant PostgreSQL with row-level security
- Stripe subscriptions, usage billing, annual plans
- SOC2-ready infrastructure from day one
- We own zero equity — you own everything
Sample Ratio Mismatch (SRM)
SRM occurs when your actual variant split differs significantly from the intended split. It invalidates results:
# SRM causes: cookie deletion, bot traffic on one variant,
# CDN caching serving one variant to all users in a region,
# bad assignment logic (not truly random)
# Always check before analyzing results:
chi2, p = stats.chisquare([control_users, treatment_users],
f_exp=[expected_per_group, expected_per_group])
if p < 0.01:
print("⚠️ Sample Ratio Mismatch detected — results are invalid")
print("Investigate assignment mechanism before analyzing outcomes")
Experiment Governance
Experiment Review Checklist (before launching)
Hypothesis
- One primary metric defined (and pre-specified)
- Minimum detectable effect specified
- Sample size calculated and test duration set
- Target segment defined (all users? New users? Specific plan?)
Technical
- Assignment is deterministic (same user always gets same variant)
- Assignment is independent of primary metric
- No shared infrastructure that could cause spillover
- Logging confirmed working for both variants
Risk
- Rollback plan if treatment breaks something
- Feature flag can be disabled without code deploy
- Control group is a true holdback (not 0%)
---
## Working With Viprasol
We build experimentation infrastructure — feature flag systems, A/B test assignment logic, analytics event pipelines for experiment tracking, and result analysis tooling. Experimentation infrastructure is a core growth investment for product-led teams.
→ [Talk to our team](/contact) about growth engineering and experimentation.
---
See Also
- Feature Flags — the infrastructure underneath A/B tests
- Product Analytics — event tracking for experiment metrics
- Product-Led Growth — experiments in a PLG context
- Edge Computing — A/B testing at the edge (no flicker)
- Web Development Services — growth engineering and analytics
About the Author
Viprasol Tech Team
Custom Software Development Specialists
The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.
Building a SaaS Product?
We've helped launch 50+ SaaS platforms. Let's build yours — fast.
Free consultation • No commitment • Response within 24 hours
Add AI automation to your SaaS product?
Viprasol builds custom AI agent crews that plug into any SaaS workflow — automating repetitive tasks, qualifying leads, and responding across every channel your customers use.