A/B Testing Engineering: Statistical Significance, Experiment Design, and Feature Flag Rollouts

A/B testing is widely misused. Most product teams run tests too short, declare success too early, and make decisions based on noise. A single p < 0.05 result with 100 users is not evidence. These are the principles and implementation patterns that make experiments produce reliable signal.

The Statistics You Need to Know

Statistical significance (p-value): The probability that you'd see this result by random chance if there were actually no effect. p < 0.05 means "< 5% chance this is noise." But:

With 20 concurrent tests, you'd expect 1 false positive by chance even if nothing works
Always pre-specify your primary metric before running the test

Statistical power (1-β): The probability the test detects an effect that truly exists. At 80% power, a test has a 20% chance of missing a real effect. Standard target: 80% power.

Minimum Detectable Effect (MDE): The smallest change worth detecting. If you can only detect effects ≥ 5%, a 2% improvement will be invisible. Set your MDE based on what would actually change a product decision.

Sample size calculation:

# scripts/sample_size.py
from scipy import stats
import math

def required_sample_size(
    baseline_rate: float,      # Current conversion rate (e.g., 0.05 for 5%)
    minimum_detectable_effect: float,  # Relative lift to detect (e.g., 0.10 for 10% relative)
    alpha: float = 0.05,       # Significance level (two-tailed)
    power: float = 0.80,       # Statistical power
) -> int:
    """
    Calculate required sample size per variant for a two-proportion z-test.
    """
    p1 = baseline_rate
    p2 = baseline_rate * (1 + minimum_detectable_effect)

    z_alpha = stats.norm.ppf(1 - alpha / 2)  # Two-tailed
    z_beta = stats.norm.ppf(power)

    p_pooled = (p1 + p2) / 2
    effect = abs(p2 - p1)

    n = (z_alpha * math.sqrt(2 * p_pooled * (1 - p_pooled)) +
         z_beta * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2 / effect ** 2

    return math.ceil(n)

# Example: 5% baseline checkout rate, want to detect 15% relative improvement
n = required_sample_size(
    baseline_rate=0.05,
    minimum_detectable_effect=0.15,  # Detect 5% → 5.75% conversion
)
print(f"Required per variant: {n:,}")  # ~3,000 per variant
print(f"Total (2 variants): {n * 2:,}")  # ~6,000 total users

# How long will the test run?
daily_users = 500
days_needed = (n * 2) / daily_users
print(f"Days needed: {days_needed:.1f}")  # ~12 days

Common Experiment Mistakes

Mistake	What Happens	Fix
Peeking	Checking results daily and stopping when significant	Pre-specify sample size; stop only at predetermined date
Multiple metrics	One of 10 metrics is significant — probably by chance	Pre-specify one primary metric; others are exploratory
Too short	Week 1 novelty effect inflates results	Run ≥ 2 full business cycles (2 weeks minimum)
No holdback	All users in experiment; no baseline	Always have a control group
Post-hoc segmentation	"Men 25-34 showed significance"	Pre-specify segments; correct for multiple comparisons
Sample ratio mismatch	53% control vs 47% treatment (should be 50/50)	Diagnose assignment bias before analyzing results

🚀 SaaS MVP in 8 Weeks — Seriously

We have launched 50+ SaaS platforms. Multi-tenant architecture, Stripe billing, auth, role-based access, and cloud deployment — all handled by one senior team.

Week 1–2: Architecture design + wireframes
Week 3–6: Core features built + tested
Week 7–8: Launch-ready on AWS/Vercel with CI/CD
Post-launch: Maintenance plans from month 3

Estimate My SaaS MVP WhatsApp

Feature Flag-Based Experiment Infrastructure

// lib/experiments.ts
import { Redis } from 'ioredis';
import { createHash } from 'crypto';

const redis = new Redis(process.env.REDIS_URL!);

interface Experiment {
  id: string;
  variants: Array<{ id: string; weight: number }>;  // weights sum to 1
  targetingRules?: {
    newUsersOnly?: boolean;
    countries?: string[];
    minAccountAgeDays?: number;
  };
  startDate: Date;
  endDate?: Date;
}

// Deterministic variant assignment: same user always gets same variant
function assignVariant(userId: string, experiment: Experiment): string {
  // Hash userId + experimentId for stable, experiment-specific assignment
  const hash = createHash('sha256')
    .update(`${userId}:${experiment.id}`)
    .digest('hex');

  // Convert first 8 hex chars to 0-1 float
  const bucket = parseInt(hash.slice(0, 8), 16) / 0xFFFFFFFF;

  // Assign based on cumulative weights
  let cumulative = 0;
  for (const variant of experiment.variants) {
    cumulative += variant.weight;
    if (bucket <= cumulative) return variant.id;
  }

  return experiment.variants[experiment.variants.length - 1].id;
}

export async function getVariant(
  userId: string,
  experimentId: string,
): Promise<string | null> {
  const experimentData = await redis.get(`experiment:${experimentId}`);
  if (!experimentData) return null;

  const experiment: Experiment = JSON.parse(experimentData);

  // Check experiment is active
  const now = new Date();
  if (now < experiment.startDate) return null;
  if (experiment.endDate && now > experiment.endDate) return null;

  const variant = assignVariant(userId, experiment);

  // Track assignment for analysis
  await redis.pfadd(`exp:${experimentId}:users:${variant}`, userId);

  return variant;
}

// React hook for experiment variants
export function useExperiment(experimentId: string): {
  variant: string | null;
  isControl: boolean;
  isTreatment: (variantId: string) => boolean;
} {
  const { user } = useAuth();
  const [variant, setVariant] = useState<string | null>(null);

  useEffect(() => {
    if (!user?.id) return;
    fetch(`/api/experiments/${experimentId}/variant?userId=${user.id}`)
      .then(r => r.json())
      .then(data => setVariant(data.variant));
  }, [user?.id, experimentId]);

  return {
    variant,
    isControl: variant === 'control',
    isTreatment: (id: string) => variant === id,
  };
}

Using the experiment in a component:

// components/PricingPage.tsx
export function PricingPage() {
  const { isTreatment } = useExperiment('pricing-page-redesign');

  // Control: original pricing table
  // Treatment: new card-based layout with annual toggle prominent

  return isTreatment('card-layout')
    ? <PricingPageV2 />
    : <PricingPageV1 />;
}

Analyzing Results

# scripts/analyze_experiment.py
import numpy as np
from scipy import stats
from dataclasses import dataclass

@dataclass
class ExperimentResults:
    variant: str
    users: int
    conversions: int

    @property
    def rate(self) -> float:
        return self.conversions / self.users

def analyze_experiment(
    control: ExperimentResults,
    treatment: ExperimentResults,
    alpha: float = 0.05,
) -> dict:
    # Two-proportion z-test
    p1, n1 = control.rate, control.users
    p2, n2 = treatment.rate, treatment.users

    p_pooled = (control.conversions + treatment.conversions) / (n1 + n2)
    se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n1 + 1/n2))
    z = (p2 - p1) / se
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))  # Two-tailed

    # Relative lift
    lift = (p2 - p1) / p1

    # 95% confidence interval for the lift
    se_diff = np.sqrt(p1*(1-p1)/n1 + p2*(1-p2)/n2)
    z_crit = stats.norm.ppf(1 - alpha/2)
    ci_lower = (p2 - p1) - z_crit * se_diff
    ci_upper = (p2 - p1) + z_crit * se_diff

    # Sample ratio mismatch check
    expected_ratio = 0.5  # Assuming 50/50 split
    actual_ratio = n1 / (n1 + n2)
    chi2_srm = ((actual_ratio - expected_ratio)**2 / expected_ratio) * 2 * (n1 + n2)
    p_srm = 1 - stats.chi2.cdf(chi2_srm, df=1)

    return {
        'control_rate': f'{p1:.3%}',
        'treatment_rate': f'{p2:.3%}',
        'relative_lift': f'{lift:+.1%}',
        'confidence_interval_95': f'[{ci_lower:+.3%}, {ci_upper:+.3%}]',
        'p_value': round(p_value, 4),
        'significant': p_value < alpha,
        'sample_ratio_mismatch': p_srm < 0.01,  # Warn if SRM detected
    }

# Example usage
control = ExperimentResults('control', users=3200, conversions=160)
treatment = ExperimentResults('card-layout', users=3150, conversions=181)

result = analyze_experiment(control, treatment)
print(result)
# {
#   'control_rate': '5.000%',
#   'treatment_rate': '5.746%',
#   'relative_lift': '+14.9%',
#   'confidence_interval_95': '[+0.173%, +1.319%]',
#   'p_value': 0.0312,
#   'significant': True,
#   'sample_ratio_mismatch': False
# }

💡 The Difference Between a SaaS Demo and a SaaS Business

Anyone can build a demo. We build SaaS products that handle real load, real users, and real payments — with architecture that does not need to be rewritten at 1,000 users.

Multi-tenant PostgreSQL with row-level security
Stripe subscriptions, usage billing, annual plans
SOC2-ready infrastructure from day one
We own zero equity — you own everything

Start My SaaS WhatsApp

Sample Ratio Mismatch (SRM)

SRM occurs when your actual variant split differs significantly from the intended split. It invalidates results:

# SRM causes: cookie deletion, bot traffic on one variant,
# CDN caching serving one variant to all users in a region,
# bad assignment logic (not truly random)

# Always check before analyzing results:
chi2, p = stats.chisquare([control_users, treatment_users],
                           f_exp=[expected_per_group, expected_per_group])
if p < 0.01:
    print("⚠️ Sample Ratio Mismatch detected — results are invalid")
    print("Investigate assignment mechanism before analyzing outcomes")

Experiment Governance

Experiment Review Checklist (before launching)

Hypothesis

One primary metric defined (and pre-specified)
Minimum detectable effect specified
Sample size calculated and test duration set
Target segment defined (all users? New users? Specific plan?)

Technical

Assignment is deterministic (same user always gets same variant)
Assignment is independent of primary metric
No shared infrastructure that could cause spillover
Logging confirmed working for both variants

Risk

Rollback plan if treatment breaks something
Feature flag can be disabled without code deploy
Control group is a true holdback (not 0%)


---

## Working With Viprasol

We build experimentation infrastructure — feature flag systems, A/B test assignment logic, analytics event pipelines for experiment tracking, and result analysis tooling. Experimentation infrastructure is a core growth investment for product-led teams.

→ [Talk to our team](/contact) about growth engineering and experimentation.

---

A/B Testing Engineering: Statistical Significance, Experiment Design, and Feature Flag Rollouts

A/B Testing Engineering: Statistical Significance, Experiment Design, and Feature Flag Rollouts

The Statistics You Need to Know

Common Experiment Mistakes

🚀 SaaS MVP in 8 Weeks — Seriously

Feature Flag-Based Experiment Infrastructure

Analyzing Results

💡 The Difference Between a SaaS Demo and a SaaS Business

Sample Ratio Mismatch (SRM)

Experiment Governance

Experiment Review Checklist (before launching)

Hypothesis

Technical

Risk

See Also

Viprasol Tech Team

Building a SaaS Product?

Add AI automation to your SaaS product?

Related Articles

Product-Led Growth Engineering: Activation Tracking, Viral Loops, and Freemium Gates

Startup Growth Metrics: North Star Metric, Growth Accounting, and Cohort Analysis

Product Analytics: PostHog vs Mixpanel vs Amplitude, Funnel Analysis, and A/B Testing