Back to Blog

A/B Testing Engineering: Statistical Significance, Experiment Design, and Feature Flag Rollouts

Build a rigorous A/B testing program — statistical significance and power calculations, minimum detectable effect, novelty effect, feature flag-based experiment

Viprasol Tech Team
June 9, 2026
12 min read

A/B Testing Engineering: Statistical Significance, Experiment Design, and Feature Flag Rollouts

A/B testing is widely misused. Most product teams run tests too short, declare success too early, and make decisions based on noise. A single p < 0.05 result with 100 users is not evidence. These are the principles and implementation patterns that make experiments produce reliable signal.


The Statistics You Need to Know

Statistical significance (p-value): The probability that you'd see this result by random chance if there were actually no effect. p < 0.05 means "< 5% chance this is noise." But:

  • With 20 concurrent tests, you'd expect 1 false positive by chance even if nothing works
  • Always pre-specify your primary metric before running the test

Statistical power (1-β): The probability the test detects an effect that truly exists. At 80% power, a test has a 20% chance of missing a real effect. Standard target: 80% power.

Minimum Detectable Effect (MDE): The smallest change worth detecting. If you can only detect effects ≥ 5%, a 2% improvement will be invisible. Set your MDE based on what would actually change a product decision.

Sample size calculation:

# scripts/sample_size.py
from scipy import stats
import math

def required_sample_size(
    baseline_rate: float,      # Current conversion rate (e.g., 0.05 for 5%)
    minimum_detectable_effect: float,  # Relative lift to detect (e.g., 0.10 for 10% relative)
    alpha: float = 0.05,       # Significance level (two-tailed)
    power: float = 0.80,       # Statistical power
) -> int:
    """
    Calculate required sample size per variant for a two-proportion z-test.
    """
    p1 = baseline_rate
    p2 = baseline_rate * (1 + minimum_detectable_effect)

    z_alpha = stats.norm.ppf(1 - alpha / 2)  # Two-tailed
    z_beta = stats.norm.ppf(power)

    p_pooled = (p1 + p2) / 2
    effect = abs(p2 - p1)

    n = (z_alpha * math.sqrt(2 * p_pooled * (1 - p_pooled)) +
         z_beta * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2 / effect ** 2

    return math.ceil(n)

# Example: 5% baseline checkout rate, want to detect 15% relative improvement
n = required_sample_size(
    baseline_rate=0.05,
    minimum_detectable_effect=0.15,  # Detect 5% → 5.75% conversion
)
print(f"Required per variant: {n:,}")  # ~3,000 per variant
print(f"Total (2 variants): {n * 2:,}")  # ~6,000 total users

# How long will the test run?
daily_users = 500
days_needed = (n * 2) / daily_users
print(f"Days needed: {days_needed:.1f}")  # ~12 days

Common Experiment Mistakes

MistakeWhat HappensFix
PeekingChecking results daily and stopping when significantPre-specify sample size; stop only at predetermined date
Multiple metricsOne of 10 metrics is significant — probably by chancePre-specify one primary metric; others are exploratory
Too shortWeek 1 novelty effect inflates resultsRun ≥ 2 full business cycles (2 weeks minimum)
No holdbackAll users in experiment; no baselineAlways have a control group
Post-hoc segmentation"Men 25-34 showed significance"Pre-specify segments; correct for multiple comparisons
Sample ratio mismatch53% control vs 47% treatment (should be 50/50)Diagnose assignment bias before analyzing results

🚀 SaaS MVP in 8 Weeks — Seriously

We have launched 50+ SaaS platforms. Multi-tenant architecture, Stripe billing, auth, role-based access, and cloud deployment — all handled by one senior team.

  • Week 1–2: Architecture design + wireframes
  • Week 3–6: Core features built + tested
  • Week 7–8: Launch-ready on AWS/Vercel with CI/CD
  • Post-launch: Maintenance plans from month 3

Feature Flag-Based Experiment Infrastructure

// lib/experiments.ts
import { Redis } from 'ioredis';
import { createHash } from 'crypto';

const redis = new Redis(process.env.REDIS_URL!);

interface Experiment {
  id: string;
  variants: Array<{ id: string; weight: number }>;  // weights sum to 1
  targetingRules?: {
    newUsersOnly?: boolean;
    countries?: string[];
    minAccountAgeDays?: number;
  };
  startDate: Date;
  endDate?: Date;
}

// Deterministic variant assignment: same user always gets same variant
function assignVariant(userId: string, experiment: Experiment): string {
  // Hash userId + experimentId for stable, experiment-specific assignment
  const hash = createHash('sha256')
    .update(`${userId}:${experiment.id}`)
    .digest('hex');

  // Convert first 8 hex chars to 0-1 float
  const bucket = parseInt(hash.slice(0, 8), 16) / 0xFFFFFFFF;

  // Assign based on cumulative weights
  let cumulative = 0;
  for (const variant of experiment.variants) {
    cumulative += variant.weight;
    if (bucket <= cumulative) return variant.id;
  }

  return experiment.variants[experiment.variants.length - 1].id;
}

export async function getVariant(
  userId: string,
  experimentId: string,
): Promise<string | null> {
  const experimentData = await redis.get(`experiment:${experimentId}`);
  if (!experimentData) return null;

  const experiment: Experiment = JSON.parse(experimentData);

  // Check experiment is active
  const now = new Date();
  if (now < experiment.startDate) return null;
  if (experiment.endDate && now > experiment.endDate) return null;

  const variant = assignVariant(userId, experiment);

  // Track assignment for analysis
  await redis.pfadd(`exp:${experimentId}:users:${variant}`, userId);

  return variant;
}

// React hook for experiment variants
export function useExperiment(experimentId: string): {
  variant: string | null;
  isControl: boolean;
  isTreatment: (variantId: string) => boolean;
} {
  const { user } = useAuth();
  const [variant, setVariant] = useState<string | null>(null);

  useEffect(() => {
    if (!user?.id) return;
    fetch(`/api/experiments/${experimentId}/variant?userId=${user.id}`)
      .then(r => r.json())
      .then(data => setVariant(data.variant));
  }, [user?.id, experimentId]);

  return {
    variant,
    isControl: variant === 'control',
    isTreatment: (id: string) => variant === id,
  };
}

Using the experiment in a component:

// components/PricingPage.tsx
export function PricingPage() {
  const { isTreatment } = useExperiment('pricing-page-redesign');

  // Control: original pricing table
  // Treatment: new card-based layout with annual toggle prominent

  return isTreatment('card-layout')
    ? <PricingPageV2 />
    : <PricingPageV1 />;
}

Analyzing Results

# scripts/analyze_experiment.py
import numpy as np
from scipy import stats
from dataclasses import dataclass

@dataclass
class ExperimentResults:
    variant: str
    users: int
    conversions: int

    @property
    def rate(self) -> float:
        return self.conversions / self.users

def analyze_experiment(
    control: ExperimentResults,
    treatment: ExperimentResults,
    alpha: float = 0.05,
) -> dict:
    # Two-proportion z-test
    p1, n1 = control.rate, control.users
    p2, n2 = treatment.rate, treatment.users

    p_pooled = (control.conversions + treatment.conversions) / (n1 + n2)
    se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n1 + 1/n2))
    z = (p2 - p1) / se
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))  # Two-tailed

    # Relative lift
    lift = (p2 - p1) / p1

    # 95% confidence interval for the lift
    se_diff = np.sqrt(p1*(1-p1)/n1 + p2*(1-p2)/n2)
    z_crit = stats.norm.ppf(1 - alpha/2)
    ci_lower = (p2 - p1) - z_crit * se_diff
    ci_upper = (p2 - p1) + z_crit * se_diff

    # Sample ratio mismatch check
    expected_ratio = 0.5  # Assuming 50/50 split
    actual_ratio = n1 / (n1 + n2)
    chi2_srm = ((actual_ratio - expected_ratio)**2 / expected_ratio) * 2 * (n1 + n2)
    p_srm = 1 - stats.chi2.cdf(chi2_srm, df=1)

    return {
        'control_rate': f'{p1:.3%}',
        'treatment_rate': f'{p2:.3%}',
        'relative_lift': f'{lift:+.1%}',
        'confidence_interval_95': f'[{ci_lower:+.3%}, {ci_upper:+.3%}]',
        'p_value': round(p_value, 4),
        'significant': p_value < alpha,
        'sample_ratio_mismatch': p_srm < 0.01,  # Warn if SRM detected
    }

# Example usage
control = ExperimentResults('control', users=3200, conversions=160)
treatment = ExperimentResults('card-layout', users=3150, conversions=181)

result = analyze_experiment(control, treatment)
print(result)
# {
#   'control_rate': '5.000%',
#   'treatment_rate': '5.746%',
#   'relative_lift': '+14.9%',
#   'confidence_interval_95': '[+0.173%, +1.319%]',
#   'p_value': 0.0312,
#   'significant': True,
#   'sample_ratio_mismatch': False
# }

💡 The Difference Between a SaaS Demo and a SaaS Business

Anyone can build a demo. We build SaaS products that handle real load, real users, and real payments — with architecture that does not need to be rewritten at 1,000 users.

  • Multi-tenant PostgreSQL with row-level security
  • Stripe subscriptions, usage billing, annual plans
  • SOC2-ready infrastructure from day one
  • We own zero equity — you own everything

Sample Ratio Mismatch (SRM)

SRM occurs when your actual variant split differs significantly from the intended split. It invalidates results:

# SRM causes: cookie deletion, bot traffic on one variant,
# CDN caching serving one variant to all users in a region,
# bad assignment logic (not truly random)

# Always check before analyzing results:
chi2, p = stats.chisquare([control_users, treatment_users],
                           f_exp=[expected_per_group, expected_per_group])
if p < 0.01:
    print("⚠️ Sample Ratio Mismatch detected — results are invalid")
    print("Investigate assignment mechanism before analyzing outcomes")

Experiment Governance

Experiment Review Checklist (before launching)

Hypothesis

  • One primary metric defined (and pre-specified)
  • Minimum detectable effect specified
  • Sample size calculated and test duration set
  • Target segment defined (all users? New users? Specific plan?)

Technical

  • Assignment is deterministic (same user always gets same variant)
  • Assignment is independent of primary metric
  • No shared infrastructure that could cause spillover
  • Logging confirmed working for both variants

Risk

  • Rollback plan if treatment breaks something
  • Feature flag can be disabled without code deploy
  • Control group is a true holdback (not 0%)

---

## Working With Viprasol

We build experimentation infrastructure — feature flag systems, A/B test assignment logic, analytics event pipelines for experiment tracking, and result analysis tooling. Experimentation infrastructure is a core growth investment for product-led teams.

→ [Talk to our team](/contact) about growth engineering and experimentation.

---

See Also

Share this article:

About the Author

V

Viprasol Tech Team

Custom Software Development Specialists

The Viprasol Tech team specialises in algorithmic trading software, AI agent systems, and SaaS development. With 100+ projects delivered across MT4/MT5 EAs, fintech platforms, and production AI systems, the team brings deep technical experience to every engagement. Based in India, serving clients globally.

MT4/MT5 EA DevelopmentAI Agent SystemsSaaS DevelopmentAlgorithmic Trading

Building a SaaS Product?

We've helped launch 50+ SaaS platforms. Let's build yours — fast.

Free consultation • No commitment • Response within 24 hours

Viprasol · AI Agent Systems

Add AI automation to your SaaS product?

Viprasol builds custom AI agent crews that plug into any SaaS workflow — automating repetitive tasks, qualifying leads, and responding across every channel your customers use.