SignaKitdocs
Concepts

A/B Testing: How It Works and How to Run Experiments

Learn how A/B testing works, how SignaKit handles bucketing and statistical analysis, and how to run your first controlled experiment.

A/B Testing

What is A/B Testing?

A/B testing (also called split testing) is a method of comparing two or more versions of something — a feature, a UI design, a copy variant — by showing different versions to different users and measuring which performs better. Rather than making decisions based on intuition, A/B testing gives you statistical evidence: variation A converted at 3.2%, variation B converted at 4.1%, and the difference is statistically significant.

The core principle is control vs. treatment. The control group sees the existing experience; the treatment group sees the new one. By splitting users randomly and measuring their behavior, you can attribute differences in outcomes to the change itself rather than to external factors.

Key concepts to understand before running your first test: statistical significance (how confident you are the result is real, not random chance), sample size (how many users you need before the result is trustworthy), and peeking bias (why reading results early before collecting enough data leads to false conclusions). Each of these is covered in the sections below.

In SignaKit, every A/B test is a flag with multiple variations and a fixed traffic split. You define the split, call decide(), and SignaKit handles bucketing, exposure tracking, and results.


What makes up an A/B test

A test has three parts:

  • Variations — named experiences the user can receive. A standard A/B test has control (the existing behavior) and treatment (the new behavior). A multivariate test adds more: control, v1, v2, and so on.
  • Traffic split — the percentage of users routed to each variation. Percentages must total 100. A classic 50/50 split sends half your users to each side. A 33/33/34 split covers three variations.
  • Metrics — the conversion events you measure to determine a winner.

You configure the flag and traffic split in the dashboard. The SDK handles the rest.


Running an experiment in code

Call decide() and branch on variationKey

decide() returns a decision object. For an A/B test, check decision.variationKey to determine which experience to render. The flag must be enabled for a decision to be returned — guard against null.

server.ts
const userCtx = client.createUserContext(userId, attributes)
const decision = userCtx.decide('checkout-redesign')

if (!decision) {
  // flag is off or user is not in the experiment
  renderDefault()
  return
}

if (decision.variationKey === 'treatment') {
  renderNewCheckout()
} else {
  // control — existing experience
  renderCurrentCheckout()
}
CheckoutPage.tsx
import { useFlag } from '@signakit/flags-react'

function CheckoutPage() {
  const decision = useFlag('checkout-redesign')

  if (decision?.variationKey === 'treatment') {
    return <NewCheckout />
  }

  return <CurrentCheckout />
}
checkout.py
user = client.create_user_context(user_id, attributes)
decision = user.decide("checkout-redesign")

if decision is None:
    render_default()
elif decision.variation_key == "treatment":
    render_new_checkout()
else:
    render_current_checkout()
checkout.go
userCtx := client.CreateUserContext(userID, attributes)
decision := userCtx.Decide("checkout-redesign")

if decision == nil {
    renderDefault(w, r)
    return
}

if decision.VariationKey == "treatment" {
    renderNewCheckout(w, r)
} else {
    renderCurrentCheckout(w, r)
}

Track conversions with trackEvent()

Exposure is recorded automatically when decide() is called — no extra call needed. To measure conversion, call trackEvent() when the user completes the action you care about.

server.ts
// After a purchase completes:
userCtx.trackEvent('purchase_completed', 49.99)

// After a button click (no value needed):
userCtx.trackEvent('signup_clicked')

Pass an optional numeric value when you want to measure revenue or quantity (sum, average) in addition to conversion rate. Omit it for binary conversion tracking.

trackEvent() is fire-and-forget. You do not need to await it or handle its return value in your request path.

Multivariate: branch on more than two variations

For tests with three or more variations, switch on all keys:

server.ts
const decision = userCtx.decide('pricing-page')

switch (decision?.variationKey) {
  case 'v1':
    return renderSimplifiedPricing()
  case 'v2':
    return renderValuePropsFirst()
  default:
    // control or no decision
    return renderCurrentPricing()
}

How Does SignaKit Assign Users to A/B Test Variations?

SignaKit uses deterministic bucketing. When decide() is called for a given userId and flagKey, it hashes the two together to produce a stable bucket number. That number maps to a variation based on the traffic split.

The result: the same user always receives the same variation for the life of the experiment, regardless of which server handles the request, which SDK version is running, or how many times decide() is called. This prevents flicker and ensures a user's experience is consistent across sessions.

Do not change the traffic split of a running experiment. Shifting percentages mid-test re-buckets some users into a different variation, contaminating your results. If you need to pause a test, disable the flag entirely.

Bucketing happens locally in the SDK using the flag configuration fetched from SignaKit's CDN. There is no network call on decide().


Reading results in the dashboard

Navigate to your flag and open the Results tab. For each variation you will see:

MetricWhat it means
ExposuresUnique users who received this variation
ConversionsUsers who fired the tracked event after exposure
Conversion rateConversions ÷ exposures
LiftRelative improvement over control (e.g. +12.4%)
p-valueProbability the observed lift is due to chance
Confidence intervalRange within which the true lift likely falls

SignaKit uses a two-proportion z-test. When the p-value drops below 0.05 (95% confidence) and remains there, the dashboard surfaces a plain-English verdict: "treatment is the winner" or "no significant difference detected."

When Can You Trust Your A/B Test Results?

Do not read results early and stop on the first promising p-value. A p-value below 0.05 at day two can easily reverse by day seven — this is called peeking bias. Guidelines:

  • Run the test long enough to collect your minimum sample size. Use a sample size calculator before launching; inputs are your baseline conversion rate, minimum detectable effect (MDE), and desired confidence level.
  • Run for at least one full week to capture day-of-week variation in user behavior.
  • Decide the stopping criteria before the test starts — not after you see the numbers.

A good rule of thumb: if you cannot afford to collect at least 1,000 exposures per variation, the results will be too noisy to act on. Consider using a multi-armed bandit instead, which adapts traffic allocation in real time without requiring a fixed sample size.


Ending a test

Once a winner is confirmed:

Ship the winner

In the dashboard, set the winning variation's traffic allocation to 100% (and all others to 0%). Every user now receives the winner. Do this before cleaning up code so there is no gap in coverage.

Remove the flag from code

Replace every decide('your-flag-key') branch with the winning code path. Delete the losing variation's code.

// Before — flag in place
const decision = userCtx.decide('checkout-redesign')
if (decision?.variationKey === 'treatment') {
  renderNewCheckout()
} else {
  renderCurrentCheckout()
}

// After — winner shipped, flag removed
renderNewCheckout()

Archive the flag

Archive the flag in the dashboard to keep your flag list clean. Archived flags remain visible in historical reports but cannot be evaluated.


A/B testing vs. multi-armed bandit

A/B tests use a fixed traffic split for the duration of the experiment. This is appropriate when you want statistically rigorous results with a controlled sample. If you prefer to automatically shift traffic toward the better-performing variation as the experiment runs, use a multi-armed bandit instead.


Frequently Asked Questions

Can I change the traffic split of a running experiment?

No — and you should not try. Changing a traffic split mid-experiment re-buckets some users into a different variation, contaminating your results. A user who saw control for the first three days and then gets switched to treatment produces measurement noise. If you need to stop a test early, disable the flag entirely rather than adjusting the split.

How many users do I need for a statistically valid A/B test?

It depends on your baseline conversion rate and the minimum detectable effect (MDE) you want to measure. As a rough rule of thumb: if your baseline is 5% and you want to detect a 20% relative improvement (5% → 6%), you need approximately 5,000 users per variation. Use a sample size calculator before launching — if you cannot collect enough users in a reasonable timeframe, consider a multi-armed bandit instead, which adapts without requiring a fixed sample size.

What is the difference between an exposure and a conversion event?

An exposure is recorded automatically when decide() is called — it represents a user encountering the experiment. A conversion event is something you track explicitly with trackEvent() when a user completes the action you are measuring (a purchase, a signup, a click). SignaKit uses the ratio of conversions to exposures per variation to calculate conversion rates and lift.

What is peeking bias, and why does it matter?

Peeking bias occurs when you read results early and stop the experiment as soon as you see a promising p-value. A p-value below 0.05 at day two can easily reverse by day seven due to natural variation in user behavior. The fix: decide your stopping criteria before the experiment starts, commit to a minimum sample size and a minimum run duration, and do not act on results until both are met.


Last updated on

On this page