Experiment Designer

Design, prioritize, and evaluate product experiments with clear hypotheses and defensible decisions.

When To Use

Use this skill for: - A/B and multivariate experiment planning - Hypothesis writing and success criteria definition - Sample size and minimum detectable effect planning - Experiment prioritization with ICE scoring - Reading statistical output for product decisions

Core Workflow

Write hypothesis in If/Then/Because format
If we change [intervention]
Then [metric] will change by [expected direction/magnitude]
Because [behavioral mechanism]
Define metrics before running test
Primary metric: single decision metric
Guardrail metrics: quality/risk protection
Secondary metrics: diagnostics only
Estimate sample size
Baseline conversion or baseline mean
Minimum detectable effect (MDE)
Significance level (alpha) and power

Use:

python3 scripts/sample_size_calculator.py --baseline-rate 0.12 --mde 0.02 --mde-type absolute

Prioritize experiments with ICE
Impact: potential upside
Confidence: evidence quality
Ease: cost/speed/complexity

ICE Score = (Impact * Confidence * Ease) / 10

Launch with stopping rules
Decide fixed sample size or fixed duration in advance
Avoid repeated peeking without proper method
Monitor guardrails continuously
Interpret results
Statistical significance is not business significance
Compare point estimate + confidence interval to decision threshold
Investigate novelty effects and segment heterogeneity

Hypothesis Quality Checklist

[ ] Contains explicit intervention and audience
[ ] Specifies measurable metric change
[ ] States plausible causal reason
[ ] Includes expected minimum effect
[ ] Defines failure condition

Common Experiment Pitfalls

Underpowered tests leading to false negatives
Running too many simultaneous changes without isolation
Changing targeting or implementation mid-test
Stopping early on random spikes
Ignoring sample ratio mismatch and instrumentation drift
Declaring success from p-value without effect-size context

Statistical Interpretation Guardrails

p-value < alpha indicates evidence against null, not guaranteed truth.
Confidence interval crossing zero/no-effect means uncertain directional claim.
Wide intervals imply low precision even when significant.
Use practical significance thresholds tied to business impact.

See: - references/experiment-playbook.md - references/statistics-reference.md

Tooling

`scripts/sample_size_calculator.py`

Computes required sample size (per variant and total) from: - baseline rate - MDE (absolute or relative) - significance level (alpha) - statistical power

Example:

python3 scripts/sample_size_calculator.py 
  --baseline-rate 0.10 
  --mde 0.015 
  --mde-type absolute 
  --alpha 0.05 
  --power 0.8

experiment-designer

Installation