Optimization

This is a work in progress, and not quite ready for primetime

What is A/B Testing

A/B testing is about taking an unlimited number of confounding variables and narrowing them down to two:

the change you made (A vs B)
"luck" (of the draw)

Common issues using an A/B test:

Not understanding everything your testing
Not fully understanding your entire audience
Not considering the impact of luck
Overestimating the scope of the impact
Not understanding the rules of the game

Great Online Tools Links

Great, full featured calculators that are generally easy to use for data exploration.

Search Discovery: Sample Size Calculator
Booking.com: Power Calculator
CXL: AB Test Calculators
AB Testguide: Frequentist A/B-test Results Calculator
AB Testguide: Bayesian A/B-test Calculator
https://yanirs.github.io/tools/split-test-calculator/
Blast Analytics: RPV Stat-Sig Calculator

Noteworthy Tools

Requires a little more understanding, but great for specific use cases

Thumbtack Abba A/B testing statistics: Great Confidence Interval graphs
Matt Mazur's AB Test Calculator: Great rate distribution graphs
Evan Miller's Chi-Squared Test: The granddaddy of website optimization calculators
Lukas Vermeer
- http://www.lukasvermeer.nl/ab-stats/
- https://www.lukasvermeer.nl/type-m/
- https://www.lukasvermeer.nl/confidence/
- https://www.lukasvermeer.nl/srm/
- https://www.lukasvermeer.nl/xy/app.html

How A/B testing works

"Controlled Experiments

A controlled experiment is a research study in which participants are randomly assigned to experimental and control groups.
A controlled experiment allows researchers to determine cause and effect between variables.
One drawback of controlled experiments is that they lack external validity (which means their results may not generalize to real-world settings)."

ThoughtCo, What are Controlled Experiments?

"A/B testing (also known as split testing or bucket testing) is a method of comparing two versions of a webpage or app against each other to determine which one performs better. AB testing is essentially an experiment where two or more variants of a page are shown to users at random, and statistical analysis is used to determine which variation performs better for a given conversion goal.

"Running an AB test that directly compares a variation against a current experience lets you ask focused questions about changes to your website or app, and then collect data about the impact of that change."

Optimizely, Optimization Glossary

When you run a non-controlled experiment like a pre-post analysis, there are an almost infinite number of confounding variables that your don't control. For example:

Your visitors buy more on the weekends
Your SEO reputation recently change because of a Google algorithm update
Your email server IP address was recently put on a blacklist
A competitor as started to aggressively started to target your most valuable customers

A/B testing is about taking these confounding variables and narrowing them down to two:

the change you want to make: the A and the B, control versus treatment, blue versus red
random variation: sometimes referred to as luck and maybe more accurately the luck of the draw

By randomly assigning your target audience in to 2 (or more) buckets, you are able to seperate the changes you are making from all the others you don't control. In addition, the various statistical frameworks provide methods to help you understand how likely the impact you measure is due to your change, luck or both.

A/B testing does not have perfect predictive power even in the best of circumstances. Predictive accuracy decreases as mistakes are made in the use of the A/B testing toolkit. Here are some common issues in using A/B test results...

Not understanding everything you're testing

TBD: what your change eg tech error and page load 2

2. Not fully understanding your entire audience

TBD: 80/20 (plats at Marriott), robots in digital

3. Not considering the impact of luck

TBD: and accounting for it

4. Overestimating the scope of the impact

TBD: eg it worked here so it will work here, really a 50% lift, wishful thinking, are you measuing the right thing,

5. Not understanding the rules of the game

TBD: http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf

and accidentally cheating eg erroneous data collection, including the garden of forking paths

Page updated

Report abuse