If you’re going to run a race, you should see it through all the way to the finish line.
Many marketers fail to apply that principle when it comes to split testing. They think their test is complete before they’ve let it run the course.
They pull out too soon. They end too early. They quit before it’s done.
What happens when tests are finished prematurely?
The numbers reported are nothing but statistical salad.
With no dressing.
Simply put: results from incomplete tests are as unreliable as next year’s weather forecast. They’re as dubious as the email you just received from a Nigerian prince.
Let your tests run to completion, and you’ll be rewarded with accurate, actionable statistics.
That’s the point I want to make, in one swell article introduction.
You can stop reading now, and keep watching cat videos. As long as you get that one message: don’t end your split tests too soon.
But to get even more value from this article, let me share some more information with you.
1. Bad News: Your gut is wrong.
Sometimes digital marketers come down with a bad case of C.B.
They run tests to validate their own preconceived notions about the way things should work. Once they get the result they’re looking for, they see no reason to continue the test.
Their bias is confirmed!
Yay. How cool is that.
That’s no way to validate a hypothesis.
Confirmation bias is a real thing, and it’s screwing you over.
Sure, there’s an argument to be made about why you should listen to your gut.
But there’s also an argument, a better one, to be made about why you should admit that your gut was wrong if the numbers say so.
The whole reason you’re split testing in the first place is to avoid gut-based and erroneous decisions. You want real data and hard evidence, not some mysterious message from the meatball sub that you had for lunch.
Numbers don’t have guts. They’re not subjective.
Numbers report cold, hard, objective reality.
They don’t even eat meatball subs.
Sometimes the reality of the data will be at odds with what your gut told you. And that’s a good thing.
But you won’t know that if you don’t let your tests run to completion. Instead, you’ll likely come down with a good, old-fashioned case of C.B. once the numbers tell you what wanted to hear.
2. How long is long enough?
Once you’re convinced you should let your tests run long enough to give you accurate stats, you’re probably wondering: how long is long enough?
That’s a killer question, and if there were an exact standard number, then I wouldn’t have had to write this article.
How long is long enough, you ask?
Answer: It depends.
Yes, that answer sucks. It also has the virtue of being right.
For starters, when you think about how long a test should run, your mind is probably thinking that the answer will be delivered in terms of time.
Yes and no.
Sure, you’ll need to make sure that your test runs an adequate number of days, but that length of time will be determined by another metric: the sample size.
Before you can determine how long you need to run your test, you’ll first need to determine how many visitors will give you the right sample size.
Otherwise, you’re likely to get statistical noise in your results.
But here we go with another question. Now you want to know how big of a sample size you need.
Again, it depends.
You’re getting tired of that answer, aren’t you?
It’s still correct.
The right sample size for your website depends on three things:
- Your existing conversion rate
- The change you’d like to see in your conversion rate
- The level of confidence you want to have that your test will be accurate.
Let’s look at an example to help clarify things, shall we?
For example, if you have a 3 percent conversion rate now and you’re shooting for a 4.5 percent conversion rate, then you’re looking to boost your conversion rate by 50 percent.
That 50 percent number is part of what determines your sample.
Next, you need to learn about p.
What’s p? It represents the number used to calculate confidence level.
If you say: “I’m 95 percent sure that these results are accurate.” Then p is .05 (100 – .05 = .95 or 95%).
As you can see, the lower the p-value, the more confident you are about the test results.
Now, you don’t have to plug a bunch of complicated formulas into a spreadsheet to determine sample sizes. There are plenty of tools online.
Head over to Optimizely and use the free tool there to determine your sample size.
Set the conversion rate to 3 percent, the minimum detectable effect to 50 percent, and the statistical significance to 95 percent.
The cool thing about that tool is you don’t have to press any buttons when you make changes. The calculation is reported on the fly.
As you can see in this case, you need 1,800 visits per variation to get a proper read on whether your change had any effect.
Please note the “per variation” part of that sentence. That means if you’re doing A/B testing, then both the “A” and “B” options must have 1,800 visits each.
So, in total, you need 3,600 visits, equally divided between the two options.
There’s another way to give you a great read on sample size.
You can work backwards.
To do that, use the number of visits you’ve seen for both the “A” and “B” options, and the number of conversions for each to calculate the confidence level.
Once again, there’s an online tool to help you out. Head over to VWO and plug in the numbers consistent with what we’ve been using.
Plug in 1,800 for number of visitors under both “Control” and “Variation.” For number of conversions, use 54 (3 percent) for “Control” and 81 (4.5 percent) for variation. Then, click on “Calculate Significance.”
You should see that the p-value gets calculated at .009. That’s about a 99 percent confidence level (100 – .009).
Also, there’s a Yes/No box at the bottom that answers the “Significant?” question. If the answer is “Yes,” that means your test is reliable.
Now that you have a sample size to shoot for, it’s time to take a look at the length of time you need to run your tests.
Please understand that the issue of stopping a test too soon is a simplistic way of viewing the problem. There are actually an attendant medley of issues that exacerbate the whole point.
Get one thing wrong, and you screw it all up. The only way to get your test properly validated is to have all your data correct.
3. Get the right sample size and time.
You might be thinking: “I’ve followed your advice and got my sample size with a high confidence level. Now all I need to do is run a test long enough to cover that sample size, amirite?”
Peep Laja of ConversionXL, who really is one of the top guys in this field, says when he first started split testing his most common mistake was ending a test too soon.
And this is important: he say he ended tests too soon even when he had a 95 percent confidence level.
At this point, you might be thinking: “Well, it’s not much of a confidence level if you can end a test at that point and still have inaccurate results.”
Correct. That’s because numbers are dumb.
I said earlier that numbers are objective and don’t lie. That’s true also.
But the statistical calculations we’ve been looking at don’t take into account variations in business cycles, days of the week, peak traffic times, seasons when conversions are more likely, etc.
Bottom line: your confidence level isn’t your bottom line.
Confidence level alone can’t validate a test’s, umm, validity.
That’s why you need a test that runs long enough to cover variations in your sales cycle.
As Laja notes, there’s no penalty for having a sample size that’s too big. There’s just a penalty for having one that’s too small.
This is a good time for me to put in a plug about always be testing.
Yes, I think you should always be testing. Yes, I think you should test longer durations rather than shorter ones. Even though relentless and constant testing is a virtue, you shouldn’t let this maxim make you rush the process.
In other words, yes, conduct split test after split test. But don’t shortchange all your efforts by pulling a test too early, or not giving yourself time to analyze test results, or rushing through the hypothesis phase. Every part of the test is important.
So, to circle back to the subject at hand, if you’re going to make a mistake, err on the side of having an unusually large sample size rather than a small one.
Keep in mind, though, if your test runs so long that it includes external forces that could affect the outcome (holidays, seasonal factors, weather, etc.), you run the risk of sample pollution and that could skew your tests as well.
Laja also offers this sage advice: your test should run the length of at least one, and preferably two, business cycles.
Or, as he puts it: “the sample would include all weekdays, weekends, various sources of traffic, your blog publishing schedules, newsletters, phases of the moon, weather and everything else that might influence the outcome.”
So, to continue with the numbers from above, if you get 3,600 visits in one business cycle (or, better yet, two business cycles), then the numbers from the tools might work just fine.
On the other hand, if you get 3,600 visits over the period of just a couple of days, then you should lengthen the time of your test to include a couple of business cycles.
You can learn from Laja’s mistake or you can repeat it. The choice is yours.
4. Make sure the significance curve flattens out.
Even after you’ve followed all the other rules, you still might have to apply one more rule before you finish your testing.
If you decide to become a gung-ho statistician and use a more sophisticated tool than the two mentioned above, you’ll likely see that your conversion rate stats are delivered with a margin of error.
Your margin of error is the square root of your sample number (30 is the square root of 900, from the example above). Then you divide 1 by that number (1 / 30 = 0.0333). And, don’t forget, the result – 3/100ths or 3 percent – represents a +/- range.
For example, you might see that the conversion rate variation for option A is .8 percent and the conversion rate variation for option B is .9 percent.
If your margin of error is 3 percent, then the full conversion range of the .8 percent variation is from 2.2 percent (3% – .8%) to 3.8 percent (3% +.8%). That is, you should expect a conversion rate anywhere from 2.2 to 3.8 percent. If your margin of error for B is 4.5 percent, then the full conversion range of the variation is from 3.1 percent (4% – .9%) to 4.9 percent (4% + .9%).
If you stick with that sample size, you are accepting a conversion range that can vary by 3 percent, plus or minus. Or, you can keep testing.
You will always have some margin of error, but the lower the margin of error, the higher the confidence you can have in your results. (Michael Aasgaard says in his tutorial to shoot for a margin of error of 1 +/-.)
Patience is a virtue, especially when it comes to split testing.
Make sure that you let your tests run long enough so that you have the right sample size, you’re covering a complete business cycle, and your margin of error is small enough.
This stuff is detailed and a bit anal, yes, but it matters.
If you want to ensure top-notch performance in your conversion efforts, you have to get every detail right.
Thankfully, this is a relatively easy one to define. Go long with your test duration, and you should be safe.