Statistic problems and other obstacles with A/B testing

Testing the functionality of various elements using A/B tests is now common practice for most website developers and operators. If sufficient traffic is available, this test procedure quickly reveals whether scenario A is more successful than scenario B. There are many obstacles that can be encountered during the planning phase as well as during the test phase and final evaluation. Here are the most common statistical errors and how you can avoid them:

The biggest mistakes in A/B test planning

Even before you’ve started the test, you might have already set yourself up for failure if you’ve made assumptions and your set-up is based on these.

Error 1: foregoing a hypothesis and playing it by ear

Probably the worst mistake that can be made in the preparation stage is to forego a hypothesis and hope that one of the variants you’re testing will be the right one. Although the number of randomly selected test variants also increases the chance of finding a winner, there’s also the chance that this winner won’t help to improve the web project. With a single variant, you will notice significant optimisation in 5% of cases even though in reality no optimisation has taken place. The more variants that are used, the more likely an alpha error will occur – there’s a 14% chance with 3 different test objects, and 34% with 8 different variants. If you don’t decide on a hypothesis beforehand, you won’t know what kind of optimisation the winner is responsible for. If you decide on the hypothesis that enlarging a button will lead to an increase in conversions, you can classify the subsequent result.

In summary, it can be said that A/B testing is by no means determined by coincidence, but rather you should always be hypothesis-driven and a have limited number of variants. If you also work with tools such as Optimizely, which prevent the error rate from increasing, nothing will stand in the way of successful testing.

Error 2: determining the incorrect indicators for a test variant’s success

Key Performance Indicators (KPIs), which are crucial to your project, also play an important role in A/B testing and shouldn’t be neglected. While increasing page views and clicks on blogs or news portals already dictate valuable conversions, these factors are no more than a positive trend for online shops. Key indicators such as orders, returns, sales, or profits, are significantly more important for stores. Because they’re difficult to measure, A/B tests, which count on a main KPI as the absolute profit, take a lot of effort. In turn, they can predict success much more easily than tests that only take into account whether a product has been placed into the shopping cart. This is because the customer might not even end up buying the product in the cart.

It is therefore important to find the appropriate values. However, you shouldn’t choose too many different ones. Limit yourself to the essential factors and remember the predefined hypothesis. This reduces the risk of presuming there will be a lasting increase even though it’s actually just a coincidental increase with no lasting effect.

Error 3: categorically eliminating multivariate testing

In some cases when preparing A/B tests, you might want to test several elements in the variants. This isn’t really feasible with a simple A/B test, which is why multivariate testing is used as an alternative. This concept is often rejected since multivariate tests are considered too complex and inaccurate even though they could be the optimal solution to the aforementioned problem if used correctly. With the right tools, the various test pages are not only quickly changed, but they are also easy to analyse. With a little practice, you can work out the difference that an individually modified component makes, but your web project first needs to have enough traffic.

The chance of declaring the wrong winner increases with the number of test variants used – therefore it’s recommended to limit your choice to a pre-selection when using this method. In order to be certain that a potentially better version actually surpasses the original, you can validate the result in retrospect using an A/B test. However, the probability of an alpha error occurring is still 5%.

Statistic problems during the test process

If the test is online and all relevant data has been recorded as desired, it would be fair to believe nothing else stands in the way of successful A/B testing. Impatience and misjudgments often mean this isn’t the case, so make sure you avoid these typical errors.

Error 4: stopping the test process too prematurely

Being able to read detailed statistics during the test proves very useful, but it often leads to premature conclusions with users even terminating the tests too soon in extreme cases. In principle, each test requires a minimum test size since the results usually vary greatly at the beginning. In addition, the longer the test phase persists, the higher the validity since random values are noticed and can then be excluded. If you stop the test too early, you run the risk of getting a completely wrong picture of how the variant is performing and then classifying it as far better or worse than it really is.

Since it’s not so easy to determine the optimal test time, there are various tools such as the A/B test duration calculator from VWO, which you can use to help you with the calculation. There are, of course, very good reasons for ending a test prematurely, for example, when a variant is performing badly and could jeopardise your economic interests.

Error 5: using modern test processes in order to shorten the test length

It is no secret that various A/B tests work with methods to help keep the error rate as low as possible among the variants used. The Bayesian method, which is used by Optimizely and Visual Website Optimizer, promises test results even if the minimum test size hasn’t yet been reached. If you use results from an early stage for your evaluation, you could encounter statistic problems. On the one hand, this method is based on your estimates regarding a variant’s success, and on the other hand, the Bayesian method cannot identify initial values as such.

Common errors when analysing A/B test results

It’s challenging finding suitable KPIs, formulating hypotheses, and ultimately organising and carrying out the A/B test. However, the real challenge awaits you when it comes to analysing the collected values and using them to make your web project more successful. This is the part where even professionals can make mistakes, but at least make sure you avoid any of the mistakes that are easy to avoid, such as these:

Error 6: only relying on the results of the testing tool

The testing tool doesn’t just help you to start the test and help you visualise the data collected, but it also provides detailed information about whether the variant has made an improvement and how much it would affect the conversion rate. In addition, a variant is declared as the winner. These tools cannot measure KPIs such as the absolute sales or returns, therefore you have to incorporate the corresponding external data. If the results don’t meet your expectations, it might be worth taking a look at the separate results of your web analysis program, which usually provides a much more detailed overview of users’ behaviour.

Inspecting individual data is the only way to identify rogue values and filter them out of the overall result. The following example illustrates why this can be very decisive criteria for avoiding a wrong assumption: the tool has shown that variant A is the optimal version since it achieved the best results. However, closer examination reveals that this is down to a single user’s purchase, who happens to be a B2B customer. If you remove this purchase from the statistics, variant B suddenly shows the best result.

The same example can be applied to the shopping cart, the order rate, or various other KPIs. In each of these cases, you will notice that extreme values can strongly influence the average value and that false conclusions can quickly arise from this.

Error 7: segmenting the results too much

The detailed verification of the A/B testing data in combination with external data sources opens up a lot more options. It’s particularly common to assign results to individually defined user groups. This is how you can find out how users of a particular age group, a particular region, or a particular browser have responded to the particular variant. The problem is that the more segments you compare, the higher the chance of error.

For this reason, you should make sure that the chosen groups have a high relevance for your test concept and make up a representative part of the overall users. For example, if you’re just examining a group of males under 30 years old, who access your site via tablet, and who only visit on weekends, you’re covering a test size that doesn’t represent the entire audience. If you plan to segment the results of an A/B test in advance, you should also set a correspondingly long test period.

Error 8: questioning the success due to vague calculations

To illustrate the extent to which changing to a new variant will affect the future conversion rate, A/B tests results are often used as the basis for concrete calculations. This may be an effective means for presentation purposes, but future prognoses aren’t really practical due to the different influences involved. While the results of A/B tests only provide information about short-term changes in user behaviour, long-term effects such as the impact on customer satisfaction are not measurable within the short test period – assuming that the consistency of a determined growth is premature. In addition, there are influences such as seasonal fluctuations, supply shortages, changes in the product range, changes in the customer base, or technical problems that can’t be included in A/B testing.

It’s important to keep a cool head regarding statistic problems and wrong assumptions when carrying out and analysing a website’s usability test. Making conclusions too early could lead to you being disappointed with the subsequent live results even though the optimised version of your project actually works quite well. Only when you formulate a future prognosis as well as a clean and well thought out working method when carrying out the analysis, you will be able to evaluate and interpret the A/B test results properly.