How we conduct A/B tests offline
And how they differ from classic online experiments
The generation of new ideas is an integral part of the development of any product. Of course, not every idea will increase conversions, increase audiences, or positively impact another metric. How then to quickly test ideas and hypotheses? There are many tools out there, but one of the most popular is A/B testing, which will be being discussed in this article.

Together with M.Video-Eldorado our team is developing internal and partner products. These services, like the usual B2C products: websites and mobile applications, help businesses better understand customers and meet their needs, but they do this by changing internal business processes and processes of interaction with product suppliers.
So, for example, last year we launched a data service for planning the "ideal" assortment of stores, and now we are working on a pricing engine that will be able to predict the optimal price for goods based on competitors' actions and a price elasticity model . These products impact the customer experience by redefining the very essence of the retailer's product offering.
Despite their specifics (small user audience, long R&D cycle, high-load architecture), when working on these services, we adhere to the classic product approach, part of which is regular hypothesis testing using A/B tests. So, returning to the examples above, before changing the assortment or prices on the shelves of 1,300 retail stores, you need to be absolutely sure that the new product offer, at a minimum, will not lead to an outflow of current customers, and even better, will attract new ones.
To conduct such experiments, we redesigned the approach to A / B testing, taking into account our offline specifics, which we want to share in this article.
Part 1. From theory to practice

In theory, running an A/B test isn't that hard. Suppose we want to evaluate the effect of some change online (on a website or mobile app) or offline (in a retail store). To do this, let's divide the set of studied objects into two groups: control ("A") and pilot ("B") - and decide which metric we want to measure the effect of: traffic, conversion, average check, etc. Having implemented the change on an isolated pilot group, we will measure the difference in the effect obtained with the control group and draw conclusions about its statistical significance. If the effect on the pilot set exceeds the effect on the control set, open the champagne and celebrate the success.
However, in practice, there are several nuances that complicate the application of this seemingly simple concept, especially in retail with a large offline presence. Here are a few of them.
Store as an object of observation

It happens that the innovation is so large-scale that the effect of it is visible to the naked eye. Well, for example, the application of a 20% discount and the subsequent increase in sales. Such an effect still requires statistical verification, however, innovations are most often of an optimization nature, and the effects from them are calculated in units of percent. Here we run into the fact that the effect itself is a random variable, fluctuations of which can be commensurate with the payoff that we want to find.
Suppose we have made some change in the way our stores work. Let it be a question of changing the schemes for displaying goods on the trading floor. To measure the effect, we took 50 stores in the pilot and control groups. The effect was defined as the total sales of all 50 stores in each of the groups in one month. Further, we calculated the difference and got 3% - the pilot group won. Then we decided to take not a month, but 6 weeks, again calculated the difference and got minus 1% - the pilot group lost. We can keep doing this indefinitely and keep getting different values. All this happens because the store's revenue is a random variable that fluctuates day by day. Today the store could sell for one amount, tomorrow for another, and so on.
Accordingly, to conduct an A/B test, we need to answer at least three questions:
  1. How to measure the effect?
  2. How long should the effect be measured for this measurement to be meaningful?
  3. How to speed up the measurement of this effect?
Looking ahead, let's say that the simple answer to the first question is to take the average of a set of measurements. However, one average is not enough to draw conclusions about the results of our test, which we will discuss further. As for the duration of measurements, for example, we can continue to measure the total revenue of stores per month, but in this case we will have only 12 observations per year and our A / B test will stretch for an indefinite time.
Comparison base

If we do not attach special importance to this, we can mistakenly form the pilot and control groups in such a way that they will initially be very different from each other. In order not to go far from the example with stores, we can include the most successful objects in terms of revenue in the pilot group, and vice versa in the control group. We will see the difference, but will it be caused by our innovation? In other words, at the start of testing, we must be able to prove that the samples of the pilot and control groups were not significantly different before the A / B test. For these purposes, as is the case with online services, we conduct A / A tests, the design of which will also be discussed below.
Comparability of objects does not end the task of forming the target and control samples. Let's take for example an 85-inch TV worth over half a million rubles. Our task is to evaluate its sales in two groups of stores. Suppose we put this TV on the shelves of shops in small towns, where the average salary does not exceed 40 thousand rubles per month. In this case, it will be possible to say that this TV will not be sold in either the pilot or control groups, and we will not measure any effect. Thus, for the pilot group, we need to be able to choose objects where it is generally reasonable to expect the effect of our innovations.

Business context

Let's imagine that we want to test our assortment planning recommendation service, which we talked about at the very beginning of the article. For the purposes of the A/B test, we will fill the shelves of pilot stores in accordance with the recommendations of the service, leave the shelves in the control group unchanged and assume that the product in the pilot group will sell better in terms of revenue.
And now we have formed the target and control groups, set up the assortment in the corresponding application and are waiting for the effects. However, the stores are not empty, they already have goods. In ordinary life, it takes time for a new assortment to get on the shelf. In different categories of assortment, the turnover period of goods is different and sometimes very long. The problem here is not only the long wait, but also the fact that some of the new assortment will be on the shelf earlier, some later (due to natural turnover), and some may not reach the stores at all due to limited stock balances. It turns out that we have not yet proven the effect, and in a large number of stores our "pilot" assortment may already be partially available. Since it costs partially, we still cannot measure the effect. Since it is a "pilot", it is possible that it is less effective than the one that would have been on the shelf instead of it. As a result, we get management dissatisfaction because of the potentially negative impact of the pilot on the company's revenue.
Therefore, when preparing for any A/B test (especially a retail test), it is important to also synchronize all related processes, physical and virtual, that potentially affect its result and duration.
Now about everything in more detail.
Part 2. Decisions: statistical and not so

Let's go back to the very first and most important question - how to measure the difference between the control and pilot groups? A statistically significant result can be obtained through a series of observations. Let's first understand where we get them. To do this, we apply several manipulations, statistical and not so much.
First, let's use a purely conceptual approach. It consists in increasing the granularity of measurements. This approach, depending on what we measure, will always be different. Consider our example of changes in the range of stores. First, let's take only the difference in sales of all stores on a monthly basis and get 12 observations per year - not much. Alternatively, let's compare all the stores individually and weekly and get the number of measurements equal to the number of stores in our sample multiplied by the number of weeks. Already more. Further, suppose that the sales of different categories of goods are independent of each other and get the number of measurements equal to the number of stores times the number of categories times the number of weeks. Perfect.
Now we will apply statistical manipulation - we will form many subsamples from our sample and compare them with each other. Thus, we artificially once again increase the number of measurements. This method of increasing the number of samples is called bootstrapping. We will also need the bootstrapping method later to form confidence intervals.

In the context of our manipulations, two new questions arise:
  1. Is it possible to increase the granularity indefinitely in order to shorten the test duration?
  2. How to estimate how many measurements are necessary?
Measurement granularity, sample size and test duration

So, what will happen if we consider not the category level, but the product level. That is, do not compare sales of the TV category in different stores, but go inside the category and compare sales of specific models or, for example, groups of models in terms of diagonals. In this case, there is a risk that our measurements will be dependent. You need to have absolute confidence (read - conduct a separate statistical test for each product group) that sales of TVs of different models and diagonals are independent of each other. If they are dependent, then each new measurement does not add valuable information to our test. In addition, due to the fact that we compare the range, there is a risk that some products may simply not be in the control or target groups. Thus, going deeper into the product hierarchy should be done with caution.
Well, what if you just go from measuring at the level of the week to the level of the day, hour or minute. There is no single answer here, you need to consider. In some cases, increasing the granularity will add information, and in others it will lead to noise. After all, our sales are a random variable, which, even with more or less stable indicators, fluctuates from day to day, from hour to hour. Accordingly, by adding noise, we do not come close to our goal of testing the hypothesis in the shortest time.
As for the sample size, we turn to historical data to determine it. We take the historical sales of those stores that we have chosen and in the granularity that we consider reasonable. We consider for what observation period we will be able to collect the required number of measurements so that our hypothesis becomes statistically significant. Is the period too long? Change the granularity and try again. We repeat this until we find the right solution for granularity to provide the sampling time that suits us.
Statistical significance is also determined on the basis of bootstrapping in combination with the Student's test, which will be discussed in more detail later.
Move on.
Selection of the control group

For pilot and control samples, we need to select similar objects, in our example, similar stores. Obviously, there is no point in comparing initially different objects with each other. So, due to the difference in the structure and level of demand, a store in the region and a store in the center of Moscow can trade differently. Also, these stores may have a different area, different assortment, and many other factors, due to which their comparison will be irrelevant. Accordingly, even before starting our test, we need to prove that the stores chosen for it do not differ from each other, i.e. do an A/A test.
The design of such tests in our case differs from similar online experiments. To solve this problem, we use two approaches: vector distance and Student's criterion. We follow the approach that will give the best results in terms of errors.
The first approach is that we represent each sample object (in our case, a store) as a point in a multidimensional space, or as a vector. Stores, for example, are very easy to vectorize - just take the daily sales of each of them, for example, for 60 days. Thus, each store will appear as a vector in a 60-dimensional space. Further, according to the rules of vector algebra, we can find the distance between vectors. Accordingly, for each store, we can find another store that is closest to it in distance. Such a store will be a pair in the control sample. The business sense of such vector mathematics is that for each store we find the store that, in terms of total daily sales, is as close as possible to it.
The second approach is that for each pair of stores, we test the hypothesis that they are not different before starting our test. In other words, the hypothesis that their sales are equal up to random deviations. To do this, we again use bootstrapping to collect the maximum number of observations from daily sales. Next, using the Student's criterion, we test the hypothesis that the stores do not differ. In the example below, p-value = 0, and we reject the null hypothesis about the equality of stores - the A / A test is not passed, we select pairs of stores further.
In both cases, we can estimate the error, which in one case follows from the distance, and in the other from the application of the Student's criterion. For the case where the error is smaller, we take the paired store.

Effect evaluation

Hooray, we have measurements in sufficient quantities! Now we could take the average of all the pairwise differences in store sales at the desired granularity between the pilot and control samples. However, this will be a random variable and it will be necessary to prove that the difference between the pilot and control groups is not due to random fluctuations in the values of this variable. To do this, the difference in the averages must be normalized to this range of fluctuations of this value, or in other words, to its dispersion. However, in itself it will also be a random value.
To answer the question about the presence of an effect, we use the hypothesis testing mechanism from mathematical statistics. Our hypotheses are that the pilot and control groups after our changes differ by some percentage, for example, 1, 5 or 10. We take different percentages, since the smaller the percentage, the more measurements are required to confirm the hypothesis of its presence. Each time we ask a series of consecutive questions: is there an effect of 10%, yes or no? If not, is there an effect of 5% and so on.
To confirm or refute our hypotheses, we use the Student's test mentioned above. It is based on using the difference in means between the pilot and control samples, which are normalized to the variance of the values and the sample size. In other words, the Student's t-test shows the difference in the difference between the means, subject to fixing the allowable error in accepting the hypothesis. That is, if we agree that the error of accepting an incorrect hypothesis does not exceed 5 percent (usually this value is taken), then under the conditions of the criterion, the values of the means either coincide or differ. This is exactly what we need to check.
In summary, the entire test process looks like this: first, we form the target and control groups. Next, select the granularity of measurements. We take historical data and inject our expected effect on the metric into them (for example, 5%). We consider measurements on historical data based on the selected granularity and then run the measurements through the Student's criterion. We look at how long it takes us to get confirmation of the hypothesis about the effect of 5% (which is definitely there, since we injected it ourselves) on historical data. If the deadline does not suit us, then we adjust the granularity and again calculate the criterion. We do this until we get the test duration that suits us. After that, we start testing on real data and wait until one of our hypotheses is tested.
So, in the case of testing the new assortment content of stores, it took us three to four weeks to confirm the effects on revenue of several categories of 3-5%.
For example, the table below, using the Refrigerators category as an example, shows the dependence of the number of observations (in cells) that we need on the size of the effect (by rows) that we want to find, and the error in estimating the effect (by columns) that we agree allow.
We also use the confidence interval method to verify our effect estimate. Remember the bootstrapping we talked about above? We take the multi-mean difference between the target and control groups, and label each difference with a dot. We get a set of points (values) or, in other words, a set of difference estimation values. Among all these estimates, we take 2.5% and 97.5% percentile - this is our confidence interval in which our effect will be. If the range of effect values matches our expectations, fine, if not, then continue the test as described above. If the confidence interval crosses zero, then this means that there may be no effect at all and the test must definitely be continued. However, in our case, this should not be the case, since we previously checked the presence of the effect using the Student's t-test.
Influence of related processes

Let's move on to the third point that we talked about earlier - processes. They are often forgotten about. Let's remember why we started A / B tests at all - to measure the effect of any changes. However, if the changes themselves do not occur as we planned, or we for some reason cannot collect the measurements themselves, the test will fail.
What should you pay attention to when preparing for the test?
First, the planned changes need to be implemented. If we change something on the site and want to measure the conversion dynamics, it is quite simple to do this: we randomly divide the traffic into two streams, send part of it to the old version, and part to the new one. In this case, we have no problems in implementing the planned changes or they are minimal. Physical stores are different. For example, in order to change the assortment in stores, you must first purchase the goods, then bring them to the stores, then put them on the shelf, provided that there is space on this shelf. Further, it is necessary to monitor that the "pilot" assortment has not ended, since it is "pilot" and is not regularly purchased.
In a retail pilot setting, change often occurs on top of the company's regular processes. The very conduct of such A / B tests turns into a serious management case that requires high involvement not only from the product itself, but also from the business team: merchants, logistics specialists, merchandisers, store employees and other specialists. The more flexible your organization, the easier it will be for you to conduct an A / B test, but in any case, this task should be approached as a serious project.
Secondly, it is important to control the progress of the A/B test. To do this, we developed a set of dashboards that allowed us to monitor progress in real time and respond to deviations. For the assortment problem, it was important for us to keep track of two things. First, the behavior of key metrics: revenue, margins and number of checks, to make sure that we do not "drop" the economy of the pilot stores. Secondly, the fullness of stores with goods compared to our pilot assortment matrix. Examples of dashboards are shown below.

Example 1. Revenue and margin. Please note that we started measuring metrics 3 weeks before the start of the test, as it was important for us to understand the behavior of the pilot and control stores before making changes.
Example 2. Matrix fullness. The graph shows that over the course of several weeks, stores have "sunk" in terms of the representation of goods in the matrix. However, this drawdown was recorded in both the control (Matrix KM) and pilot (Matrix Opt.) groups, and, therefore, was associated with the problem of product availability in general, and not with the A / B test itself.
Third, it is important to manage the expectations of the teams.

When we first started implementing an A/B testing culture in our company, we were asked by one of our departments to calculate the cost-benefit of running a test to decide if it was worth doing at all. That is, even before we measured something, we were already asked to evaluate the effect. You need to understand that conducting a pilot in retail is an investment. In stores, you need to supply another product that, by design, should be better than the current one, but, in theory, may turn out to be worse. But a test needs to be done to find out.
Another part of managing expectations is to assume reasonable effects up front. If your stakeholders assume that the effect of the implemented change on the metric will be 30%, and you find only 3%, it is likely that the results of your tests will not please everyone.
And finally, earlier in this article we dealt with the definition of the duration of the test. It should be clearly defined and agreed with the main stakeholders of your product.

Conclusions

1. Products that directly or indirectly affect the customer's offline experience differ in many ways from the usual online services, but also require hypothesis testing using A/B tests.
2. The objects of observation in the case of offline experiments are processes based on the physical infrastructure (a store, a bank branch, any other point of sale), which suggests an alternative approach to the design of an A / B test: the choice of metrics, the selection of pilot and control groups, the duration and granularity of measurements.
3. The success of a retail A/B test depends not only on the quality of the feature being tested and the efforts of the product team, but also on how this feature will be implemented in the objects of observation.


If all of this sounds familiar to you, offline A/B testing may be relevant for your product. In order not to step on the same rake as we did in our time, and to properly prepare for it, we suggest following a set of steps:
  1. Determine what business metric your new feature is impacting.
  2. Form a pilot group and discuss its composition and relevance with business experts.
  3. Select a control group and perform an A/A test using vector distances and Student's t-test.
  4. Choose the sample granularity and estimate the duration of the test by injecting the effect into historical data and applying statistical hypothesis testing criteria.
  5. Coordinate the duration of the test and its expected results with all stakeholders.
  6. Assemble a multidisciplinary team of experts from key business units to manage the test.
  7. Choose a suitable time to start the test (for example, not in the "high season") and the start time of the measurements (for example, so that the goods have time to reach).
  8. Set up monitoring of test passing and target metrics using dashboards.
  9. Communicate to the team regularly about the progress of the test, so that everyone understands why temporarily unconfirmed changes affect the core business.
  10. Compare the result of the pilot group with the control and draw a conclusion about the statistical significance of the changes made.
The teams of Data Studio and M.Video-Eldorado have worked on this text.
The teams jointly implemented the practices of A/B testing and working with data in M.Video-Eldorado projects.

I agree to the Terms of Data Use
Contacts
+7 (967) 215-75-05
contact_us@datastudio.digital