Effect evaluation
Hooray, we have measurements in sufficient quantities! Now we could take the average of all the pairwise differences in store sales at the desired granularity between the pilot and control samples. However, this will be a random variable and it will be necessary to prove that the difference between the pilot and control groups is not due to random fluctuations in the values of this variable. To do this, the difference in the averages must be normalized to this range of fluctuations of this value, or in other words, to its dispersion. However, in itself it will also be a random value.
To answer the question about the presence of an effect, we use the hypothesis testing mechanism from mathematical statistics. Our hypotheses are that the pilot and control groups after our changes differ by some percentage, for example, 1, 5 or 10. We take different percentages, since the smaller the percentage, the more measurements are required to confirm the hypothesis of its presence. Each time we ask a series of consecutive questions: is there an effect of 10%, yes or no? If not, is there an effect of 5% and so on.
To confirm or refute our hypotheses, we use the Student's test mentioned above. It is based on using the difference in means between the pilot and control samples, which are normalized to the variance of the values and the sample size. In other words, the Student's t-test shows the difference in the difference between the means, subject to fixing the allowable error in accepting the hypothesis. That is, if we agree that the error of accepting an incorrect hypothesis does not exceed 5 percent (usually this value is taken), then under the conditions of the criterion, the values of the means either coincide or differ. This is exactly what we need to check.
In summary, the entire test process looks like this: first, we form the target and control groups. Next, select the granularity of measurements. We take historical data and inject our expected effect on the metric into them (for example, 5%). We consider measurements on historical data based on the selected granularity and then run the measurements through the Student's criterion. We look at how long it takes us to get confirmation of the hypothesis about the effect of 5% (which is definitely there, since we injected it ourselves) on historical data. If the deadline does not suit us, then we adjust the granularity and again calculate the criterion. We do this until we get the test duration that suits us. After that, we start testing on real data and wait until one of our hypotheses is tested.
So, in the case of testing the new assortment content of stores, it took us three to four weeks to confirm the effects on revenue of several categories of 3-5%.
For example, the table below, using the Refrigerators category as an example, shows the dependence of the number of observations (in cells) that we need on the size of the effect (by rows) that we want to find, and the error in estimating the effect (by columns) that we agree allow.
We also use the confidence interval method to verify our effect estimate. Remember the bootstrapping we talked about above? We take the multi-mean difference between the target and control groups, and label each difference with a dot. We get a set of points (values) or, in other words, a set of difference estimation values. Among all these estimates, we take 2.5% and 97.5% percentile - this is our confidence interval in which our effect will be. If the range of effect values matches our expectations, fine, if not, then continue the test as described above. If the confidence interval crosses zero, then this means that there may be no effect at all and the test must definitely be continued. However, in our case, this should not be the case, since we previously checked the presence of the effect using the Student's t-test.