An alternative to p-values in A/B testing

How High-probability Lower bounds (HPLBs) on the total variation distance can lead to an integrated appealing test statistic in A/B testing

Figure 1: figure from the original paper (by authors)

Contributors: Loris Michel, Jeffrey Näf

The classical steps of a general A/B test, i.e. deciding whether two groups of observations come from different distributions (say P and Q), are:

  • Assume a null and an alternative hypothesis (here respectively, P=Q and P≠Q);
  • Define a level of significance alpha;
  • Construct a statistical test (a binary decision rejecting the null or not);
  • Derive a test statistic T;
  • Obtain a p-value from the approximate/asymptotic/exact null distribution of T.

However, when such a test rejects the null, i.e. when the p-value is significant (at a given level) we still lack a measure of how strong the difference between P and Q is. In fact, the rejection status of a test could turn out to be useless information in modern applications (complex data) because with enough sample size (assuming a fixed level and power) any test will tend to reject the null (since it is rarely exactly true). For example, it could be interesting to have an idea of how many data points are supporting a distributional difference.

Therefore, based on finite samples from P and Q, a finer question than “is P different from Q ?” could be stated as “What is a probabilistic lower bound on the fraction of observations λ actually supporting a difference in distribution between P and Q ?”. This would formally translate into the construction of an estimate λˆ satisfying λˆ ≤ λ with high probability (say 1-alpha). We name such an estimate an high probability lower bound (HPLB) on λ.

In this story we want to motivate the use of HPLBs in A/B testing and give an argument why the right notion for λ is the total variation distance between P and Q, i.e. TV(P, Q). We will keep the explanation and details about the construction of such an HPLB for another article. You can always check our paper for more details.

Why the Total Variation Distance?

The total variation distance is a strong (fine) metric for probabilities. This means that if two probability distributions are different then their total variation distance will be non-zero. It is usually defined as the maximal disagreement of probabilities on sets. However, it enjoys a more intuitive representation as a discrete transport of measure between the probabilities P and Q (see Figure 2):

The Total variation distance between the probability measures P and Q is the fraction of probability mass that one would need to change/move from P to obtain the probability measure Q (or vice-versa).

In practical terms the total variation distance represents the fraction of points that differ between P and Q, which is exactly the right notion for λ.

Figure 2: Top left representation of TV(P, Q) as the difference in probably mass. Top right the usual definition as TV(P, Q) as maximal probability disagreement (over a sigma-algebra). Bottom the discrete optimal transport formulation as fraction of mass differing from P and Q (by authors).

How to use an HPLB and its advantage?

The estimate λˆ is appealing for A/B testing because this single number entails both the statistical significance (as the p-value does) and the effect size estimation. It can be used as follows:

  • Define a confidence level (1-alpha);
  • Construct the HPLB λˆ based on the two samples;
  • If λˆ is zero then do not reject the null, otherwise if λˆ > 0, rejects the null and conclude that λ (the differing fraction) is at least λˆ with probability 1-alpha.

Of course the price to pay is that the value of λˆ depends on the chosen confidence level (1-alpha) whereas a p-value is independent of it. Nevertheless, in practice the confidence level do not vary a lot (usually set to 95%).

Consider the example of effect size in medicine. A new medication needs to have a significant effect in the experimental group, compared to a placebo group, that did not receive the medication. But it also matters how large the effect is. As such, one should not just talk about p-values, but also give some measure of effect size. This is now widely recognised in good medical research. Indeed, an approach using a more intuitive approach to calculate TV(P,Q) has been used in the univariate setting to describe the difference between treatment and control groups. Our HPLB approach provides both a measure of significance as well as an effect size. Let us illustrate this on an example:

Let’s make an example

We simulate two distributions P and Q in two dimensions. P will thereby be just a multivariate normal, while Q is a mixture between P and a multivariate normal with shifted mean.

library(mvtnorm)
library(HPLB)
set.seed(1)
n<-2000
p<-2
#Larger delta -> more difference between P and Q
#Smaller delta -> Less difference between P and Q
delta<-0
# Simulate X~P and Y~Q for given delta
U<-runif(n)
X<-rmvnorm(n=n, sig=diag(p))
Y<- (U <=delta)*rmvnorm(n=n, mean=rep(2,p), sig=diag(p))+ (1-(U <=delta))*rmvnorm(n=n, sig=diag(p))
plot(Y, cex=0.8, col="darkblue")
points(X, cex=0.8, col="red")

The mixture weight delta controls over how strong the two distributions are different. Varying delta from 0 to 0.9 this looks like this:

Simulate data with delta=0 (top right), delta=0.05, (top left), delta=0.3 (bottom right) and delta=0.8 (bottom left). Source: author

We can then calculate the HPLB for each of these scenarios:

#Estimate HPLB for each case (vary delta and rerun the code)
t.train<- c(rep(0,n/2), rep(1,n/2) )
xy.train <-rbind(X[1:(n/2),], Y[1:(n/2),])
t.test<- c(rep(0,n/2), rep(1,n/2) )
xy.test <-rbind(X[(n/2+1):n,], Y[(n/2+1):n,])
rf <- ranger::ranger(t~., data.frame(t=t.train,x=xy.train))
rho <- predict(rf, data.frame(t=t.test,x=xy.test))$predictions
tvhat <- HPLB(t = t.test, rho = rho, estimator.type = "adapt")
tvhat

If we do that with the seed set above, we

Estimated values for different deltas.

Thus the HPLB manages to (i) detect when there is indeed no change in the two distributions, i.e. it is zero when delta is zero, (ii) detect already the extremely small difference when delta is only 0.05 and (iii) detect that the difference is larger the larger delta is. Again the crucial thing to remember about these values is that they really mean something — the value 0.64 will be a lower bound for the true TV with high probability. In particular, each of the numbers that is larger zero means a test that P=Q got rejected on the 5% level.

Conclusion:

When it comes to A/B testing (two-sample testing) the focus is often on the rejection status of a statistical test. When a test rejects the null distribution, it is however useful in practice to have an intensity measure of the distributional difference. Through the construction of high-probability lower bounds on the total variation distance, we can construct a lower-bound on the fraction of observations that are expected to be different and thus provide an integrated answer to the difference in distribution and the intensity of the shift.

disclaimer and resources: We are aware that we left out many details (efficiency, construction of HPLBs, power studies, …) but hope to have open an horizon of thinking. More details and comparison to existing tests can be found in our paper and check out R-package HPLB on CRAN.

An alternative to p-values in A/B testing Republished from Source https://towardsdatascience.com/an-alternative-to-p-values-in-a-b-testing-44f1406d3f91?source=rss—-7f60cf5620c9—4 via https://towardsdatascience.com/feed

<!–

–>

Time Stamp:

More from Blockchain Consultants