Amp Up Your A/B Testing Using Raw Analytics Data, Apache Spark, and R

Follow @trevorwithdata

Experimentation and testing (also known as A/B testing or split testing) is an essential tool in the digital information age. If you’re not sure what A/B testing is, it’s the process of comparing two versions of a web page or app screen to see which performs better. A/B testing tools such as Adobe Target, Optimizely, and Google Optimize allow you to use data and statistics to decide which experience you’re putting in front of your customers drives better results (e.g., a red button vs. a blue button leading to higher conversion rates).

What I aim to show you in this post is that you can perform some pretty sophisticated A/B testing analysis using raw analytics data (Adobe Analytics Data Feeds as my example) and some open source software. We’ll go over the two types of statistical methods generally used by A/B testing products in the market today: the “Frequentist” Null Hypothesis Significance Testing (NHST) approach and the Bayesian Inference method. And more importantly, I’m going to show you how you can achieve even greater test result accuracy than you’d be able to accomplish with a standalone A/B testing tool by leveraging the years of historical data that probably already exist in your raw analytics logs for your testing model.

Using Analytics for A/B Testing

Your standard testing tool has essentially two parts: 1) a content delivery mechanism that handles proper user sampling (only one variant per user) and 2) an algorithm for determining a winner. For this blog post, I’m primarily focusing on #2 (using Adobe Analytics as an example) mainly because analytics tools are not content delivery mechanisms. So assuming you have a reliable way to expose A/B test variations to appropriate independent sets of users with an A/B testing tool or some other method, there are a few things you’ll need to do before you can start calling winners and losers using analytics data:

Make sure you’re collecting a unique name for each test variation you’re managing – I like names that are clear, concise, and meaningful to those who will be looking at them. Something like “Home Page Hero Banner – Variation A – Started Jul 07 2018” is usually pretty effective.
Also, make sure to store those unique names in an analytics variable (like an eVar or prop) that you can pick up later in your Data Feed.
As mentioned above, make sure not to show multiple variations of the same test to the same user – this screws up your results, and is a wasted opportunity to collect precious statistical significance as fast as possible.
If you’re an Adobe Target and Adobe Analytics customer, make sure to turn on A4T – this will put all of your Adobe Target tests into your Analytics data. Using A4T in your Data Feeds (look for the post_tnt and post_tnt_action columns) you can have Adobe Target handle the content delivery, and use Adobe Analytics for the in-depth analysis.

Also be sure to catch up on some of my previous blog posts, since we’ll be leveraging them a lot throughout the remainder of this post:

So before I start doing any analysis, I’m going to create a clean data frame that will be used by both types of tests we’ll explore. The six test variations I’ll be looking at are named:

testA = "Checkout Flow - Existing Experience - Jul-Aug 2017"
testB = "Checkout Flow - No coupon field - Jul-Aug 2017"
testC = "Checkout Flow - Multi-step checkout - Jul-Aug 2017"
testD = "Checkout Flow - Checkout button bottom - Jul-Aug 2017"
testE = "Checkout Flow - Checkout button left - Jul-Aug 2017"
testF = "Checkout Flow - Checkout button right - Jul-Aug 2017"

I’ll start by grouping by visitor_id (making every row a unique visitor/user), then I’ll create exposure and population columns for each of the test variations, then I’ll remove visitors that saw multiple variations (sometimes that happens on accident), finally getting a dataset out of the result:

visitor_pool = data_feed_tbl %>%
 group_by(visitor_id) %>%
 summarize(
   Aconversion = max(ifelse(variation==testA & conversion>0, 1, 0)),
   Aexposure = max(ifelse(variation==testA, 1, 0)),
   ...
   Fconversion = max(ifelse(variation==testF & conversion>0, 1, 0)),
   Fexposure = max(ifelse(variation==testF, 1, 0)),
   date = from_unixtime(max(hit_time_gmt), "YYYY-MM-dd")
 ) %>% ungroup() %>%
 mutate(
   totalExposures = Aexposure + ... + Fexposure
 ) %>%
 filter(totalExposures == 1) %>%
 mutate(
   exposure = ifelse(Aexposure==1, "A", 
     ifelse(Bexposure==1, "B", 
       ifelse(Cexposure==1, "C", 
         ifelse(Dexposure==1, "D", 
           ifelse(Eexposure==1, "E", 
             ifelse(Fexposure==1, "F", NA)))))),
   conversion = Aconversion + ... + Fconversion
 ) %>%
 select(
   exposure,
   conversion,
   date
 )

This produces a Spark data frame where every row is a visitor, and columns indicating which test variation the user was exposed to, whether they converted, and the date of the test exposure. With this, I’m ready to dive in – analyzing which of these six variations is better!

A/B Null Hypothesis Testing Using A Basic T-Test

Null Hypothesis testing is the prevailing method for doing A/B testing in the market today. It’s what Adobe Target and Optimizely currently use in one form or another and has been around for a long time. Often referred to as the “frequentist” approach – it assumes that your data roughly conforms to a family of Normal or Gaussian distributions, most commonly the Student’s t-distribution. T-tests have been a tried and true approach for many applications and are the generally accepted method in most testing applications. Despite the wide adoption, I think this method has a few disadvantages:

Conversions that occur on the web are typically infrequent – often making conversion distributions decidedly skewed (in other words, not Normal).
Conversions are typically always positive, and conversion rates are never greater than one (a Normal distribution assumes you can have negatives and can exceed one), and
T-tests assume no other historical precedent or data exists – so every test is like starting from ground zero. In theory, these assumptions are not that bad when you have a lot of data, but in practice, data is often sparse because we can’t wait months for a test result, or conversions don’t happen that often, or both.

I find that the results of a t-test can be difficult to communicate with less sophisticated business stakeholders who don’t have much statistics background. Lift, confidence, and control variants are concepts that take a little training and are easy to misinterpret – often leading to spurious conclusions and missed business opportunities. It’s unfortunately common in our industry to wrongly infer that a test variation with a large lift (however low statistical confidence) is the winner when there’s not enough data to support that conclusion. On the other hand, several of the more sophisticated organizations I’ve seen don’t have this knowledge gap and generally prefer to think of their testing in terms of lift and confidence.

Whether your organization is sophisticated or not, I’m going to show you how to do a basic t-test using the calculations Adobe Target uses to decide on winners and losers using the dataset we just created above. To start, I’ll first need to summarize each of the test variants across visitors in my example dataset so that I can calculate the mean and variance of the conversion rate:

# First let Spark do the heavy lifting
results = visitor_pool %>%
 group_by(exposure) %>%
 summarize(
   population = n(),
   conversions = sum(conversion)
 ) %>% collect()
 
# Calculate mean and variance for each variation locally
results = results %>%
 mutate(
   cr = conversions/population,
   cr_sigma = sqrt(cr*(1-cr)/population)
 ) %>% 
 arrange(exposure)

Next, I’ll divide my results data frame into control and test data frames so that I can calculate lift and confidence for each. In my case, I’ll be using “A” as my control:

control = results[1,]
test_group = results[-1,]

test_lift = test_group %>%
 mutate(
   lift = cr/control$cr - 1,
   lift_sigma = abs(lift * sqrt((cr_sigma/cr)^2 + 
                (control$cr_sigma/control$cr)^2)),
   lift_upper = lift + lift_sigma*1.96,
   lift_lower = lift - lift_sigma*1.96
 )

Finally, visualizing the results with a box plot using plotly:

And looking at the results illustrating lift (solid line) and confidence (shaded bands) over the life of the test (using a similar method above, but also grouped by date, not just by variant):

If it’s not clear, the box plot is telling me that B (the blue box) has the greatest lift compared to A. Luckily I do not have to choose between E and F (the red and purple) which have very similar performance, but if I did I would likely need to run my test longer or expose E and F to more visitors. The box plot, while industry standard, is not my favorite way to visualize the results – inevitably I’m going to be asked, “so is B the best?” or “wait, what is lift?” or “where’s A on here? or sometimes “what is a p-value?” In my opinion, the winner of the test and its statistical significance should be evident in the test results and accompanying visualizations. Sometimes box plots seem to require more stakeholder training than anyone has time for.

The last point about this type of testing – you can see it took about four weeks in my example for the clear winner to emerge (the point where the 95% bands don’t overlap anymore). Four weeks can be a long time in engineering terms, so anything we can do to bring that timeframe down is valuable in my mind.

Bayesian Testing – An Alternative Approach

Bayesian estimation is an excellent alternative to null hypothesis testing in my opinion, and there’s an article about it here from the CXL Institute discussing both sides if you’re interested. Bayesian estimation allows you to frame the issue in a slightly different way, and describe your data using the distribution that makes the most sense (rather than assuming your data fits a Normal distribution). Instead of trying to answer, “which of these alternatives is better than a control?” it answers the question, “which of all these variants has the highest probability of being the best?” Additionally, Bayesian estimation allows you to use a “prior” distribution that can help you arrive at a winner more quickly – essentially allowing previous tests to inform the results of subsequent tests.

The reason Bayesian testing is an appealing alternative in the context of raw analytics data (like Adobe Analytics Data Feeds) is that you can build an excellent prior distribution leveraging every test you’ve ever done in the past. An excellent prior distribution allows you to put the results of any test variation in the context of all other tests you’ve ever done on your site or app.

To set up the groundwork, I’m going to rely on two distributions for my testing – the binomial distribution, and the beta distribution. The binomial distribution is perfect for describing a conversion rate – e.g., what’s the probability that someone will convert given they’ve seen any particular variant? The beta distribution is excellent because it will allow us to form an educated prior of our conversion rate (it lives between zero and one – perfect for our use case). Together, these distributions form a conjugate pair, which makes our calculations much more comfortable as you’ll see.

To build my prior distribution, I’m going to look at the conversion rate that resulted from every test I’ve ever done:

prior_tests = data_feed_tbl %>%
 group_by(visitor_id) %>%
 summarize(
   conversion = max(conversion),
   exposure = first_value(variation)
 ) %>% ungroup %>%
 group_by(exposure) %>%
 summarize(
   conversions = sum(conversion),
   population = n_distinct(visitor_id)
 ) %>% ungroup() %>%
 mutate(
   cr = conversions/population
 ) %>%
 filter(exposures >= 30) %>% 
 collect()

u=mean(prior_tests$cr)
s2=var(prior_tests$cr)

I’ve filtered out every test that had fewer than 30 users, to avoid junking up my prior. After running this on my example data, I get a mean conversion rate (across all tests) of about 3% and a variance of 0.0008 (which translates to a standard deviation of about 3%). The distribution of all my tests looks like this:

You can see the considerable majority of the tests in my test dataset (there are a couple hundred test variations represented in this histogram) have less than a 3% conversion rate. This histogram is helpful context for any new testing I do and helps me understand where I should expect to see my conversion rates fall in any future testing. To turn this data into a prior (beta) distribution, I’m going to need an alpha and beta parameter which can be calculated given a mean and variance from my historical data:

prior_alpha = ((1-u)/s2 - 1/u)*u^2
prior_beta = prior_alpha*(1/u - 1)

Based on my data, that gives me a prior alpha of 0.93 and a prior beta of 32.73. Overlayed on the previous histogram, it looks like this (not too bad):

p = plot_ly(alpha = 0.6) %>%
 add_trace(x=~prior_tests$cr, type="histogram", histnorm="probability", 
           autobinx=FALSE, xbins=list(start=0,end=.11,size=.01), 
           name="All Tests") %>%
 add_trace(x=~rbeta(1000,prior_alpha,prior_beta), type="histogram", 
           histnorm="probability", autobinx=FALSE, 
           xbins=list(start=0,end=.11,size=.01), name="Prior Fit") %>%
 layout(yaxis = list(title="Probability", type="linear"), 
        xaxis=list(title="Conversion Rate", range=c(0,.11)), 
        barmode="overlay")
p

Now comes the easy part (thanks to the magic of conjugate priors) – estimating the true conversion rate of each of the test variations so that we can compare them. To do that, I’ll apply the conjugate prior to the observed data to create posterior hyper-parameters. That sounds complicated, but it’s easy – just add the prior alpha with the number of conversions to get the posterior alpha and add the prior beta to the total population minus the conversions to get the posterior beta:

posteriors = results %>%
 mutate(
   posterior_alpha = prior_alpha + conversions,
   posterior_beta = prior_beta + population - conversions
 )

In essence, the posterior distribution we will create using the posterior_alpha and posterior_beta parameters for each variation represents an estimate of the range the true conversion rate each test variation would have if we ran this test in perpetuity. To form the resulting posterior distributions (and distribution densities) for each of the test variations A through F, just use the rbeta function which generates a beta distribution given the two input parameters (similar to how I generated a beta distribution for the prior above). You can also use the density function that’ll give you something useful to plot (as I’ll show below):

Apost = rbeta(10000, posteriors$posterior_alpha[1], 
        posteriors$posterior_beta[1])
Afit = density(Apost)
...
Fpost = rbeta(10000, posteriors$posterior_alpha[6], 
        posteriors$posterior_beta[6])
Ffit = density(Fpost)

Plotting the densities of each gives me a lovely visualization of the likely conversion rate for each test variation:

p = plot_ly(x = Bfit$x, y=Bfit$y, type="scatter", mode="lines", 
            fill="tozeroy", name="B") %>%
 add_trace(x = Cfit$x, y=Cfit$y, fill="tozeroy", name="C") %>%
 add_trace(x = Dfit$x, y=Dfit$y, fill="tozeroy", name="D") %>%
 add_trace(x = Efit$x, y=Efit$y, fill="tozeroy", name="E") %>%
 add_trace(x = Ffit$x, y=Ffit$y, fill="tozeroy", name="F") %>%
 add_trace(x = Afit$x, y=Afit$y, fill="tozeroy", name="A") %>%
 layout(xaxis=list(title="Conversion Rate"), 
        yaxis=list(title="Probability Density", showticklabels=FALSE, 
        range=c(0,450)))
p

And an associated trend (again grouping by date in addition to testing variation) that now includes the A variation that was absent from the t-test trend:

In my opinion, the distribution plot is a little easier to interpret than a box plot for a lay user – it’s clear that B has a higher conversion rate (being the furthest right on the x-axis), and that the confidence interval doesn’t overlap (other than for E and F). Also, because I’ve already simulated each of these distributions in R, it’s quite easy for me to answer, “what’s the probability that E is better than F?”:

> sum(Epost > Fpost) / length(Epost)
[1] 0.7051

This means that there’s about a 71% chance that E is going to produce a higher conversion rate than F. This question was tough to answer with the t-test results. You’ll also notice that I arrived confidently at a winner slightly sooner than I did with the t-test – B pulls away about 3.5 weeks into the test instead of the four weeks it took with the t-test, and I have a slightly better confidence interval at the end than I do with the t-test. This increased confidence is due to the prior distribution that we introduced – we’re not starting the test without any prior knowledge like we are with the t-test.

Lastly, I don’t have to worry about explaining what “control” or “lift” is, and instead can just talk about the expected conversion rates for each variation which is much easier to understand. This is a big deal to me, and saves me the trouble of having to teach statistics to business stakeholders!

Conclusion

To wrap up, let me call out a few things. Whichever testing method you prefer, both types of tests gave essentially the same result and that neither one (in this example) was inherently better than the other in picking winners and losers in the long term. The chief advantages of the Bayesian approach in my mind are interpretability and slightly faster results. However null hypothesis testing is undoubtedly preferred by many in our industry, and it is also the most common approach. So I’ll leave it to you to decide which is better for your organization.

It’s probably also important to point out that analytics tools are not a 100% replacement option for a fully featured A/B testing tool that can handle content delivery, visitor targeting, and personalization. I should also mention that I prefer any A/B testing tool that tightly integrates with an analytics tool (like Adobe Target) because it allows me to dive even more in-depth than I would with a testing tool alone.

The main takeaway I hope you have is that whatever your testing methods, bringing the data into Adobe Analytics Data Feeds (or any raw analytics data) and R will give you the ability to compare the results and see which method best suits your business case. With out-of-the-box A/B testing software alone, you’re locked-in to whatever algorithm the tool supports, and you probably won’t have the wealth of data at your disposal that you already have with Adobe Analytics.

Good luck with all your testing efforts! Please let me know on Twitter if you found this useful or have any questions.

Follow @trevorwithdata

Trevor Paulsen

Trevor is a group product manager for Adobe's Customer Journey Analytics (CJA). With a background in aerospace engineering and robotics, he has a strong foundation in estimation theory and data mining. Before leading Adobe's data science consulting team, Trevor used these skills to drive innovation in the fields of aerospace and robotics. When he's not working, Trevor enjoys engaging in big data projects and statistical analyses as a hobby. He is also a father of five and enjoys bike rides and music. All views expressed are his own.

One thought to “Amp Up Your A/B Testing Using Raw Analytics Data, Apache Spark, and R”

Casual reader says:

August 20, 2018 at 9:38 am

Hi Trevor,

Thank you for great and detailed post! It’s really informative… Could you help me understand the following calculation – lift_sigma = abs(lift * sqrt((cr_sigma/cr)^2 + (control$cr_sigma/control$cr)^2))

Comments are closed.