Propensity Scoring in Adobe Analytics Using Data Feeds and R

Follow @trevorwithdata

When I was a kid, my favorite TV gameshow was “The Price Is Right” – it’s flashy, fun, and to this day I still love watching it – that is except for one thing: the ads. I still find it obnoxious to be bombarded by annoying (and sometimes gross) TV ads about hemorrhoid treatments or mesothelioma lawsuits! I suppose they’re trying to reach the 50+ male crowd, and on TV that’s the best that can be done. But in the digital space, you have a lot more information about each individual than TV advertisers do. For example, instead of guessing, you can know for sure if the person viewing the site is in their 50s before you serve an ad. However, the advantage of digital marketing is that you can take this idea even further. For example, instead of targeting 50 year old males in the hope that you’ll get a few people that would be interested in a mesothelioma lawsuit, what if you could identify the individuals in that group that are statistically likely to be interested and focus your marketing dollars on just them? It’s already possible to do this, and in this post I’m going to show you how to identify people that have a high probability of conversion, then show you how to identify the things that make them uniquely marketable.

Before we begin, you’re going to want to familiarize yourself with sparklyr, and how to create visitor level aggregations which we’ve covered here and here. This post will pick up with a “visitor_rollup” Spark data frame, which is where I left off in the previously mentioned post.

To begin, we’re going to build a basic propensity model using logistic regression. The first step is to get the visitor rollup data frame into the right format for modeling, which means you need to create a binary response variable (1 or 0) out of a success variable you have in your visitor rollup. To do that, I’ll use a simple “ifelse” statement within a sparklyr mutate function:

propensity_rollup = visitor_rollup %>%
  mutate(
    # Your success variable might be orders, or a
    # custom event, or even a repeat visit.
    response_var = ifelse(success_variable > 0, 1, 0)
  ) %>% select(-success_variable)

Your “success_variable” might be orders, revenue, return visits (in which case I would say visits > 1 rather than > 0), or even a file download for example. Notice I’ve also removed the original success_variable from my new propensity_rollup data frame so that I don’t accidentally use it as an input in training the model.

Next, I’m going to do a little data clean up and pull out a sample of visitors to build my propensity model from while removing the “visitor_id” column which isn’t useful for modeling:

# Select model sample size
sample_size = 10000

# Grab sample of visitors for building my model
sample_visitors = propensity_rollup %>%
  top_n(sample_size, wt=visitor_id) %>%
  collect()

# Throw out single hit visitors (these visitors
# often screw up a model and aren't valuable)
clean_sample = sample_visitors %>%
  filter(hit_count > 1) %>%
  select(-visitor_id)

Now, it’s time to build the propensity model:

# Train a propensity model
prop_model = glm(response_var ~ ., family=binomial(), 
     data=clean_sample)

You can see I’ve set my formula to predict the “response_var” I setup earlier with the rest of the columns in the dataset. I’ve also set the family parameter to “binomial” which sets me up for a logistic regression.

Once done, you can see how well the model did by running the following code. Interpreting the results is hard to do if you don’t have a statistical background, but there’s a great tutorial you can read here.

summary(prop_model)

With my model built, I can now apply it to the rest of the visitors in my dataset (which I’ve called “local_visitor_rollup” as per my previous blog post):

propensity_scores = round(100*predict(prop_model, 
     newdata=local_visitor_rollup, 
     type="response"))

This will give me an array of the probability (or propensity) on a scale from 0 to 100 that each visitor has to complete my success variable. Awesome. But, now what?

Having a list of visitors that are likely to convert is only cool if I know what makes them uniquely marketable. Who are they? Where do I find them? What’s the best way to reach them? So, I’m going to show you two great things you can do with this list of propensity scores – 1) analyze them to find the key differentiating factors that set them apart and 2) target them directly by publishing my likely converters to a shared segment in the Marketing Cloud Audience Library.

To analyze these propensity scores in Adobe Analytics, I’m going to be following the steps I’ve outlined previously showing how to bring the results of an R model into Adobe Analytics. To do that, I’m going to take the propensity scores I just created and export them to a file that I can upload to Adobe’s Customer Attributes (after converting the visitor IDs to their hex versions as I’ve shown how to do in my previous post) to get a table like the one below. Note: if you’re already using another statistical model with Customer Attributes, you’ll have to manually merge the two modeled columns (e.g. “cluster_assignments” and “propensity_scores”) into a single table to upload against a single visitor ID which you can do with a simple bit of code like this:

out_file = merge(
  cluster_mapping,
  propensity_mapping,
  by = "aaid"
)

Using only the propensity scores would give me a table like this:

aaid	propensity_scores
291D990B85013FAB-40001603E02EF701	10
291D9CE30514BFD6-6000017700068E38	99
291D9D7C850100A0-4000010440206AD4	2
291D9F4485013A5C-4000160520238614	23

After uploading the file to the Customer Attributes UI, it takes a few hours for the new propensity scores to propagate down into Analytics. Once it’s there, you can start reporting on the propensity scores directly! The first thing I like to look at is a histogram of the results. To do that, just use the “propensity_scores” Customer Attributes dimension with the unique visitors metric and sort the dimension alphabetically:

Pretty cool, but back to my original point: I know which visitors are most likely to convert, but what makes them uniquely marketable? If you’re statistically savvy, you could look at the coefficients of the logistic regression model to see which input columns were most predictive of success, but it’s very likely that I missed many important variables or eVar values in my visitor rollup, so I’m going to do something even better. To best answer that question, I’m going to leverage my favorite feature of Analysis Workspace: Segment Compare. To use Segment Compare with my propensity model, I need to first create the segment that I want to better understand – my likely converters (those who had a 75% chance or greater of converting):

From there, I want to understand what makes visitors with a high conversion probability statistically different from everyone else. To do this, I’ll drag the segment I’ve just created into the Segment Comparison Panel:

The results have always shown me something interesting (like the following):

Looking at the results, it’s pretty easy to see the things that differentiate this high propensity segment. Visitors that are more than 75% likely to convert:

are 35 times more likely to open an email from our company,
are 30 times more likely to view our content on a mobile device,
tend to fall outside the low value cluster from my previous cluster analysis,
fall primarily into the 25-34 age group,
are more than twice as likely to live in New York City and visit at 9:00 AM,
visited my site or app more than 10 times in the last 30 days,
and are 20% more likely to be a Gold Customer.

If you were to ask me which marketing channels/demographics/behavioral attributes actually matter to my conversion, I now have a relevant and statistically based answer! If I’m trying to build a marketing strategy for my potential customers, I know which marketing channel to reach them at, what type of device I need to optimize for, the geography where they live, and the type of customer they’ve been in the past – far and away better than making big assumptions about who I think matters to my conversion.

Ok, last point – sharing this segment of likely converters directly to other Adobe solutions is pretty easy. All you have to do is check the box to publish the segment to the Marketing Cloud Audience Library, and you can now start advertising directly to them across Adobe’s solution set:

In conclusion, don’t ever do blind targeting to a demographic you don’t really even understand – instead, use a more statistically based approach to figure out what really matters and start targeting that instead!

Follow @trevorwithdata

Trevor Paulsen

Trevor is a group product manager for Adobe's Customer Journey Analytics (CJA). With a background in aerospace engineering and robotics, he has a strong foundation in estimation theory and data mining. Before leading Adobe's data science consulting team, Trevor used these skills to drive innovation in the fields of aerospace and robotics. When he's not working, Trevor enjoys engaging in big data projects and statistical analyses as a hobby. He is also a father of five and enjoys bike rides and music. All views expressed are his own.