Clustering Your Customers Using Adobe Analytics Data Feeds and R

Follow @trevorwithdata

Theodore Levitt was a famous Harvard economist who is famous for his definition of corporate purpose, which he proposed was not merely making a profit, but instead creating and keeping customers. One of my favorite quotes comes from his book, The Marketing Imagination, in which Levitt says, “If you’re not thinking segments, you’re not thinking.” Nowadays, digital marketers have so many data points they’re collecting that it is quite difficult to answer simple questions like, “What types of people are visiting my site?” or “What makes these types of visitors unique, and how do I optimize my business around them?” Segmentation is at the heart of marketing, yet it has become one of the most difficult things to do effectively.

Luckily, machine learning techniques can help us overcome these obstacles. In this post, I’m going to walk you through how to use Adobe Analytics Data Feeds with R to cluster your customers and find the most meaningful segments for marketing. To start, you’ll want to review my post on how to create visitor level aggregations using Data Feeds and R. If you’re wondering which variables or metrics to include in your visitor rollup, I usually error on the side of including anything that is 1) important to my business and 2) potentially differentiating for my customer base.

There are lots of different clustering algorithms out there – each with their own strengths and weaknesses – so this approach is certainly not the only approach you can take, but it is one I’ve used many times and found to be useful in a marketing context. To start, I’ll assume you have a sparklyr object called “visitor_rollup”. We’re going to select just a sample of visitors to build our model with, then apply the model to the entire dataset.

library(sparklyr)

# Grab a sample of visitors to build a cluster model
sample_size = 10000
sample_visitors = visitor_rollup %>%
 top_n(sample_size, wt=visitor_id) %>%
 collect()

Once we’ve got the 10,000 visitor sample in our new R object “sample_visitors”, we’re ready to start modeling. First, I like to throw out any visitors that had only a single hit or row in the data – they don’t really add any value to my site and oftentimes gunk up the model. To throw them out, just use a simple filter on hit_count, and remove the visitor_id column since it doesn’t contain useful data for clustering.

# Throw out single hit visitors
clean_sample = sample_visitors %>%
 filter(hit_count > 1) %>%
 select(-visitor_id)

Next, I’m going to use a dimensionality reduction technique called principle component analysis (PCA). PCA is especially useful if you have lots of variables in your visitor rollup. At a high level, it essentially compresses all of your data into a series of uncorrelated pseudo-measurements. These pseudo-measurements are not useful for humans to read, but they really help the clustering algorithm work more effectively.

# Principle Component Analysis
components = prcomp(clean_sample)
rotation_matrix = components$rotation
percents = components$sdev^2/sum(components$sdev^2)

# View the percentages to pick the number of components to use
# then apply the rotation matrix to the sample dataset
percents
pseudo_vars = 3
pseudo_observations = as.data.frame(as.matrix(clean_sample) %*% 
 as.matrix(rotation_matrix[,1:pseudo_vars]))

You’ll notice I’ve stored the resulting “rotation_matrix” which will allow me to apply PCA to my giant dataset later. I recommend looking at the “percents” variable – that’s a way to see what percent of the total variance has been encapsulated by each pseudo-measurement. I typically like to include enough pseudo-variables to capture ~95% of all variance. For digital analytics, that’s typically 2 or 3 pseudo-variables. Finally, you can see I’ve applied the rotation matrix to my clean sample to generate a set of new data that I’ll use for clustering.

To do the actual clustering, I like the “mclust” package in R. It is a great implementation of the Expectation Maximization (EM) algorithm which I greatly prefer over k-means for this use case. K-means is great for many applications, but it has some pretty severe problems when applied to sparse or highly skewed datasets which digital analytics data certainly is.

# Select the number of clusters you want and cluster the sample
library(mclust)
cluster_count = 4
cluster_model = Mclust(pseudo_observations, G=1:cluster_count)

This should take a few minutes to run, but when it’s done, the cluster_model object will store the results I need to apply to the entire dataset. To get a look at how the model did, I like to inspect the percent of visitors it put into each cluster:

> table(cluster_model$classification) / length(cluster_model$classification)

         1          2          3          4 
0.40645329 0.32500651 0.20491803 0.06362217

You can see I’ve got one cluster with 40% of visitors on down to a cluster with about 6% of visitors. This is a pretty common distribution I’ve seen when clustering digital analytics data. Generally, any results where a cluster has less than say 5% of visitors isn’t going to be super useful for marketing purposes unless that 5% makes up a large share of value. No one wants to create marketing campaigns around a measly 5% of your customer base unless that 5% makes up more than 50% of your business value.

Now that we’ve got a decent model, let’s apply it to the entire rollup:

# Prepping to apply cluster model to entire dataset
clean_rollup = local_visitor_rollup %>%
 select(-visitor_id)

# Transform total dataset into principle components
transformed_rollup = as.data.frame(as.matrix(clean_rollup) %*% 
 as.matrix(rotation_matrix[,1:pseudo_vars]))

# Apply cluster model to total dataset
model_output = predict(cluster_model, newdata=transformed_rollup)
cluster_assignments = as.numeric(as.character(model_output$classification))

Sometimes, the model_output will contain spurious values on a very few visitors, so I cast the output as a character, then back to numeric to fix it. Just one of those little R nuances. With my results, I’m now ready to take a look at the results:

# View cluster assignment distribution
> table(cluster_assignments) / sum(table(cluster_assignments))
cluster_assignments
         1          2          3          4 
0.52231912 0.19362601 0.19015616 0.09389871

Looks like the model applied fairly well to the entire dataset since the distribution is still pretty close to the original, so I think I’ll keep it. To add the cluster assignments to my original rollup, just use a mutate:

# Put cluster assignments into the visitor rollup
local_visitor_rollup = local_visitor_rollup %>%
 mutate(cluster = cluster_assignments)

Now that I’ve added the cluster assignments to my rollup, I can do some pretty basic analysis on my original data to really get a feeling for the different types of visitors on my site:

# Summarize clusters
summary_table = local_visitor_rollup %>%
 group_by(cluster) %>%
 summarize(
 population = n(),
 median_hit_count = median(hit_count),
 median_visits = median(visits),
 median_event4s = median(event4s_triggered),
 median_days_visited = median(days_visited),
 percent_of_population = n() / dim(local_visitor_rollup)[1],
 percent_of_visits = sum(visits) / sum(local_visitor_rollup$visits),
 percent_of_event4s = sum(event4s_triggered) / sum(local_visitor_rollup$event4s_triggered),
 percent_of_hp_visits = sum(visits_to_homepage) / sum(local_visitor_rollup$visits_to_homepage)
 )

View(t(summary_table))

The results should look something like this (I’ve formatted them a little bit to make it easier to read):

cluster	1	2	3	4
population	164,530	60,992	59,899	29,578
median_hit_count	8 hits	29 hits	95 hits	442 hits
median_visits	1 visit	2 visits	4 visits	19 visits
median_event4s	1 event4	6 event4s	29 event4s	123 event4s
median_days_visited	1 day	1 day	3 days	10 days
percent_of_population	52%	19%	19%	9%
percent_of_visits	12%	8%	24%	56%
percent_of_event4s	2%	5%	24%	68%
percent_of_hp_visits	5%	7%	23%	65%

This is useful information – I can now see that about 9% of my visitors make up 56% of my total visits, 68% of my event4 count, and 65% of the visits to my homepage – certainly an interesting segment of users to dive into further. Most people don’t realize that so much of the value on their sites is derived by so few visitors – and this is important.

As cool as this summary table is, you may be thinking – “This is great and all, but how do I get this back into Analytics to dive in deeper?” or “How do I share these results with our other marketing tools?” Well, stay tuned – in an upcoming blog post I’ll show you how you can actually insert these cluster results into your Adobe Analytics reporting using the Customer Attributes feature. In the meantime, good luck and I hope you find some interesting insights about your customers!

Update: You can read about how to bring these results back into Adobe Analytics’ native reporting here.

Follow @trevorwithdata

Trevor Paulsen

Trevor is a group product manager for Adobe's Customer Journey Analytics (CJA). With a background in aerospace engineering and robotics, he has a strong foundation in estimation theory and data mining. Before leading Adobe's data science consulting team, Trevor used these skills to drive innovation in the fields of aerospace and robotics. When he's not working, Trevor enjoys engaging in big data projects and statistical analyses as a hobby. He is also a father of five and enjoys bike rides and music. All views expressed are his own.