One of the hottest topics in the digital marketing space has always been marketing attribution. If you’re unfamiliar with this problem space, (I’d be surprised, but) there are lots of excellent explanations out there including this one. In a nutshell, companies have a lot of marketing outlets – search, display ads, social networks, email, and the list goes on. The attribution problem is how you assign credit for a business outcome to the different marketing outlets your brand employs given that your customers may see any combination of marketing channels in their customer journey.
There are lots of tools out there designed to allow you to slice and dice credit based on a set of logical rules, but as an avid user of R and Apache Spark, I wanted to illustrate how you can do this yourself. In fact, you can go above and beyond what most traditional multi-touch attribution vendors have to offer if you want to.
To illustrate, I’m going to walk you through a multi-touch attribution analysis and show you just how easy it is to create custom multi-touch attribution models. I also refer you to my previous guides on how to get Adobe Analytics Data Feeds, how to setup sparklyr (a dplyr interface between R and Apache Spark), and how to download and use a classification file in R.
First, let’s load up a bunch of Data Feed data:
library(dplyr) library(sparklyr) setwd("/data") # Connect to my Spark instance sc = spark_connect(master="local", version="2.2.0", config=sconfig) # Load the data feed files data_feed_local = spark_read_csv( sc=sc, name="data_feed", path="*-reportsuite_2017-*.tsv.gz", header=FALSE, delimiter="\t" ) # Load the campaign classification file campaign_class_local = spark_read_csv( sc = sc, name = "campaign_class", path = "SC_EXPORT_campaigns_classifications_reportsuite.tab", header = TRUE, delimiter = "\t" )
With the data loaded, I’m going to do a little cleanup that will make my life easier later. First, I’m going to merge the visitor id high and low into a single field, and I’m going to add a leading and trailing “,” to the event list which makes it easier to search for specific Analytics events such as orders. Then, I’ll give some friendly names to the data feed columns based on the “column_headers.tsv” file that comes with the data feed (for example, the fourth column maps to the value “V4” and contains the timestamp of a hit in my files):
data_feed_tbl = data_feed_local %>% mutate( merged_visitor_ids = paste0(V1,"_",V2), fixed_event_list = paste0(",",V5,",") ) %>% select( visitor_id = merged_visitor_ids, hit_time_gmt = V4, post_event_list = fixed_event_list, post_campaign = V7, )
Now, we’ll add in the campaign classification using a left join:
data_feed_tbl = data_feed_tbl %>% left_join(campaign_class_local, by=c("post_campaign"="Key"))
And for our final data preparation step, we’re going to create two new columns called “mid_campaign” and “conversion.” The mid_campaign field is a copy of a campaign value, but only at the time it was set – specifically not including all of the persistence logic that Adobe Analytics applies to the data during collection. Creating this unpersisted value is important because we want to apply custom attribution logic to the data rather than using the previously configured settings in Adobe Analytics. To accomplish this, we’ll create a new variable containing the value of a marketing campaign at the time it was set, which is signified by a value of “20” in the event list – the instance event for the campaign variable.
The conversion column I’ve created will signify the conversion event I want to attribute – in this case orders. Orders appear in the post_event_list with the value “1” (you’ll notice the leading and trailing commas we added earlier make it easier for me to search for the order event).
data_feed_tbl = data_feed_tbl %>% mutate( # My classification file has "Marketing_Channel" as its lookup column # The %regexp% operator is an easy way to search for a string value mid_campaign = ifelse(post_event_list %regexp% ",20,", Marketing_Channel, NA), conversion = ifelse(post_event_list %regexp% ",1,", 1, 0) )
The next part is the tricky part – grouping the sequential campaign touchpoints resulting in an order. Unfortunately, that’s not something I can easily do using just the visitor id and the conversion variable I created earlier, so to group by these order-completing sequences I’m going to use a couple of the windowing functions that sparklyr provides:
# Constructing order sequences for all visitors data_feed_tbl = data_feed_tbl %>% # Start by grouping by visitor id and sorting by time group_by(visitor_id) %>% arrange(hit_time_gmt) %>% # Next I'll copy the "conversion" column to a new "order_seq" column mutate(order_seq = ifelse(conversion > 0, 1, NA)) %>% # Using lag, I'll shift the conversions down exactly one row # Using cumsum, I'm going to carry forward the conversion until # I see another order, in which case, I'll increment order_seq mutate(order_seq = lag(cumsum(ifelse(is.na(order_seq), 0, order_seq)))) %>% # I'll clean up the first row (which lag left as NA) setting it # to "-1" if it had an order, and 0 otherwise mutate(order_seq = ifelse((row_number() == 1) & (conversion > 0), -1, ifelse(row_number() == 1, 0, order_seq))) %>% # With my new order_seq variable, I can now group by it group_by(visitor_id, order_seq) %>% mutate( # Finally, I'll blank out (with NAs) any sequences that # didn't actually contain an order order_seq = ifelse(sum(conversion)>0, order_seq, NA) # Ungrouping to get back to the original dataset ) %>% ungroup() %>% ungroup()
If this bit didn’t make sense, here’s what I just did in table form. Notice the “order_seq” column now has groupings of hits that occurred leading up to and including my conversion event – perfect for the attribution modeling I’ll do next:
To begin the actual attribution modeling, I’m going to create a new table called “attributable_sequences” containing only the marketing touch points of interest:
attributable_sequences = data_feed_tbl %>% filter(!is.na(mid_campaign) & !is.na(order_seq))
All of my attribution rules will follow the same basic outline, with only slight modifications to the actual aggregation. To illustrate the basic idea, I’ll start by grouping by visitor_id and order_seq, run the attribution logic of my choice, then aggregate the fractional orders against each marketing channel:
some_touch_orders = attributable_sequences %>% group_by(visitor_id, order_seq) %>% mutate( order_participation = <Attribution Logic Goes Here> ) %>% ungroup() %>% group_by(mid_campaign) %>% summarize(orders = sum(order_participation)) %>% collect()
For linear attribution, credit is divided equally among all touchpoints which I accomplish with the following:
order_participation = 1/n()
U shaped attribution (40% credit to first and last touches, with 20% given to the middle touchpoints) is accomplished with a few ifelse statements:
order_participation = ifelse(n() == 1, 1, ifelse(n() == 2, 0.5, ifelse(hit_time_gmt == max(hit_time_gmt), 0.4, ifelse(hit_time_gmt == min(hit_time_gmt), 0.4, 0.2/(n()-2) ) ) ) )
For a fancier half-life based time decay model where a channel’s credit decays by half every seven days, I can use an exponential decay formula, where the “half_life” value is 7*24*60*60 (7 days times 24 hours times 60 minutes time 60 seconds).
order_participation = 0.5^((hit_time_gmt - min(hit_time_gmt))*1/half_life), order_participation = order_participation/sum(order_participation)
For your basic first and last touch models, the standard first and last functions of dplyr that I would use surprisingly don’t yet work in sparklyr as of the time of this writing, so I had to tweak it a little bit using the top_n function:
last_touch_orders = attributable_sequences %>% group_by(visitor_id, order_seq) %>% top_n(1,hit_time_gmt) %>% select(visitor_id, order_seq, mid_campaign) %>% ungroup() %>% group_by(mid_campaign) %>% summarize(orders = n()) %>% collect()
For first touch, you can use the same code but merely modify hit_time_gmt in the top_n function to -hit_time_gmt.
Finally, no multi-touch attribution analysis would be complete without comparing the results of all the models to each other. To do that, you can use the merge function to join all of the tables created for each attribution model:
model_comparison = list( last_touch_orders, first_touch_orders, linear_orders, half_life_orders, u_shaped_orders) %>% Reduce(function(...) merge(..., all=TRUE, by="mid_campaign"), .) View(model_comparison)
And, here’s what I end up with:
|Marketing Channel||Last Touch||First Touch||Linear||Half Life||U-Shaped|
As you can see (and as is usually the case with rules based attribution models) the most striking differences occur between first and last touch – however, each model allows you to see the unique performance of each marketing channel in the overall customer journey. In the following post, I’ll show you how to be even smarter about your attribution modeling – moving beyond arbitrary rules-based logic into something statistically based.
Update: You can now read my next post here!