Multi-Touch Attribution Using Adobe Analytics Data Feeds and R

Follow @trevorwithdata

One of the hottest topics in the digital marketing space has always been marketing attribution. If you’re unfamiliar with this problem space, (I’d be surprised, but) there are lots of excellent explanations out there including this one. In a nutshell, companies have a lot of marketing outlets – search, display ads, social networks, email, and the list goes on. The attribution problem is how you assign credit for a business outcome to the different marketing outlets your brand employs given that your customers may see any combination of marketing channels in their customer journey.

There are lots of tools out there designed to allow you to slice and dice credit based on a set of logical rules, but as an avid user of R and Apache Spark, I wanted to illustrate how you can do this yourself. In fact, you can go above and beyond what most traditional multi-touch attribution vendors have to offer if you want to.

To illustrate, I’m going to walk you through a multi-touch attribution analysis and show you just how easy it is to create custom multi-touch attribution models. I also refer you to my previous guides on how to get Adobe Analytics Data Feeds, how to setup sparklyr (a dplyr interface between R and Apache Spark), and how to download and use a classification file in R.

First, let’s load up a bunch of Data Feed data:

library(dplyr)
library(sparklyr)
setwd("/data")

# Connect to my Spark instance
sc = spark_connect(master="local", version="2.2.0", config=sconfig)

# Load the data feed files
data_feed_local = spark_read_csv(
  sc=sc, 
  name="data_feed", 
  path="*-reportsuite_2017-*.tsv.gz", 
  header=FALSE, 
  delimiter="\t"
)

# Load the campaign classification file
campaign_class_local = spark_read_csv(
  sc = sc, 
  name = "campaign_class", 
  path = "SC_EXPORT_campaigns_classifications_reportsuite.tab", 
  header = TRUE, 
  delimiter = "\t"
)

With the data loaded, I’m going to do a little cleanup that will make my life easier later. First, I’m going to merge the visitor id high and low into a single field, and I’m going to add a leading and trailing “,” to the event list which makes it easier to search for specific Analytics events such as orders. Then, I’ll give some friendly names to the data feed columns based on the “column_headers.tsv” file that comes with the data feed (for example, the fourth column maps to the value “V4” and contains the timestamp of a hit in my files):

data_feed_tbl = data_feed_local %>%
 mutate(
   merged_visitor_ids = paste0(V1,"_",V2),
   fixed_event_list = paste0(",",V5,",")
 ) %>%
 select(
   visitor_id = merged_visitor_ids,
   hit_time_gmt = V4,
   post_event_list = fixed_event_list,
   post_campaign = V7,
 )

Now, we’ll add in the campaign classification using a left join:

data_feed_tbl = data_feed_tbl %>%
 left_join(campaign_class_local, by=c("post_campaign"="Key"))

And for our final data preparation step, we’re going to create two new columns called “mid_campaign” and “conversion.” The mid_campaign field is a copy of a campaign value, but only at the time it was set – specifically not including all of the persistence logic that Adobe Analytics applies to the data during collection. Creating this unpersisted value is important because we want to apply custom attribution logic to the data rather than using the previously configured settings in Adobe Analytics. To accomplish this, we’ll create a new variable containing the value of a marketing campaign at the time it was set, which is signified by a value of “20” in the event list – the instance event for the campaign variable.

The conversion column I’ve created will signify the conversion event I want to attribute – in this case orders. Orders appear in the post_event_list with the value “1” (you’ll notice the leading and trailing commas we added earlier make it easier for me to search for the order event).

data_feed_tbl = data_feed_tbl %>%
  mutate(
    # My classification file has "Marketing_Channel" as its lookup column
    # The %regexp% operator is an easy way to search for a string value

    mid_campaign = ifelse(post_event_list %regexp% ",20,", 
      Marketing_Channel, NA),
    conversion = ifelse(post_event_list %regexp% ",1,", 1, 0)
  )

The next part is the tricky part – grouping the sequential campaign touchpoints resulting in an order. Unfortunately, that’s not something I can easily do using just the visitor id and the conversion variable I created earlier, so to group by these order-completing sequences I’m going to use a couple of the windowing functions that sparklyr provides:

# Constructing order sequences for all visitors
data_feed_tbl = data_feed_tbl %>%
  
  # Start by grouping by visitor id and sorting by time
  group_by(visitor_id) %>%
  arrange(hit_time_gmt) %>%
  
  # Next I'll copy the "conversion" column to a new "order_seq" column
  mutate(order_seq = ifelse(conversion > 0, 1, NA)) %>%

  # Using lag, I'll shift the conversions down exactly one row
  # Using cumsum, I'm going to carry forward the conversion until
  # I see another order, in which case, I'll increment order_seq
  mutate(order_seq = lag(cumsum(ifelse(is.na(order_seq), 0, order_seq)))) %>%
  
  # I'll clean up the first row (which lag left as NA) setting it
  # to "-1" if it had an order, and 0 otherwise
  mutate(order_seq = ifelse((row_number() == 1) & (conversion > 0), 
    -1, ifelse(row_number() == 1, 0, order_seq))) %>%
  
  # With my new order_seq variable, I can now group by it
  group_by(visitor_id, order_seq) %>%
  mutate(
    
    # Finally, I'll blank out (with NAs) any sequences that
    # didn't actually contain an order 
    order_seq = ifelse(sum(conversion)>0, order_seq, NA)

  # Ungrouping to get back to the original dataset
  ) %>% ungroup() %>% ungroup()

If this bit didn’t make sense, here’s what I just did in table form. Notice the “order_seq” column now has groupings of hits that occurred leading up to and including my conversion event – perfect for the attribution modeling I’ll do next:

hit_time_gmt	conversion	order_seq
1	NA	0
2	NA	0
3	NA	0
4	1	0
5	NA	1
6	NA	1
7	1	1
8	NA	2
9	1	2

To begin the actual attribution modeling, I’m going to create a new table called “attributable_sequences” containing only the marketing touch points of interest:

attributable_sequences = data_feed_tbl %>%
 filter(!is.na(mid_campaign) & !is.na(order_seq))

All of my attribution rules will follow the same basic outline, with only slight modifications to the actual aggregation. To illustrate the basic idea, I’ll start by grouping by visitor_id and order_seq, run the attribution logic of my choice, then aggregate the fractional orders against each marketing channel:

some_touch_orders = attributable_sequences %>%
  group_by(visitor_id, order_seq) %>%
  mutate(
    order_participation = <Attribution Logic Goes Here>
  ) %>%
  ungroup() %>%
  group_by(mid_campaign) %>%
  summarize(orders = sum(order_participation)) %>% collect()

For linear attribution, credit is divided equally among all touchpoints which I accomplish with the following:

order_participation = 1/n()

U shaped attribution (40% credit to first and last touches, with 20% given to the middle touchpoints) is accomplished with a few ifelse statements:

order_participation = 
  ifelse(n() == 1, 1,
    ifelse(n() == 2, 0.5,
      ifelse(hit_time_gmt == max(hit_time_gmt), 0.4,
        ifelse(hit_time_gmt == min(hit_time_gmt), 0.4,
          0.2/(n()-2)
        )
      )
    )
  )

For a fancier half-life based time decay model where a channel’s credit decays by half every seven days, I can use an exponential decay formula, where the “half_life” value is 7*24*60*60 (7 days times 24 hours times 60 minutes time 60 seconds).

order_participation = 0.5^((hit_time_gmt - min(hit_time_gmt))*1/half_life),
order_participation = order_participation/sum(order_participation)

For your basic first and last touch models, the standard first and last functions of dplyr that I would use surprisingly don’t yet work in sparklyr as of the time of this writing, so I had to tweak it a little bit using the top_n function:

last_touch_orders = attributable_sequences %>%
 group_by(visitor_id, order_seq) %>%
 top_n(1,hit_time_gmt) %>%
 select(visitor_id, order_seq, mid_campaign) %>%
 ungroup() %>%
 group_by(mid_campaign) %>%
 summarize(orders = n()) %>% collect()

For first touch, you can use the same code but merely modify hit_time_gmt in the top_n function to -hit_time_gmt.

Finally, no multi-touch attribution analysis would be complete without comparing the results of all the models to each other. To do that, you can use the merge function to join all of the tables created for each attribution model:

model_comparison = list(
  last_touch_orders, 
  first_touch_orders, 
  linear_orders, 
  half_life_orders, 
  u_shaped_orders) %>%
Reduce(function(...) merge(..., all=TRUE, by="mid_campaign"), .)
View(model_comparison)

And, here’s what I end up with:

Marketing Channel	Last Touch	First Touch	Linear	Half Life	U-Shaped
Email	51,790	54,638	53,598	53,836	53,343
Natural Search	17,766	19,310	18,543	18,755	18,544
Affiliates	12,528	7,713	9,685	9,264	9,963
Paid Search	5,805	6,154	5,987	5,942	5,982
Display	182	234	226	236	215
Social Media	165	184	175	179	175

As you can see (and as is usually the case with rules based attribution models) the most striking differences occur between first and last touch – however, each model allows you to see the unique performance of each marketing channel in the overall customer journey. In the following post, I’ll show you how to be even smarter about your attribution modeling – moving beyond arbitrary rules-based logic into something statistically based.

Update: You can now read my next post here!

Follow @trevorwithdata

Trevor Paulsen

Trevor is a group product manager for Adobe's Customer Journey Analytics (CJA). With a background in aerospace engineering and robotics, he has a strong foundation in estimation theory and data mining. Before leading Adobe's data science consulting team, Trevor used these skills to drive innovation in the fields of aerospace and robotics. When he's not working, Trevor enjoys engaging in big data projects and statistical analyses as a hobby. He is also a father of five and enjoys bike rides and music. All views expressed are his own.

2 thoughts to “Multi-Touch Attribution Using Adobe Analytics Data Feeds and R”

Mabelly T. says:

February 2, 2018 at 10:56 am

Hi Trevor,

Great article. Since I don’t have access to Adobe, I was wondering if you could provide sample data sets or some sort of template for the very first files.

Thanks.
sara says:

February 15, 2018 at 9:07 am

Hi Trevor, very interesting article!
I would really appreciate if you could share a template of the dataset from which you start the analysis
Many thanks,

Comments are closed.