Three DB SQL’s walk into a NOSQL bar. A little while later… they all walked out, because they couldn’t find a TABLE…
Joking aside, online marketers frequently use analytics tools like Adobe Analytics, but find that the granularity and accessibility of the data in the tool doesn’t meet their needs.
A few examples:
- Loading Adobe Analytics data into a BI reporting tool like Tableau
- Maintaining an on-premise data warehouse system that forecasting and recommendations packages can run against
- Analyzing data at a visitor level or campaign level
There are many others.
I’ve met many companies who were eager to learn more about how to query Adobe Analytics Clickstream Data Feeds. Understanding the basics of the Adobe Analytics Data Feeds and running introductory queries can help set a good foundation for more complex analysis, however it’s not always obvious how to calculate some of the basic metrics and reports you would normally find in Adobe Analytics from the Data Feeds themselves. To help you with that, I’ve put together some basic SQL queries that you can use to do basic reporting and hopefully match the reports you would normally pull from Adobe Analytics directly.
Using a tool like MySQL can work well as a boiler-plate tool if the volume of data that you’re processing is relatively small (either low overall traffic volume to your site or if you’re analyzing data for an hour or in some cases a day). Once you start querying tables that have 10 MM+ rows, MySQL starts to break down. Other blog posts have and will address other platforms that scale very well for processing this data.
That being said, expressing some of these queries in SQL can provide a good understanding of how the data is structured, so let’s get started.
I’ll be using MySQL 5.7.17 in the examples below.
To start, let’s review a couple of the key Clickstream Data Feed columns needed to run our SQL queries:
|Column Name||Field Type||Description||Example|
|date_time||DATETIME||Time in readable format in Time Zone specified by RSID||0000-00-00 00:00:00|
|duplicate_purchase||TINYINT(3)||Flag indicating whether the purchase should be ignored||0=False or 1=True|
|exclude_hit||TINYINT(3)||Hit excluded by client rule, VISTA etc.||0=False or >=1 True|
|hit_source||TINYINT(3)||Flag for backend processing to know type of hit data hit is||1=web beacon|
|post_event_list||TEXT||Comma separated list of numeric ID’s representing each event sent in||1,100,201
(purchase, evar1 & event2)
|post_page_url||VARCHAR(255)||Post version URL of the page hit||http://adobe.com|
|post_pagename||VARCHAR(100)||Post version of the page name (page URL if blank)||Home Page|
|post_product_list||TEXT||Post version of the list of products in cart or purchased||category;product;qty;rev|
|post_visid_high||BIGINT(19)||Part 1 of Visitor ID. Combination creates unique ID||33434080343|
|post_visid_low||BIGINT(19)||Part 2 of Visitor ID. Combination creates unique ID||75394626767|
|visit_num||INT(10)||Number of the current visit. Incremented each return.||5|
You can request data in this format from Adobe Analytics through the Admin menu which we wrote about how to do here. Also, this post assumes that you can request the data yourself and load the data into a MySQL table called ‘online_users’.
In the columns above, you’ll notice that many of the column names are prefixed with “post”. It is important to understand that as each of the Adobe web beacons are processed within Adobe Analytics through a series of stages. The Clickstream Data Feeds reflect this by providing the original data (i.e. as the web beacon came in) as one set of columns, and how the data was finally changed (i.e. after the data went through processing in Adobe Analytics) in the “post” columns. The “post” columns are used by Adobe Analytics for reporting, so I typically recommend always using the “post” columns unless doing some debugging to see how the data changed.
You can find additional details on each of the Clickstream Data Feed columns here. With this background, let me illustrate how you would construct a few of the metrics normally used by Adobe Analytics users:
Page Views Metric
In order to count the total page views metric for a time period, you’ll need to make sure that either the pagename or the page_url column have a value. Also, you’ll need to remove records that have been excluded during processing. Clickstream Data Feed records could be excluded due to Analytics Processing Rules, VISTA rule exclusions or other settings:
/* PAGE VIEWS */ SELECT COUNT(*) FROM online_users WHERE (pagename != "") OR (pageurl !="") AND exclude_hit = 0 AND hit_source = 1;
Unique Visitors Metric
To calculate the Unique Visitors to your site for the day, use the visitor identifier columns, post_visid_high and post_visid_low. Also, to avoid any potential collision problems, I recommend adding a delimiter between the 2 columns when they are concatenated together. Also, as shown above, add the standard exclusion filters with exclude_hit and hit_source to match Adobe Analytics reports.
/* UNIQUE VISITORS (UV’s) */ SELECT CONCAT(visid_high, '-', visid_low) AS visid, COUNT(*) 'UVs' FROM online_users WHERE hit_source = 1 AND exclude_hit = 0 GROUP BY visid;
Calculating the Visits metric is almost identical to the Unique Visitors metric. Just add the visit_num metric to the concatenation to define a unique visitor with a unique visit for the day. For example, if the same unique visitor returns to the site multiple times during the day with 30 minutes of inactivity between previous hits, the visit_num column will be incremented by one each time.
/* DAILY VISITS */ SELECT CONCAT(visid_high, "-", visid_low, "-", visit_num) AS unq_visit, COUNT(*) 'Visits' FROM online_users WHERE hit_source = 1 AND exclude_hit = 0 GROUP BY unq_visit;
Custom Event Metrics
To calculate any of the standard or custom events, we need to be able to locate the numeric identifier within the “event_list” column. In the example below, the “event_list” column was used to locate the total of all purchase events where users had ordered products online. Adding a comma at the beginning and the end of the purchase event ID “1”, will ensure that only this specific event is located and not match other eVar and event numeric identifiers within the “event_list” column.
To calculate any other standard or custom events, simply swap out the “1” for the purchase event for the event report needed.
|Lookup ID#||Event Description|
|10||Shopping Cart Open|
|12||Shopping Cart Add|
|13||Shopping Cart Remove|
|14||Shopping Cart View|
|20||Instance of Campaign|
|100||Instance of Evar1|
|200||Instance of Event1|
This is a simplified way of identifying which records have a certain event. In the case outlined below, we’re counting the number of purchases in the dataset.
/* Count Events */ SELECT COUNT(*) FROM online_users WHERE CONCAT(',', event_list, ',') LIKE "%,1,%" AND hit_source = 1 AND exclude_hit = 0; Example Output: +----------+ | count(*) | +----------+ | 6 | +----------+ 1 row in set (0.00 sec)
Sometimes, you want to identify records that contain a custom event (like event 1) that has a custom increment amount.
/* Returning rows containing a custom event */ SELECT event_list FROM online_users WHERE CONCAT(',', event_list, ',') LIKE "%,200=%," AND hit_source = 1 AND exclude_hit = 0; Example Output: +--------------------+ | event_list | +--------------------+ | ,1,200=17,201, | | ,1,200=20, | | ,1,200=10, | | ,1,200=17,201, | | ,1,203,200=17,201, | | ,1,200=17,201, | +--------------------+ 6 rows in set (0.00 sec)
Your next question is likely, “But how do I SUM the values associated to event 1?”. Unfortunately, MySQL doesn’t natively have support for the string split function that we’ll need to isolate the values here like Spark SQL or sparklyr (like Trevor used in this post). However, you have the following options for handling this:
- Define a user-defined function – a Google search can reveal some good examples
- Load the data into a relational schema — for example, the product list and event list would be in separate tables and you’d join the data as part of the query
- Sum the event totals using some other tool (Bash/Awk, Excel, etc) after running the query
To complete this example, I’ll illustrate option #3:
echo "SELECT event_list FROM online_users WHERE event_list like '%,200=%,';" | mysql -u root -p -D datafeed_tests | sed 's/^.*,200=\([0-9\.]*\),.*$/\1/'|tail +2 > event_totals.txt
There is a lot happening here in one line. Let’s step through it:
echo "SELECT event_list FROM online_users WHERE event_list like '%,200=%,';" | mysql -u root -p -D datafeed_tests
This part sends the query to the database using the command line. You can run this on its own (it will prompt you for a password) and produce the following output:
event_list ,1,200=17,201, ,1,200=20, ,1,200=10, ,1,200=17,201, ,1,200=171.20,201, ,1,200=17,201,
Next, we need to extract out the custom increment amounts for event 1 (event ID 200).
This part looks at each row of output data from MySQL and searches for a string that looks like “,200=SOME NUMBER,”. If it finds a match, it returns the SOME NUMBER part of the pattern. When provided the data from MySQL as an input, it produces the following output:
event_list 17 20 10 17 171.20 17
That ‘event_list’ thing at the top is still there. Let’s use ‘tail’ to remove it.
This part removes the first row from the output. MySQL tends to return column names in the output. This just removes them. Be careful with this as your client may not print a header and you may be removing the top record in your results. Our output now looks like this:
17 20 10 17 171.20 17
If you’re on a Linux system, (non OSX), you can actually just pipe this output to paste and bc as seen below without using the ‘event_totals.txt’ file. I’m on a Mac though, so I’ll show how to use the file.
This captures the above output and stores it in a file called event_totals. For example, after running this on my test data, this file contained the following:
17 20 10 17 171.20 17
Now, to add these together, you can use the following command:
paste -sd+ event_totals.txt | bc
Let’s break this apart. The “paste -sd+ event_totals.txt” generates output that looks like this: “17+20+10+17+171.20+17”. The ‘bc’ part is just a calculator that evaluates an expression. It takes the output from the previous paste command, evaluates it, and produces the output:
As you can see, this isn’t an ideal way of summing a custom event. This is one of the weaknesses in using MySQL for this sort of thing. That being said, there are a lot of things that MySQL can do quickly if your dataset is small.
Hopefully this was a helpful introduction to a few basic queries to run against Clickstream Data Feeds. Be sure to compare the output of the SQL queries with the same Adobe Analytics reports to verify you’re doing things right. Check back for more information and examples on constructing useful reports using SQL with Adobe Analytics Data Feeds.