Automating Analytics Data Quality with ChatGPT and R

Data quality has long been the pesky, perennial problem plaguing the analytics industry since its inception. We all recognize its importance—if data is unreliable, the insights derived from it are equally dubious. Despite its importance, however, addressing data quality remains a notorious challenge. Programmatic data quality solutions lack the essential human context to discern whether something appears “off.” In contrast, a human has that intuition but can never keep pace with the immense volume of data that needs inspection. So, we find ourselves entangled in a problem with no easy solution.

That was, indeed, the case—until now. Artificial intelligence (AI) has advanced where human-like context can now be applied programmatically at scale. ChatGPT boasts an impressive capacity for contextual understanding and reasoning and, as you’ll see, offers a potential solution to data quality issues we couldn’t have fathomed solving previously. In this blog post, I’ll guide you through harnessing the combined power of ChatGPT and the R programming language to sift through your data, automatically identify potential issues, and—for those using Adobe Customer Journey Analytics—implement fixes with the help of some nifty CJA features.

Before diving in, you will need to install the following R libraries if you haven’t already:

To install and setup cjar, follow the guide I provided in a previous blog post along with the helpful link above (thanks, Ben Woodard!). To set up the chatgpt library, you must first obtain an API key from the OpenAI website. You’ll have to provide some billing details (they charge a tiny bit of money for each API call) and then follow the helpful guide provided by Juan Cruz Rodriguez, which you can find on GitHub here.

By the way, if you don’t know how to use or install an R library, ChatGPT is a wonderful coach and can help you through the process!

Setting Things Up

First things first – let’s load our libraries, authenticate with cjar (or adobeanalyticsr) and set up our initial variables:

library(cjar)
library(chatgpt)
library(jsonlite)
library(knitr)
library(dplyr)

cja_auth()

data_view = "dv_someDataViewID"
dimensions = cja_get_dimensions(data_view)

The dimensions variable will now store a list of all the dimensions you use for reporting purposes. To make ChatGPT work best (and as a general best practice), you’ll want to ensure that all of your dimensions have been given a good description so that ChatGPT has some context to go off of when checking the values for problems:

Here are a couple of things to consider when creating dimension descriptions for use with ChatGPT:

  • If the dimension items follow a specific format (e.g., should always start with ‘product:,’ or should always be a URL without URL encoding), this can be very helpful to ChatGPT in identifying values that don’t match the format.
  • ChatGPT has a wealth of knowledge outside of your implementation. When you include important dimension details like “this is a country code” or “this is a URL,” it can use its understanding of countries and internet domains to help you find errors in your data that you didn’t know were errors. For example, identifying invalid country codes or suspicious, irregular domains.

Using ChatGPT

With our product descriptions in place, we can now pull data from each of these dimensions for inspection by ChatGPT. To pull a simple report, use the cja_freeform_table function:

i = 1 # (We'll replace this with a loop later)
sample_vals = cja_freeform_table(
  dataviewId = data_view,
  dimensions = dimensions$id[i],
  metrics = "occurrences",
  top = 100
)

This will return the top 100 most commonly occurring values in your data from the first dimension in your dimensions variable. Due to limitations of what the ChatGPT API can handle, you can’t inspect every dimension value in a single API call if your dimension has thousands or more rows. So, you can either inspect the top values, or if you are running this on a schedule periodically, you can choose other values by changing the top and/or page parameters of the cja_freeform_table function.

Next, we’re going to prep the data for ChatGPT and feed it into a very detailed prompt text (all in one line to copy/paste – I repeated the prompt below so that you can read it):

sample_table = paste(kable(sample_vals, format = "pipe", row.names = FALSE), collapse = "\n")

text_response = ask_chatgpt(
paste0(
  "I have an Adobe Analytics report for a dimension called '",
  dimensions$name[i],
  "' which has the following description: '", 
  dimensions$description[i],
  "'. Can you look at the values of this report and tell me if anything   looks off? For example, any mispellings or errant characters or anything else that might indicate that my Adobe Analytics implementation has a problem? You can ignore when the report says 'No Value'. When you return the response, give me only a json object with the errant values and the rationale with no other text other than a json object. Here's the report:\n\n",
  sample_table,
  "\n\n Respond with a json object where you have an array of 'errant_values', with an array of objects inside for 'value' and 'rationale'. Use as few characters as possible for the rationale to save space. Keep the response under 1000 characters, so pick only the most relevant/urgent problems to share. Again, one more time - do not respond with anything other than this json object - no other text at all. For example, do not say 'here's the requested json object' or anything like that. Just give me the top 5 errant values in json with no '\n' characters and no other text and keep it under 1000 characters."
  )
)

Here’s the full prompt generated for the “Session Referrer” dimension I was using (shortened for readability):

*** ChatGPT input:

"I have an Adobe Analytics report for a dimension called 'Session Referrer'
which has the following description: 'The last known referring URL of the 
current page. It should be in URL format and only contain the domain of 
the referring URL. When no referring URL is present, it will show the 
value of 'typed/bookmarked'.'. Can you look at the values of this report 
and tell me if anything looks off? For example, any mispellings or errant 
characters or anything else that might indicate that my Adobe Analytics 
implementation has a problem? You can ignore when the report says 'No 
Value'. When you return the response, give me only a json object with the 
errant values and the rationale with no other text other than a json 
object. Here's the report:

|values                                  | events|
|:---------------------------------------|------:|
|bookmarked/typed                        |  34270|
|www.google.com                          |  15260|
|dev26.cevala.dev                        |   2739|
|www.bing.com                            |    983|
|statics.teams.cdn.office.net            |    548|
...
...
|com.snapchat.android                    |      1|

Respond with a json object where you have an array of 'errant_values', 
with an array of objects inside for 'value' and 'rationale'. Use as few 
characters as possible for the rationale to save space. Keep the response 
under 1000 characters, so pick only the most relevant/urgent problems to 
share. Again, one more time - do not respond with anything other than this 
json object - no other text at all. For example, do not say 'here's the 
requested json object' or anything like that. Just give me the top 5 
errant values in json with no '\n' characters and no other text and keep 
it under 1000 characters."

I tweaked this prompt after a bunch of trial and error with various reports from some sample data; however, I’m sure there’s still room for improvement. What’s amazing is that ChatGPT knows how to scan these data, find potential problems, then return the results in JSON format. Truly mind-blowing. The result of calling the ask_chatgpt function above is:

{
  "errant_values": [
    {
      "value": "bookmarked/typed",
      "rationale": "Value should only contain the domain of the referring URL"
    },
    {
      "value": "adobe.okta.com",
      "rationale": "Value seems to be an internal site, should not be a session referrer"
    },
    {
      "value": "statics.teams.cdn.office.net",
      "rationale": "Value is a CDN domain, should not be a session referrer"
    },
    {
      "value": "amp-amebaownd-com.cdn.ampproject.org",
      "rationale": "Value is a CDN domain, should not be a session referrer"
    },
    {
      "value": "usc-powerpoint.officeapps.live.com",
      "rationale": "Value is a Microsoft Office domain, should not be a session referrer"
    }
  ]
}

You’ll notice that It followed my instructions only to return a JSON object (although it still added a bunch of return characters into the string despite my saying not to). To interpret this into an R data frame, we need the jsonlite library:

data_from_text = fromJSON(text_response)
df = as.data.frame(data_from_text$errant_values)

Which yields:

ValueRationale
bookmarked/typedValue should only contain the domain of the referring URL
adobe.okta.comValue seems to be an internal site, should not be a session referrer
dev26.cevala.devValue is a CDN domain, should not be a session referrer
amp-amebaownd-com.cdn.ampproject.orgValue is a CDN domain, should not be a session referrer
usc-powerpoint.officeapps.live.comValue is a Microsoft Office domain, should not be a session referrer

Fixing the Problems

Now that we’ve identified some problems let’s fix them. In CJA, these problems are simple to correct using include/exclude rules and fixing how we treat the “No Value” line item. To do this, we’ll:

  1. Change the “bookmarked/typed” line item to “typed/bookmarked” as it’s supposed to be per the description
  2. Add some exclusion rules to remove the spurious domains

You can do this by editing the dimension itself in CJA:

This option doesn’t exist in Adobe Analytics, but you can still fix your implementation to exclude those domains if you’re using Adobe Analytics.

Automating Things

While this isolated example worked well, I would not rely on ChatGPT (at least GPT3, which the API supports as of this writing) to automatically correct errors without a human inspection first. Instead, you can use the guide I posted previously to automatically send these detected errors to a Slack or email notification.

Next, let’s configure things to loop through all dimensions so we don’t have to run this for every dimension manually. By creating a find_errant_values function, we can add it to a loop that will have ChatGPT inspect each one, then compile the results into a single table that you can turn into an alert:

find_errant_values = function(dimension_id, dimension_name, dimension_description){
  tryCatch({
    sample_vals = cja_freeform_table(
      dataviewId = data_view,
      dimensions = dimension_id,
      metrics = "occurrences",
      top = 100
    )
    sample_table = paste(kable(sample_vals, format = "pipe", row.names = FALSE), collapse = "\n")

    text_response = ask_chatgpt(paste0("I have an Adobe Analytics report for a dimension called '", dimension_name, "' which has the following description: '", dimension_description, "'. Can you look at the values of this report and tell me if anything looks off? For example, any mispellings or errant characters or anything else that might indicate that my Adobe Analytics implementation has a problem? You can ignore when the report says 'No Value'. When you return the response, give me only a json object with the errant values and the rationale with no other text other than a json object. Here's the report:\n\n", sample_table, "\n\n Respond with a json object where you have an array of 'errant_values', with an array of objects inside for 'value' and 'rationale'. Use as few characters as possible for the rationale to save space. Keep the response under 1000 characters, so pick only the most relevant/urgent problems to share. Again, one more time - do not respond with anything other than this json object - no other text at all. For example, do not say 'here's the requested json object' or anything like that. Just give me the top 5 errant values in json with no '\n' characters and no other text and keep it under 1000 characters."))

    data_from_text = fromJSON(text_response)
    df = as.data.frame(data_from_text$errant_values)
    df$dimension_name = dimension_name
    df$dimension_id = dimension_id
    return(df)
  },
  error = function(e) {
    empty_dataframe = data.frame(
      value = numeric(0),
      rationale = character(0),
      dimension_name = character(0),
      dimensions_id = integer(0),
      stringsAsFactors = FALSE
    )
    return(empty_dataframe)
  })
}

error_table = find_errant_values(dimensions$id[1], dimensions$name[1], dimensions$description[1])

for(i in 2:(dim(dimensions)[1])){
  error_table = rbind(error_table, find_errant_values(dimensions$id[i], dimensions$name[i], dimensions$description[i]))
}

Notice that I had to wrap everything in tryCatch as sometimes the ChatGPT output doesn’t always conform to a perfect JSON object. If one of the dimensions errors out, the loop can continue analyzing the rest of my data.

While there’s not enough room to list all the issues detected in my data within this post, here are some remarkable findings ChatGPT identified:

  • Invalid country codes
  • Malformed product names
  • Questionable referring URLs from unknown or suspicious domains
  • Inconsistencies in dimension item naming conventions
  • URL encoding problems
  • Misspellings
  • Invalid IP addresses
  • IP addresses linked to suspicious organizations
  • Incorrectly formatted timestamps
  • Capitalization mistakes
  • Unwanted leading or trailing spaces (these drive me nuts!)
  • Extraneous JavaScript code
  • Non-production domains from test data

In my results, there were also a few false positives where ChatGPT seemed to flag errors that didn’t quite add up. This serves as a reminder that AI isn’t infallible, and is why I wouldn’t automatically apply the fixes it suggests (at least with the current version of ChatGPT). Nevertheless, ChatGPT’s ability to detect major issues makes it significantly easier to apply the final layer of human scrutiny. If you’re interested in reviewing my complete results, feel free to download them here.

Conclusion

Although ChatGPT isn’t flawless and doesn’t deliver impeccable insights or consistent results every time, it’s still remarkable how swiftly and effortlessly it identifies these issues. While it may not be a magic wand that resolves all your data concerns, it’s undoubtedly a valuable tool that can expedite your data quality assessments. Honestly, I was skeptical it would uncover any issues in the polished sample data I was using, but it revealed numerous items I had to correct.

Last thing worth mentioning: it’s not always clear how or whether ChatGPT uses the data you put into its API, so you’ll want to be very careful about whether or which data you decide to have ChatGPT check for you.

I hope you find this information useful—feel free to share your thoughts with me on Twitter!

Trevor Paulsen

Trevor is a group product manager for Adobe's Customer Journey Analytics (CJA). With a background in aerospace engineering and robotics, he has a strong foundation in estimation theory and data mining. Before leading Adobe's data science consulting team, Trevor used these skills to drive innovation in the fields of aerospace and robotics. When he's not working, Trevor enjoys engaging in big data projects and statistical analyses as a hobby. He is also a father of five and enjoys bike rides and music. All views expressed are his own.