Extract data from Google Analytics with Azure Data Factory

In a previous post I prepared an ADF pipeline to make authorised Google Analytics API requests, using an Azure Function to obtain an OAuth 2.0 access token. In this article I'm going to use the returned token to authorise an API connection and extract data from Google Analytics.

Azure Data Factory datasets provide convenient abstractions of external data stores in a variety of shapes and sizes, including REST APIs. You can extract data from a REST API endpoint with an ADF Copy data activity that uses a REST dataset as its source.

Google Analytics provides a batch data extract endpoint with URL https://analyticsreporting.googleapis.com/v4/reports:batchGet – encapsulating access to this endpoint in a dataset allows you to extract data using ADF. The REST dataset is implemented over a REST linked service which represents the API connection.

I created this linked service using the API endpoint's base URL:

Its authentication type is Anonymous because API requests will be authorised using an OAuth token passed in the request header (instead of a username and password, for example).

The complete endpoint URL combines the base URL specified in the linked service with a relative URL configured in this REST dataset:

Requests are made to the /v4/reports:batchGet endpoint using a query specified in a POST request's JSON body. The request body can contain many query options, but to keep it simple I'm going to restrict it to these basic features:

  • one date range
  • some metrics – things I want to measure like sessions, bounces, duration and page views
  • some dimensions – attributes I want to break the measures down by, like channel, source and medium.

You can select up to 10 metrics – you need at least one – and 9 dimensions. This will look familar if you're used to fact and dimension star schemas for reporting (although you can't request data at the grain of individual fact rows).

Here's my request body:

{
  "reportRequests": [
    {
      "viewId": "12345678",
      "dateRanges": [
        {"startDate": "2020-09-03", "endDate": "2020-09-03"}
      ],
      "metrics": [
        {"expression": "ga:sessions"}, {"expression": "ga:sessionDuration"},
        {"expression": "ga:bounces"}, {"expression": "ga:pageviews"}
      ],
      "dimensions": [
        {"name": "ga:source"}, {"name": "ga:medium"},
        {"name": "ga:hasSocialSourceReferral"}, {"name": "ga:adDistributionNetwork"}
      ]
    }
  ]
}

It's requesting:

  • total sessions, session duration, bounces and pageviews
  • on September 3rd 2020
  • from reporting view 12345678 (this is the view ID I noted earlier)
  • broken down by four dimensions.

A wide range of dimensions and metrics are available – the dimensions I've chosen here are those required to reproduce Google Analytics' default channel definitions.

Now you're ready to copy data from the API endpoint 😀. To do this you need an ADF pipeline containing a single Copy data activity. Here's how to configure it.

This is the configuration for the Copy data activity's source, using the REST dataset I described earlier.

  • Source dataset is the REST dataset
  • Request method is POST
  • Request body is the JSON object shown above
  • There are two Additional headers:
    • The “Authorization” header has the form “Bearer <Token>”, where <Token> is the OAuth access token retrieved by the Azure Function activity. This is how the API request will be authorised.
    • I've also set the “Content-Type” header to “application/json”.

The API returns a JSON response which can easily be written out to blob storage, but I prefer to copy the data directly into a database table for ease of analysis. The API returns a maximum of 9 dimensions and 10 metrics, so it's easy to define a general-purpose sink table for response data. My table has 9 nullable text columns and 10 nullable numerics; I've included a CREATE TABLE script in the code for this post.

My sink dataset represents this table in an Azure SQL Database. The mapping between source and sink isn't complicated, but configuring it in the ADF UX can be cumbersome because of the way result sets are nested in the API response body. To keep this simple, my JSON request body specifies only one report request with a single date range.

Rather than configure the mapping by hand, I've stored it in a JSON file which can be pasted into a pipeline string variable and used in a mapping expression, e.g. @json(variables('yourVariableName')). The mapping expression must be a typed JSON value and not just the raw string – this is a convenient way to achieve that. The mapping definition is also included in the code for this post.

In this post I showed you how to create a REST dataset to encapsulate the Google Analytics reporting API endpoint, then to use ADF's Copy data activity – with the OAuth token already retrieved – to extract data into a database table.

By default, the Google Analytics reporting API returns results in batches of 1000 rows (as described in the definition of pageSize). In my next post I look at how to handle the Analytics API's pagination model in ADF.

  • Code: The code for this post is available on Github – the T-SQL script and Copy data mapping are in the scripts folder.

  • Share: If you found this article useful, please share it!

V E Q U Q