Google Analytics API pagination in Azure Data Factory

In a previous post I created a pipeline to retrieve data from the Google Analytics reporting API, using an OAuth 2.0 access token for authorisation. Azure Data Factory's Copy data activity handles various styles of paged API response, but it doesn't support the approach taken by the Google Analytics reporting API. In this post I look at how to make that work.

API consumers want fast response times, but sometimes request a lot of data. API providers typically reconcile these conflicting requirements by limiting the amount of data that can be returned from a single API call. When the amount of data to be returned exceeds a limit for the API, only a subset of the data is returned.

In this situation, the response also includes information about how to get the remaining data, such as an alternative URL or query parameter to be used in a subsequent call to the API. Using this information, an API consumer can make a series of calls to retrieve the full dataset in separate pieces. This is paging.

If more data exists than can be returned in a /v4/reports:batchGet API report, the JSON response body includes a nextPageToken property. The value of this property can be specified in another request's pageToken property, allowing the next portion of the report dataset to be retrieved.

An API consumer pages through a Google Analytics reporting response like this:

  1. Call the API and receive a response
  2. Check the response for a nextPageToken value
  3. If a nextPageToken value is present, call the API again, passing the value as the next request's pageToken
  4. Repeat steps 2 & 3 until the response contains no nextPageToken – i.e. there is no more data to be returned.

ADF's Copy data activity encapsulates the detail of handling paged API responses, so that normally you don't need to think about it. By specifying pagination rules, you describe the two features required for paging:

  • how the “next page” identifier is returned in an API response (so that ADF can extract that information from the response)
  • how “next page” is to be specified in an API request (so that ADF can make subsequent API calls automatically).

This is a really convenient abstraction – the Copy data activity makes it appear that the API is providing a single, uninterrupted stream of data.

A pagination rule is configured as a key-value pair on the Copy data activity's Source tab:

  • Value indicates how to find the next page token. It can be either a response header name (for APIs that return the next page token in a response header) or a JSON path (for APIs like Google Analytics which return the token inside a response JSON payload).

    For the Google Analytics response body, the JSON path is reports[0].nextPageToken. The 0 refers to the first report in the response – Google Analytics allows you to submit multiple report requests in one API call, returning a response containing multiple reports. This is more complicated to handle, so I avoid making API calls with more than one report request.

  • Name indicates how the next page token is to be passed into the subsequent API call. Depending on the pattern used by the API you're calling, you can specify an absolute URL, a request header or a URL query parameter.

    Notice that the list of options does not include passing the next page token as a property in the request body! This means that pagination support in the Copy data activity simply can't be used with Google Analytics.

The approach to managing your own pagination in Azure Data Factory is to make REST API calls inside an ADF Until activity. My new pipeline looks like this – it includes a “NextPageToken” variable to keep track of next page tokens returned by successive API calls.

Handling the response is more complicated, because I want ADF to treat response data in two different ways:

  • as a stream of rows to be copied to storage (e.g. a database table, as in the example of my previous post)
  • as a complete dataset, from which I can look up the next page token.

I haven't found a clean way to reconcile these requirements for data in flight, so the approach I've taken is to include the returned token in the Copy data mapping. This writes it into the sink data store, where I can read it back out using a Lookup activity. Here's the body of the Until activity:

Its expression is @equals(variables('NextPageToken'),'-1') – it loops until the value of “NextPageToken” is -1. The variable's default value is -1, but the loop still starts because the Until activity's expression is not evaluated until the end of each iteration.

The Copy data activity configuration is very similar to the one in my previous post, with a few differences.

The Source tab's Request body is now dynamic, because it needs to reflect the changing value of pageToken on successive API calls. I'm using string interpolation to add , "pageToken":"<NextPageToken>" to the request body (where <NextPageToken> is the value of that variable). The expression omits the property from the first API call (when “NextPageToken” has the value -1).

{
  "reportRequests": [
    {
      "viewId": "226439645",
      "dateRanges": [
        {"startDate": "2020-09-03", "endDate": "2020-09-03"}
      ],
      "metrics": [
        {"expression": "ga:sessions"}, {"expression": "ga:sessionDuration"},
        {"expression": "ga:bounces"}, {"expression": "ga:pageviews"}
      ],
      "dimensions": [
        {"name": "ga:source"}, {"name": "ga:medium"},
        {"name": "ga:hasSocialSourceReferral"}, {"name": "ga:adDistributionNetwork"}
      ],
      "pageSize": 10
      @{if(equals(variables('NextPageToken'),'-1'),'',concat(',"pageToken":"',variables('NextPageToken'),'"'))}
    }
  ]
}

I've also added an explicit pageSize value – the default is 1000 but a lower value is useful for debugging. The source has no pagination rules configured because I'm doing it myself.

The Sink table is the same one as in my previous post, but it now needs a [NextPageToken] column to store the token returned from the API call. The Mapping definition also needs to change to keep up with the sink table. The code for this post includes a revised table definition and mapping JSON.

I'm using a Lookup activity to read the value of nextPageToken returned in the API response and written to the sink table. The SQL query in the activity selects the highest token value from the table, then sets the field to NULL. This ensures that the lookup functions correctly on successive pipeline runs.

When the final page is returned by the API, nextPageToken is absent – the SQL COALESCE returns the value -1 in this case, which causes the Until activity to terminate.

SELECT COALESCE(MAX(NextPageToken), -1)  AS NextPageToken
FROM dbo.GoogleAnalytics
 
UPDATE dbo.GoogleAnalytics SET NextPageToken = NULL

This lookup pattern is specific to my choice of Copy data sink. You don't have to use a table sink here – you could just as easily write the JSON response out to blob storage, then read it back in with a Lookup activity.

The Set variable activity sets the value of “NextPageToken” to that returned by the Lookup activity. After that, the Until activity's expression is evaluated:

  • If the API response contains a nextPageToken, the Until activity iterates again
  • if not, “NextPageToken” is -1 and the Until activity exits.

If an error occurs in my example, “NextPageToken” is never set and the activity will run until it times out. This example is deliberately simple for clarity, but a full implementation should ensure that the Until activity exits if an error occurs.

By default, the value of pageSize is 1000, but you can increase it up to a maximum of 100000.

If you're confident that your API request will never exceed 100000 rows, you could just set pageSize to 100000 and sidestep paging altogether. If you do this, it would be prudent to add a safety-check on the number of rows returned (or the existence of nextPageToken), in case your confidence turns out to be misplaced 😁.

The paging pattern used by the Google Analytics reporting API isn't supported by the Copy data activity's pagination rules in ADF. In this post I used ADF's Until activity to page through API responses, enabling even very large Google Analytics report responses to be handled in ADF.

  • Code: The code for this post is available on Github – the T-SQL script and Copy data mapping are in the scripts folder.

  • Share: If you found this article useful, please share it!

E Z W P B