Automate to replicate

In September 2020's T-SQL Tuesday, Elizabeth Noble (b|t) wants to know what members of the SQL community have automated to make their lives easier. Thanks for asking, Elizabeth 😀.

I found this question harder to answer than other T-SQL Tuesday challenges, simply because I love automation so much 🤓. I've always been a fan of automating deployments, from batch scripts back-in-the-day to modern Azure Pipelines. With the help of pipelines, automated testing now seems like a no-brainer. I'm a data engineer, so many of my automation challenges have been in the areas of process control and code generation, often using dynamic T-SQL – it's this last one that I'm going to talk about here.

I spent a good part of my career in the UK National Health Service (NHS). We often talk about data engineering problems in terms of handling large volumes of data, but the challenge in the NHS was frequently complexity – even a small acute hospital might have a dozen or more separate systems, running on various different DBMS backends, with many tables of interest for management or central reporting.

You might need to extract data from hundreds of (mostly small) tables to populate your data warehouse – my challenge was to make this as quick and easy to implement as possible.

The great thing about this particular scenario was that the vast majority of source systems we encountered were relational DBMSs. This allowed us to exploit three features:

  • you could typically connect to a source system as a SQL Linked Server
  • you could interrogate source systems using SQL queries (with some variations in dialect)
  • source systems usually included a system catalog – either of the ISO standard INFORMATION_SCHEMA type or some vendor-specific alternative.

This allowed us to connect to source systems and extract catalog metadata into a shared catalog, stored on the data warehouse side. The shared catalog contained a common set of attributes like:

  • source linked server name
  • table name
  • column name
  • column type, length, precision, scale
  • column key participation.

We'd augment this locally by setting an include/exclude flag on columns and tables for extraction.

Combining dynamic T-SQL with the shared metadata catalog is where this gets exciting 😀. A basic implementation uses metadata to generate a local staging table to contain remote data, then builds a stored procedure definition that populates the local table from the source system using a linked server query. Here's a basic example:

DECLARE @tableName NVARCHAR(255) = N'MySourceTable'
  @sql += CHAR(13) + CHAR(10) + '  , ' + QUOTENAME(ColumnName)
FROM dbo.SharedCatalog
WHERE TableName = @tableName
  @sql = 'CREATE PROCEDURE stg.' + QUOTENAME('Extract_' + TableName) + '
TRUNCATE TABLE stg.' +  QUOTENAME(TableName) + '
INSERT INTO stg.' +  QUOTENAME(TableName) + ' (
  ' + REPLACE(SUBSTRING(@sql, 7, LEN(@sql)), '  , ', ', ') + '
FROM OPENQUERY(' + QUOTENAME(LinkedServerName) + ', 
    ' + SUBSTRING(@sql, 7, LEN(@sql)) + '
  FROM ' + QUOTENAME(TableName) + ''');'
FROM dbo.SharedCatalog
WHERE TableName = @tableName
PRINT @sql

A more sophisticated ETL pattern includes:

  • a local view of a remote table, containing the linked server query (which may include ETL watermark information)
  • a stored procedure to rebuild the view for updated ETL watermarks
  • the local staging table into which source data is loaded
  • the extract stored procedure, populating the staging table from the view
  • a history table, containing a local history of change (where necessary to support type 2 SCDs or snapshot fact tables)
  • a stored procedure to update the history table from the staging table.

If your ETL process control is metadata-driven you can go one step further, automatically configuring the various stored procedures and their dependencies for execution.

The benefits of this approach for us were the same as for any automation – save time, reduce errors! When a new source system was identified (or when we needed additional tables or columns from an existing source), we could regenerate the entire interface automatically. When we got it right, this saved days or weeks of development time. When we didn't, the same error appeared all over the place – but could be corrected in all of those places relatively easily.

The pattern we were using – extracting data over linked servers using T-SQL queries – was specific to our use case and won't apply everywhere, but patterns pop up all over data engineering. Whether it's building tables (staging, SCDs, fact patterns) or processes to support and maintain them, dynamic SQL can be a really powerful tool for automating the development of those objects.