Speaking

Test coverage for data engineering developments often isn't high, and Azure Data Factory (ADF) pipelines are no exception. In this talk I'll apply well-established testing approaches to an ADF pipeline using C# and NUnit, integrating the test suite into an Azure DevOps pipeline for regular automatic execution. The result will be an ADF instance that is re-tested in full whenever any change is made – quickly flushing out errors and breaking changes, and giving you the opportunity to fix bugs as they occur (instead of at 2am in three months' time!).

Test coverage for data engineering developments often isn't high, and Azure Data Factory (ADF) pipelines are no exception. In this talk I begin by presenting a basic C#/NUnit test setup for a simple ADF pipeline. I'll move on to talk about patterns for flexible test setup, isolation using dependency injection and faked external dependencies, unit vs functional tests in ADF and calculation of test coverage. My aim here is to demonstrate that test construction needn't be hard or onerous, and that developing automatable tests alongside your ADF pipelines provides real benefits in terms of prompt bug discovery and regression prevention.

Download slides

As data engineers we're great at putting together and managing complex process flows, but what happens if we stop trying to control the flow and start thinking about the metadata it needs instead? In this session we'll look at a variety of ETL metadata, how we can use it to drive process execution, and see benefits quickly emerge. I'll talk about design principles for metadata-first process control and show how using this approach reduces complexity, enhances resilience and allows a suite of ETL processes adaptively to reorganise itself.

Documentation has never been this much fun! In this session I'll be introducing Graphviz – free, open-source, graph visualisation software with relevance that extends beyond traditional graph applications. I will show how we can use it to build informative visualisations of common data management artefacts, specifically SQL Server database diagrams and ETL data pipelines. Combining the approach with sources of metadata we'll see how we can quickly and automatically generate suites of interlinked diagrams to describe large and complex database and ETL systems in an easy-to-navigate way.

ETL development can be packed with variety or as repetitive as WHILE 1 = 1 – and when it's the latter it's time-consuming, boring and error-prone. In this session I'll get the ball rolling with some basic dynamic T-SQL before supercharging it with metadata to generate (and re-generate) a variety of ETL components in T-SQL. We'll wrap up with some thoughts about how to tackle this in the real world with a heady mixture of good practice and metadata abstraction.

Download slides

There are many techniques for orchestrating ETL processes, but the difference between good ones and great ones is how they perform when things go wrong. Desirable behaviours – like fault tolerance, quick fault finding and easy resume after error – often aren't available and sometimes seem hard to achieve. In my session I'll present an approach to doing this using only TSQL and the SQL Server Agent, and which also enables parallel processing, adapts to evolving workloads and provides a wide variety of monitoring and diagnostic information.