Set up automated testing for Azure Data Factory

This is the first article in a series about automated testing for Azure Data Factory (ADF) pipelines.

The series is aimed at people who already know a bit about ADF – if you're brand new to it, I highly recommend getting started with Cathrine Wilhemsen's Beginner's Guide to Azure Data Factory.

Test automation allows you to run more tests, in less time, with guaranteed repeatability. If you change an existing ADF dataset definition for a new ADF pipeline, how do you know haven't broken something else? Automatically re-testing all your ADF pipelines before deployment gives you some protection against regression faults. Automated testing is a key component of CI/CD software development approaches: inclusion of automated tests in CI/CD deployment pipelines for Azure Data Factory can significantly improve quality.

I'll be using the word pipeline a lot! I'll try to be clear about which I mean whenever it comes up; typically I'll refer to an Azure Data Factory pipeline as an ADF pipeline and a CI/CD deployment pipeline as a deployment pipeline.

I'll be considering three common kinds of software test in the context of ADF pipelines:

  • Integration test: A test of a pipeline as-is, without eliminating any effects of external dependencies.
  • Functional test: An isolated test of whether the pipeline is doing things right – is the pipeline producing the desired result?
  • Unit test: An isolated test of whether the pipeline is doing the right things – do the pipeline's activities get executed in the way you expect?

You can find all sorts of definitions of these test types online. I don't claim that my description is definitive – but when I use these terms here, this is what I mean.

An “isolated test” can mean a couple of things: either isolation of a test from its external dependencies, or from other tests. I'll be considering aspects of test isolation as they come up.

I'll be using NUnit to build tests throughout this series. NUnit describes itself as “a unit-testing framework for all .NET languages”, but you'll see that

  • it needn't be restricted to unit tests alone
  • it can be used to test anything you can interact with using a .NET language.

I'm using NUnit because it automates test execution and presents the results in a convenient way. I'll be using C# (which is a .NET language) to build ADF pipeline tests using the .NET SDK. I'll start by running tests in Visual Studio, then later in the series will build them into an Azure DevOps pipeline to run automatically on various CI/CD triggers.

I am using a set of Azure resources created specifically for this series – a single resource group containing:

  • a SQL Server with two databases:
    • [AdfTesting] contains tables where I'll be importing data
    • [ExternalSystem] represents a source system over which I have no control
  • an Azure Key Vault
  • an instance of Azure Data Factory to run tests.

My test data factory contains a number of linked services and datasets:

  • Linked services:

    • KV_AzureKeyVault is a connection to the key vault
    • LS_ASQL_ExternalSystem is a connection to the [ExternalSystem] database, using a connection string stored in key vault secret “ExternalSystemDbConnectionString”
    • LS_ASQL_AdfTesting is a connection to the [AdfTesting] database, using a connection string stored in key vault secret “AdfTestingDbConnectionString”

  • Datasets:

    • DS_ASQL_ExternalSystem provides parameterised access to tables in LS_ASQL_ExternalSystem
    • DS_ASQL_AdfTesting provides parameterised access to tables in LS_ASQL_AdfTesting

The ADF pipeline I'll be testing is called “PL_Stage_Authors”. It contains a single Copy data activity that copies data from source table [dbo].[Authors] (via DS_ASQL_ExternalSystem dataset) into staging table [stg].[Authors] (via the DS_ASQL_AdfTesting dataset):

The pipeline has been published to my test data factory. You may be used to running pipelines in Debug mode, but this is a feature of the online ADF UI – to make an ADF pipeline available to an external testing tool, it must be published to an ADF instance.

To start, I create a new NUnit project in Visual Studio, called 'AdfTests':

By default this contains a single test class containing a setup method and one test:

public class Tests
{
    [SetUp]
    public void Setup()
    {
    }
 
    [Test]
    public void Test1()
    {
        Assert.Pass();
    }
}

Any method with the [Test] attribute will be executed as a test by NUnit; the method marked [SetUp] is executed once before each [Test] method is called.

Tests are automatically detected by Visual Studio's test adapter. The Test Explorer pane (accessed by clicking Test → Test Explorer) shows a grouped list of tests in my VS solution:

To run the test here, I click the “Run All Tests” button. Unsurprisingly, this test passes!

I want tests and their purposes to be as readable as possible, so I like to use the Given-When-Then formula:

  • Given a scenario under test
  • When the pipeline is run
  • Then a particular expected result is observed.

The source table in my [ExternalSystem] database has 23 rows, so I rename the class Given23Rows. I'm going to execute the ADF pipeline during the test setup (because I want to run tests on its results), so I rename the [SetUp] method to WhenPipelineIsRun. To start with, I'm just going to test that the pipeline runs without error, so I rename the [Test] method to ThenPipelineOutcomeIsSucceeded.

It may look strange written down like this, but the effect in Test Explorer is much more readable. Normal C# naming conventions do not apply! You'll notice I've also renamed the namespace to the name of the pipeline, so it's easy to see which pipeline a test is for:

Again, readability is key – I want the purpose of my test to be as clear as possible. To avoid cluttering the test, I'm going to use a separate “helper” class to do most of the background work. I'll define the helper class in a moment, but here's the test, re-written to use it:

namespace PL_Stage_Authors
{
    public class Given23Rows
    {
        private PLStageAuthorsHelper _helper;
 
        [SetUp]
        public async Task WhenPipelineIsRun()
        {
            _helper = new PLStageAuthorsHelper();
            await _helper.RunPipeline();
        }
 
        [Test]
        public void ThenPipelineOutcomeIsSucceeded()
        {
            Assert.AreEqual("Succeeded", _helper.PipelineOutcome);
        }
    }
}

The [SetUp] method calls the helper to run the pipeline, then the [Test] actually tests something! I'm using Assert.AreEqual() to test that the pipeline outcome is “Succeeded”.

All the code in this article is available on GitHub – there's a link at the end.

You may notice that the [SetUp] method is now async and returns a Task instead of void. This is because the RunPipeline() method has to be awaited – it has no effect on the test.

Helper class

The helper class has two public members:

  • property PipelineOutcome stores the outcome of the last pipeline run
  • method RunPipeline() authenticates against Azure, connects to ADF, runs the pipeline and waits for it to finish.
  1. public class PLStageAuthorsHelper
  2. {
  3. public string PipelineOutcome { get; private set; }
  4.  
  5. public async Task RunPipeline()
  6. {
  7. PipelineOutcome = "Unknown";
  8.  
  9. // authenticate against Azure
  10. var context = new AuthenticationContext("https://login.windows.net/" + Environment.GetEnvironmentVariable("AZURE_TENANT_ID"));
  11. var cc = new ClientCredential(Environment.GetEnvironmentVariable("AZURE_CLIENT_ID"), Environment.GetEnvironmentVariable("AZURE_CLIENT_SECRET"));
  12. var authResult = await context.AcquireTokenAsync("https://management.azure.com/", cc);
  13.  
  14. // prepare ADF client
  15. var cred = new TokenCredentials(authResult.AccessToken);
  16. using (var adfClient = new DataFactoryManagementClient(cred) { SubscriptionId = Environment.GetEnvironmentVariable("AZURE_SUBSCRIPTION_ID") })
  17. {
  18. var adfName = "firefive-adftest95-adf"; // name of data factory
  19. var rgName = "firefive-adftest95-rg"; // name of resource group that contains the data factory
  20.  
  21. // run pipeline
  22. var response = await adfClient.Pipelines.CreateRunWithHttpMessagesAsync(rgName, adfName, "PL_Stage_Authors");
  23. string runId = response.Body.RunId;
  24.  
  25. // wait for pipeline to finish
  26. var run = await adfClient.PipelineRuns.GetAsync(rgName, adfName, runId);
  27. while (run.Status == "Queued" || run.Status == "InProgress" || run.Status == "Canceling")
  28. {
  29. Thread.Sleep(2000);
  30. run = await adfClient.PipelineRuns.GetAsync(rgName, adfName, runId);
  31. }
  32. PipelineOutcome = run.Status;
  33. }
  34. }
  35.  
  36. public PLStageAuthorsHelper()
  37. {
  38. PipelineOutcome = "Unknown";
  39. }
  40. }

Here I'm using some Microsoft .NET libraries to interact with Azure Data Factory, connecting and running the “PL_Stage_Authors” pipeline. If you want to see more C# examples for ADF, Microsoft's quickstart tutorial is a good place to start.

You'll also notice that the code uses four environment variables (lines 10, 11 & 16). This is the safest way to reference credentials – they don't appear in code or in config files for the VS project, so they don't find their way into source control and can be stored securely elsewhere.

To use the variables, you need to set their values in your environment before you launch Visual Studio. You can do this from a command prompt or in a batch script (replacing the values here with real ones for your environment):

setx AZURE_TENANT_ID "your-azure-tenant-guid"
setx AZURE_SUBSCRIPTION_ID "guid-of-azure-subscription-where-adf-created"
setx AZURE_CLIENT_ID "guid-of-application-registration-with-access-to-adf"
setx AZURE_CLIENT_SECRET "application-registration-client-secret"

Use exactly this set of variable names – in a later post I'll be using another Microsoft API that expects these environment variables to exist.

Now the test is ready to run! I hit the “Run All Tests” button in Test Explorer and wait for it to succeed 8-).

Failures

A test failure might be reported for one of two reasons:

  • The pipeline does not complete successfully – in this case the test outcome will look something like this:

    Here the test failure indicates a problem in the ADF pipeline itself. Look at the pipeline execution history in the ADF UI to find out what.

  • The pipeline fails to start. This indicates some problem with your setup – in this example I have forgotten to create the AZURE_TENANT_ID environment variable:

    You can see that this is a setup failure because the exception stack trace indicates that the error occurred in the RunPipeline() method.

In this article I showed you how to build and execute a simple ADF pipeline test using NUnit in Visual Studio. It's a good start, but the test I wrote has some problems:

  • It's very basic. It only indicates whether or not the pipeline's execution was successful, nothing more.
  • It's poorly-isolated from external dependencies. The test extracts data from an external system, so the results depend on that system as much as on the pipeline. If I have zero control over the external system, this isn't much more than a basic integration test.
  • It's only one test! I want to be able to test many pipelines, with many more tests than this.
  • Next up: In the next post I refactor the test code presented here to make it easier to reuse and extend, so that I can start assembling a suite of tests for my data factory.

  • Code: The code for the series is available on Github. The Visual Studio solution specific to this article is in the adf-testing-series/vs/01-FirstTest folder. It contains three projects: the NUnit project AdfTests along with database projects for the [ExternalSystems] and [AdfTesting] databases. Tables in the [ExternalSystems] database are based on Microsoft's Northwind sample database.

  • Share: If you found this article useful, please share it!