Automate integration tests in Azure Data Factory

This is the second article in my series about automated testing for Azure Data Factory (ADF) pipelines. If you haven't already seen the first, you may prefer to start there.

In my previous post, I set up and ran one basic test of a single pipeline. In this article, I refactor my VS testing solution to make it easier to add new tests and to test new pipelines.

The helper class I introduced in the last article does a number of different things, and does them in a very specific way:

  • uses system settings from environment variables
  • connects to a named data factory instance
  • runs a named ADF pipeline.

This allows me to run a basic test, but it's not very reusable. It also mixes a number of different features together in one class, which isn't very convenient (and isn't great design). I'm going to separate the class into a number of related helper classes to make this easier to manage.

You may prefer to follow this section in the complete VS solution. All the code in this article is available on GitHub – there's a link at the end.

My original helper class used environment variables to hold secret settings for connecting to Azure, but I also want to be able to:

  • store and access non-sensitive settings for my testing setup
  • access secrets held in my Azure Key Vault.

Non-sensitive settings

A convenient way to store non-sensitive settings for NUnit is to store them in a runsettings file. A runsettings file contains XML describing test parameters as a collection of name/value pairs:

<?xml version='1.0' encoding='utf-8'?>
<RunSettings>
    <TestRunParameters>
        <Parameter name="KeyVaultUrl" value="https://firefive-adftest95-kv.vault.azure.net" />
        <Parameter name="DataFactoryResourceGroup" value="firefive-adftest95-rg" />
        <Parameter name="DataFactoryName" value="firefive-adftest95-adf" />
    </TestRunParameters>
</RunSettings>

You'll notice I've included the URL of my key vault as the first non-sensitive setting.

To specify the runsettings file for Visual Studio:

  • save it somewhere convenient with a name ending in “.runsettings” – I'm using “tests.runsettings”
  • configure it for the solution in Visual Studio, by clicking Test → Configure Run Settings → Select Solution Wide runsettings File, then choose the file you saved.

Accessing the key vault

Microsoft provides a .NET API for connecting to an Azure Key Vault. This needs the URL of the key vault (which I've already stored in my runsettings file) and an object to provide authentication credentials. I'm going to use a DefaultAzureCredential object, like this:

var myCredential = new DefaultAzureCredential();

What's a default credential?! The DefaultAzureCredential class uses three environment variables to authenticate against Azure, which is why I don't need to specify any in the code:

  • AZURE_TENANT_ID
  • AZURE_CLIENT_ID
  • AZURE_CLIENT_SECRET

I set these up in the previous post, so I'm good to go. 8-)

Settings helper class

Bringing this together, my settings helper class looks like this:

public class SettingsHelper
{
    public string GetSetting(string settingName)
    {
        // return environment variable "settingName", if present
        var value = Environment.GetEnvironmentVariable(settingName);
        if (value?.Length > 0)
            return value;
 
        // return value of runsettings parameter "settingName", if present
        value = TestContext.Parameters[settingName];
        if (value?.Length > 0)
            return value;
 
        // if a key vault is specified, return the value of secret "settingName", if present
        if (_keyVaultClient != null)
        {
            value = _keyVaultClient.GetSecret(settingName).Value.Value;
            if (value?.Length > 0)
                return value;
        }
 
        throw new Exception($"Test setting '{settingName}' not found");
    }
 
    private readonly SecretClient _keyVaultClient;
 
    public SettingsHelper()
    {
        var kvUrl = TestContext.Parameters["KeyVaultUrl"];
        if (kvUrl?.Length > 0)
            _keyVaultClient = new SecretClient(new Uri(kvUrl), new DefaultAzureCredential());
    }
}

The core of the class is the GetSetting() method. It takes a settingName parameter, then looks for a setting of that name in three different places:

  • local environment variables
  • my runsettings file – the collection of name/value pairs is accessed conveniently in code using TestContext.Parameters
  • my key vault.

I check the key vault last because it takes the most work to check, and because if a secret named settingName isn't found I can allow the resulting exception to be thrown from the test.

My data factory helper class looks very similar to the data factory code I used in the last post. It's a subclass of SettingsHelper (line 1), so I can continue to access all helper functionality in the test through a single helper object.

  1. public class DataFactoryHelper : SettingsHelper
  2. {
  3. public string PipelineOutcome { get; private set; }
  4.  
  5. public async Task RunPipeline(string pipelineName)
  6. {
  7. PipelineOutcome = "Unknown";
  8.  
  9. // authenticate against Azure
  10. var context = new AuthenticationContext("https://login.windows.net/" + GetSetting("AZURE_TENANT_ID"));
  11. var cc = new ClientCredential(GetSetting("AZURE_CLIENT_ID"), GetSetting("AZURE_CLIENT_SECRET"));
  12. var authResult = await context.AcquireTokenAsync("https://management.azure.com/", cc);
  13.  
  14. // prepare ADF client
  15. var cred = new TokenCredentials(authResult.AccessToken);
  16. using (var adfClient = new DataFactoryManagementClient(cred) { SubscriptionId = GetSetting("AZURE_SUBSCRIPTION_ID") })
  17. {
  18. var adfName = GetSetting("DataFactoryName");
  19. var rgName = GetSetting("DataFactoryResourceGroup");
  20.  
  21. // run pipeline
  22. var response = await adfClient.Pipelines.CreateRunWithHttpMessagesAsync(rgName, adfName, pipelineName);
  23. string runId = response.Body.RunId;
  24.  
  25. // wait for pipeline to finish
  26. var run = await adfClient.PipelineRuns.GetAsync(rgName, adfName, runId);
  27. while (run.Status == "Queued" || run.Status == "InProgress" || run.Status == "Canceling")
  28. {
  29. Thread.Sleep(2000);
  30. run = await adfClient.PipelineRuns.GetAsync(rgName, adfName, runId);
  31. }
  32. PipelineOutcome = run.Status;
  33. }
  34. }
  35.  
  36. public DataFactoryHelper()
  37. {
  38. PipelineOutcome = "Unknown";
  39. }
  40. }

It has some important differences:

  • RunPipeline() (line 5) now requires a pipelineName parameter – I can use it to run any pipeline I choose.
  • All references to environment variables (lines 10, 11, 16) are now made via the GetSetting() method. If I forget to set a variable, I'll get a more helpful exception message from SettingsHelper.
  • The data factory and resource group names are also accessed via GetSetting() (lines 18, 19). If I want to point this test suite at a different ADF instance, all I need to do is edit the runsettings file.

In the same way that the settings helper is only concerned with providing access to settings, the data factory helper is now only concerned with running ADF pipelines in general. This allows me to share its functionality between tests of different pipelines.

The pipeline helper still exists – it subclasses DataFactoryHelper and manages pipeline-specific information:

public class PLStageAuthorsHelper : DataFactoryHelper
{
    public async Task RunPipeline()
    {
        await RunPipeline("PL_Stage_Authors");
    }
}

At this stage, all it's doing is supplying the pipeline name to the data factory helper.

NUnit refers to a class containing tests as a test fixture. I've been careful to reorganise my helper class without changing my Given23Rows test fixture at all – it's still simple, still readable.

I'm going to make one change anyway, because I don't like Assert.AreEqual() – it's core NUnit syntax, but remember I want my tests to be as readable as possible. I prefer to use FluentAssertions, a set of extension methods that allows me to specify expected test outcomes much more naturally. Here's my original test:

[Test]
public void ThenPipelineOutcomeIsSucceeded()
{
    Assert.AreEqual("Succeeded", _helper.PipelineOutcome);
}

And here's the same test, re-written using a fluent assertion:

[Test]
public void ThenPipelineOutcomeIsSucceeded()
{
    _helper.PipelineOutcome.Should().Be("Succeeded");
}

I've organised the changes in my Visual Studio solution like this:

  • General helper classes (not pipeline-specific helpers) are in the “Helpers” folder
  • the “Pipelines” folder will contain pipeline-specific subfolders – “PL_Stage_Authors” contains the test fixtures and helper class belonging to my ADF pipeline of the same name.

I can run this test in the same way as before and get exactly the same result. So what's the benefit? To start with, separating helper behaviour out into layers like this makes it easier to slot in other pieces of functionality. I'm going to do that now for my next test: checking the number of rows that were staged by the pipeline.

I need to make a few changes to accomplish this:

  • add a new test to Given23Rows
  • add a StagedRowCount property to the pipeline helper class PLStageAuthorsHelper
  • add a way to get the staged row count from the database.

This is the new test in Given23Rows, also written using a fluent assertion:

[Test]
public void Then23RowsAreStaged()
{
    _helper.StagedRowCount.Should().Be(32);
}

Here's PLStageAuthorsHelper updated with the new StagedRowCount property:

  1. public class PLStageAuthorsHelper : DatabaseHelper
  2. {
  3. public async Task RunPipeline()
  4. {
  5. await RunPipeline("PL_Stage_Authors");
  6. }
  7.  
  8. public int StagedRowCount
  9. {
  10. get
  11. {
  12. return RowCount("stg.Authors");
  13. }
  14. }
  15. }

I think that counting rows is likely to be generally useful for testing, so I haven't queried the database here. Instead I've called a RowCount() method (line 12), passing the parameter value “stg.Authors” – recall this is the name of the table where the pipeline writes its output.

I haven't defined RowCount() yet, but the code contains a clue about where I'm going to put it – the pipeline helper is no longer a subclass of the data factory helper (line 1) but of a new class, DatabaseHelper.

The new database helper class has a single method, RowCount(), which takes a tableName parameter. The method returns the number of rows in the specified table:

  1. public class DatabaseHelper : DataFactoryHelper
  2. {
  3. public int RowCount(string tableName)
  4. {
  5. using (var conn = new SqlConnection(GetSetting("AdfTestingDbConnectionString")))
  6. {
  7. conn.Open();
  8. using (var cmd = new SqlCommand($"SELECT COUNT(*) FROM {tableName}", conn))
  9. using (var reader = cmd.ExecuteReader())
  10. {
  11. reader.Read();
  12. return reader.GetInt32(0);
  13. }
  14. }
  15. }
  16. }

The first line in the method body (line 5) creates a connection to a database. Which one? I'm using GetSetting() to retrieve the connection string for the [AdfTesting] DB – recall in the first post I said that this was stored in the Key Vault for ADF to use, so I'm going to take advantage of that.

I'm building a SQL statement with whatever arrives in the tableName parameter. In a public-facing system (like a website) this would be vulnerable to SQL injection, but it's safe here because my testing setup is completely closed.

The database helper subclasses the data factory helper, so my helper inheritance hierarchy now looks like this:

I hit the “Run All Tests” button in Test Explorer and… whoops!

Then23RowsAreStaged has failed because the row count wasn't what the test expected. You might have seen this coming when you looked at the new test definition – I specified the test incorrectly, by expecting 32 rows instead of 23. Another reason a test can fail is that it's badly-written!

Look closely at the output from the test run above, and you'll notice that the two tests each took ten seconds to execute. In the last post I mentioned that NUnit runs a fixture's [SetUp] method before each [Test] method is called – each test took so long here because the pipeline was executed once before each test was run, two times in total.

Running [SetUp] before each [Test] helps to ensure that tests are isolated from one another. My approach has been to use [SetUp] to run an ADF pipeline, then to check a feature of its outcome in [Test] – so repeated calls to [SetUp] don't make my tests any more isolated than they already are. From now on, I'm going to run test setup (and the ADF pipeline) only once per fixture, before any of the tests are run. I do this by changing the WhenPipelineIsRun() method's attribute to [OneTimeSetUp]:

[OneTimeSetUp]
public async Task WhenPipelineIsRun()
{
    _helper = new PLStageAuthorsHelper();
    await _helper.RunPipeline();
}

Now the method will only be called once, followed by each of the [Test] methods in the class. Tests will be faster (and cheaper) to run, because I'm triggering fewer ADF pipeline runs.

Reported test durations will now seem very low – I think this is because reported test duration includes [SetUp] execution time, which is now zero.

This test is still not isolated from external dependencies. In fact, I've made the situation worse, because as soon as the source [dbo].[Authors] table has a new row inserted, my row count test will fail. Even if the data doesn't change often, I still have no control over it, so the new test is very fragile.

My two tests are really of different types:

  • ThenPipelineOutcomeIsSucceeded is a simple integration test – although it doesn't seem to do much, it shows that the pipeline can connect to the source and staging databases, that the source and target tables exist and that ADF has permission to read and write to them.
  • Then23RowsAreStaged is an attempt at a functional test. In the previous post I described a functional test as an isolated test of whether the pipeline is doing things right. Checking that it stages 23 rows is one test that it's “doing things right”, but its setup isn't isolated at all.

What makes each of these either an integration test or a functional test isn't just a question of the [Test] method – it's a combination of the test and its [OneTimeSetUp]. My attempt at a functional test is fragile because I'm setting it up like an integration test. I'll talk about how to set up functional tests in my next post.

Now I have some reusable helpers, I'm in good shape to start testing a second ADF pipeline. I have another pipeline called “PL_Stage_Titles” – it does the same thing as “PL_Stage_Authors”, but for source table [dbo].[Titles], and it has some additional logging behaviour:

  • “Log pipeline start” is a Lookup activity which calls a database stored procedure to make an entry in a log table and return an integer run ID.
  • The returned run ID is stored in a RunId pipeline variable by “Set RunId”, a Set variable activity.
  • The Copy data activity is very similar to the one in “PL_Stage_Authors”, but adds the RunId variable as a source column for insertion into [stg].[Titles] (boxed in red on the screenshot).
  • The “Log pipeline end” activity calls another stored procedure to update the log table.

Remember that to be able to test an ADF pipeline, it must be published to an instance of ADF.

The refactored helper structure makes it easier and faster for me to write new tests. To start testing “PL_Stage_Titles” quickly, I'm going to copy the integration test setup I already wrote for the “PL_Stage_Authors” pipeline. I need to:

  • copy and modify the pipeline helper class
  • copy and modify the test fixture.

I'll create the test fixture and helper in a new “PL_Stage_Titles” subfolder of my Visual Studio “Pipelines” folder.

Pipeline helper

I copy “PLStageAuthorsHelper.cs” from folder “PL_Stage_Authors” and into “PL_Stage_Titles”, then make the following changes:

  • modify the namespace to PL_Stage_Titles (line 1)
  • change the class name (and the name of the .cs file itself) to PLStageTitlesHelper (line 3)
  • revise the RunPipeline() method to run ADF pipeline “PL_Stage_Titles” (line 7)
  • revise the StagedRowCount property to count rows in “stg.Titles” (line 14).
  1. namespace PL_Stage_Titles
  2. {
  3. public class PLStageTitlesHelper : DatabaseHelper
  4. {
  5. public async Task RunPipeline()
  6. {
  7. await RunPipeline("PL_Stage_Titles");
  8. }
  9.  
  10. public int StagedRowCount
  11. {
  12. get
  13. {
  14. return RowCount("stg.Titles");
  15. }
  16. }
  17. }
  18. }

Test fixture

I copy “Given23Rows.cs” from folder “PL_Stage_Authors” and into “PL_Stage_Titles”, then make the following changes:

  • modify the namespace to PL_Stage_Titles (line 1)
  • change the class name (and the name of the .cs file itself) to GivenExternalDependencies to reflect that this is an integration test (line 3)
  • change the type of the _helper instance variable to PLStageTitlesHelper (line 5)
  • remove method Then23RowsAreStaged – a row count check is more like a functional test, so isn't compatible with this this fixture's [OneTimeSetUp].
  1. namespace PL_Stage_Titles
  2. {
  3. public class GivenExternalDependencies
  4. {
  5. private PLStageTitlesHelper _helper;
  6.  
  7. [OneTimeSetUp]
  8. public async Task WhenPipelineIsRun()
  9. {
  10. _helper = new PLStageTitlesHelper();
  11. await _helper.RunPipeline();
  12. }
  13.  
  14. [Test]
  15. public void ThenPipelineOutcomeIsSucceeded()
  16. {
  17. _helper.PipelineOutcome.Should().Be("Succeeded");
  18. }
  19. }
  20. }

I hit the “Run All Tests” button in Test Explorer. You'll notice that Test Explorer now shows the two test fixtures, grouped by namespace:

I still haven't fixed the bad test definition in Given23Rows – it's a functional test set up like an integration test so has bigger problems! I'll replace it in the next post.

In this article I refactored the basic test structure from the previous post. I separated helper functionality into smaller classes which are easier to understand, maintain and share between pipelines. Sharing functionality between tests of different pipelines makes it easier and faster to write new tests.

Using [OneTimeSetUp] makes my tests faster and cheaper to run, because it requires fewer ADF pipeline runs. This is sufficient to isolate tests from one another because I'm running ADF pipelines during test setup. The test setup I've been using here is suitable for integration testing – in the next post I'll look at pipeline isolation for reliable functional tests.

  • Next up: In the next post I look at isolating tests from external dependencies. This makes robust functional testing possible, and allows me to test a wider range of scenarios.

  • Code: The code for the series is available on Github. The Visual Studio solution specific to this article is in the adf-testing-series/vs/02-IntegrationTesting folder. It contains three projects: the NUnit project AdfTests along with database projects for the [ExternalSystems] and [AdfTesting] databases. Tables in the [ExternalSystems] database are based on Microsoft's Northwind sample database.

    The GitHub repo includes my “tests.runsettings” file, but its association with the VS solution is not persisted – this is a VS issue. Before running tests for the first time you will need to specify the solution's runsettings file as above.

  • Share: If you found this article useful, please share it!

R J Y Q S