Testing Azure Data Factory in your CI/CD pipeline

This is part four of my series about automated testing for Azure Data Factory (ADF) pipelines. If this is the first part you've read, you may prefer to start at the beginning.

In my previous post I used ADF pipeline parameters to implement dependency injection for ADF pipelines and build isolated functional tests using the NUnit testing framework. In this article, I integrate the NUnit testing solution into an Azure DevOps pipeline, so that I can run tests automatically whenever changes are made to ADF resources.

Azure Pipelines is a cloud-based service that enables you to build, test and deploy code automatically. It's part of Azure DevOps Services – I'm going to refer to Azure Pipelines as DevOps pipelines to distinguish them from the ADF pipelines I've been talking about throughout this series.

A DevOps pipeline is attached to a git repository, usually the one where you store the code you want to build, test or deploy. To use DevOps pipelines to automate my ADF pipeline testing, I'm going to need to connect my ADF instance to a git repo. If you're not already doing this, there are a number of other good reasons why you should:

  • you can save your ADF development work in progress (imagine that!)
  • each “Save All” in the ADF UI is a git commit – so you get a full history of changes to your working branch, and can undo mistakes
  • different developers can work in separate feature branches, allowing them to work independently.

Azure DevOps pipelines are triggered by changes to specified repo branches and folders – I'll be using this behaviour to re-test all my ADF pipelines automatically whenever an ADF resource is modified.

You can write a DevOps pipeline using the online “classic” editor or by defining it in a YAML file. A major advantage of using YAML is that your pipeline definition is stored as a “.yml” file in your git repo – so the pipeline definition itself is kept under source control. In this article I'll be creating a YAML DevOps pipeline for ADF testing.

All the code for this series of articles is available on GitHub, in the adf-testing-series folder of my “community” repo (link below). adf-testing-series has these subfolders:

  • adf contains Azure Data Factory resource definitions
  • devops contains the DevOps pipeline YAML (“.yml”) file
  • vs contains Visual Studio solution folders for other articles in the series – 03-FunctionalTesting and others. (03-FunctionalTesting contains the VS solution with functional test examples I used in the previous post).

I prefer to attach an ADF instance to a subfolder in a repo (instead of the repo root), because it enables me to organise ADF resources, ADF pipeline testing solution(s) and DevOps pipeline definitions in the same place. My data factory repository settings are shown on the right – notice that Root folder is /adf-testing-series/adf.

The DevOps pipeline I create in this article will automate the execution of ADF pipeline tests found in the 03-FunctionalTesting solution.

DevOps pipelines are created in projects which in turn belong to organizations or project collections. If you don't already have an Azure DevOps organization you can sign up for free – if like me you're using a GitHub repo you should sign up with a GitHub account.

Once you have created and/or chosen your Azure DevOps organization and project, browse to the project homepage and click the Pipelines button in the left-hand sidebar, followed by New pipeline. This launches the four-step pipeline creation wizard:

Step 1: Connect. I select GitHub as the location of my git repo. You might be redirected to GitHub to sign in – if so, enter your GitHub credentials.

Step 2: Select. I choose my “community” repo – this is where my ADF resources are stored, and where the YAML pipeline definition will be saved. You might be redirected to GitHub to install the Azure Pipelines app – if so, select Approve and install.

Step 3: Configure. I choose Starter pipeline – this will create a basic YAML pipeline which I can modify.

Step 4: Review. By default, the YAML file will be created in the root of the repo – I don't want this, so I change the path (and the filename), saving it in folder adf-testing-series/devops. In the top right I have the option to Save and run, but clicking on the down arrow lets me choose to Save only.

Understanding the starter pipeline

The YAML starter pipeline has three sections: trigger, pool and steps:

trigger:
- master

pool:
  vmImage: 'ubuntu-latest'

steps:
- script: echo Hello, world!
  displayName: 'Run a one-line script'

- script: |
    echo Add other tasks to build, test, and deploy your project.
    echo See https://aka.ms/yaml
  displayName: 'Run a multi-line script'
  • steps is a sequence of scripts or tasks. A task is a pre-defined script – an extensive library of tasks created by Microsoft or third-parties is available. Steps are the smallest units of work a DevOps pipeline can perform and are grouped into jobs – no jobs are specified here, which means that all the steps belong to the same, single job.
  • When a DevOps pipeline is run, each job is executed by an agent – software running on a virtual machine. Rather than specifiying an individual agent to run a job, pool indicates a collection of agents of an appropriate type to run the pipeline. The exact choice of agent to run a DevOps pipeline is made automatically at execution time.
  • trigger defines conditions which will cause the DevOps pipeline to be executed automatically.

When the pipeline is triggered, the Azure Pipelines service requests an agent from the pool. In the case of Microsoft-hosted agents this will be a fresh virtual machine, discarded when pipeline execution is complete. The agent automatically executes a git checkout to obtain the code in the pipeline's source repo.

Now that I have a starter DevOps pipeline I can replace its default script tasks with something more useful – a Visual Studio Test task which can run NUnit tests. I configure it like this:

- task: VSTest@2
  displayName: 'Run tests'
  inputs:
    testSelector: 'testAssemblies'
    testAssemblyVer2: |
      **\AdfTests.dll
    searchFolder: 'adf-testing-series\vs\03-FunctionalTesting\tests\AdfTests\bin\Debug'
    testRunTitle: 'AdfTestRun'
    runSettingsFile: 'adf-testing-series\vs\tests.runsettings'
  env:
    AZURE_TENANT_ID: $(AzureTenantId)
    AZURE_SUBSCRIPTION_ID: $(AzureSubscriptionId)
    AZURE_CLIENT_ID: $(AzureClientId)
    AZURE_CLIENT_SECRET: $(AzureClientSecret)

VSTest is the unique name of the Visual Studio test task; @2 indicates that I am using version 2. It takes these arguments:

  • displayName is how the task will appear in the pipeline run output
  • inputs is a set of input values for the task:
    • testSelector, testAssemblyVer2 and searchFolder tell the task that the tests are defined in a “.dll” file and where to find it
    • testRunTitle is a name for the test run
    • runSettingsFile is the path to a runsettings file for the test suite. I'm using the file stored in the git repo (the same one I've been using to run tests in Visual Studio), but I could specify a different file here. For example, this would let me use VS to run tests in one ADF instance during development, then later re-run them automatically in a different ADF testing instance.
  • env is a list of environment variables that I want to set for the task – recall that these are the four environment variables used by the testing code to connect to Azure Data Factory and to the Azure Key Vault. For security reasons I want to keep their values out of source control, so I'm passing them in from DevOps pipeline variables instead of including them in the YAML file.

This seems fairly straightforward, but it suggests that I need to do some other things first:

  • to be able to run any ADF pipeline tests, my ADF resources must be published to the data factory instance specified in my selected runSettingsFile
  • to obtain a “.dll” file, I need to build the test project (recall that so far the DevOps agent has only checked out the source code)
  • to populate the AZURE_… environment variables, I need first to populate the pipeline variables – while still keeping them out of source control.

The Visual Studio Build & Test tasks also require Visual Studio to be installed on the agent machine. VMs in the Microsoft-hosted windows-2019 agent pool have VS 2019 pre-installed, so I'll use this instead of the starter pipeline's pool vmImage:

pool:
  vmImage: 'windows-2019'

Publishing ADF resources automatically deserves a whole series of articles by itself, so I won't go into it in much detail here. For the purposes of this post, I'm using a simple PowerShell script which loops through the ADF resource JSON files in my git repo and deploys them using Set-… cmdlets from the Az.DataFactory PowerShell module.

The PowerShell script is included in my git repo (link below) but real deployment pipelines benefit from something a bit more sophisticated. I highly recommend Kamil Nowinski's tutorials on using DevOps pipelines to publish to ADF from ARM templates and from JSON files.

I use the Azure PowerShell task to execute my script:

- task: AzurePowerShell@4
  displayName: Publish ADF resources
  inputs:
    azureSubscription: $(PipelineServiceConnection)
    azurePowerShellVersion: latestVersion
    ScriptPath: adf-testing-series\adf\publish.ps1
    ScriptArguments: -resourceGroupName 'firefive-adftest95-rg' -dataFactoryName 'firefive-adftest95-adf' -adfFileRoot '$(System.DefaultWorkingDirectory)\adf-testing-series\adf' 

PipelineServiceConnection

$(PipelineServiceConnection) (used in the azureSubscription input for the AzurePowerShell@4 task) refers to a DevOps pipeline variable containing the name of a service connection defined in my Azure DevOps project. The service connection provides an Azure service principal which I can authorise to access other resources in Azure.

In this case, the DevOps pipeline uses the service connection to deploy resources to ADF. For this to work, the underlying service principal must be permitted to make ADF deployments.

You probably won't be surprised to learn that I can use a Visual Studio Build task to build the testing “.dll” file. Before I do, remember all those APIs I'm using (to talk to ADF, talk to the Key Vault, use FluentAssertions)? Those packages aren't stored in my git repo, so first I need to download them from NuGet. There's a predefined task for that too:

- task: NuGetCommand@2
  displayName: Restore NuGet packages
  inputs:
    command: restore
    feedsToUse: 'select'
    restoreSolution: 'adf-testing-series\vs\03-FunctionalTesting\tests\AdfTests\AdfTests.csproj'

- task: VSBuild@1
  displayName: 'Build testing project'
  inputs:
    solution: 'adf-testing-series\vs\03-FunctionalTesting\tests\AdfTests\AdfTests.csproj'
    configuration: 'Debug'
    clean: true

I'm using NuGetCommand@2 to restore packages from NuGet.org. restoreSolution specifies the path to my VS solution or project – you can see I'm specifying the AdfTests project (rather than the solution that contains it). I'm also specifying the testing project in VSBuild@1's solution argument – I don't want to waste time building the entire solution because I'm only interested in the testing project at this moment.

I'm passing values for the AZURE_… environment variables into the VSTest@2 testing task using pipeline variables, because I don't want to write their secret values into the YAML file under source control. I could store them securely as secret variables, but I can avoid having to do this by running my tests using the service principal associated with the pipeline service connection. I can extract the values I need from the service principal at runtime using the Azure CLI task:

- task: AzureCLI@2
  displayName: 'Set pipeline identity variables'
  inputs:
    azureSubscription: '$($(PipelineServiceConnection))'
    scriptType: 'pscore'
    scriptLocation: 'inlineScript'
    addSpnToEnvironment: true
    inlineScript: |
      Write-Host "##vso[task.setvariable variable=AzureTenantId;issecret=true]$env:tenantId"
      Write-Host "##vso[task.setvariable variable=AzureSubscriptionId;issecret=true]$(az account show --query 'id' --output tsv)"
      Write-Host "##vso[task.setvariable variable=AzureClientId;issecret=true]$env:servicePrincipalId"
      Write-Host "##vso[task.setvariable variable=AzureClientSecret;issecret=true]$env:servicePrincipalKey"

AzureCLI@2 has an addSpnToEnvironment argument which allows me to inject the service principal identity into the task as a number of environment variables. The inlineScript argument uses ##vso[task.setvariable… to copy those variables from the task environment and into DevOps pipeline variables. This means that the only pipeline variable I need to configure is PipelineServiceConnection.

Before deciding on a trigger for my testing pipeline, I need to think a bit about workflow. This isn't a technical feature – it's the organisational process that a team uses to manage development. There are all sorts of options here depending on how your team is set up. Here's the workflow I'm using:

  1. Data engineers develop ADF changes. They do this in a shared development ADF instance which is attached to a git repo. They work in feature branches and run their pipelines using the Debug option in the ADF UI. At the same time (and in the same feature branch) they write/revise NUnit test fixtures for their new/modified ADF pipelines.
  2. When an ADF change and its tests are ready, an engineer opens a pull request to merge the feature branch into the master branch.
  3. When the pull request has been reviewed and gained approval, the feature branch is merged into master. At this point, ADF pipeline testing should be triggered.

This is one reason I prefer to publish ADF resources from JSON files rather than ARM templates – a merge to master results in deployment-ready artefacts, so their publication can be triggered directly.

The trigger definition below indicates that the DevOps pipeline should run on pushes to the master branch, but only when files in certain paths are modified: ADF resource JSON files or my VS testing project. There are many other, unrelated resources in my git repo, and I don't want changes to those to cause unnecessary ADF test runs.

trigger:
  branches:
    include:
      - master
  paths:
    include: 
      - adf-testing-series/adf/*
      - adf-testing-series/vs/03-FunctionalTesting/tests/AdfTests/*

I'm triggering the DevOps pipeline on changes to data factory resources or to the testing project – when either one of those things is updated, I want to re-run all the tests.

My choice of DevOps pipeline trigger – when changes are pushed to the repo's master branch – is a consequence of my development workflow. In my example I have a single shared instance for ADF development and testing, so I have to accept limitations on how frequently I can run tests (because I don't want test runs triggered by different developers to collide). A different workflow could use different triggers – if each developer has a dedicated ADF instance (or if you create one per feature branch), you can run tests sooner and more often.

Now I have everything I need to run tests using my DevOps pipeline. I assemble the various components of the YAML file in this order:

  • trigger – DevOps pipeline trigger as above
  • pool – I'm using vmImage: 'windows-2019'
  • steps
    • task AzurePowerShell@4 publishes ADF resources to my testing data factory
    • NuGetCommand@2 restores NuGet packages to the VS testing project
    • VSBuild@1 builds the project
    • AzureCLI@2 uses the DevOps pipeline service connection name to obtain credentials
    • VSTest@2 runs the tests against the published ADF pipelines.

I won't reproduce the full script here – it's available in my GitHub repo. In the script you'll also notice I use a name expression to generate a unique label for each DevOps pipeline run.

I've defined PipelineServiceConnection as a secret variable in the Azure DevOps UI, to keep it out of source control. This is why you don't see it in the YAML file.

A DevOps pipeline run (i.e. a full test of all my ADF pipelines) will be triggered automatically whenever a change to ADF resources or to the testing project is pushed into the git repo's master branch. If necessary, I can also trigger a DevOps pipeline run manually from the Azure DevOps UI.

If any test fails during the DevOps pipeline run, the VSTest@2 task (displayed as “Run tests”) – and the DevOps pipeline run itself – will fail. This screenshot shows the detail of a failed pipeline run:

I can see which test failed in the VSTest@2 task's output (above). This isn't very user-friendly, but the collected results of the test run are also automatically published in the DevOps project's test management area:

From here I can drill down into the set of test results, easily identifying failed tests and failure reasons:

Azure DevOps issues email notifications to subscribed users when a DevOps pipeline run completes – if I receive a notification that this pipeline has failed, it's a good indication that one or more tests has failed. A more sophisticated approach (for example to send notifications only on failure, or to send notifications to some other messaging channel) would be to script your own notification tasks in the YAML pipeline.

In this post I developed an Azure DevOps pipeline that runs all my ADF pipeline tests, triggered automatically whenever a change is made to an ADF resource or a test fixture. This is one piece of a longer CI/CD pipeline, also responsible for building, testing and deploying all the other components of my data platform implementation.

In my workflow, a change is considered “made” when it is pushed into the git repo's master branch. Exactly how frequently I can trigger test runs depends on my development workflow, because I have to publish ADF pipelines to a data factory instance before I can test them. Workflows which permit early and frequent testing enable faster feedback, which helps to improve development quality.

  • Next up: In the previous post I looked at isolating ADF pipelines, in order to verify that they're “doing things right”. In the next post I extend the approach to check that they're “doing the right things” – this is how I described a unit test.

  • Code: The code for the series is available on Github. The DevOps YAML pipeline from this post is in the adf-testing-series/devops folder. The DevOps pipeline publishes ADF resources from folder adf-testing-series/adf and runs the set of tests defined in the VS solution in adf-testing-series/vs/03-FunctionalTesting.

  • Share: If you found this article useful, please share it!

D E L Y X