Azure Data Factory, the ADF UX and Git

Published
10-Nov-2020

If you're using Azure Data Factory (ADF), you're probably using Git and almost certainly using the ADF User Experience (ADF UX) – ADF's online integrated development environment (IDE). These three components are so closely interlinked that sometimes it's hard to think about them separately – in this article I try to do exactly that.

A data factory – an instance of ADF – is an Azure resource. You can find it in the Azure portal, without even opening the ADF UX.

I find it useful to think of it in two parts:

  • a set of resource definitions (pipelines, datasets, linked services etc) that have been published to the factory
  • something I'm loosely describing as factory machinery which interprets those published resource definitions and executes pipelines. This description is neither precise nor offical 😀.

The factory machinery uses metadata stored in linked service and integration runtime definitions to interact with the wider external environment – other services elsewhere in Azure and beyond.

I described the ADF UX as an IDE, but in reality it combines two different roles:

  • an online integrated development environment, used to develop and debug data factory resources (sometimes referred to as “authoring”)
  • a management tool, used to monitor the behaviour of published factory resources, verify pipeline success, diagnose faults etc.

An ADF UX session is always connected to a data factory instance. The dashed horizontal arrows between the UX and the connected instance in the diagram indicate monitoring tasks; the others indicate authoring tasks.

  • When you first open the ADF UX, you choose a data factory instance to connect to. When connecting to a factory that is not Git-enabled, your ADF UX session is loaded with a set of resource definitions copied from the published factory instance.

  • You develop, test and debug changes to pipelines and other resources in the ADF UX authoring canvas. Running pipelines in debug mode still needs compute resources, so the ADF UX uses the factory machinery from the attached ADF instance.

  • If you make changes to resource definitions in this Git-free environment, the only way to keep them is to publish them immediately to the factory instance. The factory machinery starts to use those updated resource definitions as soon as they have been published.
  • The ADF UX monitoring experience allows you to monitor the behaviour and performance of published pipelines, their triggers and activities.

  • Because the ADF UX is loaded with resource definitions already published in the factory, you are able to inspect published resources directly.

    It's useful to be able to do this, for example when a pipeline fails, because you can verify exactly what pipeline definition is being used at runtime. (Noting this as a separate task might seem strange, but bear with me).

Using Git allows you to save your factory resource definitions without having to publish them, and even better, now they're under version control 😅. To use Git, a factory instance must be linked to a Git repository in Azure Repos or GitHub.

Now when you open the ADF UX, the factory resources loaded into your session are the versions stored in Git, instead of the resources published in the factory. When connected to a Git-enabled factory, the UX gains a “Save” button – clicking “Save” stores changes made in the UX to your Git server, without publishing them to the factory.

You can still publish resources to the factory by clicking “Publish”, but you can only do this from a nominated collaboration branch in Git (usually main or equivalent). Publishing also updates a special publish branch in your repository (usually adf_publish), writing factory resources into it in the form of ARM templates. (For the purposes of this post I'm side-stepping the whole area of feature branch workflow, but this is highly recommended).

Notice that you can no longer use the ADF UX to inspect the resources that have been published to this factory. The UX is loaded from Git, and because the repository is linked to the factory (rather than the UX) you can't see the published definitions without disconnecting the repo. This might be a problem if you're working with a single ADF instance, but if you've got this far you're probably also using multiple factories.

A common development workflow is to use multiple data factories to provide separate environments for production, testing and development.

In the development ADF instance, the ADF UX now acts more-or-less exclusively as an authoring tool:

  • you load working definitions from Git
  • you make changes on the pipeline authoring canvas
  • you debug them using the dev instance's factory machinery
  • you save your changes back to Git.

Even if you use the “Publish” option to save ARM templates to Git, you probably don't care about the resources published to the factory any more. You won't run them there1) so there's nothing to monitor, and you don't need to be able to inspect them. I've greyed them out in the diagram above.

The resources you build in the development environment and save to Git are deployed to other factory instances for testing or production use. You can either:

  • deploy ARM templates, written to the Git repo's adf_publish branch by publishing from dev, or
  • deploy JSON resource definitions, written to the Git repo when saving from dev. These can be deployed using tools like PowerShell or a DevOps pipeline task.

Either way, you're deploying definitions stored in Git, using some external tooling to do so.

The non-development ADF instances are not Git-enabled, by design – the whole point about the Git repo attached to your development environment is that it's the single authoritative repository of ADF resource definitions. This means that when you connect to one of the non-development instances using the ADF UX, you are able once again to see published pipeline definitions.

In the same way that the ADF UX is purely an IDE in development, in other factory instances it's now purely a management tool. You use it to monitor behaviour and performance, and while you can inspect published pipeline definitions with it, you should avoid editing them directly.2)

The ADF UX is a combined IDE and management tool which is always connected to an ADF instance. The combination of roles can be confusing, but in a multi-instance development workflow using Git they naturally become separated, depending on the type of instance you're connected to:

  • when connected to a development ADF instance, the ADF UX is pure IDE, providing a graphical authoring experience, debugging tools and integrated version control
  • when connected to other instances – production, test etc – the ADF UX is a service management tool, supporting activities for outcome verification, fault diagnosis and performance monitoring.

2)
You may wish to protect non-development environments from direct editing by restricting contributor-level access to them.