Why automate ADF pipeline testing?

Since writing my series on automated testing Azure Data Factory pipelines, I've had a few questions along the lines of “why bother?”. One reader commented that “a typical ADF developer tests their pipeline doing debug runs”. This is exactly how I develop pipelines: make a change, run the change, repeat until the pipeline does what I want. “Why bother with more testing?” is a good question!

One careful owner. Credit: @CheshireRCU In the UK, most motor vehicles more than three years old are required to pass an annual MOT test. The purpose of the test is to ensure that a vehicle is safe enough to drive and to be on the road. The reason your car is tested annually is because things change – just because it was safe to drive 12 months ago doesn't mean it's safe now.

An MOT test certificate means simply that your car was safe to drive when it was tested – if you lose a wheel on the way home from the test centre it won't mean much. The same is true of ADF pipeline tests: a passed test only means that the pipeline was working, under the specific test scenario, at the moment when the test was run. When things change, the test's previous pass result may no longer mean anything.

ADF pipelines don't exist in isolation – they use a variety of resources inside and outside your data factory, including other pipelines. Even if a given pipeline changes infrequently, things around it are likely to change much more often. In this situation, asking “does it work now?” at development time probably won't be enough – you have to keep asking “does it still work?”. Testing like this is naturally repetitive, and repetitive tasks are time-consuming and error-prone.

As time goes on, you'll add more and more pipelines to your data factory. These new pipelines will need repeated testing too, and you still need to keep testing your existing pipelines. The number of tests to be run grows much more often than it shrinks. Repetitive tasks in high numbers are classic candidates for automation.

Test automation often appears in DevOps toolchains, used in support of rapid development and delivery cycles. Automated testing is essential for accelerating delivery because it enables you to make changes quickly – if something gets broken, your test suite should tell you.

In longer delivery cycles you might imagine that a lower frequency testing is sufficient, but I'd argue that the reverse is true. Longer delays between development and deployment increase the opportunity for conflicting changes to be introduced into your data factory. Automated testing can help protect you against this.

You might end a long delivery cycle with one big manual test of everything before deployment, but so long after the original development it can be hard to remember exactly how a pipeline should behave. Automated tests are a great way of documenting intent – what the pipeline is intended to achieve, at a level above the detail of its implementation – while also reducing the burden of pre-deployment testing.

Whenever I verify pipeline behaviour during development, or fix a bug, I want to make sure that the behaviour doesn't change, that the bug stays fixed. In a continually evolving data factory, the only way to assure this is to test, test and retest. A key motivator for me in test automation is repeatability – not just making sure that I can test something in exactly the same way, but ensuring that it actually gets done.

If you found this article useful, please share it!