Get Metadata recursively in Azure Data Factory


Azure Data Factory's Get Metadata activity returns metadata properties for a specified dataset. In the case of a blob storage or data lake folder, this can include childItems array – the list of files and folders contained in the required folder. If you want all the files contained at any level of a nested a folder subtree, Get Metadata won't help you – it doesn't support recursive tree traversal. In this post I try to build an alternative using just ADF.

Spoiler alert: The performance of the approach I describe here is terrible! I'm sharing this post because it was an interesting problem to try to solve, and it highlights a number of other ADF features 😊. I take a look at a better/actual solution to the problem in another blog post.

Here's a pipeline containing a single Get Metadata activity. The activity is using a blob storage dataset called “StorageMetadata” which requires a “FolderPath” parameter – I've provided the value /Path/To/Root. The path represents a folder in the dataset's blob storage container, and the “Child Items” argument in the field list asks Get Metadata to return a list of the files and folders it contains.

The folder at /Path/To/Root contains a collection of files and nested folders, but when I run the pipeline, the activity output shows only its direct contents – the folders Dir1 and Dir2, and file FileA.

The files and folders beneath Dir1 and Dir2 are not reported – Get Metadata did not descend into those subfolders.

Factoid #1: ADF's Get Metadata data activity does not support recursive folder traversal. This is a limitation of the activity. (OK, so you already knew that).

Here's an idea: follow the Get Metadata activity with a ForEach activity, and use that to iterate over the output childItems array. If an element has type “Folder”, use a nested Get Metadata activity to get the child folder's own childItems collection.

This suggestion has a few problems. First, it only descends one level down – you can see that my file tree has a total of three levels below /Path/To/Root, so I want to be able to step though the nested childItems and go down one more level. Iterating over nested child items is a problem, because:

Factoid #2: You can't nest ADF's ForEach activities.

I also want to be able to handle arbitrary tree depths – even if it were possible, hard-coding nested loops is not going to solve that problem. A workaround for nesting ForEach loops is to implement nesting in separate pipelines, but that's only half the problem – I want to see all the files in the subtree as a single output result, and I can't get anything back from a pipeline execution.

Factoid #3: ADF doesn't allow you to return results from pipeline executions.

In any case, for direct recursion I'd want the pipeline to call itself for subfolders of the current folder, but:

Factoid #4: You can't use ADF's Execute Pipeline activity to call its own containing pipeline.

You could maybe work around this too, but nested calls to the same pipeline feel risky. You don't want to end up with some runaway call stack that may only terminate when you crash into some hard resource limits 🙄.

An alternative to attempting a direct recursive traversal is to take an iterative approach, using a queue implemented in ADF as an Array variable. Here's the idea:

  • create a queue of one item – the root folder path – then start stepping through it
  • whenever a folder path is encountered in the queue, use a Get Metadata activity to get the folder's childItems and add them to the queue too
  • keep going until the end of the queue – i.e. when every file and folder in the tree has been “visited”.

Now I'll have to use the Until activity to iterate over the array – I can't use ForEach any more, because the array will change during the activity's lifetime.

Factoid #5: ADF's ForEach activity iterates over a JSON array copied to it at the start of its execution – you can't modify that array afterwards. Subsequent modification of an array variable doesn't change the array copied to ForEach.

There's another problem here. I can start with an array containing /Path/To/Root, but what I append to the array will be the Get Metadata activity's childItems – also an array. What I really need to do is join the arrays, which I can do using a Set variable activity and an ADF pipeline join expression. To make this a bit more fiddly:

Factoid #6: The Set variable activity doesn't support in-place variable updates.

So I can't set Queue = @join(Queue, childItems)1). In fact, I can't even reference the queue variable in the expression that updates it. The workaround here is to save the changed queue in a different variable, then copy it into the queue variable using a second Set variable activity.

childItems is an array of JSON objects, but /Path/To/Root is a string – as I've described it, the joined array's elements would be inconsistent:

[ /Path/To/Root, {"name":"Dir1","type":"Folder"}, {"name":"Dir2","type":"Folder"}, {"name":"FileA","type":"File"} ]

This is inconvenient, but easy to fix by creating a childItems-like object for /Path/To/Root. I've given the path object a type of “Path” so it's easy to recognise.

[ {"name":"/Path/To/Root","type":"Path"}, {"name":"Dir1","type":"Folder"}, {"name":"Dir2","type":"Folder"}, {"name":"FileA","type":"File"} ]

What's more serious is that the new “Folder” type elements don't contain full paths – just the local name of a subfolder.

Factoid #7: Get Metadata's childItems array includes file/folder local names, not full paths.

To get the child items of Dir1, I need to pass its full path to the Get Metadata activity. In this example the full path is

  • /Path/To/Root (stored in the head of the queue), then /, then
  • Dir1 – the local name of the child folder.

The path prefix won't always be at the head of the queue, but this array suggests the shape of a solution: make sure that the queue is always made up of PathChildChildChild… subsequences. Each Child is a direct child of the most recent Path element in the queue.

By using the Until activity I can step through the array one element at a time, processing each one like this:

  • If it's a full folder path
    • use Set variable to store the path in a variable
    • use Get Metadata to return its childItems
    • insert the child items into the queue, just behind their parent path – these are all children of the stored path.
  • If it's a file's local name, prepend the stored path and add the file path to an array of output files.
  • If it's a folder's local name, prepend the stored path and add the folder path to the end of the queue. (I can't just call Get Metadata again and insert more children, because that would break up the sequence of children belonging to the original path).

I can handle the three options (path/file/folder) using a Switch activity – which a ForEach activity can contain.

Factoid #8: ADF's iteration activities (Until and ForEach) can't be nested, but they can contain conditional activities (Switch and If Condition).

The revised pipeline uses four variables:

  • “Queue” is the queue
  • “CurrentFolderPath” stores the latest path encountered in the queue
  • “FilePaths” is an array to collect the output file list
  • “_tmpQueue” is a variable used to hold queue modifications before copying them back to the “Queue” variable.

The first Set variable activity takes the /Path/To/Root string and initialises the queue with a single object: {"name":"/Path/To/Root","type":"Path"}. (I've added the other one just to do something with the output file array so I can get a look at it).

The Until activity uses a Switch activity to process the head of the queue, then moves on. You could use a variable to monitor the current item in the queue, but I'm removing the head instead (so the current item is always array element zero).

The Switch activity's “Path” case sets the new value CurrentFolderPath, then retrieves its children using Get Metadata. Two Set variable activities are required again – one to insert the children in the queue, one to manage the queue variable switcheroo.

The other two switch cases are straightforward:

  • “Default” (for files) adds the file path to the output array using an Append variable activity
  • “Folder” creates a corresponding “Path” element and adds to the back of the queue. Creating the element references the front of the queue, so can't also set the queue variable – a second Set variable is needed again.

Here's the good news: the output of the “Inspect output” Set variable activity. (Don't be distracted by the variable name – the final activity copied the collected “FilePaths” array to “_tmpQueue”, just as a convenient way to get it into the output).

The result correctly contains the full paths to the four files in my nested folder tree. So it's possible to implement a recursive filesystem traversal natively in ADF, even without direct recursion or nestable iterators.

That's the end of the good news: to get there, this took 1 minute 41 secs and 62 pipeline activity runs! For four files. This is not the way to solve this problem 🤣.

A better way around it might be to take advantage of ADF's capability for external service interaction – perhaps by deploying an Azure Function that can do the traversal and return the results to ADF. But that's another post…

This isn't valid pipeline expression syntax, by the way – I'm using pseudocode for readability
AzureGuy, 2020/11/05 19:30
This is exactly what I need, but without seeing the expressions of each activity it's extremely hard to follow and replicate. It would be helpful if you added in the steps and expressions for all the activities.
Hoop, 2020/11/26 15:57
This is something I've been struggling to get my head around thank you for posting. Do you have a template you can share?
Tom, 2021/02/17 06:02
We NEED to do this!

Why doesn't the MS "ETL Tool" do what is so simple in DataStage or Informatica or any ksh?
Richard, 2021/02/23 22:43
Thanks for the comments -- I now have another post about how to do this using an Azure Function, link at the top :) . Richard
Dipesh Shrestha, 2021/05/03 08:49
Hy, could you please provide me link to the pipeline or github of this particular pipeline. I wanted to know something how you did. I could understand by your code.
Thank You!
maremttu sivarajah, 2021/05/09 16:29
Hi I create the pipeline based on the your idea but one doubt how to manage the queue variable switcheroo.please give the expression
Anita Garg, 2021/06/08 07:46

I'm not able to understand it completely. Can anybody please help me with the complete pipeline.
Abdul Mannan Mughal, 2021/06/10 12:49

Can you please create a template for this solution as it would be very helpful. Currently one cannot understand what is happening in the pipeline.
Mark, 2021/09/23 22:43
Thanks for the explanation, could you share the json for the template?
Ed Z, 2021/09/29 16:40
I was thinking about Azure Function (C#) that would return json response with list of files with full path.
Ed Z, 2021/09/30 04:37
Another nice way is using REST API: However it has limit up to 5000 entries.
Liyasker J, 2022/02/10 10:55
Hi, This is very complex i agreed but the step what u have provided is not having transparency, so if u go step by step instruction with configuration of each activity it will be really helpful
Achates, 2022/04/19 03:58
Great idea! I followed the same and successfully got all files. I even can use the similar way to read manifest file of CDM to get list of entities, although a bit more complex. Now the only thing not good is the performance. In my case, it ran overall more than 800 activities, and it took more than half hour for a list with 108 entities.
Lee, 2022/04/26 10:22
As requested for more than a year: This needs more information!!!
gfna, 2022/05/31 10:27
i am extremely happy i stumbled upon this blog, because i was about to do something similar as a POC but now i dont have to since it is pretty much insane :D
Joe, 2022/07/25 12:49
Hi, Please could this post be updated with more detail? I'm having trouble replicating this.
Mye, 2022/10/08 14:39
It is difficult to follow and implement those steps. It would be great if you share template or any video for this to implement in ADF.
karthik, 2022/10/18 03:58
I want to load the csv source data which is different folder structure and each folder like table, for example account,property ..etc each folder having multiple files, I have load the csv data to on premise SQL DB, so that i needs to load data dynamically, if future new folder coming, i have load the new tables, so that how do we handle this scenario. kindly help me.