Get Metadata recursively in Azure Data Factory
Updated
23-Feb-2021
Azure Data Factory's Get Metadata activity returns metadata properties for a specified dataset. In the case of a blob storage or data lake folder, this can include childItems
array – the list of files and folders contained in the required folder. If you want all the files contained at any level of a nested a folder subtree, Get Metadata won't help you – it doesn't support recursive tree traversal. In this post I try to build an alternative using just ADF.
Spoiler alert: The performance of the approach I describe here is terrible! I'm sharing this post because it was an interesting problem to try to solve, and it highlights a number of other ADF features 😊. I take a look at a better/actual solution to the problem in another blog post.
The problem
Here's a pipeline containing a single Get Metadata activity. The activity is using a blob storage dataset called “StorageMetadata” which requires a “FolderPath” parameter – I've provided the value /Path/To/Root
. The path represents a folder in the dataset's blob storage container, and the “Child Items” argument in the field list asks Get Metadata to return a list of the files and folders it contains.
The folder at /Path/To/Root
contains a collection of files and nested folders, but when I run the pipeline, the activity output shows only its direct contents – the folders Dir1
and Dir2
, and file FileA
.
The files and folders beneath Dir1
and Dir2
are not reported – Get Metadata did not descend into those subfolders.
Factoid #1: ADF's Get Metadata data activity does not support recursive folder traversal. This is a limitation of the activity. (OK, so you already knew that).
Roll-your-own recursive traversal
Here's an idea: follow the Get Metadata activity with a ForEach activity, and use that to iterate over the output childItems
array. If an element has type
“Folder”, use a nested Get Metadata activity to get the child folder's own childItems
collection.
This suggestion has a few problems. First, it only descends one level down – you can see that my file tree has a total of three levels below /Path/To/Root
, so I want to be able to step though the nested childItems
and go down one more level. Iterating over nested child items is a problem, because:
Factoid #2: You can't nest ADF's ForEach activities.
I also want to be able to handle arbitrary tree depths – even if it were possible, hard-coding nested loops is not going to solve that problem. A workaround for nesting ForEach loops is to implement nesting in separate pipelines, but that's only half the problem – I want to see all the files in the subtree as a single output result, and I can't get anything back from a pipeline execution.
Factoid #3: ADF doesn't allow you to return results from pipeline executions.
In any case, for direct recursion I'd want the pipeline to call itself for subfolders of the current folder, but:
Factoid #4: You can't use ADF's Execute Pipeline activity to call its own containing pipeline.
You could maybe work around this too, but nested calls to the same pipeline feel risky. You don't want to end up with some runaway call stack that may only terminate when you crash into some hard resource limits 🙄.
Roll-your-own iterative traversal
An alternative to attempting a direct recursive traversal is to take an iterative approach, using a queue implemented in ADF as an Array variable. Here's the idea:
- create a queue of one item – the root folder path – then start stepping through it
- whenever a folder path is encountered in the queue, use a Get Metadata activity to get the folder's
childItems
and add them to the queue too - keep going until the end of the queue – i.e. when every file and folder in the tree has been “visited”.
Now I'll have to use the Until activity to iterate over the array – I can't use ForEach any more, because the array will change during the activity's lifetime.
Factoid #5: ADF's ForEach activity iterates over a JSON array copied to it at the start of its execution – you can't modify that array afterwards. Subsequent modification of an array variable doesn't change the array copied to ForEach.
There's another problem here. I can start with an array containing /Path/To/Root
, but what I append to the array will be the Get Metadata activity's childItems
– also an array. What I really need to do is join the arrays, which I can do using a Set variable activity and an ADF pipeline join
expression. To make this a bit more fiddly:
Factoid #6: The Set variable activity doesn't support in-place variable updates.
So I can't set Queue
= @join(Queue, childItems)
1). In fact, I can't even reference the queue variable in the expression that updates it. The workaround here is to save the changed queue in a different variable, then copy it into the queue variable using a second Set variable activity.
childItems
is an array of JSON objects, but /Path/To/Root
is a string – as I've described it, the joined array's elements would be inconsistent:
[ /Path/To/Root
, {"name":"Dir1","type":"Folder"}
, {"name":"Dir2","type":"Folder"}
, {"name":"FileA","type":"File"}
]
This is inconvenient, but easy to fix by creating a childItems
-like object for /Path/To/Root
. I've given the path object a type of “Path” so it's easy to recognise.
[ {"name":"/Path/To/Root","type":"Path"}
, {"name":"Dir1","type":"Folder"}
, {"name":"Dir2","type":"Folder"}
, {"name":"FileA","type":"File"}
]
What's more serious is that the new “Folder” type elements don't contain full paths – just the local name of a subfolder.
Factoid #7: Get Metadata's childItems
array includes file/folder local names, not full paths.
To get the child items of Dir1
, I need to pass its full path to the Get Metadata activity. In this example the full path is
/Path/To/Root
(stored in the head of the queue), then/
, thenDir1
– the local name of the child folder.
The path prefix won't always be at the head of the queue, but this array suggests the shape of a solution: make sure that the queue is always made up of Path
→ Child
→ Child
→ Child
… subsequences. Each Child
is a direct child of the most recent Path
element in the queue.
The solution
By using the Until activity I can step through the array one element at a time, processing each one like this:
- If it's a full folder path
- use Set variable to store the path in a variable
- use Get Metadata to return its
childItems
- insert the child items into the queue, just behind their parent path – these are all children of the stored path.
- If it's a file's local name, prepend the stored path and add the file path to an array of output files.
- If it's a folder's local name, prepend the stored path and add the folder path to the end of the queue. (I can't just call Get Metadata again and insert more children, because that would break up the sequence of children belonging to the original path).
I can handle the three options (path/file/folder) using a Switch activity – which a ForEach activity can contain.
Factoid #8: ADF's iteration activities (Until and ForEach) can't be nested, but they can contain conditional activities (Switch and If Condition).
The revised pipeline uses four variables:
- “Queue” is the queue
- “CurrentFolderPath” stores the latest path encountered in the queue
- “FilePaths” is an array to collect the output file list
- “_tmpQueue” is a variable used to hold queue modifications before copying them back to the “Queue” variable.
The first Set variable activity takes the /Path/To/Root
string and initialises the queue with a single object: {"name":"/Path/To/Root","type":"Path"}
. (I've added the other one just to do something with the output file array so I can get a look at it).
The Until activity uses a Switch activity to process the head of the queue, then moves on. You could use a variable to monitor the current item in the queue, but I'm removing the head instead (so the current item is always array element zero).
The Switch activity's “Path” case sets the new value CurrentFolderPath
, then retrieves its children using Get Metadata. Two Set variable activities are required again – one to insert the children in the queue, one to manage the queue variable switcheroo.
The other two switch cases are straightforward:
- “Default” (for files) adds the file path to the output array using an Append variable activity
- “Folder” creates a corresponding “Path” element and adds to the back of the queue. Creating the element references the front of the queue, so can't also set the queue variable – a second Set variable is needed again.
Results
Here's the good news: the output of the “Inspect output” Set variable activity. (Don't be distracted by the variable name – the final activity copied the collected “FilePaths” array to “_tmpQueue”, just as a convenient way to get it into the output).
The result correctly contains the full paths to the four files in my nested folder tree. So it's possible to implement a recursive filesystem traversal natively in ADF, even without direct recursion or nestable iterators.
That's the end of the good news: to get there, this took 1 minute 41 secs and 62 pipeline activity runs! For four files. This is not the way to solve this problem 🤣.
A better way around it might be to take advantage of ADF's capability for external service interaction – perhaps by deploying an Azure Function that can do the traversal and return the results to ADF. But that's another post…
Share: If you found this article
usefulinteresting, please share it – and thanks for reading!
Discussion
Why doesn't the MS "ETL Tool" do what is so simple in DataStage or Informatica or any ksh?
Thank You!
I'm not able to understand it completely. Can anybody please help me with the complete pipeline.
Can you please create a template for this solution as it would be very helpful. Currently one cannot understand what is happening in the pipeline.
I want to load the csv source data which is different folder structure and each folder like table, for example account,property ..etc each folder having multiple files, I have load the csv data to on premise SQL DB, so that i needs to load data dynamically, if future new folder coming, i have load the new tables, so that how do we handle this scenario. kindly help me.