More Get Metadata in ADF
Updated
23-Feb-2021
Last year I wrote a post on getting metadata recursively in Azure Data Factory. This is something that comes up every so often, because ADF's own Get Metadata activity doesn't do it – you can list the immediate contents of a folder, but that's it. The post presented a pure-ADF approach using array variables to manage a queue of entries, and its performance was terrible!
Since then, a few people have asked me to share the implementation, but I've been reluctant to do so because I don't want to be held responsible for this atrocity getting into circulation 😉. So… In this post I'll talk about a better approach to solving the problem, using an Azure Function. If at the end you still want to see the full horror of the pipeline implementation, you can download it along with the function app source code.
Why an Azure Function?
An Azure Function is a nice way to implement recursive traversal, because support for doing so is baked into the .NET libraries for Azure Storage. What's more, you can call a function from ADF using the Azure Function activity, so it's a convenient abstraction inside a pipeline. A JSON payload returned from an Azure Function can be used directly in a pipeline, just like the output of other ADF activities (including Get Metadata).
The function
Here's my function code (this isn't absolutely everything, but you can find the full definition in the source code for this post).
public static class GetMetadata { [FunctionName("GetMetadata")] public static async Task<IActionResult> Run( [HttpTrigger(AuthorizationLevel.Function, "post", Route = null)] HttpRequest req, ILogger log) { dynamic data = JsonConvert.DeserializeObject(requestBody); string storageAccount = data.storageAccount; string container = data.container; string folderPath = data.folderPath; // with no leading slash var uri = $"https://{storageAccount}.blob.core.windows.net/"; var containerClient = serviceClient.GetBlobContainerClient(container); await GetFiles(containerClient, folderPath, files); } private static async Task GetFiles(BlobContainerClient client, string path, List<BlobFilePath> files) { var pages = client.GetBlobsByHierarchyAsync(prefix: path, delimiter: "/").AsPages(default); await foreach (var page in pages) { foreach (var item in page.Values) { if (item.IsPrefix) await GetFiles(client, item.Prefix, files); else } } } }
The entry point for the function is the Run
method on line 4. Lines 8-12 read some required information out of the web request JSON – the function needs a JSON request body that looks like this:
{ "storageAccount": "myStorageAccountName", "container": "myContainerName", "folderPath": "Path/To/Root" }
Here I'm requesting the contents of the Path/To/Root
folder of the myContainerName
container, in the storage account called myStorageAccountName
.
The folderPath
has no leading slash path separator – if you add one, you'll get no files back!
Authorisation
Lines 14-16 are about connecting to the storage account. I'm using a DefaultAzureCredential
which supports a number of different ways to authorise connection. The easiest approach is probably to enable the function app's system identity, then to make the identity a member of the storage account's Storage Blob Data Reader role.
Alternatively – still using the DefaultAzureCredential
– you could create an Azure AD app registration with a client secret, and configure the client ID and secret values in the function app's application settings as AZURE_CLIENT_ID
and AZURE_CLIENT_SECRET
respectively. You could also choose to connect using the storage account's connection string, but you'll want to involve a key vault to do that securely and will need to write more code.
Recursion
Once connected, line 19 of the Run
method calls the private GetFiles
method to make the “get metadata” call. Line 32 of GetFiles
contains the recursive call to get files from a child folder – this is where C# is doing what ADF won't 😀. Line 34 collects up a found file entry and adds it to a list of BlobFilePath
objects – this is just for convenience, so that the results serialise easily to some nice JSON for return.
Try it out
The “GetMetadata.sln” Visual Studio solution in the source code for this post contains the full function definition. Here's how to try it out:
- Create an Function App resource in the Azure portal. Choose the .NET runtime stack and create it in the same region as your ADF instance.
- In the Function App's portal blade, use the Identity page to switch its System assigned identity On.
- In your storage account's portal blade, use the Access Control (IAM) blade to assign the function app's identity to the “Storage Blob Data Reader” role. The app's identity has the same name as the function app itself. It sometimes takes a few minutes for permission changes to take effect.
- Download “GetMetadata.sln” from my GitHub repo, build the solution, then publish the “GetMetadata” project to your function app.
- In the Azure Data Factory UX, create an Azure Function linked service to connect to your function app. You'll need to provide a function key – you can find function host keys on the App keys page of the function app's Azure portal blade.
Now you're all set to Get Metadata! The screenshot shows my ADF Azure Function activity setup – I've overlaid the contents of the Body expression so you can see that I'm using the same JSON I described earlier, but with a pipeline parameter for the folder path.
Results
Here's the output from running the Azure Function activity, set up as above. I'm using the same test file hierarchy as last time; it's in the download too.
For consistency, I've made the function output resemble ADF's own Get Metadata – the output is a JSON array called childItems
, and each element has a name
and a type
. The two main differences are:
- each element has a
folderName
, identifying the file's folder - the
executionDuration
was 7 seconds, which is a significant improvement over the ADF-only solution from my earlier post.
If you want to try this out, you can find the VS solution, the sample folder tree and – I'm sorry 😉 – my original pure ADF solution on GitHub.
Thanks for reading. If you found this post interesting or useful, please share it!