Better version control for Power BI datasets

Power BI Azure Pipelines This is Part 2 of a series about creating a professional developer experience for working with Power BI. If you haven't already seen the first, you may prefer to start there.

In the previous post, I outlined a possible workflow for Power BI development, and implemented an Azure DevOps pipeline to show how steps in such a workflow could be automated. To build the pipeline I stored an entire .pbix report file – data and all – in version control, which is a problem for at least two reasons:

  • storing large report files in a version control system won't scale well
  • datasets may contain confidential or sensitive data which must be kept out of version control.

In this post I'll look at separating a report's dataset from its visuals, version controlling the standalone dataset (without data), and deploying the dataset automatically to Power BI.

Want a preview of where we're going? Take a look at the video at the end 😊.

A Power BI dataset is an example of a tabular model. A tabular model is a multi-dimensional data store – a kind of database technology different from relational databases such as SQL Server. Tabular models are optimised for analytic query workloads, and tabular model technology underpins a number of Microsoft analytics services:

You may be used to developing Power BI reports which import data from external sources, for example Excel files, text files, SQL Server or Dataverse. Even in these cases, the Power BI report stores imported data in an internal tabular model database.

Separating a Power BI report from its tabular model has a number of advantages, not just limited to version control:

  • A Power BI report which uses a separate dataset contains no data at all. This is sometimes called a thin report because it contains visual features only, often resulting in a very small .pbix file.

  • A standalone tabular model definition can be stored in a text-based format. This also contains no data, and has additional benefits for version control.

  • A standalone tabular model can be shared by multiple reports. Sharing datasets helps you ensure that measures and other business concepts defined in the model can be reproduced consistently in every report, and has a bunch of other advantages. For this reason they are sometimes referred to as golden datasets.

Tabular model scripting language (TMSL, pronouced “ TIM-zel ”) is the most common format for persisting tabular models as text. TMSL files have the extension .bim – this is the top of the .bim file representing the model I used for the sample report in my previous post:

TMSL is implemented using JSON, which means that it's (sort of) human-readable – as we shall see later, this makes it possible to compare different versions of a model and to understand differences using the text alone.

Working with datasets in PBI desktop means that the tabular model definition continues to be stored in a PBIX binary file (not text) – this prevents you from taking advantage of other features of version control in Git. I obtained the .bim file shown above by extracting it from the sample report's PBIX file using pbi-tools, but this isn't a convenient way to manage dataset definitions on an ongoing basis – the workflow is bumpy, so the developer experience is poor.

You can create and edit TMSL tabular models in a number of ways:

If you want to try before you buy, Tabular Editor v2 (TE2) is free and can be downloaded by choosing “Tabular Editor 2.x (free)” from the dropdown at the right of the download button.

Use the portable version (usually a download called TabularEditor.Portable.zip) if you'd prefer to test the software without installing it.

This is my model's .bim file opened in Tabular Editor 2.17.3:

Once I've created a standalone dataset and published it to the Power BI service, I can use it in any Power BI report:

  • When creating a new Power BI report, I choose “Power BI datasets” from Power BI Desktop's “Get data” dropdown. I can then select a published dataset from the available list.

  • To bind my original sample report to a newly-standalone dataset, I edit the report in Power BI desktop, then delete all objects from its model view. After model objects are deleted, I can bind the report to a published dataset as above, via the “Get data” dropdown.

When a report is bound to a standalone dataset, notice that the Data view icon disappears from the Power BI desktop sidebar. This is because the report no longer contains any data itself – it is using the standalone dataset instead.

I talk more about binding reports to datasets later in the series.

Using a tool like Visual Studio or Tabular Editor, I can make changes to a standalone tabular model, saving them back to the model's .bim file as I go. To bring the model under version control, I add the file to my Git repository. At this point, other advantages of maintaining the dataset as a .bim file start to appear 😊.

Because a model .bim file is text (TMSL) rather than binary (PBIX), it becomes possible to make meaningful comparisons between different versions. The screenshot below shows a GitHub compare between versions of a PBIX file:

As the commit message suggests, the latest commit adds a measure description to the PBIX file. It isn't possible to verify the change, because the PBIX file format is binary.

In contrast, this screenshot shows a comparison of the same change, this time made to a TMSL file using Tabular Editor:

In this case, the exact nature of the modification is visible, confirming not only that a description was added, but also the text of the new value.

When changes are pushed to a Git service such as GitHub or Azure Repos, those changes will be automatically merged into the target file(s) if Git can work out how to do so. In the case of a text file – where a few lines have been updated or inserted – this is usually possible. In the case of a binary file such as a PBIX, Git can never perform an automatic merge.

The ability to merge different text changes into different parts of the same .bim file removes a longstanding bottleneck in Power BI dataset development: developers can work on the same model, independently, at the same time. This requires less developer coordination, less waiting around, and provides an overall better developer experience.

However, even with a text-based model definition, merge conflicts can occur when Git receives multiple changes from different developers made to the same part of the .bim file. There are two reasons TMSL merge conflicts occur:

  1. Developers don't talk to one another, and try to make changes to the same object in a model (e.g. the same measure). When this happens, it's the result of failures in developer communication – this is a process problem!

  2. When a modified model is serialised (saved back to the .bim file) by Tabular Editor or Visual Studio, the order in which the model's objects are stored is changed. This creates a large number of unmergeable file changes, even if the actual change to the model is very small.

Avoiding merge conflicts of the second type is critical to a smooth, parallel development workflow. The good news is that a number of solutions to the problem have been proposed 😃.

Save-to-Folder in Tabular Editor

Specifically with the aim of enabling parallel development, you can configure Tabular Editor to “save to folder”. Save-to-folder breaks the model into separate files, defining each object in a separate file. This screenshot of File Explorer shows the result of saving the example tabular model using the save-to-folder option:

Version controlling this file structure sidesteps the reordering problem to avoid merge conflicts, but the model must be converted by hand back to TMSL (a single .bim file) before it can be deployed. From a workflow perspective this is less appealing.

Tabular Model Description Language (TMDL)

The public preview of Tabular Model Description Language (TMDL, pronouced “ TIM-del ”) was announced by Microsoft in April 2023. TMDL provides an alternative serialisation language for tabular models, with a number of advantages:

  • Like Tabular Editor's “save to folder”, model objects are saved in separate files, grouped into folders.
  • Objects are no longer persisted in JSON, but with a YAML-like syntax which is easier to read.

This is the TMSL representation of my model's “Amount by Due Date measure”:

{
	"name": "Amount by Due Date",
	"description": "Sales amount broken down by payment due date.",
	"expression": [
		"CALCULATE(",
		"  SUM(Sales[Amount])",
		", USERELATIONSHIP(",
		"    Sales[Due Date]",
		"  ,'Date'[Date]",
		"  )",
		")"
	],
	"annotations": [
		{
			"name": "PBI_FormatHint",
			"value": "{\"isGeneralNumber\":true}"
		}
	]
}

Notice in the annotations array that quotes must be escaped, and that multiline expressions are stored as arrays of JSON strings (which cannot be copied directly into a DAX expression editor).

In contrast, the TMDL file representing this object would look something like this:

/// Sales amount broken down by payment due date.
measure 'Amount by Due Date' = 
    CALCULATE(
      SUM(Sales[Amount])
    , USERELATIONSHIP(
        Sales[Due Date]
      , 'Date'[Date]
      )
    )
 
    annotation PBI_FormatHint = {"isGeneralNumber":true}

Microsoft support is currently limited to provision of a serialisation/deserialisation API, but the intention is that TMDL will gradually be integrated into Microsoft tools. In the meantime, API support has already allowed third-party tools such as pbi-tools and Tabular Editor to begin adding support for the language.

Tabular Editor 2.18.0 introduced TMDL support for TE2, but integration into TE3 will be delayed until the TMDL API is more stable and complete. For the same reason, I won't be using TMDL in this post, but in a later post will demonstrate integrating it into the workflow I show here.

Enforce object order on commit

A direct approach to eliminating merge conflicts caused by model serialisation is to ensure that the objects in a model's .bim file always appear in the same order, before the file is committed to Git. This can be achieved using a Git pre-commit hook – I will be using this approach here.

A detailed description is available in a separate article, and the GitHib repo accompanying this post (link at the end) includes code to support hook implementation.

Still with me after the merge conflict rabbit hole 🐇? Great! So what we have now is:

  • a tabular model definition which is text-based, contains no data and is in version control
  • a way for multiple developers to work on it safely and in parallel.

Let's deploy this thing! Like last time, I'm going to automate a basic workflow, in preparation for something more powerful:

  1. A developer makes changes to a tabular model.

  2. The dev saves the modified model into version control, causing it to be published automatically as a Power BI dataset.

To publish the dataset I'll be using the Power BI workspace's XMLA endpoint. This is a Power BI Premium feature – a lower-cost option available for development and testing purposes is to use an A SKU.

To make deployments, XMLA endpoint availability must be set to Read/Write in your Power BI tenant settings (in Admin portal → Capacity settings → Power BI workloads → XMLA Endpoint):

Here's the YAML definition for my Azure DevOps pipeline – it's similar to the last one, but:

  • It uses a Windows build agent, instead of an Ubuntu agent (specified by the windows-latest pool image). I'll explain why in a moment.
  • It calls a different PowerShell script – Deploy-PbiDataset.ps1 – with appropriate parameter values.
  • It has some small differences in the trigger & variable definitions – those are just to match the code location for this post.
trigger:
  branches:
    include:
      - main
  paths:
    include:
      - powerbi-pro-devex-series/02-Datasets/Model.bim

variables:
  folderPath: $(System.DefaultWorkingDirectory)/powerbi-pro-devex-series/02-Datasets

pool:
  vmImage: windows-latest

steps:

- task: PowerShell@2
  displayName: Publish Power BI dataset
  inputs:      
    targetType: filePath
    filePath: $(folderPath)/Deploy-PbiDataset.ps1
    arguments: >
      -DatasetName "AdventureWorksLT"
      -WorkspaceName "Datasets"
      -BimFilePath "$(folderPath)/Model.bim"
      -RefreshDataset $true
    failOnStderr: true
  env:
    AZURE_TENANT_ID: $(AzureTenantId)
    AZURE_CLIENT_ID: $(AzureClientId)
    AZURE_CLIENT_SECRET: $(AzureClientSecret)

Once I've created an Azure DevOps pipeline to use this pipeline definition file, the pipeline will be triggered whenever a change is made to the specified Model.bim file in the GitHub repository's main branch.

The pipeline calls PowerShell script file Deploy-PbiDataset.ps1. Here's the script contained in the file:

  1. param(
  2. [Parameter(Mandatory = $true)]
  3. [ValidateNotNullOrEmpty()]
  4. [string] $DatasetName,
  5.  
  6. [Parameter(Mandatory = $true)]
  7. [ValidateNotNullOrEmpty()]
  8. [string] $WorkspaceName,
  9.  
  10. [Parameter(Mandatory = $true)]
  11. [ValidateNotNullOrEmpty()]
  12. [string] $BimFilePath,
  13.  
  14. [Parameter(Mandatory = $False)]
  15. [ValidateNotNullOrEmpty()]
  16. [bool] $RefreshDataset
  17. )
  18.  
  19.  
  20. $scriptFolder = $MyInvocation.MyCommand.Path | Split-Path
  21. Import-Module $scriptFolder\PbiDeployment\PbiDeployment.psm1 -Force
  22.  
  23. # Connect to Power BI
  24. $credential = Use-Pbi -WithModules @('Workspaces', 'Data')
  25.  
  26. # prepare model
  27. $model = Get-Content $BimFilePath -Encoding UTF8
  28. $cmd = '{"createOrReplace":{"object":{"database":""}, "database":' + $model + '}}' | ConvertFrom-Json
  29. $cmd.createOrReplace.object.database = $DatasetName
  30. $cmd.createOrReplace.database.name = $DatasetName
  31. $query = ConvertTo-Json $cmd -Depth 100 -Compress
  32.  
  33. # deploy model
  34. $endpoint = "powerbi://api.powerbi.com/v1.0/myorg/$WorkspaceName"
  35. $result = Invoke-ASCmd -Server $endpoint -Query $query -Credential $credential -ServicePrincipal -TenantId $Env:AZURE_TENANT_ID
  36. Write-Output $result
  37. if($result -like "*error*") {
  38. throw "Error deploying dataset."
  39. }
  40.  
  41. # refresh dataset
  42. if ($RefreshDataset) {
  43. $workspaceId = (Get-PowerBIWorkspace -Name $WorkspaceName).Id
  44. $datasetId = (Get-PowerBIDataset -WorkspaceId "$workspaceId" | Where-Object {$_.Name -eq "$DatasetName"}).Id
  45. $url = "/groups/$workspaceId/datasets/$datasetId/refreshes"
  46. $body = @{notifyOption = "NoNotification"} | ConvertTo-Json -Compress
  47. Invoke-PowerBIRestMethod -Url $url -Method POST -Body $body | Out-Null
  48. }

Like the last script, it begins by importing PowerShell modules for Power BI and connecting to the Power BI service. For this example I've moved those tasks into a separate module file, PbiDeployment.psm1 (imported on line 22). The module function Use-Pbi – called on line 25 – is responsible for importing modules and authenticating against Power BI.

Lines 27-32 prepare the model's TMSL representation for deployment by:

  • reading the model from its .bim file
  • embedding it in a TMSL createOrReplace command, then converting the command into a PowerShell object
  • setting database name properties in the command object (lines 30 & 31)
  • converting the object back into JSON.

The model is deployed to the workspace's XLMA endpoint on line 36, using the Invoke-ASCmd cmdlet from the SqlServer module for PowerShell. The cmdlet is not present in the module version for PowerShell core, which is why the pipeline has to use a Windows build agent.1)

Invoke-ASCmd returns an XML result describing the outcome of the operation. Even if the operation fails, the cmdlet returns a success response – lines 38-40 inspect the returned XML for error messages, and throw a “real” error if any are found. This ensures that if dataset deployment fails, the deployment pipeline run also fails.

Finally, if the value of the script's RefreshDataset parameter is $true, the script uses the Power BI API to refresh the dataset from its sources.

By design, deploying a Power BI dataset like this excludes data source credentials – after initial deployment, you must set them in the published dataset before its data can be refreshed.2) Once set, stored credentials are retained in the published dataset, even after further deployments.

Now I have everything I need to deploy dataset updates automatically 😃. In the video below I update the model .bim file and sync the changes to GitHub, causing the Azure DevOps pipeline to publish the dataset automatically into Power BI. I'm still making changes directly to the main branch – I'll improve on that in a future post.

In this article I demonstrated an approach to managing standalone Power BI datasets as .bim files, outside a PBIX file. This has a number of advantages:

  • Model definitions stored as text files can be version controlled more meaningfully.
  • A consistently-ordered TMSL file eliminates most merge conflicts, enabling parallel model development. As it matures, TMDL is likely to become the better approach, because TMDL files are naturally less susceptible to merge conflicts and are easier to read.
  • A standalone model can be shared between multiple reports.
  • PBIX report files using shared datasets no longer contain their own data, keeping it out of version control.
  • Next up: In the next post, I describe a more sophisticated workflow for report development – supporting report environments for testing, UAT and production – and build a new deployment pipeline to support it.

  • Code: The code for the series is available on Github. The files specific to this article are in the powerbi-pro-devex-series/02-Datasets folder.

  • Share: If you found this article useful, please share it!


1)
The Windows build agent includes the SqlServer module by default, so no Install-Module call is required in the script.
2)
You may be able to set credentials programmatically using the Gateways - Update Datasource API – I have not done so here because the ability to do this varies, depending on how your data sources are configured.