Better version control for Power BI datasets
This is Part 2 of a series about creating a professional developer experience for working with Power BI. If you haven't already seen the first, you may prefer to start there.
In the previous post, I outlined a possible workflow for Power BI development, and implemented an Azure DevOps pipeline to show how steps in such a workflow could be automated. To build the pipeline I stored an entire .pbix
report file – data and all – in version control, which is a problem for at least two reasons:
- storing large report files in a version control system won't scale well
- datasets may contain confidential or sensitive data which must be kept out of version control.
In this post I'll look at separating a report's dataset from its visuals, version controlling the standalone dataset (without data), and deploying the dataset automatically to Power BI.
Want a preview of where we're going? Take a look at the video at the end 😊.
Tabular models
A Power BI dataset is an example of a tabular model. A tabular model is a multi-dimensional data store – a kind of database technology different from relational databases such as SQL Server. Tabular models are optimised for analytic query workloads, and tabular model technology underpins a number of Microsoft analytics services:
You may be used to developing Power BI reports which import data from external sources, for example Excel files, text files, SQL Server or Dataverse. Even in these cases, the Power BI report stores imported data in an internal tabular model database.
Separating a Power BI report from its tabular model has a number of advantages, not just limited to version control:
A Power BI report which uses a separate dataset contains no data at all. This is sometimes called a thin report because it contains visual features only, often resulting in a very small
.pbix
file.A standalone tabular model definition can be stored in a text-based format. This also contains no data, and has additional benefits for version control.
- A standalone tabular model can be shared by multiple reports. Sharing datasets helps you ensure that measures and other business concepts defined in the model can be reproduced consistently in every report, and has a bunch of other advantages. For this reason they are sometimes referred to as golden datasets.
Tabular model scripting language (TMSL)
Tabular model scripting language (TMSL, pronouced “ TIM-zel ”) is the most common format for persisting tabular models as text. TMSL files have the extension .bim
– this is the top of the .bim
file representing the model I used for the sample report in my previous post:
TMSL is implemented using JSON, which means that it's (sort of) human-readable – as we shall see later, this makes it possible to compare different versions of a model and to understand differences using the text alone.
Developing standalone datasets
Working with datasets in PBI desktop means that the tabular model definition continues to be stored in a PBIX binary file (not text) – this prevents you from taking advantage of other features of version control in Git. I obtained the .bim
file shown above by extracting it from the sample report's PBIX file using pbi-tools, but this isn't a convenient way to manage dataset definitions on an ongoing basis – the workflow is bumpy, so the developer experience is poor.
You can create and edit TMSL tabular models in a number of ways:
- using a text editor (in principle, but not a great experience 😂)
- using Visual Studio's Analysis Services extension
- using a third-party tool such as Tabular Editor.
If you want to try before you buy, Tabular Editor v2 (TE2) is free and can be downloaded by choosing “Tabular Editor 2.x (free)” from the dropdown at the right of the download button.
Use the portable version (usually a download called TabularEditor.Portable.zip
) if you'd prefer to test the software without installing it.
This is my model's .bim
file opened in Tabular Editor 2.17.3:
Using standalone datasets in Power BI reports
Once I've created a standalone dataset and published it to the Power BI service, I can use it in any Power BI report:
When creating a new Power BI report, I choose “Power BI datasets” from Power BI Desktop's “Get data” dropdown. I can then select a published dataset from the available list.
To bind my original sample report to a newly-standalone dataset, I edit the report in Power BI desktop, then delete all objects from its model view. After model objects are deleted, I can bind the report to a published dataset as above, via the “Get data” dropdown.
When a report is bound to a standalone dataset, notice that the Data view icon disappears from the Power BI desktop sidebar. This is because the report no longer contains any data itself – it is using the standalone dataset instead.
I talk more about binding reports to datasets later in the series.
TMSL & version control
Using a tool like Visual Studio or Tabular Editor, I can make changes to a standalone tabular model, saving them back to the model's .bim
file as I go. To bring the model under version control, I add the file to my Git repository. At this point, other advantages of maintaining the dataset as a .bim
file start to appear 😊.
Inspecting version history
Because a model .bim
file is text (TMSL) rather than binary (PBIX), it becomes possible to make meaningful comparisons between different versions. The screenshot below shows a GitHub compare between versions of a PBIX file:
As the commit message suggests, the latest commit adds a measure description to the PBIX file. It isn't possible to verify the change, because the PBIX file format is binary.
In contrast, this screenshot shows a comparison of the same change, this time made to a TMSL file using Tabular Editor:
In this case, the exact nature of the modification is visible, confirming not only that a description was added, but also the text of the new value.
Parallel development & merge conflicts
When changes are pushed to a Git service such as GitHub or Azure Repos, those changes will be automatically merged into the target file(s) if Git can work out how to do so. In the case of a text file – where a few lines have been updated or inserted – this is usually possible. In the case of a binary file such as a PBIX, Git can never perform an automatic merge.
The ability to merge different text changes into different parts of the same .bim
file removes a longstanding bottleneck in Power BI dataset development: developers can work on the same model, independently, at the same time. This requires less developer coordination, less waiting around, and provides an overall better developer experience.
However, even with a text-based model definition, merge conflicts can occur when Git receives multiple changes from different developers made to the same part of the .bim
file. There are two reasons TMSL merge conflicts occur:
Developers don't talk to one another, and try to make changes to the same object in a model (e.g. the same measure). When this happens, it's the result of failures in developer communication – this is a process problem!
When a modified model is serialised (saved back to the
.bim
file) by Tabular Editor or Visual Studio, the order in which the model's objects are stored is changed. This creates a large number of unmergeable file changes, even if the actual change to the model is very small.
Avoiding merge conflicts of the second type is critical to a smooth, parallel development workflow. The good news is that a number of solutions to the problem have been proposed 😃.
Save-to-Folder in Tabular Editor
Specifically with the aim of enabling parallel development, you can configure Tabular Editor to “save to folder”. Save-to-folder breaks the model into separate files, defining each object in a separate file. This screenshot of File Explorer shows the result of saving the example tabular model using the save-to-folder option:
Version controlling this file structure sidesteps the reordering problem to avoid merge conflicts, but the model must be converted by hand back to TMSL (a single .bim
file) before it can be deployed. From a workflow perspective this is less appealing.
Tabular Model Description Language (TMDL)
The public preview of Tabular Model Description Language (TMDL, pronouced “ TIM-del ”) was announced by Microsoft in April 2023. TMDL provides an alternative serialisation language for tabular models, with a number of advantages:
- Like Tabular Editor's “save to folder”, model objects are saved in separate files, grouped into folders.
- Objects are no longer persisted in JSON, but with a YAML-like syntax which is easier to read.
- TMDL is supported by Microsoft.
This is the TMSL representation of my model's “Amount by Due Date measure”:
{ "name": "Amount by Due Date", "description": "Sales amount broken down by payment due date.", "expression": [ "CALCULATE(", " SUM(Sales[Amount])", ", USERELATIONSHIP(", " Sales[Due Date]", " ,'Date'[Date]", " )", ")" ], "annotations": [ { "name": "PBI_FormatHint", "value": "{\"isGeneralNumber\":true}" } ] }
Notice in the annotations
array that quotes must be escaped, and that multiline expressions are stored as arrays of JSON strings (which cannot be copied directly into a DAX expression editor).
In contrast, the TMDL file representing this object would look something like this:
/// Sales amount broken down by payment due date. measure 'Amount by Due Date' = CALCULATE( SUM(Sales[Amount]) , USERELATIONSHIP( Sales[Due Date] , 'Date'[Date] ) ) annotation PBI_FormatHint = {"isGeneralNumber":true}
Microsoft support is currently limited to provision of a serialisation/deserialisation API, but the intention is that TMDL will gradually be integrated into Microsoft tools. In the meantime, API support has already allowed third-party tools such as pbi-tools and Tabular Editor to begin adding support for the language.
Tabular Editor 2.18.0 introduced TMDL support for TE2, but integration into TE3 will be delayed until the TMDL API is more stable and complete. For the same reason, I won't be using TMDL in this post, but in a later post will demonstrate integrating it into the workflow I show here.
Enforce object order on commit
A direct approach to eliminating merge conflicts caused by model serialisation is to ensure that the objects in a model's .bim
file always appear in the same order, before the file is committed to Git. This can be achieved using a Git pre-commit hook – I will be using this approach here.
A detailed description is available in a separate article, and the GitHib repo accompanying this post (link at the end) includes code to support hook implementation.
Automated deployment
Still with me after the merge conflict rabbit hole 🐇? Great! So what we have now is:
- a tabular model definition which is text-based, contains no data and is in version control
- a way for multiple developers to work on it safely and in parallel.
Let's deploy this thing! Like last time, I'm going to automate a basic workflow, in preparation for something more powerful:
A developer makes changes to a tabular model.
The dev saves the modified model into version control, causing it to be published automatically as a Power BI dataset.
To publish the dataset I'll be using the Power BI workspace's XMLA endpoint. This is a Power BI Premium feature – a lower-cost option available for development and testing purposes is to use an A SKU.
To make deployments, XMLA endpoint availability must be set to Read/Write in your Power BI tenant settings (in Admin portal → Capacity settings → Power BI workloads → XMLA Endpoint):
Pipeline definition
Here's the YAML definition for my Azure DevOps pipeline – it's similar to the last one, but:
- It uses a Windows build agent, instead of an Ubuntu agent (specified by the
windows-latest
pool image). I'll explain why in a moment. - It calls a different PowerShell script –
Deploy-PbiDataset.ps1
– with appropriate parameter values. - It has some small differences in the trigger & variable definitions – those are just to match the code location for this post.
trigger: branches: include: - main paths: include: - powerbi-pro-devex-series/02-Datasets/Model.bim variables: folderPath: $(System.DefaultWorkingDirectory)/powerbi-pro-devex-series/02-Datasets pool: vmImage: windows-latest steps: - task: PowerShell@2 displayName: Publish Power BI dataset inputs: targetType: filePath filePath: $(folderPath)/Deploy-PbiDataset.ps1 arguments: > -DatasetName "AdventureWorksLT" -WorkspaceName "Datasets" -BimFilePath "$(folderPath)/Model.bim" -RefreshDataset $true failOnStderr: true env: AZURE_TENANT_ID: $(AzureTenantId) AZURE_CLIENT_ID: $(AzureClientId) AZURE_CLIENT_SECRET: $(AzureClientSecret)
Once I've created an Azure DevOps pipeline to use this pipeline definition file, the pipeline will be triggered whenever a change is made to the specified Model.bim
file in the GitHub repository's main
branch.
PowerShell script
The pipeline calls PowerShell script file Deploy-PbiDataset.ps1
. Here's the script contained in the file:
param( [Parameter(Mandatory = $true)] [ValidateNotNullOrEmpty()] [string] $DatasetName, [Parameter(Mandatory = $true)] [ValidateNotNullOrEmpty()] [string] $WorkspaceName, [Parameter(Mandatory = $true)] [ValidateNotNullOrEmpty()] [string] $BimFilePath, [Parameter(Mandatory = $False)] [ValidateNotNullOrEmpty()] [bool] $RefreshDataset ) $scriptFolder = $MyInvocation.MyCommand.Path | Split-Path Import-Module $scriptFolder\PbiDeployment\PbiDeployment.psm1 -Force # Connect to Power BI $credential = Use-Pbi -WithModules @('Workspaces', 'Data') # prepare model $model = Get-Content $BimFilePath -Encoding UTF8 $cmd = '{"createOrReplace":{"object":{"database":""}, "database":' + $model + '}}' | ConvertFrom-Json $cmd.createOrReplace.object.database = $DatasetName $cmd.createOrReplace.database.name = $DatasetName $query = ConvertTo-Json $cmd -Depth 100 -Compress # deploy model $endpoint = "powerbi://api.powerbi.com/v1.0/myorg/$WorkspaceName" $result = Invoke-ASCmd -Server $endpoint -Query $query -Credential $credential -ServicePrincipal -TenantId $Env:AZURE_TENANT_ID Write-Output $result if($result -like "*error*") { throw "Error deploying dataset." } # refresh dataset if ($RefreshDataset) { $workspaceId = (Get-PowerBIWorkspace -Name $WorkspaceName).Id $datasetId = (Get-PowerBIDataset -WorkspaceId "$workspaceId" | Where-Object {$_.Name -eq "$DatasetName"}).Id $url = "/groups/$workspaceId/datasets/$datasetId/refreshes" $body = @{notifyOption = "NoNotification"} | ConvertTo-Json -Compress Invoke-PowerBIRestMethod -Url $url -Method POST -Body $body | Out-Null }
Like the last script, it begins by importing PowerShell modules for Power BI and connecting to the Power BI service. For this example I've moved those tasks into a separate module file, PbiDeployment.psm1
(imported on line 22). The module function Use-Pbi
– called on line 25 – is responsible for importing modules and authenticating against Power BI.
Lines 27-32 prepare the model's TMSL representation for deployment by:
- reading the model from its
.bim
file - embedding it in a TMSL
createOrReplace
command, then converting the command into a PowerShell object - setting database name properties in the command object (lines 30 & 31)
- converting the object back into JSON.
The model is deployed to the workspace's XLMA endpoint on line 36, using the Invoke-ASCmd
cmdlet from the SqlServer module for PowerShell. The cmdlet is not present in the module version for PowerShell core, which is why the pipeline has to use a Windows build agent.1)
Invoke-ASCmd
returns an XML result describing the outcome of the operation. Even if the operation fails, the cmdlet returns a success response – lines 38-40 inspect the returned XML for error messages, and throw a “real” error if any are found. This ensures that if dataset deployment fails, the deployment pipeline run also fails.
Finally, if the value of the script's RefreshDataset
parameter is $true
, the script uses the Power BI API to refresh the dataset from its sources.
By design, deploying a Power BI dataset like this excludes data source credentials – after initial deployment, you must set them in the published dataset before its data can be refreshed.2) Once set, stored credentials are retained in the published dataset, even after further deployments.
Modify the sample dataset
Now I have everything I need to deploy dataset updates automatically 😃. In the video below I update the model .bim
file and sync the changes to GitHub, causing the Azure DevOps pipeline to publish the dataset automatically into Power BI. I'm still making changes directly to the main
branch – I'll improve on that in a future post.
Summary
In this article I demonstrated an approach to managing standalone Power BI datasets as .bim
files, outside a PBIX file. This has a number of advantages:
- Model definitions stored as text files can be version controlled more meaningfully.
- A consistently-ordered TMSL file eliminates most merge conflicts, enabling parallel model development. As it matures, TMDL is likely to become the better approach, because TMDL files are naturally less susceptible to merge conflicts and are easier to read.
- A standalone model can be shared between multiple reports.
- PBIX report files using shared datasets no longer contain their own data, keeping it out of version control.
Next up: In the next post, I describe a more sophisticated workflow for report development – supporting report environments for testing, UAT and production – and build a new deployment pipeline to support it.
Code: The code for the series is available on
Github. The files specific to this article are in the
powerbi-pro-devex-series/02-Datasets
folder.Share: If you found this article useful, please share it!
Install-Module
call is required in the script.