Skip to main content

Enforcing rules at scale with pre-commit-dbt

· 13 min read
Benoit Perigaud

Editor's note — since the creation of this post, the package pre-commit-dbt's ownership has moved to another team and it has been renamed to dbt-checkpoint. A redirect has been set up, meaning that the code example below will still work. It is also possible to replace repo: https://github.com/offbi/pre-commit-dbt with repo: https://github.com/dbt-checkpoint/dbt-checkpoint in your .pre-commit-config.yaml file.

At dbt Labs, we have best practices we like to follow for the development of dbt projects. One of them, for example, is that all models should have at least unique and not_null tests on their primary key. But how can we enforce rules like this?

That question becomes difficult to answer in large dbt projects. Developers might not follow the same conventions. They might not be aware of past decisions, and reviewing pull requests in git can become more complex. When dbt projects have hundreds of models, it's hard to know which models do not have any tests defined and aren't enforcing your conventions.

One potential solution is to leverage the open-source package pre-commit-dbt, created by dbt community members, that can be used to automatically run tests before committing files to git or as part of CI steps. In this article, I'll walk you through the strategy I use to implement this package and enforce rules at scale.

What are pre-commit and pre-commit-dbt?

pre-commit is a framework that can be used to automatically run tests before committing files to git, leveraging git hooks.

In our case, we will use the ability of pre-commit to run automated tests but I will also explain below how to use it with the flags --all-files or --files to leverage the same tests on a predefined list of dbt models.

On the other hand, pre-commit-dbt defines dbt specific tests and actions (called hooks) for the pre-commit framework.

There are currently over 20 tests that have been created but here 2 examples that we will leverage:

  • check-model-has-tests: Check the model has a number of tests.
  • check-model-has-properties-file: Check the model has a properties file (also called schema file).

Implementing pre-commit-dbt & adding tests

Let’s take the example of a project with more than 300 models. Dozens of people have committed to the project, a PR review process is in place, but sometimes, with multiple models in the same PR, tracking if tests have been added or not is not easy and we know that not all models are tested today even if they should.

To remediate this let’s follow those 4 steps:

Flow graph that describes the steps in defining a strategy for implementing this package on a dbt project with more than 300 models.

1. Define our rules and create a pre-commit config file

pre-commit-dbt provides a range of tests that can be run on the models of our dbt project. In the case of a project which has existed for some time and which might not be entirely following best practices, I would recommend selecting a small subset of tests, fixing the project and adding more tests later on.

In our example, let’s just start by saying that we want:

  • all our models to have been added to a YAML file
  • all our models to include some tests

To configure pre-commit, we have to create a file called .pre-commit-config.yaml at the root of our dbt project and add the following content:

repos:
- repo: https://github.com/offbi/pre-commit-dbt
rev: v1.0.0
hooks:
- id: dbt-docs-generate
- id: check-model-has-properties-file
name: Check that all models are listed in a YAML file
files: ^models/
- id: check-model-has-tests
name: Check that all models have tests
files: ^models/

A few notes about this file:

  • We start with a dbt-docs-generate to ensure that the command dbt docs generate is run before checking our models. This is required because pre-commit-dbt reads from the artifacts catalog.json and manifest.json and those files can be generated by generating the documentation
  • The files argument is a regular expression. ^models/ is going to apply the test to all our models whose path starts with models, i.e. all the models of our project. If we wanted to run the test only for the models in the mart folder, we could write ^models/mart. There are a few handy tools online to define and test regular expressions; one of them is https://regex101.com/
  • We could also provide values for the parameter exclude if we wanted to exclude specific files or folders
  • We could have added a parameter to the hook check-model-has-tests to mention how many tests each model is supposed to have (see here)

Now that we have defined our configuration file, the next step will depend on whether we are using dbt via dbt Cloud or dbt Core via the CLI.

If we are using dbt Cloud, let’s jump to step 4, where we will set up pre-commit-dbt as part of the CI process, otherwise, with dbt Core we can go to step 2.

2. Understand the scope of the changes required

With our configuration file created, we can now set up our computer to do some checks locally.

Activating a Python virtual environment

If we are using dbt in a Python virtual environment, let’s activate this environment. If not, we should really create a Python virtual environment (more info here) and activate it before installing pre-commit.

We could technically skip this step but might then end up getting issues on our computer with different Python packages conflicting with each other.

Installing and running pre-commit

Once in the Python virtual environment, installing pre-commit is as straightforward as running python -m pip install pre-commit.

A normal next step after installing pre-commit is to run a pre-commit install to install the git hooks and run tests automatically, but in our case, let’s wait a bit! We will cover this in step 4.

Instead, we can do a pre-commit run --all-files, which will run all the tests defined in our configuration file on all the files in our dbt project.

Animation showing the output in the Terminal after running the above commands

In my case, I can see that my model called customers.sql has not been added to any YAML file and has no test defined.

In the case of a large project, the number of issues might be much bigger. If we use zsh as our shell, wildcard expansion can be used and we could run pre-commit run --files models/mart/* if we wanted to run all the checks only in the models stored under mart.

3. Decide what needs to be fixed immediately

Once we have the list of models that either don’t exist in the YAML files or that don’t have any test defined, we can decide if we want to fix all of them at once or not.

What we will see in step 4 is that even if not all models are fixed at once, the CI step and the git hooks can lead to better project hygiene, forcing every model that is being modified to be tested.

In my example above, with just one model to fix, it is easy to create a PR with the changes, but if hundreds of models show up, you might decide to only fix the most important ones at first (your mart for example) and fix the other ones later on.

4. Make it part of the periodic checks

The last step of our flow is to make those pre-commit checks part of the day-to-day activities, running on the dbt models that are newly created or modified. That way, even if we don’t fix all our models at once, if they get modified at some points, tests will need to be added for the PR to be merged.

Adding periodic pre-commit checks can be done in 2 different ways, through CI (Continuous Integration) actions, or as git hooks when running dbt locally

a) Adding pre-commit-dbt to the CI flow (works for dbt Cloud and dbt Core users)

The example below will assume GitHub actions as the CI engine but similar behavior could be achieved in any other CI tool.

As described before, we need to run a dbt docs generate in order to create updated JSON artifacts used in the pre-commit hooks.

For that reason, we will need our CI step to execute this command, which will require setting up a profiles.yml file providing dbt the information to connect to the data warehouse. Profiles files will be different for each data warehouse (example here).

In our case, let’s create a file called profiles.yml at the root of our dbt project, with the following information:

​​jaffle_shop:
target: ci
outputs:
ci:
type: postgres
host: <your_host>
user: <user>
password: "{{ env_var('DB_PASSWORD') }}"
port: 5432
dbname: <database>
schema: ci
threads: 4

We don’t want to save the password of our user in a clear text file. For that purpose, we use the ability to read it from an environment variable. The next step is to save the value of our password as a secret in GitHub. In our GitHub repository, under Settings > Security > Secrets > Action, let’s create a secret called DB_PASSWORD to store our sensitive password.

Finally, we can create a new YAML file to define our GitHub action. e.g. .github/workflows/pre_commit_checks.yml The name is not important but this file must be saved under the folders .github/workflows/ (create those if they don’t exist yet)

name: pre-commit-check

on:
pull_request:
branches:
- main

jobs:
pre-commit-pip:
name: Install pre-commit via pip
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
os: ['ubuntu-latest']
python-version: [3.8]

# Set environment variables used throughout workflow
env:
DBT_PROFILES_DIR: .
DB_PASSWORD: ${{ secrets.DB_PASSWORD }}

steps:

- name: Checkout branch
uses: actions/checkout@v3

# Using bash and pip to install dbt and pre-commit
# Update the dbt installation command to include the adapter you need
- name: Install dbt and pre-commit
shell: bash -l {0}
run: |
python -m pip install dbt-postgres pre-commit

# This action will output all the files that are being created and modified in our PR
- name: Get changed files
id: get_file_changes
uses: trilom/file-changes-action@v1.2.4
with:
output: ' '

# Transforming the output of get_file_changes to a string we can use for our next step
# We want to take in account both new files and files that have been modified
- name: Get changed .sql files in /models to lint
id: get_files_to_lint
shell: bash -l {0}
run: |
# Set the command in the $() brackets as an output to use in later steps
echo "::set-output name=files::$(
# Issue where grep regular expressions don't work as expected on the
# Github Actions shell, check dbt/models/ folder
echo \
$(echo ${{ steps.get_file_changes.outputs.files_modified }} |
tr -s ' ' '\n' |
grep -E '^models.*[.]sql$' |
tr -s '\n' ' ') \
$(echo ${{ steps.get_file_changes.outputs.files_added }} |
tr -s ' ' '\n' |
grep -E '^models.*[.]sql$' |
tr -s '\n' ' ')
)"

# Finally run pre-commit
- name: Run pre-commit
shell: bash -l {0}
run: |
pre-commit run --files ${{ steps.get_files_to_lint.outputs.files }}

The code is documented and should be self-explanatory, in a nutshell, we perform the following steps:

  • Mention that this action and all the steps should be run for every PR to main
  • Retrieve the code from our PR
  • Install dbt and pre-commit
  • Identify the files modified in our PR and format them as a list of models separated by spaces
  • Execute a pre-commit run --files on the models we just modified or created

Once we push those changes to our repo to a custom branch and create a PR to main, we see the following:

The GitHub action is running:

Screenshot of a GitHub action executing on a PR that is running the pre-commit-check test

The step fails because I missed some tests and it tells me what model is failing:

Screenshot of the errors logs for the failed pre-commit-check test on the PR shown previously

The result of the check is also shown in the PR directly:

Screenshot of the failed test shown directly in the PR &quot;checks&quot; interface

With that information, I could now go back to dbt, document my model customers and push those new changes to my repo for another check to be performed.

We could set up rules that prevent any change to be merged if the GitHub action fails. Alternatively, this action step can be defined as merely informational.

b) Installing the pre-commit git hooks (for dbt Core users)

If we develop locally with the dbt Core CLI, we could also execute pre-commit install to install the git hooks. What it means then is that every time we want to commit code in git, the pre-commit hooks will run and will prevent us from committing if any step fails.

If we want to commit code without performing all the steps of the pre-hook we could use the environment variable SKIP or the git flag --no-verify as described in the documentation. (e.g. we might want to skip the auto dbt docs generate locally to prevent it from running at every commit and rely on running it manually from time to time)

And if we install the hooks and realize that we don’t want them anymore, we just need to delete the folder .git/hooks/

c) So tests in CI, or git hooks locally?

Those two configurations are not exclusive but complementary.

  • Having the hooks set up locally ensures that all our models follow our conventions even before pushing them to our repo, ensuring that the CI step will pass correctly.
  • And the GitHub action as CI test is a great safeguard for people using the dbt Cloud IDE or local developers who either have not installed the hooks or tried to push changes with the --no-verify flag

A meme that reads &quot;Why not both?&quot;

Taking this solution further

We now have a process to ensure that the rules we are setting around testing required in our dbt models are enforced through automated steps.

What next? Going back to the diagram at the beginning of this post, we can now think of new rules we want to automate and modify our pre-commit file and GitHub actions to increase the quality of our dbt project.

It is important though to keep in mind a good balance between setting enough rules and automation to ensure a project of good quality and setting too many of them, taking time from more value-added work and potentially slowing down the overall analytics development process.

A diagram that adds additional steps to the original diagram shown in the beginning of the article. This solution includes a cyclical step for continuously adding new rules and leveraging a SQL linter.

  • We could for example add SQLFluff as a SQL Linter to show us what SQL code is not following the rules that we have defined
  • Or we could add more pre-commit-dbt checks like check-model-name-contract to make sure that all our model names are following the correct naming convention
  • Or we could add a check on our YAML files to verify if they are all indented correctly

If you have any questions about this process or want to share how you are using pre-commit and CI to increase the quality of your dbt project, feel free to jump in dbt Slack and post in #i-made-this or #dbt-deployment-and-orchestration !

Comments

Loading