Building dbt packages

Updated

dbt Core

Advanced

Introduction

Creating packages is an advanced use of dbt. If you're new to the tool, we recommend that you first use the product for your own analytics before attempting to create a package for others.

Prerequisites

A strong understanding of:

packages
administering a repository on GitHub
semantic versioning

Assess whether a package is the right solution

Packages typically contain either:

macros that solve a particular analytics engineering problem — for example, auditing the results of a query, generating code, or adding additional schema tests to a dbt project.
models for a common dataset — for example a dataset for software products like MailChimp or Snowplow, or even models for metadata about your data stack like Snowflake query spend and the artifacts produced by dbt run. In general, there should be a shared set of industry-standard metrics that you can model (e.g. email open rate).

Packages are not a good fit for sharing models that contain business-specific logic, for example, writing code for marketing attribution, or monthly recurring revenue. Instead, consider sharing a blog post and a link to a sample repo, rather than bundling this code as a package (here's our blog post on marketing attribution as an example).

Create your new project

Using the command line for package development

We tend to use the command line interface for package development. The development workflow often involves installing a local copy of your package in another dbt project — at present dbt Cloud is not designed for this workflow.

Use the dbt init command to create a new dbt project, which will be your package:

$ dbt init [package_name]

Create a public GitHub¹ repo, named dbt-<package-name>, e.g. dbt-mailchimp. Follow the GitHub instructions to link this to the dbt project you just created.
Update the name: of the project in dbt_project.yml to your package name, e.g. mailchimp.
Define the allowed dbt versions by using the require-dbt-version config.

¹Currently, our package registry only supports packages that are hosted in GitHub.

Develop your package

We recommend that first-time package authors first develop macros and models for use in their own dbt project. Once your new package is created, you can get to work on moving them across, implementing some additional package-specific design patterns along the way.

When working on your package, we often find it useful to install a local copy of the package in another dbt project — this workflow is described here.

Follow best practices

Modeling packages only

Use our dbt coding conventions, our article on how we structure our dbt projects, and our best practices for all of our advice on how to build your dbt project.

This is where it comes in especially handy to have worked on your own dbt project previously.

Make the location of raw data configurable

Modeling packages only

Not every user of your package is going to store their Mailchimp data in a schema named mailchimp. As such, you'll need to make the location of raw data configurable.

We recommend using sources and variables to achieve this. Check out this package for an example — notably, the README includes instructions on how to override the default schema from a dbt_project.yml file.

Install upstream packages from hub.getdbt.com

If your package relies on another package (for example, you use some of the cross-database macros from dbt-utils), we recommend you install the package from hub.getdbt.com, specifying a version range like so:

packages.yml

packages:
  - package: dbt-labs/dbt_utils
    version: [">0.6.5", "0.7.0"]

When packages are installed from hub.getdbt.com, dbt is able to handle duplicate dependencies.

Implement cross-database compatibility

Many SQL functions are specific to a particular database. For example, the function name and order of arguments to calculate the difference between two dates varies between Redshift, Snowflake and BigQuery, and no similar function exists on Postgres!

If you wish to support multiple warehouses, we have a number of tricks up our sleeve:

We've written a number of macros that compile to valid SQL snippets on each of the original four adapters. Where possible, leverage these macros.
If you need to implement cross-database compatibility for one of your macros, use the adapter.dispatch macro to achieve this. Check out the cross-database macros in dbt-utils for examples.
If you're working on a modeling package, you may notice that you need write different models for each warehouse (for example, if the EL tool you are working with stores data differently on each warehouse). In this case, you can write different versions of each model, and use the enabled config, in combination with target.type to enable the correct models — check out this package as an example.

If your package has only been written to work for one data warehouse, make sure you document this in your package README.

Use specific model names

Modeling packages only

Many datasets have a concept of a "user" or "account" or "session". To make sure things are unambiguous in dbt, prefix all of your models with [package_name]_. For example, mailchimp_campaigns.sql is a good name for a model, whereas campaigns.sql is not.

Default to views

Modeling packages only

dbt makes it possible for users of your package to override your model materialization settings. In general, default to materializing models as views instead of tables.

The major exception to this is when working with data sources that benefit from incremental modeling (for example, web page views). Implementing incremental logic on behalf of your end users is likely to be helpful in this case.

Test and document your package

It's critical that you test your models and sources. This will give your end users confidence that your package is actually working on top of their dataset as intended.

Further, adding documentation via descriptions will help communicate your package to end users, and benefit their stakeholders that use the outputs of this package.

Include useful GitHub artifacts

Over time, we've developed a set of useful GitHub artifacts that make administering our packages easier for us. In particular, we ensure that we include:

A useful README, that has:
- installation instructions that refer to the latest version of the package on hub.getdbt.com, and includes any configurations requires (example)
- Usage examples for any macros (example)
- Descriptions of the main models included in the package (example)
GitHub templates, including PR templates and issue templates (example)

Add integration tests

Optional

We recommend that you implement integration tests to confirm that the package works as expected — this is an even more advanced step, so you may find that you build up to this.

This pattern can be seen most packages, including the audit-helper and snowplow packages.

As a rough guide:

Create a subdirectory named integration_tests
In this subdirectory, create a new dbt project — you can use the dbt init command to do this. However, our preferred method is to copy the files from an existing integration_tests project, like the ones here (removing the contents of the macros, models and tests folders since they are project-specific)
Install the package in the integration_tests subdirectory by using the local syntax, and then running dbt deps

packages.yml

packages:
    - local: ../ # this means "one directory above the current directory"

Add resources to the package (seeds, models, tests) so that you can successfully run your project, and compare the output with what you expect. The exact approach here will vary depending on your packages. In general you will find that you need to:
- Add mock data via a seed with a few sample (anonymized) records. Configure the integration_tests project to point to the seeds instead of raw data tables.
- Add more seeds that represent the expected output of your models, and use the dbt_utils.equality test to confirm the output of your package, and the expected output matches.
Confirm that you can run dbt run and dbt test from your command line successfully.
(Optional) Use a CI tool, like CircleCI or GitHub Actions, to automate running your dbt project when you open a new Pull Request. For inspiration, check out one of our CircleCI configs, which runs tests against our four main warehouses. Note: this is an advanced step — if you are going down this path, you may find it useful to say hi on dbt Slack.

Deploy the docs for your package

Optional

A dbt docs site can help a prospective user of your package understand the code you've written. As such, we recommend that you deploy the site generated by dbt docs generate and link to the deployed site from your package.

The easiest way we've found to do this is to use GitHub Pages.

On a new git branch, run dbt docs generate. If you have integration tests set up (above), use the integration-test project to do this.
Move the following files into a directory named docs (example): catalog.json, index.html, manifest.json, run_results.json.
Merge these changes into the main branch
Enable GitHub pages on the repo in the settings tab, and point it to the “docs” subdirectory
GitHub should then deploy the docs at <org-name>.github.io/<repo-name>, like so: fivetran.github.io/dbt_ad_reporting

Release your package

Create a new release once you are ready for others to use your work! Be sure to use semantic versioning when naming your release.

In particular, if new changes will cause errors for users of earlier versions of the package, be sure to use at least a minor release (e.g. go from 0.1.1 to 0.2.0).

The release notes should contain an overview of the changes introduced in the new version. Be sure to call out any changes that break the existing interface!

Add the package to hub.getdbt.com

Our package registry, hub.getdbt.com, gets updated by the hubcap script. To add your package to hub.getdbt.com, create a PR on the hubcap repository to include it in the hub.json file.

Introduction​

Prerequisites​

Assess whether a package is the right solution​

Create your new project​

Develop your package​

Follow best practices​

Make the location of raw data configurable​

Install upstream packages from hub.getdbt.com​

Implement cross-database compatibility​

Use specific model names​

Default to views​

Test and document your package​

Include useful GitHub artifacts​

Add integration tests​

Deploy the docs for your package​

Release your package​

Add the package to hub.getdbt.com​

Introduction

Prerequisites

Assess whether a package is the right solution

Create your new project

Develop your package

Follow best practices

Make the location of raw data configurable

Install upstream packages from hub.getdbt.com

Implement cross-database compatibility

Use specific model names

Default to views

Test and document your package

Include useful GitHub artifacts

Add integration tests

Deploy the docs for your package

Release your package

Add the package to hub.getdbt.com