Skip to main content

Refactor an existing mart

A new approach​

We've covered the basics, now it's time to dig in to the fun and messy part: how do we refactor an existing mart in dbt into semantic models and metrics?

Let's look at the differences we can observe in how we might approach this with MetricFlow supercharging dbt versus how we work without a Semantic Layer. These differences can then inform our structure.

  • 🍊 In dbt, we tend to create highly denormalized datasets that bring everything you want around a certain entity or process into a single table.
  • πŸ’œ The problem is, this limits the dimensionality available to MetricFlow. The more we pre-compute and 'freeze' into place, the less flexible our data is.
  • 🚰 In MetricFlow, we ideally want highly normalized, star schema-like data that then allows MetricFlow to shine as a denormalization engine.
  • ∞ Another way to think about this is that instead of moving down a list of requested priorities trying to pre-make as many combinations of our marts as possible β€” increasing lines of code and complexity β€” we can let MetricFlow present every combination possible without specifically coding it.
  • πŸ—οΈ To resolve these approaches optimally, we'll need to shift some fundamental aspects of our modeling strategy.

Refactor steps outlined​

We recommend an incremental implementation process that looks something like this:

  1. πŸ‘‰ Identify an important output (a revenue chart on a dashboard for example, and the mart model(s) that supplies this output.
  2. πŸ” Examine all the entities that are components of this mart (for instance, an orders mart may include customers, shipping, and product data).
  3. πŸ› οΈ Build semantic models and metrics for all the required components.
  4. πŸ‘― Create a clone of the output on top of the Semantic Layer.
  5. πŸ’» Audit to ensure you get accurate outputs.
  6. πŸ’Ž Use mf list dimensions --metrics [metric_name] to check that your refactoring is increasing dimensionality (flexibility).
  7. πŸ‘‰ Identify any other outputs that point to the mart and move them to the Semantic Layer.
  8. ✌️ Put a deprecation plan in place for the mart.

You would then continue this process on other outputs and marts moving down a list of priorities. Each model as you go along will be faster and easier as you'll reuse many of the same components that will already have been semantically modeled.

Let's make a revenue metric​

So far we've been working in new pointing at a staging model to simplify things as we build new mental models for MetricFlow. In reality, unless you're implementing MetricFlow in a green-field dbt project, you probably are going to have some refactoring to do. So let's get into that in detail.

  1. πŸ“š Per the above steps, we've identified our target, now we need to identify all the components we need, these will be all the 'import' CTEs at the top our mart. Let's look at orders and order_items, the likely models to generate revenue, we see we'll need: orders, order_items, products, locations, and supplies.

  2. πŸ—ΊοΈ We'll next make semantic models for all of these. Let's walk through a straightforward conversion first with locations.

  3. ⛓️ We'll want to first decide if we need to do any joining to get this into the shape we want for our semantic model. The biggest determinants of this are two factors:

    • πŸ“ Does this semantic model contain measures?
    • πŸ•₯ Does this semantic model have a primary timestamp?
    • πŸ«‚ If a semantic model has measures but no timestamp (for example, supplies in the example project, which has static costs of supplies), you'll likely want to sacrifice some normalization and join it on to another model that has a primary timestamp to allow for metric aggregation.
  4. πŸ”„ If we don't need any joins, we'll just go straight to the staging model for our semantic model's ref. Locations does have a tax_rate measure, but it also has an ordered_at timestamp, so we can go straight to the staging model here.

  5. πŸ₯‡ We specify our primary entity (based on location_id), dimensions (one categorical, location_name, and one primary time dimension opened_at), and lastly our measures, in this case just average_tax_rate.

    semantic_models:
    - name: locations
    description: |
    Location dimension table. The grain of the table is one row per location.
    model: ref('stg_locations')
    entities:
    - name: location
    type: primary
    expr: location_id
    dimensions:
    - name: location_name
    type: categorical
    - name: date_trunc('day', opened_at)
    type: time
    type_params:
    time_granularity: day
    measures:
    - name: average_tax_rate
    description: Average tax rate.
    expr: tax_rate
    agg: avg

Semantic and logical interaction​

Now, let's tackle a thornier situation. Products and supplies both have dimensions and measures but no time dimension. Products has a one-to-one relationship with order_items, enriching that table, which is itself just a mapping table of products to orders. Additionally, products have a one-to-many relationship with supplies. The high-level ERD looks like the diagram below.

So to calculate, for instance, the cost of ingredients and supplies for a given order, we'll need to do some joining and aggregating, but again we lack a time dimension for products and supplies. This is the signal to us that we'll need to build a logical mart and point our semantic model at that.

tip

dbt 🧑 MetricFlow. This is where integrating your semantic definitions into your dbt project really starts to pay dividends. The interaction between the logical and semantic layers is so dynamic, you either need to house them in one codebase or facilitate a lot of cross-project communication and dependency.

  1. 🎯 Let's aim at, to start, building a table at the order_items grain. We can aggregate supply costs up, map over the fields we want from products, such as price, and bring the ordered_at timestamp we need over from the orders table. We'll write the following code in models/marts/order_items.sql.

    {{
    config(
    materialized = 'table',
    )
    }}

    with

    order_items as (

    select * from {{ ref('stg_order_items') }}

    ),

    orders as (

    select * from {{ ref('stg_orders')}}

    ),

    products as (

    select * from {{ ref('stg_products') }}

    ),

    supplies as (

    select * from {{ ref('stg_supplies') }}

    ),

    order_supplies_summary as (

    select
    product_id,
    sum(supply_cost) as supply_cost

    from supplies

    group by 1
    ),

    joined as (

    select
    order_items.*,
    products.product_price,
    order_supplies_summary.supply_cost,
    products.is_food_item,
    products.is_drink_item,
    orders.ordered_at

    from order_items

    left join orders on order_items.order_id = orders.order_id

    left join products on order_items.product_id = products.product_id

    left join order_supplies_summary on order_items.product_id = order_supplies_summary.product_id

    )

    select * from joined
  2. πŸ—οΈ Now we've got a table that looks more like what we want to feed into MetricFlow. Next, we'll build a semantic model on top of this new mart in models/marts/order_items.yml. Again, we'll identify our entities, then dimensions, then measures.

    semantic_models:
    #The name of the semantic model.
    - name: order_items
    defaults:
    agg_time_dimension: ordered_at
    description: |
    Items contatined in each order. The grain of the table is one row per order item.
    model: ref('order_items')
    entities:
    - name: order_item
    type: primary
    expr: order_item_id
    - name: order_id
    type: foreign
    expr: order_id
    - name: product
    type: foreign
    expr: product_id
    dimensions:
    - name: ordered_at
    expr: date_trunc('day', ordered_at)
    type: time
    type_params:
    time_granularity: day
    - name: is_food_item
    type: categorical
    - name: is_drink_item
    type: categorical
    measures:
    - name: revenue
    description: The revenue generated for each order item. Revenue is calculated as a sum of revenue associated with each product in an order.
    agg: sum
    expr: product_price
    - name: food_revenue
    description: The revenue generated for each order item. Revenue is calculated as a sum of revenue associated with each product in an order.
    agg: sum
    expr: case when is_food_item = 1 then product_price else 0 end
    - name: drink_revenue
    description: The revenue generated for each order item. Revenue is calculated as a sum of revenue associated with each product in an order.
    agg: sum
    expr: case when is_drink_item = 1 then product_price else 0 end
    - name: median_revenue
    description: The median revenue generated for each order item.
    agg: median
    expr: product_price
  3. πŸ“ Finally, Let's build a simple revenue metric on top of our semantic model now.

    metrics:
    - name: revenue
    description: Sum of the product revenue for each order item. Excludes tax.
    type: simple
    label: Revenue
    type_params:
    measure: revenue

Checking our work​

  • πŸ” We always will start our auditing with a dbt parse && mf validate-configs to ensure our code works before we examine its output.
  • πŸ‘― If we're working there, we'll move to trying out an mf query that replicates the logic of the output we're trying to refactor.
  • πŸ’Έ For our example we want to audit monthly revenue, to do that we'd run the query below. You can read more about the MetricFlow CLI.

Example query​

mf query --metrics revenue --group-by metric_time__month

Example query results​

βœ” Success πŸ¦„ - query completed after 1.02 seconds
| METRIC_TIME__MONTH | REVENUE |
|:---------------------|----------:|
| 2016-09-01 00:00:00 | 17032.00 |
| 2016-10-01 00:00:00 | 20684.00 |
| 2016-11-01 00:00:00 | 26338.00 |
| 2016-12-01 00:00:00 | 10685.00 |
  • Try introducing some other dimensions from the semantic models into the group-by arguments to get a feel for this command.

An alternate approach​

If you don't have capacity to refactor some of your marts, they can still benefit from the Semantic Layer. The above process is about maximizing dimensionality for the long term. In the short term, making your marts as-is available to MetricFlow unlocks greatly increased functionality. For an example of this quicker approach check out the customers SQL and YAML files on the main branch. This displays a typical denormalized dbt mart being hooked into MetricFlow.

0