Skip to main content

Dimensions

Dimensions is a way to group or filter information based on categories or time. It's like a special label that helps organize and analyze data.

In a data platform, dimensions is part of a larger structure called a semantic model. It's created along with other elements like entities and measures, and used to add more details to your data that can't be easily added up or combined. In SQL, dimensions is typically included in the group by clause of your SQL query.

All dimensions require a name, type and in some cases, an expr parameter. The name for your dimension must be unique to the semantic model and can not be the same as an existing entity or measure within that same model.

ParameterDescriptionType
nameRefers to the name of the group that will be visible to the user in downstream tools. It can also serve as an alias if the column name or SQL query reference is different and provided in the expr parameter.

Dimension names should be unique within a semantic model, but they can be non-unique across different models as MetricFlow uses joins to identify the right dimension.
Required
typeSpecifies the type of group created in the semantic model. There are three types:

- Categorical: Group rows in a table by categories like geography, color, and so on.
- Time: Point to a date field in the data platform. Must be of type TIMESTAMP or equivalent in the data platform engine.
- Slowly-changing dimensions: Analyze metrics over time and slice them by groups that change over time, like sales trends by a customer's country.
Required
type_paramsSpecific type params such as if the time is primary or used as a partitionRequired
descriptionA clear description of the dimensionOptional
exprDefines the underlying column or SQL query for a dimension. If no expr is specified, MetricFlow will use the column with the same name as the group. You can use column name itself to input a SQL expression.Optional

Refer to the following for the complete specification for dimensions:

dimensions:
- name: name of the group that will be visible to the user in downstream tools
type: Categorical or Time
type_params: specific type params such as if the time is primary or used as a partition
description: same as always
expr: the column name or expression. If not provided the default is the dimension name

Refer to the following example to see how dimensions are used in a semantic model:

semantic_models:
- name: transactions
description: A record for every transaction that takes place. Carts are considered multiple transactions for each SKU.
model: {{ ref("fact_transactions") }}
defaults:
agg_time_dimension: metric_time
# --- entities ---
entities:
...
# --- measures ---
measures:
...
# --- dimensions ---
dimensions:
- name: metric_time
type: time
expr: date_trunc('day', ts)
- name: is_bulk_transaction
type: categorical
expr: case when quantity > 10 then true else false end

MetricFlow requires that all dimensions have a primary entity. This is to guarantee unique dimension names. If your data source doesn't have a primary entity, you need to assign the entity a name using the primary_entity: entity_name key. It doesn't necessarily have to map to a column in that table and assigning the name doesn't affect query generation.

semantic_model:
name: bookings_monthly_source
description: bookings_monthly_source
defaults:
agg_time_dimension: ds
model: ref('bookings_monthly_source')
measures:
- name: bookings_monthly
agg: sum
create_metric: true
primary_entity: booking_id

Dimensions types

This section further explains the dimension definitions, along with examples. Dimensions have the following types:

Categorical

Categorical is used to group metrics by different categories such as product type, color, or geographical area. They can refer to existing columns in your dbt model or be calculated using a SQL expression with the expr parameter. An example of a category dimension is is_bulk_transaction, which is a group created by applying a case statement to the underlying column quantity. This allows users to group or filter the data based on bulk transactions.

dimensions: 
- name: is_bulk_transaction
type: categorical
expr: case when quantity > 10 then true else false end

Time

use datetime data type if using BigQuery

To use BigQuery as your data platform, time dimensions columns need to be in the datetime data type. If they are stored in another type, you can cast them to datetime using the expr property. Time dimensions are used to group metrics by different levels of time, such as day, week, month, quarter, and year. MetricFlow supports these granularities, which can be specified using the time_granularity parameter.

Time has additional parameters specified under the type_params section. When you query one or more metrics in MetricFlow using the CLI, the default time dimension for a single metric is the aggregation time dimension, which you can refer to as metric_time or use the dimensions' name.

You can use multiple time groups in separate metrics. For example, the users_created metric uses created_at, and the users_deleted metric uses deleted_at:

# dbt Cloud users
dbt sl query --metrics users_created,users_deleted --group-by metric_time__year --order-by metric_time__year

# dbt Core users
mf query --metrics users_created,users_deleted --group-by metric_time__year --order-by metric_time__year

You can set is_partition for time or categorical dimensions to define specific time spans. Additionally, use the type_params section to set time_granularity to adjust aggregation detail (like daily, weekly, and so on):

Use is_partition: True to show that a dimension exists over a specific time window. For example, a date-partitioned dimensional table. When you query metrics from different tables, the dbt Semantic Layer uses this parameter to ensure that the correct dimensional values are joined to measures.

You can also use is_partition for categorical dimensions as well.

MetricFlow enables metric aggregation during query time. For example, you can aggregate the messages_per_month measure. If you originally had a time_granularity for the time dimensions metric_time, you can specify a yearly granularity for aggregation in your query:

# dbt Cloud users
dbt sl query --metrics messages_per_month --group-by metric_time__year --order-by metric_time__year

# dbt Core users
mf query --metrics messages_per_month --group-by metric_time__year --order metric_time__year
dimensions: 
- name: created_at
type: time
expr: date_trunc('day', ts_created) #ts_created is the underlying column name from the table
is_partition: True
type_params:
time_granularity: day
- name: deleted_at
type: time
expr: date_trunc('day', ts_deleted) #ts_deleted is the underlying column name from the table
is_partition: True
type_params:
time_granularity: day

measures:
- name: users_deleted
expr: 1
agg: sum
agg_time_dimension: deleted_at
- name: users_created
expr: 1
agg: sum

SCD Type II

caution

Currently, there are limitations in supporting SCD's.

MetricFlow supports joins against dimensions values in a semantic model built on top of an SCD Type II table (slowly changing dimension) Type II table. This is useful when you need a particular metric sliced by a group that changes over time, such as the historical trends of sales by a customer's country.

As their name suggests SCD Type II are groups that change values at a coarser time granularity. This results in a range of valid rows with different dimensions values for a given metric or measure. MetricFlow associates the metric with the first (minimum) available dimensions value within a coarser time window, such as month. By default, MetricFlow uses the group that is valid at the beginning of the time granularity.

The following basic structure of an SCD Type II data platform table is supported:

entity_keydimensions_1dimensions_2...dimensions_xvalid_fromvalid_to
  • entity_key (required): An entity_key (or some sort of identifier) must be present
  • valid_from (required): A timestamp indicating the start of a changing dimensions value must be present
  • valid_to (required): A timestamp indicating the end of a changing dimensions value must be present

Note: The SCD dimensions table must have valid_to and valid_from columns.

This is an example of SQL code that shows how a sample metric called num_events is joined with versioned dimensions data (stored in a table called scd_dimensions) using a primary key made up of the entity_key and timestamp columns.

select metric_time, dimensions_1, sum(1) as num_events
from events a
left outer join scd_dimensions b
on
a.entity_key = b.entity_key
and a.metric_time >= b.valid_from
and (a.metric_time < b. valid_to or b.valid_to is null)
group by 1, 2

This example shows how to create slowly changing dimensions (SCD) using a semantic model. The SCD table contains information about sales persons' tier and the time length of that tier. Suppose you have the underlying SCD table:

sales_person_idtierstart_dateend_date
11112019-02-032020-01-05
11122020-01-052048-01-01
22222020-03-052048-01-01
33322020-08-192021-10-22
33332021-10-222048-01-01

The validity_params include two important arguments is_start and is_end. These specify the columns in the SCD table that mark the start and end dates (or timestamps) for each tier or dimension. Additionally, the entity is tagged as natural to differentiate it from a primary entity. In a primary entity, each entity value has one row. In contrast, a natural entity has one row for each combination of entity value and its validity period.

semantic_models:
- name: sales_person_tiers
description: SCD Type II table of tiers for salespeople
model: {{ref(sales_person_tiers)}}
defaults:
agg_time_dimension: tier_start

dimensions:
- name: tier_start
type: time
expr: start_date
type_params:
time_granularity: day
validity_params:
is_start: True
- name: tier_end
type: time
expr: end_date
type_params:
time_granularity: day
validity_params:
is_end: True
- name: tier
type: categorical

primary_entity: sales_person

entities:
- name: sales_person
type: natural
expr: sales_person_id

The following code represents a separate semantic model that holds a fact table for transactions:

semantic_models: 
- name: transactions
description: |
Each row represents one transaction.
There is a transaction, product, sales_person, and customer id for
every transaction. There is only one transaction id per
transaction. The `metric_time` or date is reflected in UTC.
model: {{ ref(fact_transactions) }}
defaults:
agg_time_dimension: metric_time

entities:
- name: transaction_id
type: primary
- name: customer
type: foreign
expr: customer_id
- name: product
type: foreign
expr: product_id
- name: sales_person
type: foreign
expr: sales_person_id

measures:
- name: transactions
expr: 1
agg: sum
- name: gross_sales
expr: sales_price
agg: sum
- name: sales_persons_with_a_sale
expr: sales_person_id
agg: count_distinct

dimensions:
- name: metric_time
type: time
is_partition: true
type_params:
time_granularity: day
- name: sales_geo
type: categorical

You can now access the metrics in the transactions semantic model organized by the slowly changing dimension of tier.

In the sales tier example, For instance, if a salesperson was Tier 1 from 2022-03-01 to 2022-03-12, and gets promoted to Tier 2 from 2022-03-12 onwards, all transactions from March would be categorized under Tier 1 since the dimensions value of Tier 1 comes earlier (and is the default starting point), even though the salesperson was promoted to Tier 2 on 2022-03-12.

0