Building semantic models
How to build a semantic model
A semantic model is the Semantic Layer equivalent to a logical layer model (what historically has just been called a 'model' in dbt land). Just as configurations for models are defined on the models:
YAML key, configurations for semantic models are housed under semantic models:
. A key difference is that while a logical model consists of configuration and SQL or Python code, a semantic model is defined purely via YAML. Rather than encoding a specific dataset, a semantic model describes relationships and expressions that let your end users select and refine their own datasets dynamically and reliably.
- ⚙️ Semantic models are comprised of three components:
- 🫂 entities: these describe the relationships between various semantic models (think ids)
- 🔪 dimensions: these are the columns you want to slice, dice, group, and filter by (think timestamps, categories, booleans).
- 📏 measures: these are the quantitative values you want to aggregate
- 🪣 We define columns as being an entity, dimension, or measure. Columns will typically fit into one of these 3 buckets, or if they're a complex aggregation expression, they might constitute a metric.
Defining orders
Let's zoom in on how we might define an orders semantic model.
- 📗 We define it as a YAML dictionary in the
semantic_models
list. - 📑 It will have a name, entities list, dimensions list, and measures list.
- ⏬ We recommend defining them in this order consistently as a style best practice.
semantic_models:
- name: orders
entities: ... # we'll define these later
dimensions: ... # we'll define these later
measures: ... # we'll define these later
- Next we'll point to the corresponding logical model by supplying a
ref
in themodel:
property, and adescription
for documentation.
semantic_models:
- name: orders
description: |
Model containing order data. The grain of the table is the order id.
model: ref('stg_orders')
entities: ...
dimensions: ...
measures: ...
Establishing our entities
- 🫂 Entities are the objects and concepts in our data that have dimensions and measures. You can think of them as the nouns of our project, the spines of our queries that we may want to aggregate by, or simply the join keys.
- 🔀 Entities help MetricFlow understand how various semantic models relate to one another.
- ⛓️ Unlike many other semantic layers, in MetricFlow we do not need to describe joins explicitly, instead the relationships are implicitly described by entities.
- 1️⃣ Each semantic model should have one primary entity defined for itself, and any number of foreign entities for other semantic models it may join to.
- 🫂 Entities require a name and type
- 🔑 Types available are primary, foreign, unique or natural — we'll be focused on the first two for now, but you can read more about unique and natural keys.
Entities in action
If we look at an example staging model for orders, we see that it has 3 id columns, so we'll need three entities.
renamed as (
select
---------- ids
id as order_id,
store_id as location_id,
customer as customer_id,
---------- properties
(order_total / 100.0) as order_total,
(tax_paid / 100.0) as tax_paid,
---------- timestamps
ordered_at
from source
- 👉 We add them with a
name
,type
, and optionalexpr
(expression). The expression can be any valid SQL expression on your platform. - 📛 If you don't add an expression, MetricFlow will assume the name is equal to the column name in the underlying logical model.
- 👍 Our best practices pattern is to, whenever possible, provide a
name
that is the singular form of the subject or grain of the table, and useexpr
to specify the precise column name (with_id
etc). This will let us write more readable metrics on top of these semantic models. For example, we'll uselocation
instead oflocation_id
.
semantic_models:
- name: orders
...
entities:
# we use the column for the name here because order is a reserved word in SQL
- name: order_id
type: primary
- name: location
type: foreign
expr: location_id
- name: customer
type: foreign
expr: customer_id
dimensions:
...
measures:
...
Defining our dimensions
- 🧮 Dimensions are the columns that we want to filter and group by, the adjectives of our project. They come in three types:
- categorical
- time
- slowly changing dimensions — these are covered in the documentation, and a little more complex. To focus on building your mental models of MetricFlow's fundamentals, we won't be using SCDs in this guide.
- ➕ We're not limited to existing columns, we can use the
expr
property to add simple computations in our dimensions. - 📛 Categorical dimensions are the simplest, they simply require a
name
andtype
(type being categorical). If thename
property matches the name of the dimension column, that's it, you're done. If you want or need to use aname
other than the column name, or do some filtering or computation, you can supply an optionalexpr
property to evaluate for the dimension.
Dimensions in action
- 👀 Let's look at our staging model again and see what fields we have available.
select
---------- ids -> entities
id as order_id,
store_id as location_id,
customer as customer_id,
---------- numerics -> measures
(order_total / 100.0) as order_total,
(tax_paid / 100.0) as tax_paid,
---------- timestamps -> dimensions
ordered_at
from source
- ⏰ For now the only dimension to add is a time dimension:
ordered_at
. - 🕰️ At least one primary time dimension is required for any semantic models that have measures.
- 1️⃣ We denote this with the
is_primary
property, or if there is only a one-time dimension supplied it is primary by default. Below we only haveordered_at
as a timestamp so we don't need to specify anything except the minimum granularity we're bucketing to (in this case, day). By this we mean that we're not going to be looking at orders at a finer granularity than a day.
dimensions:
- name: ordered_at
expr: date_trunc('day', ordered_at)
type: time
type_params:
time_granularity: day
Dimensional models. You may have some models that do not contain measures, just dimensional data that enriches other facts. That's totally fine, a semantic model does not require dimensions or measures, it just needs a primary entity, and if you do have measures, a primary time dimension.
We'll discuss an alternate situation, dimensional tables that have static numeric values like supply costs or tax rates but no time dimensions, later in the Guide.
- 🔢 We can also make a dimension out of a numeric column that would typically be a measure.
- 🪣 Using
expr
we can create buckets of values that we label for our dimension. We'll add one of these in for labeling 'large orders' as any order totals over $50.
dimensions:
- name: ordered_at
expr: date_trunc('day', ordered_at)
type: time
type_params:
time_granularity: day
- name: is_large_order
type: categorical
expr: case when order_total > 50 then true else false end
Making our measures
- 📏 Measures are the final component of a semantic model. They describe the numeric values that we want to aggregate.
- 🧱 Measures form the building blocks of metrics, with entities and dimensions helping us combine, group, and filter those metrics correctly.
- 🏃 You can think of them as something like the verbs of a semantic model.
Measures in action
- 👀 Let's look at our staging model one last time and see what fields we want to measure.
select
---------- ids -> entities
id as order_id,
store_id as location_id,
customer as customer_id,
---------- numerics -> measures
(order_total / 100.0) as order_total,
(tax_paid / 100.0) as tax_paid,
---------- timestamps -> dimensions
ordered_at
from source
- ➕ Here
order_total
andtax paid
are the columns we want as measures. - 📝 We can describe them via the code below, specifying a name, description, aggregation, and expression.
- 👍 As before MetricFlow will default to the name being the name of a column when no expression is supplied.
- 🧮 Many different aggregations are available to us. Here we just want sums.
measures:
- name: order_total
description: The total amount for each order including taxes.
agg: sum
- name: tax_paid
description: The total tax paid on each order.
agg: sum
- 🆕 We can also create new measures using expressions, for instance adding a count of individual orders as below.
- name: order_count
description: The count of individual orders.
expr: 1
agg: sum
Reviewing our work
Our completed code will look like this, our first semantic model!
semantic_models:
- name: orders
defaults:
agg_time_dimension: ordered_at
description: |
Order fact table. This table is at the order grain with one row per order.
model: ref('stg_orders')
entities:
- name: order_id
type: primary
- name: location
type: foreign
expr: location_id
- name: customer
type: foreign
expr: customer_id
dimensions:
- name: ordered_at
expr: date_trunc('day', ordered_at)
# use date_trunc(ordered_at, DAY) if using BigQuery
type: time
type_params:
time_granularity: day
- name: is_large_order
type: categorical
expr: case when order_total > 50 then true else false end
measures:
- name: order_total
description: The total revenue for each order.
agg: sum
- name: order_count
description: The count of individual orders.
expr: 1
agg: sum
- name: tax_paid
description: The total tax paid on each order.
agg: sum
Next steps
Let's review the basics of semantic models:
- 🧱 Consist off entities, dimensions, and measures.
- 🫂 Describe the semantics and relationships of objects in the warehouse.
- 1️⃣ Correspond to a single logical model in your dbt project.
Next up, let's use our new semantic model to build a metric!