dbt run
executes the compiled sql model files against the current target
database
dbt run --full-refresh
will treat incremental models as table modelsdbt select --select my_selected_model
to only run that tabledbt build
will run models, test tests, snapshot snapshots, seed seedsdbt test
will test your modelsdbt docs generate
to create documents and dbt docs serve
to start a webserver on port 8080
dbt (data build tool) is a transformation workflow that helps modularize and centralize your analytics code. Collaborate on data models, version them, test, and document. dbt compiles and runs yours analytics code against your data platform, enabling a single source of truth for metrics, insights, and business definitions.
select
statement, a Python DataFrame, and dbt handles the materialization (default to views)ETL (Extract Transform Load) - more traditional process
ELT (Extract, Load, Transform)
Data can be transformed directly in the database - no need to extract and load repeatedly
Introduces the idea of an analytics engineer
Owns the Transformation of raw data up to the BI Layer
Modern Data Team:
dbt_project.yml
Every project needs a dbt_project.yml
file (how dbt knows a directory is a dbt project).
We use the current working directory or you can set the --project-dir
flag. An example file looks like:
[name](project-configs/name): string
[config-version](project-configs/config-version): 2
[version](project-configs/version): version
[profile](project-configs/profile): profilename
[model-paths](project-configs/model-paths): [directorypath]
[seed-paths](project-configs/seed-paths): [directorypath]
[test-paths](project-configs/test-paths): [directorypath]
[analysis-paths](project-configs/analysis-paths): [directorypath]
[macro-paths](project-configs/macro-paths): [directorypath]
[snapshot-paths](project-configs/snapshot-paths): [directorypath]
[docs-paths](project-configs/docs-paths): [directorypath]
[asset-paths](project-configs/asset-paths): [directorypath]
[target-path](project-configs/target-path): directorypath
[log-path](project-configs/log-path): directorypath
[packages-install-path](project-configs/packages-install-path): directorypath
[clean-targets](project-configs/clean-targets): [directorypath]
[query-comment](project-configs/query-comment): string
[require-dbt-version](project-configs/require-dbt-version): version-range | [version-range]
[quoting](project-configs/quoting):
database: true | false
schema: true | false
identifier: true | false
models:
[<model-configs>](model-configs)
seeds:
[<seed-configs>](seed-configs)
snapshots:
[<snapshot-configs>](snapshot-configs)
sources:
[<source-configs>](source-configs)
tests:
[<test-configs>](test-configs)
vars:
[<variables>](/docs/build/project-variables)
[on-run-start](project-configs/on-run-start-on-run-end): sql-statement | [sql-statement]
[on-run-end](project-configs/on-run-start-on-run-end): sql-statement | [sql-statement]
[dispatch](project-configs/dispatch-config):
- macro_namespace: packagename
search_order: [packagename]
.dbtignore
Specify files that should be entirely ignored by dbt (similar to a .gitignore
)
# .dbtignore
# ignore individual .py files
not-a-dbt-model.py
another-non-dbt-model.py
# ignore all .py files
**.py
# ignore all .py files with "codegen" in the filename
*codegen*.py
Normally we create:
In dbt, modeling are just SQL Select Statements.
models
directory.dim_customers.sql
See how we have a few CTEs (Common Table Expression)
with customers as (
select
id as customer_id,
first_name,
last_name
from raw.jaffle_shop.customers
),
orders as (
select
id as order_id,
user_id as customer_id,
order_date,
status
from raw.jaffle_shop.orders
),
customer_orders as (
select
customer_id,
min(order_date) as first_order_date,
max(order_date) as most_recent_order_date,
count(order_id) as number_of_orders
from orders
group by 1
),
final as (
select
customers.customer_id,
customers.first_name,
customers.last_name,
customer_orders.first_order_date,
customer_orders.most_recent_order_date,
coalesce(customer_orders.number_of_orders, 0) as number_of_orders
from customers
left join customer_orders using (customer_id)
)
select * from final
with a config block
You can select to only materialize specific models and its downstream models with:
dbt run --select dim_mycustomers+
https://docs.getdbt.com/docs/build/materializations
There’s four types of materializations:
create table as
statement)You can specify in a dbt_project.yml
or directly inside the model sql files.
# The following dbt_project.yml configures a project that looks like this:
# .
# └── models
# ├── csvs
# │ ├── employees.sql
# │ └── goals.sql
# └── events
# ├── stg_event_log.sql
# └── stg_event_sessions.sql
name: my_project
version: 1.0.0
config-version: 2
models:
my_project:
events:
# materialize all models in models/events as tables
+materialized: table
csvs:
# this is redundant, and does not need to be set
+materialized: view
e.g. inside the models/events/stg_event_log.sql
SELECT *
FROM ...
For a car, we don’t have a bunch of metal and bang that out into a car. We have individual parts and pieces we build/order, then put them together into a car. The same idea is for Models, where each model is reusable (e.g. can be used for other downstream models) How do we apply it?
In directory, we have:
models
stg_customers.sql
dim_customers.sql
And then we have stg_customers.sql
as:
with customers as (
select
id as customer_id,
first_name,
last_name
from raw.jaffle_shop.customers
)
SELECT * FROM customers;
Let’s also create a stg_orders.sql
:
with orders as (
select
id as order_id,
`user_id` as customer_id,
order_date,
status
from raw.jaffle_shop.orders
)
SELECT * FROM orders
Now we can go back to dim_customers.sql
and add in our models:
with customers as (
select * from
),
orders as (
select * from
)
...
In the past, we normalized data, optimizing for reducing data redundancy using:
Due to storage being cheap and compute being stronger, we approach a denormalized modeling for more agile analytics, ad-hoc analytics (i.e. optimize for readability)
DBT recommends the following 5 Models.
Other dirs might include:
Marts are where we deliver our final models
Example structure:
models
staging
jaffle_shop
stg_customers.sql
stg_orders.sql
marts
core
dim_customers.sql
finance
marketing
You can reference a table directly, but is tedious for swapping out when table names change.
Instead, you can configure the source once in a .yml
file.
In your models, you can just reference `` to get a direct reference
of the yml file you created earlier.
Sources appear as green nodes in the GUI lineage.
models/staging/jaffle_shop/src_jaffle_shop.yml
We’ll create a src_
file.
version: 2
sources:
- name: jaffle_shop
database: raw
schema: jaffle_shop
tables:
- name: customers
- name: orders
models/staging/jaffle_shop/stg_customers.sql
select
id as customer_id,
first_name,
last_name
from
models/staging/jaffle_shop/stg_orders.sql
select
id as order_id,
user_id as customer_id,
order_date,
status
from
Other examples:
with
source as (
select * from
),
...
You can setup a freshness warn-after
and/or error_after
.
E.g. models/staging/jaffle_shop/src_jaffle_shop.yml
Warn after 12 hours, error after 24 hours
version: 2
sources:
- name: jaffle_shop
database: raw
schema: jaffle_shop
tables:
- name: customers
- name: orders
loaded_at_field: _etl_loaded_at
freshness:
warn_after: {count: 12, period: hour}
error_after: {count: 24, period: hour}
Run dbt source freshness
to check freshness
Testing in dbt lets you run tests in development. You can schedule tests to run in production. There’s two types of tests:
There’s four types of generic tests out of the box (but you can also write your own custom generic tests):
You can run tests with dbt test
To run generic tests only, run dbt test --select test_type:generic
models/staging/jaffle_shop/stg_jaffle_shop.yml
Run with: dbt test --select stg_customers
version: 2
models:
- name: stg_customers
columns:
- name: customer_id
tests:
- unique
- not_null
- name: stg_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: status
tests:
- accepted_values:
values:
- completed
- shipped
- returned
- return_pending
- placed
To run singular tests only, run dbt test --select test_type:singular
tests/assert_positive_total_for_payments.sql
We have a .sql file in the tests directory
-- Refunds have a negative amount, so the total amount should always be >=0.
-- Therefore return records where this isn't true to make the test fail.
select
order_id,
sum(amount) as total_amount
from
group by 1
having not (total_amount >= 0)
We can test the source data as well, similar to testing our models. You can put this in the models or the custom tests dir.
models/staging/jaffle_shop/src_jaffle_shop.yml
version: 2
sources:
- name: jaffle_shop
database: raw
schema: jaffle_shop
tables:
- name: customers
columns:
- name: id
tests:
- unique
- not_null
- name: orders
columns:
- name: id
tests:
- unique
- not_null
loaded_at_field: _etl_loaded_at
freshness:
warn_after: {count: 12, period: hour}
error_after: {count: 24, period: hour}
If you fail a test, downstream marts don’t build.
dbt docs generate
to create documentation
models/staging/jaffle_shop/stg_jaffle_shop.yml
version: 2
models:
- name: stg_customers
description: Staged customer data from our jaffle shop app.
columns:
- name: customer_id
description: The primary key for customers.
tests:
- unique
- not_null
- name: stg_orders
description: Staged order data from our jaffle shop app.
columns:
- name: order_id
description: Primary key for orders.
tests:
- unique
- not_null
- name: status
description: ""
tests:
- accepted_values:
values:
- completed
- shipped
- returned
- placed
- return_pending
- name: customer_id
description: Foreign key to stg_customers.customer_id.
tests:
- relationships:
to: ref('stg_customers')
field: customer_id
models/staging/jaffle_shop/jaffle_shop.md
For docs order_status:
One of the following values:
| status | definition |
|----------------|--------------------------------------------------|
| placed | Order placed, not yet shipped |
| shipped | Order has been shipped, not yet been delivered |
| completed | Order has been received by customers |
| return pending | Customer indicated they want to return this item |
| returned | Item has been returned |