William Liu

DBT

Cheatsheet Commands

Summary

dbt (data build tool) is a transformation workflow that helps modularize and centralize your analytics code. Collaborate on data models, version them, test, and document. dbt compiles and runs yours analytics code against your data platform, enabling a single source of truth for metrics, insights, and business definitions.

Advantages:

DBT Fundamentals

Traditional Data Teams

ETL vs ELT

ETL (Extract Transform Load) - more traditional process

ELT (Extract, Load, Transform)

Data can be transformed directly in the database - no need to extract and load repeatedly

Introduces the idea of an analytics engineer

Analytics Engineer

Owns the Transformation of raw data up to the BI Layer

Modern Data Team:

Configs

dbt_project.yml

Every project needs a dbt_project.yml file (how dbt knows a directory is a dbt project). We use the current working directory or you can set the --project-dir flag. An example file looks like:

[name](project-configs/name): string

[config-version](project-configs/config-version): 2
[version](project-configs/version): version

[profile](project-configs/profile): profilename

[model-paths](project-configs/model-paths): [directorypath]
[seed-paths](project-configs/seed-paths): [directorypath]
[test-paths](project-configs/test-paths): [directorypath]
[analysis-paths](project-configs/analysis-paths): [directorypath]
[macro-paths](project-configs/macro-paths): [directorypath]
[snapshot-paths](project-configs/snapshot-paths): [directorypath]
[docs-paths](project-configs/docs-paths): [directorypath]
[asset-paths](project-configs/asset-paths): [directorypath]

[target-path](project-configs/target-path): directorypath
[log-path](project-configs/log-path): directorypath
[packages-install-path](project-configs/packages-install-path): directorypath

[clean-targets](project-configs/clean-targets): [directorypath]

[query-comment](project-configs/query-comment): string

[require-dbt-version](project-configs/require-dbt-version): version-range | [version-range]

[quoting](project-configs/quoting):
  database: true | false
  schema: true | false
  identifier: true | false

models:
  [<model-configs>](model-configs)

seeds:
  [<seed-configs>](seed-configs)

snapshots:
  [<snapshot-configs>](snapshot-configs)

sources:
  [<source-configs>](source-configs)

tests:
  [<test-configs>](test-configs)

vars:
  [<variables>](/docs/build/project-variables)

[on-run-start](project-configs/on-run-start-on-run-end): sql-statement | [sql-statement]
[on-run-end](project-configs/on-run-start-on-run-end): sql-statement | [sql-statement]

[dispatch](project-configs/dispatch-config):
  - macro_namespace: packagename
    search_order: [packagename]

.dbtignore

Specify files that should be entirely ignored by dbt (similar to a .gitignore)

# .dbtignore

# ignore individual .py files
not-a-dbt-model.py
another-non-dbt-model.py

# ignore all .py files
**.py

# ignore all .py files with "codegen" in the filename
*codegen*.py

Modeling

Normally we create:

In dbt, modeling are just SQL Select Statements.

Model Example

dim_customers.sql

See how we have a few CTEs (Common Table Expression)

with customers as (
  select
    id as customer_id,
    first_name,
    last_name
  from raw.jaffle_shop.customers
),
orders as (
  select
    id as order_id,
    user_id as customer_id,
    order_date,
    status
  from raw.jaffle_shop.orders
),
customer_orders as (
  select
    customer_id,
    min(order_date) as first_order_date,
    max(order_date) as most_recent_order_date,
    count(order_id) as number_of_orders
  from orders
  group by 1
),
final as (
  select
    customers.customer_id,
    customers.first_name,
    customers.last_name,
    customer_orders.first_order_date,
    customer_orders.most_recent_order_date,
    coalesce(customer_orders.number_of_orders, 0) as number_of_orders
  from customers
  left join customer_orders using (customer_id)
)

select * from final

with a config block


Selecting

You can select to only materialize specific models and its downstream models with:

dbt run --select dim_mycustomers+

Materializations

https://docs.getdbt.com/docs/build/materializations

There’s four types of materializations:

You can specify in a dbt_project.yml or directly inside the model sql files.

# The following dbt_project.yml configures a project that looks like this:
# .
# └── models
#     ├── csvs
#     │   ├── employees.sql
#     │   └── goals.sql
#     └── events
#         ├── stg_event_log.sql
#         └── stg_event_sessions.sql

name: my_project
version: 1.0.0
config-version: 2

models:
  my_project:
    events:
      # materialize all models in models/events as tables
      +materialized: table
    csvs:
      # this is redundant, and does not need to be set
      +materialized: view

e.g. inside the models/events/stg_event_log.sql



SELECT *
FROM ...

Modularity

For a car, we don’t have a bunch of metal and bang that out into a car. We have individual parts and pieces we build/order, then put them together into a car. The same idea is for Models, where each model is reusable (e.g. can be used for other downstream models) How do we apply it?

In directory, we have:

models
  stg_customers.sql
  dim_customers.sql

And then we have stg_customers.sql as:

with customers as (

  select
    id as customer_id,
    first_name,
    last_name

  from raw.jaffle_shop.customers
)

SELECT * FROM customers;

Let’s also create a stg_orders.sql:

with orders as (

  select
    id as order_id,
    `user_id` as customer_id,
    order_date,
    status

  from raw.jaffle_shop.orders

)

SELECT * FROM orders

Now we can go back to dim_customers.sql and add in our models:



with customers as (
  select * from 
),

orders as (
  select * from 

)
...

Traditional Modeling

In the past, we normalized data, optimizing for reducing data redundancy using:

Due to storage being cheap and compute being stronger, we approach a denormalized modeling for more agile analytics, ad-hoc analytics (i.e. optimize for readability)

Model Naming Conventions

DBT recommends the following 5 Models.

Other dirs might include:

Marts are where we deliver our final models

Example structure:

models
  staging
    jaffle_shop
      stg_customers.sql
      stg_orders.sql
  marts
    core
      dim_customers.sql
    finance
    marketing

Sources

You can reference a table directly, but is tedious for swapping out when table names change. Instead, you can configure the source once in a .yml file. In your models, you can just reference `` to get a direct reference of the yml file you created earlier. Sources appear as green nodes in the GUI lineage.

Example

models/staging/jaffle_shop/src_jaffle_shop.yml

We’ll create a src_ file.

version: 2

sources:
  - name: jaffle_shop
    database: raw
    schema: jaffle_shop
    tables:
      - name: customers
      - name: orders

models/staging/jaffle_shop/stg_customers.sql

select
    id as customer_id,
    first_name,
    last_name
from 

models/staging/jaffle_shop/stg_orders.sql

select
    id as order_id,
    user_id as customer_id,
    order_date,
    status
from 

Other examples:

with
source as (
  select * from 
),
...

Freshness

You can setup a freshness warn-after and/or error_after.

E.g. models/staging/jaffle_shop/src_jaffle_shop.yml

Warn after 12 hours, error after 24 hours

version: 2

sources:
  - name: jaffle_shop
    database: raw
    schema: jaffle_shop
    tables:
      - name: customers
      - name: orders
        loaded_at_field: _etl_loaded_at
        freshness:
          warn_after: {count: 12, period: hour}
          error_after: {count: 24, period: hour}

Run dbt source freshness to check freshness

Testing

Testing in dbt lets you run tests in development. You can schedule tests to run in production. There’s two types of tests:

There’s four types of generic tests out of the box (but you can also write your own custom generic tests):

You can run tests with dbt test

Example Generic Tests

To run generic tests only, run dbt test --select test_type:generic

models/staging/jaffle_shop/stg_jaffle_shop.yml

Run with: dbt test --select stg_customers

version: 2

models:
  - name: stg_customers
    columns:
      - name: customer_id
        tests:
          - unique
          - not_null

  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values:
                - completed
                - shipped
                - returned
                - return_pending
                - placed

Example Singular Test

To run singular tests only, run dbt test --select test_type:singular

tests/assert_positive_total_for_payments.sql

We have a .sql file in the tests directory

-- Refunds have a negative amount, so the total amount should always be >=0.
-- Therefore return records where this isn't true to make the test fail.
select
    order_id,
    sum(amount) as total_amount
from 
group by 1
having not (total_amount >= 0)

Testing source data

We can test the source data as well, similar to testing our models. You can put this in the models or the custom tests dir.

models/staging/jaffle_shop/src_jaffle_shop.yml

version: 2

sources:
  - name: jaffle_shop
    database: raw
    schema: jaffle_shop
    tables:
      - name: customers
        columns:
          - name: id
            tests:
              - unique
              - not_null

      - name: orders
        columns:
          - name: id
            tests:
              - unique
              - not_null
        loaded_at_field: _etl_loaded_at
        freshness:
          warn_after: {count: 12, period: hour}
          error_after: {count: 24, period: hour}

Test workflow

If you fail a test, downstream marts don’t build.

Documentation

dbt docs generate to create documentation

Example Documentation

models/staging/jaffle_shop/stg_jaffle_shop.yml

version: 2

models:
  - name: stg_customers
    description: Staged customer data from our jaffle shop app.
    columns:
      - name: customer_id
        description: The primary key for customers.
        tests:
          - unique
          - not_null

  - name: stg_orders
    description: Staged order data from our jaffle shop app.
    columns:
      - name: order_id
        description: Primary key for orders.
        tests:
          - unique
          - not_null
      - name: status
        description: ""
        tests:
          - accepted_values:
              values:
                - completed
                - shipped
                - returned
                - placed
                - return_pending
      - name: customer_id
        description: Foreign key to stg_customers.customer_id.
        tests:
          - relationships:
              to: ref('stg_customers')
              field: customer_id

models/staging/jaffle_shop/jaffle_shop.md

For docs order_status:


One of the following values:

| status         | definition                                       |
|----------------|--------------------------------------------------|
| placed         | Order placed, not yet shipped                    |
| shipped        | Order has been shipped, not yet been delivered   |
| completed      | Order has been received by customers             |
| return pending | Customer indicated they want to return this item |
| returned       | Item has been returned                           |