DBT

Cheatsheet Commands

dbt run executes the compiled sql model files against the current target database
- dbt run --full-refresh will treat incremental models as table models
- dbt select --select my_selected_model to only run that table
dbt build will run models, test tests, snapshot snapshots, seed seeds
dbt test will test your models
dbt docs generate to create documents and dbt docs serve to start a webserver on port 8080

Summary

dbt (data build tool) is a transformation workflow that helps modularize and centralize your analytics code. Collaborate on data models, version them, test, and document. dbt compiles and runs yours analytics code against your data platform, enabling a single source of truth for metrics, insights, and business definitions.

Advantages:

Avoid writing boilerplate DML (Data Manipulation Language, a class of SQL statements used to query, edit, and delete row-level data, e.g. SELECT, INSERT, DELETE, UPDATE)
Avoid writing boilerplate DDL (Data Definition Language, a group of SQL statements that you can execute to manage database objects, including tables, views, and more)
Write business logic with a SQL select statement, a Python DataFrame, and dbt handles the materialization (default to views)
Leverage metadata to find long-running models
Use incremental models
Write DRYer code using macros, hooks, and package management

Quicklinks

dbt Docs
dbt Core, an open-source cli tool

DBT Fundamentals

Traditional Data Teams

Data Engineers - make sure data is where it should be (i.e. orchestrate moving data) - builds dashboards, sql, excel
- Knows how to get data around
Data Analysts - works with business decision makers, queries tables that data engineers built - etl, python, java, orchestration
- Knows what to build

ETL vs ELT

ETL (Extract Transform Load) - more traditional process

Extract data
Manipulate that data
Load that data so data analysts can query that data

ELT (Extract, Load, Transform)

Data can be transformed directly in the database - no need to extract and load repeatedly

Once we get the data into the warehouse
Transform the data after

Introduces the idea of an analytics engineer

Analytics Engineer

Owns the Transformation of raw data up to the BI Layer

Modern Data Team:

Data Engineer - build custom data ingestion integrations, manage overall pipeline orchestration, develop and deploy machine learning endpoints, build and maintain the data platform, data warehouse performance optimization
Analytics Engineer - provide clean, transformed data ready for analysis, apply software engineering practices to analytics code (e.g. version control, testing, continuous integration), maintain data docs and definitions, train business users on how to use a data platform data visualization tools
Data Analytics - deep insights work (e.g. why did churn spike last month? what are the best acquisition channels?), work with business users to understand data requirements, build critical dashboards, forecasting

Configs

`dbt_project.yml`

Every project needs a dbt_project.yml file (how dbt knows a directory is a dbt project). We use the current working directory or you can set the --project-dir flag. An example file looks like:

[name](project-configs/name): string

[config-version](project-configs/config-version): 2
[version](project-configs/version): version

[profile](project-configs/profile): profilename

[model-paths](project-configs/model-paths): [directorypath]
[seed-paths](project-configs/seed-paths): [directorypath]
[test-paths](project-configs/test-paths): [directorypath]
[analysis-paths](project-configs/analysis-paths): [directorypath]
[macro-paths](project-configs/macro-paths): [directorypath]
[snapshot-paths](project-configs/snapshot-paths): [directorypath]
[docs-paths](project-configs/docs-paths): [directorypath]
[asset-paths](project-configs/asset-paths): [directorypath]

[target-path](project-configs/target-path): directorypath
[log-path](project-configs/log-path): directorypath
[packages-install-path](project-configs/packages-install-path): directorypath

[clean-targets](project-configs/clean-targets): [directorypath]

[query-comment](project-configs/query-comment): string

[require-dbt-version](project-configs/require-dbt-version): version-range | [version-range]

[quoting](project-configs/quoting):
  database: true | false
  schema: true | false
  identifier: true | false

models:
  [<model-configs>](model-configs)

seeds:
  [<seed-configs>](seed-configs)

snapshots:
  [<snapshot-configs>](snapshot-configs)

sources:
  [<source-configs>](source-configs)

tests:
  [<test-configs>](test-configs)

vars:
  [<variables>](/docs/build/project-variables)

[on-run-start](project-configs/on-run-start-on-run-end): sql-statement | [sql-statement]
[on-run-end](project-configs/on-run-start-on-run-end): sql-statement | [sql-statement]

[dispatch](project-configs/dispatch-config):
  - macro_namespace: packagename
    search_order: [packagename]

`.dbtignore`

Specify files that should be entirely ignored by dbt (similar to a .gitignore)

# .dbtignore

# ignore individual .py files
not-a-dbt-model.py
another-non-dbt-model.py

# ignore all .py files
**.py

# ignore all .py files with "codegen" in the filename
*codegen*.py

Modeling

Normally we create:

source tables
intermediate tables/views
final tables

In dbt, modeling are just SQL Select Statements.

Models live in the models directory.
each model has a one-to-one relationship with a table or view in the data warehouse
you configure how dbt builds your models
dbt will handle the DDL/DML

Model Example

dim_customers.sql

See how we have a few CTEs (Common Table Expression)

with customers as (
  select
    id as customer_id,
    first_name,
    last_name
  from raw.jaffle_shop.customers
),
orders as (
  select
    id as order_id,
    user_id as customer_id,
    order_date,
    status
  from raw.jaffle_shop.orders
),
customer_orders as (
  select
    customer_id,
    min(order_date) as first_order_date,
    max(order_date) as most_recent_order_date,
    count(order_id) as number_of_orders
  from orders
  group by 1
),
final as (
  select
    customers.customer_id,
    customers.first_name,
    customers.last_name,
    customer_orders.first_order_date,
    customer_orders.most_recent_order_date,
    coalesce(customer_orders.number_of_orders, 0) as number_of_orders
  from customers
  left join customer_orders using (customer_id)
)

select * from final

with a config block

Selecting

You can select to only materialize specific models and its downstream models with:

dbt run --select dim_mycustomers+

Materializations

https://docs.getdbt.com/docs/build/materializations

There’s four types of materializations:

table (i.e. a create table as statement)
view (default)
incremental
ephemeral

You can specify in a dbt_project.yml or directly inside the model sql files.

# The following dbt_project.yml configures a project that looks like this:
# .
# └── models
#     ├── csvs
#     │   ├── employees.sql
#     │   └── goals.sql
#     └── events
#         ├── stg_event_log.sql
#         └── stg_event_sessions.sql

name: my_project
version: 1.0.0
config-version: 2

models:
  my_project:
    events:
      # materialize all models in models/events as tables
      +materialized: table
    csvs:
      # this is redundant, and does not need to be set
      +materialized: view

e.g. inside the models/events/stg_event_log.sql

SELECT *
FROM ...

Modularity

For a car, we don’t have a bunch of metal and bang that out into a car. We have individual parts and pieces we build/order, then put them together into a car. The same idea is for Models, where each model is reusable (e.g. can be used for other downstream models) How do we apply it?

In directory, we have:

models
  stg_customers.sql
  dim_customers.sql

And then we have stg_customers.sql as:

with customers as (

  select
    id as customer_id,
    first_name,
    last_name

  from raw.jaffle_shop.customers
)

SELECT * FROM customers;

Let’s also create a stg_orders.sql:

with orders as (

  select
    id as order_id,
    `user_id` as customer_id,
    order_date,
    status

  from raw.jaffle_shop.orders

)

SELECT * FROM orders

Now we can go back to dim_customers.sql and add in our models:

with customers as (
  select * from 
),

orders as (
  select * from 

)
...

Traditional Modeling

In the past, we normalized data, optimizing for reducing data redundancy using:

Star Schema
Kimball
Data Vault

Due to storage being cheap and compute being stronger, we approach a denormalized modeling for more agile analytics, ad-hoc analytics (i.e. optimize for readability)

Model Naming Conventions

DBT recommends the following 5 Models.

Sources Models are ways to reference the raw data that has already been loaded
Staging Models are one-to-one with source tables (quick conversion of a number, clean and standardize the data)
Intermediate Models are between Staging and Final Models (always built on Staging Models); should not reference Source
Facts Models (things that are occuring or have occured), like orders, events, clicks
Dimensions Models are things that exist (people, place, thing, users, companies, products)

Other dirs might include:

Marts are where we deliver our final models

Example structure:

models
  staging
    jaffle_shop
      stg_customers.sql
      stg_orders.sql
  marts
    core
      dim_customers.sql
    finance
    marketing

Sources

You can reference a table directly, but is tedious for swapping out when table names change. Instead, you can configure the source once in a .yml file. In your models, you can just reference `` to get a direct reference of the yml file you created earlier. Sources appear as green nodes in the GUI lineage.

Example

models/staging/jaffle_shop/src_jaffle_shop.yml

We’ll create a src_ file.

version: 2

sources:
  - name: jaffle_shop
    database: raw
    schema: jaffle_shop
    tables:
      - name: customers
      - name: orders

models/staging/jaffle_shop/stg_customers.sql

select
    id as customer_id,
    first_name,
    last_name
from 

models/staging/jaffle_shop/stg_orders.sql

select
    id as order_id,
    user_id as customer_id,
    order_date,
    status
from 

Other examples:

with
source as (
  select * from 
),
...

Freshness

You can setup a freshness warn-after and/or error_after.

E.g. models/staging/jaffle_shop/src_jaffle_shop.yml

Warn after 12 hours, error after 24 hours

version: 2

sources:
  - name: jaffle_shop
    database: raw
    schema: jaffle_shop
    tables:
      - name: customers
      - name: orders
        loaded_at_field: _etl_loaded_at
        freshness:
          warn_after: {count: 12, period: hour}
          error_after: {count: 24, period: hour}

Run dbt source freshness to check freshness

Testing

Testing in dbt lets you run tests in development. You can schedule tests to run in production. There’s two types of tests:

Singular tests (one-off)
Generic tests (highly scalable tests, instead of writing logic, just writing yaml)

There’s four types of generic tests out of the box (but you can also write your own custom generic tests):

unique - every value is unique in that column
not_null - every value is not null in that column
accepted_values - make sure every value in a column is equal to a provided list
relationships - every value in a column exists in a column in another model

You can run tests with dbt test

Example Generic Tests

To run generic tests only, run dbt test --select test_type:generic

models/staging/jaffle_shop/stg_jaffle_shop.yml

Run with: dbt test --select stg_customers

version: 2

models:
  - name: stg_customers
    columns:
      - name: customer_id
        tests:
          - unique
          - not_null

  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: status
        tests:
          - accepted_values:
              values:
                - completed
                - shipped
                - returned
                - return_pending
                - placed

Example Singular Test

To run singular tests only, run dbt test --select test_type:singular

tests/assert_positive_total_for_payments.sql

We have a .sql file in the tests directory

-- Refunds have a negative amount, so the total amount should always be >=0.
-- Therefore return records where this isn't true to make the test fail.
select
    order_id,
    sum(amount) as total_amount
from 
group by 1
having not (total_amount >= 0)

Testing source data

We can test the source data as well, similar to testing our models. You can put this in the models or the custom tests dir.