William Liu

Data Engineering

Overview

1.) Dimensional Data Modeling

2.) Slowly Changing Dimensions and Idempotent Queries in Iceberg

3.) Conceptual Data Modeling

1 - Dimensional Data Modeling

Dimensions are attributes of an entity (e.g. user’s birthday, user’s favorite food)

Dimensions are either:

Knowing your consumer

There’s different types of consumers. Meet people where they are (not everyone wants to learn).

OLTP vs Master data vs OLAP

There’s three ways you can model your data:

Notes:

Cumulative Table Design

You want to track dimensions over time to hold onto history (not counting users that ask to be deleted)

Cumulative Table vs Snapshot Dimensions

Daily Snapshot (name, is_active)

2023-01-01, Will, False
2023-01-02, Will, True
2023-01-03, Will, True
2023-01-04, Will, True
2023-01-05, Will, False

Rolling Snapshots (name, is_active_array)

2023-01-01, Will, [False]
2023-01-02, Will, [True, False]
2023-01-03, Will, [True, True, False]
2023-01-04, Will, [True, True, True, False]
2023-01-05, Will, [False, True, True, False]

You normally don’t want this array to go on forever, normally just limit (e.g. to 30 days, is_monthly_active)

Cumulative Table Design

  1. Yesterday + Today ->

  2. FULL OUTER JOIN COALESCE ids and unchanging dimensions Compute cumulation metrics (e.g. days since x) Combine arrays and changing values

  3. -> Cumulated Output

With the array, is able to check if is_active in last 90 days

Strengths:

Drawbacks:

Can do “transition analysis” since all data is rammed into one row, e.g. definitions of active status:

Compactness vs Usability Tradeoff

When would you use each type of table?

Struct vs Array vs Map

Temporal Cardinality Explosions of Dimensions

Badness of Denormalized Temporal Dimensions

If you explode it out and need to join other dimensions, Spark shuffle will ruin your compression!

Run-length Encoding Compression

Probably the most important compression technique in big data right now

When a value is repeated, it nulls it out and says “this value is repeated 5 times”. Now big values are nulled out because you can compress it down in Parquet

After a join, Spark (or any distributed compute engine) may mix up the ordering of the rows and mix up the order. In the end, the listing-level with an array of nights is more efficient (since downstream data engineers or users can join that data). If your downstream consumers are producing datasets, the shuffling will cause the compression to change.

2. Slowly Changing Dimensions and Idempotent Queries in Iceberg

Create a Slowly Changing Dimension table that is idempotent(same result no matter how many times you execute).

Idempotent Pipelines are critical

Idempotent means your pipelines produce the same results regardless of when it’s ran.

When a pipeline is not idempotent, it’s difficult to work with because:

What can make a Pipeline not Idempotent

Should you model as Slowly Changing Dimensions?

Why do dimensions change?

How can you model Dimensions that change?

Types of Slowly Changing Dimensions

See below for analytical workloads:

Summary:

For transactional data, other SCDs are useful (for low latency). Above recommendations are just for analytical.

SCD Type 2 Loading

You have a couple options:

Code

CREATE TABLE williamliu.nba_player_scd (
    player_name VARCHAR,
    is_active BOOLEAN,
    start_season INTEGER,
    end_season INTEGER,
    current_season INTEGER
)
WITH (
    format = 'PARQUET',
    partitioning = ARRAY ['current_season']
)

LAG function

Load data that has a slowly changing dimension (track user activity changes using the LAG function) LAG is used to implement Type 2 SCDs, which track historical changes by adding a new row for each change. LAG lets you access data from the previous row within the same result set, based on a defined order.

LAG(column_name, offset, default_value) OVER (PARTITION BY partition_column ORDER BY order_column)

For example, you want to track when a Customer dimension changes its Address.

----- Original Table
CustomerID	      Name	Address	UpdateDate
1	John Doe	  123   Elm St	2024-01-01
1	John Doe	  456   Oak St	2024-06-01
2	Jane Smith	  789   Pine St	2024-03-01
-----

WITH Changes AS (
    SELECT
        CustomerID,
        Name,
        Address,
        UpdateDate,
        LAG(Address) OVER (PARTITION BY CustomerID ORDER BY UpdateDate) AS PreviousAddress
    FROM
        Customer
)
SELECT
    CustomerID,
    Name,
    Address,
    UpdateDate,
    CASE
        WHEN Address != PreviousAddress OR PreviousAddress IS NULL THEN 'New'
        ELSE 'No Change'
    END AS ChangeFlag
FROM
    Changes;

----- New Result
CustomerID	Name	Address	UpdateDate	ChangeFlag
1	John Doe	123 Elm St	2024-01-01	New
1	John Doe	456 Oak St	2024-06-01	New
2	Jane Smith	789 Pine St	2024-03-01	New
-----

# Note: With a Type 2 SCD, you want Start and End Dates

WITH Changes AS (
    SELECT
        CustomerID,
        Name,
        Address,
        UpdateDate,
        LAG(UpdateDate) OVER (PARTITION BY CustomerID ORDER BY UpdateDate) AS PreviousUpdateDate,
        LAG(Address) OVER (PARTITION BY CustomerID ORDER BY UpdateDate) AS PreviousAddress
    FROM
        Customer
)
SELECT
    CustomerID,
    Name,
    Address,
    COALESCE(PreviousUpdateDate, '1900-01-01') AS StartDate,
    UpdateDate AS EndDate
FROM
    Changes
WHERE
    Address != PreviousAddress OR PreviousAddress IS NULL;


----- Type 2 SCD with 1.) Start and End Date and 2.) current flag
WITH Changes AS (
    SELECT
        CustomerID,
        Name,
        Address,
        UpdateDate,
        LAG(Address) OVER (PARTITION BY CustomerID ORDER BY UpdateDate) AS PreviousAddress,
        ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY UpdateDate DESC) AS RowNumber
    FROM
        Customer
)
SELECT
    CustomerID,
    Name,
    Address,
    UpdateDate AS StartDate,
    LEAD(UpdateDate) OVER (PARTITION BY CustomerID ORDER BY UpdateDate) AS EndDate,
    CASE WHEN RowNumber = 1 THEN 'Y' ELSE 'N' END AS CurrentFlag
FROM
    Changes;

----- With Current Flag
CustomerID	Name	Address	StartDate	EndDate	CurrentFlag
1	John Doe	123 Elm St	2024-01-01	2024-06-01	N
1	John Doe	456 Oak St	2024-06-01	NULL	Y
2	Jane Smith	789 Pine St	2024-03-01	NULL	Y

Example LAG function:

SELECT
  player_name,
  is_active,
  LAG(is_active, 1) OVER (PARTITION BY player_name ORDER BY current_season) AS is_active_last_season,
  current_season
FROM williamliu.nba_players

-- PLAYER_NAME  IS_ACTIVE   IS_ACTIVE_LAST_SEASON   CURRENT_SEASON
-- Antoine Carr    true    null    1996
-- Antoine Carr    true    true    1997
-- Antonio McDyess true    null    1996
WITH lagged AS (
SELECT
  player_name,
  CASE WHEN is_active THEN 1 ELSE 0 END AS is_active,
  CASE WHEN LAG(is_active, 1) OVER (PARTITION BY player_name ORDER BY current_season) THEN 1 ELSE 0 END AS is_active_last_season,
  current_season
FROM williamliu.nba_players
)

SELECT
  *,
  CASE WHEN is_active <> is_active_last_season THEN 1 ELSE 0 END AS did_change
FROM lagged

----
--PLAYER_NAME IS_ACTIVE   IS_ACTIVE_LAST_SEASON   CURRENT_SEASON  DID_CHANGE
--Ben Wallace 1   0   1996    1
--Ben Wallace 1   1   1997    0
--Boban Marjanovic    1   0   2020    1

Rolling Sum

WITH lagged AS (
  SELECT
    player_name,
    CASE WHEN is_active THEN 1 ELSE 0 END AS is_active,
    CASE WHEN LAG(is_active, 1) OVER (
      PARTITION BY player_name ORDER BY current_season)
    THEN 1 ELSE 0 END AS is_active_last_season,
    current_season
  FROM williamliu.nba_players
),
streaked AS (
  SELECT
    *,
    SUM(
      CASE WHEN is_active <> is_active_last_season THEN 1 ELSE 0 END
    ) OVER (
      PARTITION BY
        player_name
      ORDER BY
        current_season
    ) AS streak_identifier
  FROM lagged
)
SELECT
  player_name,
  streak_identifier,
  MAX(is_active) AS is_active,
  MIN(current_season) AS start_season,
  MAX(current_season) AS end_season
FROM streaked
GROUP BY player_name, streak_identifier

Notes:

3.) Conceptual Data Modeling

Understand stakeholders’ true requirements and use that in choosing the right solution In order to write code that lasts, you need to get clarity from stakeholders

What is Conceptual Data Modeling

How to effectively brainstorm a pipeline

When pushing back be mindful

We design systems for needs not requirements

Systems with constraints are much more likely to be successfully delivered. You need to be able to determine what you are and are not doing.

You Aren’t Going to Need It. The less scalable solution (with constraints) was preferred because it had much less maintenance.

Be intentional about what the system can or cannot answer. Your data system can’t answer everything. Everything is a tradeoff (e.g. IP address is out of scope, would need to build a different system). You can’t answer everyone’s questions with the same system.

Brainstorm isn’t always right

Work backwards from Metrics

Who should brainstorm with you?

The pain when this goes wrong

Conceptual Data Modeling through Lucidchart

For example, living your best life has buckets:

Questions:

Creating Dimensions from your Data Sources:

How to analyze:

Think about what you’re NOT doing. It’s not a “No”, just taken out for now.