William Liu

Designing Data Intensive Applications

Chapter 1: Reliable, Scalable, and Maintainable Applications

Most applications today are data-intensive instead of compute-intensive. A data intensive application might perform some of the following basic tasks:

While these are the basic tasks, there’s a lot of different types of databases with different characteristics. In order to determine the right types of systems, we need to look at the fundamentals of what it means to be:

Types of Data Systems

Traditionally data systems have been relational databases and message queues. They both store data for some time, but they’re different due to their very different access patterns, meaning broadly that we’ll have much different performance characteristics and implementations.

With newer tools like Redis (a datastore that can also be a message queue) and Apache Kafka (message-queue with database-like durability guarantees), the traditional categories have been blurred.

We also don’t just use a single tool for all our tasks anymore. If we have additional pieces like an application-managed caching layer (e.g. Memcached) or a full-text search server (e.g. Elasticsearch or Solr), we create application code whose job is to stitch these systems together (i.e. application code is responsible for keeping those caches and indexes in sync with the main datbase). The service’s interface or API hides the details from others (e.g. the cache will be correctly invalidated or updated on writes, but those details don’t need to be seen.)

Reliability

So reliability can be broken down into a few things, including:

Not Reliable

So what happens when things go wrong? Some general definitions are:

Some places like Netflix have the Chaos Monkey that deliberately introduces faults so we can ensure a fault-tolerant system. Some things we want to prevent rather than cure because there is no cure (e.g. a data breach).

Hardware Faults

Hardware fails, whether its a hard disk crashing, RAM becoming faulty, the power grid has a blackout, or a network cable is unplugged. Hard disks have a mean time to failure (MTTF) of about 10 to 50 years. If we have 10,000 disks, there is an average of a failure of one disk to die per day.

We can add redundancy to individual hardware components. When one system dies, then another takes its place. Having additional systems also has operational advantages so a server can reboot with patches (i.e. can do a rolling upgrade by patching one server at a time).

Software Errors

Another class of faults is a systematic error within the system. These might be a software bug that causes every instance of an application server to crash when given a bad input or a runaway process that uses up some shared resource like CPU time, memory, disk space, or network bandwidth. These issues are harder to catch and there is no quick solution. You have to try to think through assumptions and interactions in the system, try to isolate the process, measure and monitor the system behavior in production.

Human Errors

Humans are unreliable, even though we have the best intentions. We instead need to approach our system designs with this in mind:

Scalability

Scalability is how a system is able to cope with increased load. It is not a plain “X is scalable” or “Y does not scale”. Instead, we need to describe how a system copes with growth, whether that’s computing resources, memory, network bandwidth, etc.

Load

Load can be described with a few numbers called load parameters; these load parameters depend on the system architecture. You might have requests per second on a web server, ratio of reads to writes in a database, or the hit rate on a cache.

Twitter Load Example

Twitter’s two main operations are posting a tweet and reading from their home timeline.

The issue Twitter faced was fan-out, where each user follows many people and each user is followed by many people. A couple ways to handle this problem is to:

1 Posting a tweet inserts the new tweet into a global collection of tweets. When a user requests their home timeline, look up the people they follow, find all their tweets, and merge them (sorted by time).

Code Example:

SELECT tweets.*, users.* FROM tweets
  JOIN users ON tweets.sender_id = users.id
  JOIN follows on follows.followee_id = users.id
  WHERE follows.follow_id = current_user

2 Maintain a cache for each user’s home timeline (a mailbox of tweets for each recipient user). When a user posts a tweet, look up all the people who follow that user, and insert the new tweet into each of their home timeline caches. Reads to the home timeline are then cheap.

The tradeoffs are that for the first approach, the systems struggled to keep up with the load of home timeline queries due to the joins. The second approach reads much quicker, but requires to do more work at write time. There was an edge case scenario with the second approach due to high number of followers per user (e.g. Justin Beiber). Since he has over 30 million followers, a single tweet results in over 30 million writes to home timelines. This a good scenario where we should look beyond the average load parameters and look at say the 99th percentile since they’re higher influencers. In the end, Twitter is using a hybrid approach depending on whether you have a large number of followers or not.

Describing Performance

Let’s look at what happens when load increases in a system through two ways:

  1. When you increase a load parameter and keep the same system resources (CPU, memory, bandwidth), how is the performance of the system affected?
  2. When you increase a load parameter, how much do you need to increase the resources if you want to keep performance unchanged?

Latency and Response Time

Response Time is what the client sees: besides the actual time to process the request (the service time), it includes network delays and queueing delays.

Latency is the duration that a request is waiting to be handled, during which it is latent, awaiting service.

In reality when clients make requests over and over, we’ll get slightly different response times each time. That means we can’t measure response times as a single number, but a distribution of values that you can measure. Sometimes you’ll see the average response time, but this is not a very good metric since it hides the delays of your outliers.

Instead, we want to use percentiles to see say the 50th, 95th, 99th, and 99.9th percentiles. You want to pay special attention to response times in the 99.9th percentile (tail latencies) because the customers with the slowest requests are usually the ones with the most data in their accounts (and thus most valuable customers). Percentiles are used in service level objectives (SLOs) and service level agreements (SLAs) that define the expected performance and availability of a service.

Since a server can only process a small number of things in parallel (e.g. limited by number of CPU cores), it only takes a small number of slow requests to hold up the processing of subsequent requests (aka head-of-line blocking).

If you have a microservice architecture where a single end-user request hits multiple backend calls, the end-user needs to wait for the slowest of the calls to complete. Even if a small percentage of this is slow, you can end with very high wait times due to tail latency amplification.

You can check response times for all requests within a time window and sort that list every minute or you can get approximations of percentiles with algorithms like forward decay, t-digest, or HdrHistogram.

Coping with Load

Now that we can describe and measure load, we can try to architect for an appropriate level of load. This means that if we architect for say 2 or 3 times the load, it might not be the same as architecting for 10 times that load.

When coping with load, you can scale up (aka vertical scaling) by making a machine more powerful or scale out (aka horizontal scaling, shared-nothing architecture) by spreading the load across multiple smaller machines. A single machine can often be simpler, but very costly.

Some systems are elastic, meaning automatic computing resources are added when they detect a load increase while other systems require manual intervention for adding resources.

Moving from a single machine into a distributed stateless service across multiple machines can introduce a lot of additional complexity. You should consider the problems you are trying to solve before rearchitecting your system; the problem(s) might be the volume of reads, the volume of writes, the volume of data to store, the complexity of the data, the response time requirements, the access patterns, or some mix.

For example, a system designed to handle 100k requests per second, each 1kb in nsize, looks very different from a system that is designed for 3 requests per minute, each 2GB in size, even though the two systems have the same data throughput (the number of records we can process per second, or the total time it takes to run a job on a dataset of a certain size).

Maintainability

The majority of the cost of software isn’t on the initial development, but in its ongoing maintenance like fixing bugs, investigating failures, modifying the existing system for new use cases, and adding new features. We should design software meant to minimize pain during maintenance by looking at three design principles in particular:

Operability

“Good operations can often work around the limitations of bad or incomplete software, but good software cannot run reliably with bad operations”. A good operations team keeps a software system running by:

Simplicity

Small projects have simple code, but as projects get larger, they become complex and difficult to understand. Complexity slows down everyone who needs to work on the system and increases the cost of maintenance and increases the number of bugs when changes happen. We want to remove accidental complexity, which is defined as: complexity is accidental if it is not inherent in the problem that the software solves (as seen by the users) but arises only from the implementation. We can usually use abstraction to help with complexity by hiding the implementation details behind a clean, simple to understand facade.

Evolvability

Make changes easy. Your system requirements probably change frequently from new business requiremnets or new cases emerge.

Chapter 2: Data Models and Query Languages

Data models are the one of the most important pieces of developing software because it is not only how the software is written, but also how we think about the problem that we are solving. We usually layer one data model on top of another, with each layer thinking “how is it represented in terms of the next-lower layer”?

  1. As an application developer, you look at the real world (e.g. people, organizations, actions) and model it in terms of objects or data structures with APIs that manipulate those data structures.
  2. When you want to store those data structures, we express them in general-purpose data models like JSON or XML documents, tables in a relational databases or a graph model.
  3. The engineers who built the database software decided on representing the JSON/XML/relational/graph data in terms of bytes in memory, on disk, or on a network. The representation allows the data to be queried, searched, manipulated, processed in different ways.
  4. On lower levels, hardware engineers represent bytes in terms of electrical currents, pulses of light, etc.

The idea is that each layer hides complexity of the layers below it by providing a clean data model.

Relational Model vs Document Model

The best known relational model is SQL; data is organized into relations (aka tables in SQL), where each relation is an unordered collection of tuples (aka rows in SQL).

NoSQL

NoSQL isn’t a particular technology, it just stands for Not Only SQL. NoSQL stands to do:

Since SQL and NoSQL solve different problems, these two technologies can be used together (aka polygot persistence).

Object-Relational Mismatch

Most application development is done using object-oriented programming languages, but data is stored using the SQL data model in relational tables. There is this awkward mismatch between transitioning objects to tables/rows/columns called impedance mismatch.

Resume Example

Let’s look at how a resume might be expressed in a relational schema.

One-to-One Relationships

For a resume, we have one-to-one fields like a user and their name.

One-to-Many Relationships

For a resume, we have one-to-many relationships. People have more than one job in their career and more than one periods of education or contact information (one-to-many relationship). We can express this a few ways:

If we go the JSON or XML representation, we have better locality than the multi-table schema, meaning when we access the data, we do not need to perform multiple queries and joins between all the tables; instead we only have one query with all the information in one place.

Many-to-One and Many-to-Many Relationships

In a resume, if there are free-text fields, it makes sense to store that as plain text strings. However, for fields like region_id and industry_id, we want to use an id instead of plain text like Greater Seattle Area and Technology. Users can then choose from a drop down list to help with:

You can technically store the text, but using an ID helps because it has no meaning to humans so it never needs to change. Anything meaningful to humans might need to be changed. Removing duplicated information is the idea of normalization in databases.

With many-to-one relationships (many people live in one region, many people work in one industry), it won’t fit nicely in a document model (unlike a relational database where it’s normal to refer to rows in other tables by ID). In document databases, we don’t need joins for one-to-many tree structures (e.g. education, jobs). So what do we do? We can shift the joins from the database to the application code (w/ multiple queries, then doing a join), but then that really doesn’t have the database solve the issue.

The same issue that document databases encounter for one-to-many relationships is the same issue as for many-to-many relationships. If there are many of these types of joins, then a relational model is better. However, if we want more schema flexibility and better performance due to locality with one-to-many relationships, the document database might work better.

In a relational model you can take a document-like structure (i.e. a tree of one-to-many relationships) and split this document into multiple tables (aka technique of shredding), but can lead to cumbersome schemas and unnecessarily complicated application code.

If data is very highly interconnected, the document model is not the way to go, the relational model is okay, and graph models are the most natural.

Relationship Summary

The main idea is that you want to have simple application code and in order to accomplish this, we pick the right database depending on the types of relationships that exist between data items. Usually:

Schemas

Relational and Document databases handle schemas a bit differently. Most document databases and relational databases that support JSON do not enforce any schema on the data in documents. XML support in relational databases usually have an optional schema validation.

schema-on-read vs schema-on-write

So to an application developer, this idea of schema-on-read vs schema-on-write is comparable to type checking in programming languages.

So how does this affect things?

Schema changes are usually pretty quick (except for MySQL, which copies the entire table, meaning minutes or even hours of downtime for larger tables).

So when is it good to have a schema?

Imperative vs Declarative Languages

We have imperative languages (most programming languages, like Python) that tells the computer to perform certain operations in a certain order. We can step through the code line by line.

On the opposite end, we have declarative languages (like SQL, CSS, or relational algebra), where you specify the pattern of the data you want, but not HOW to achieve that goal. This type of language is more concise and easier to work with and hides the implementation details of something like the database engine. This allows for automatic optimizations and usually allows parallel execution.

MapReduce

MapReduce is kinda in the middle between an imperative and declarative language. You run snippets of code repeatedly by a processing framework. This code does a map (aka collect) and reduce (aka fold, inject). An example of this might be to map how many sharks you saw per month.

In a relational database like Postgres, this might look like:

SELECT date_trunc('month', observeration_timestamp) AS observation_month,
SUM(num_animals) AS total_animals
FROM observations
WHERE family = 'Sharks'
GROUP BY observation_month;

In a MapReduce like MongoDB, we have:

db.observations.mapReduce(
    function map() {
        var year = this.observationTimestamp.getFullYear();
        var month = this.observationTimestamp.getMonth() + 1;
        emit(year + "-" + month, this.numAnimals);
    },
    function reduce(key, values) {
        return Array.sum(values);
    },
    {
        query: { family: "Sharks" },
        out: "monthlySharkReport"
    }
);

We group by a key (in this case, the year and month combined as a key like ‘2017-10’) and emit the value (number of animals in that observation).

The map and reduce functions must be pure functions, meaning they ONLY use the data passed to them as input and cannot do additional database queries or have any side effects. This allows the database to run the functions anywhere, in any order, and rerun them on failure in a distributed execution environment on a cluster of machines.

Graph Databases

If many-to-many relationships are very common in your data, then consider a graph database. A graph has two kinds of objects: verticies (aka nodes, entities) and edges (aka relationships, arcs). Examples are:

Graphs are not limited to homogeneous data (same types of data, e.g. web page link to other web pages). We can have a single graph link people to locations, events, checkins, comments, etc.

We can structure and query data using the property graph model (e.g. used by Neo4J, Titan, and InfiniteGraph) and the triple-store model (e.g. used by Datomic, AllegroGraph)

I don’t have much experience with Graph databases, but I think the idea is that they are the opposite of document databases. In graph databases, anything is potentially related to everything. In document databases, the target use case is that data is self-contained documents and relationships between one document and another are rare.

Chapter 3: Storage and Retrieval

A database stores data and when you ask for it later, it returns data. How the storage engine works internally will help you know what type of engine to pick, with the reason being that there is a big difference between storage engines that are optimized for transactional workloads versus those optimized for analytics. Since we talked about relational databases and NoSQL databases, we’ll look at log-structured and page-oriented storage engines.

First Principles Database

Let’s look at the simplest database created using two bash functions to create a key-value store. db_set key value will store key and value in the database. db_get key will get the most recent value with that key. The idea is we have a text file where each line is a key-value pair separated by a comma (ignoring escape issues). When a new db_set is done, the old value is not overwritten, we just look at the last occurrence of the key with tail -n 1 in db_get.

#!/bin/bash

db_set () {
    echo "$1,$2" >> database 
}

db_get () {
    grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}

Example Usage

db_set 42 '{"name": "San Francisco", "attractions": ["Golden Gate
Bridge"]}'

db_get 42
'{"name": "San Francisco", "attractions": ["Golden Gate
Bridge"]}'

Performance

The db_set performance is good because appending to a file is very efficient. Many databases do something similar internally using a log, an append-only sequence of records.

The db_get performance is bad with a large number of records because we have O(n) lookup costs; if we double our records, we double our search times.

Index

If we want to efficiently find the value for a specific key in a database, we need an additional data structure called an index, with the general idea of keeping additional metadata on the side, acting as a signpost to help you locate the data you want. If you want to search data in different ways, you can have different indexes on different parts of the data.

The benefit of indexes is that it helps with the performance of some read queries at the cost of adding overhead on writes, with every write also needing to write to the indexes.

There are various types of indexes, including:

Hash Indexes

Hash indexes are indexes for key-value data. Key-value stores are basically like the dictionary data type in programming languages and are implemented as a hash map/hash table.

The idea is that you have an in-memory hash map where every key is mapped to a byte offset in a data file (location where the value can be found).

An example storage engine with hash indexes is Bitcask (default storage engine in Riak). Bitcask has high performance reads and writes, but the requirement is that all keys fit in RAM since the hash map is kept in memory.

When to use? If you have a situation where the value for each key is updated very frequently and can fit in memory, then this storage engine would be a good fit. A key might be a URL of a link and the value is the number of times it has been accessed (incremented each time there is a visit). There are a lot of writes, but not too many distinct keys.

With our example we are only appending to a file, which would lead to running out of disk space. We can break the log into segments when a segment reaches a certain size, then make writes to a new segment. This allows compaction on the segments, meaning we throw away duplicate keys in the log and keeping only the most recent value for each key. Since compaction makes segments smaller by removing duplicate values on keys, we can also merge segments together (into new segments).

Each segment has its own in-memory hash table mapping keys to file offsets. To find the value for a key, we check the most recent segment’s hash map (and if not present, then check the second-most recent segment, etc).

SSTables and LSM-Trees

Remember that each log-structured storage segment is a sequence of key-value pairs. These pairs appear in the order that they were written and values later in the log take precedence over values for the same key earlier in the log.

Sorted String Table (aka SSTable) is where the sequence of key-value pairs are sorted by key. There’s some advantages over a regular hash index:

Keeping a maintained order for SSTables

So the issue is that after you sort your data, how do you keep sorting for new writes? You can look up data structures like red-black trees or AVL trees. With these data structures, you can insert keys in any order and read them back in sorted order.

This works well except for database crashes since the most writes are in the memtable and have not been written to disk yet. To solve this problem, we can keep a separate log on disk where every write is immediately appended. Use this only to retore the memtable after a crash. Discard this log every time the memtable is written to disk.

We can create Log-Structured Merge-Tree (aka LSM-Trees) data structures out of SSTables. There are storage engines built out of this concept, including:

SSTables and Full Text Searches

Lucene (used by Elasticsearch and Solr) use a similar method for storing its term dictionary. A full-text index is more complex than a key-value index, but runs on a similar idea: given a word in a search query, find all documents that mention the word. The key is a word (a term) and the value is the list of IDs of all the documents that contain the word (the postings list)

SSTables and Storage Engine Optimizations
B-Trees

The most widely used indexing structure is the B-Tree. They are the standard index implementation in almost all relational databases and many nonrelational databases.

B-Trees is maintaining a sorted structure on disk (opposed to memory), where the database is broken down into fixed-size blocks (aka pages) of about 4KB in size and read or write one page at a time.

B-Trees need to maintain a write ahead log.

Comparing B-Trees and LSM-Trees

Usually LSM-Trees are typically faster for writes and B-Trees are faster for reads.

Data Warehousing

So in a database, you have transactions that mean a group of reads and writes. For transactions, you can have transaction processing, which means allowing clients to make low-latency reads and writes. On the opposite side are batch processing jobs, which only run periodically (e.g. once a day).

OLTP - Transactions based

The basic access pattern for most businesses was to look up a small number of records by a key, using an index. The records are inserted or updated based on the user’s input; because these applications are interactive, the access pattern became known as online transaction processing (OLTP).

For OLTP, we have two main schools of thought for storage:

OLAP - Analytics based

You can also use databases for data analytics, which has a much different access pattern than transactions. Usually an analytic query scans over a large number of records, reads a few columns per record, and calculates aggregate statistics (e.g. count, sum, average) instead of returning raw data back.

OLTP vs OLAP

PROPERTY                    OLTP                                            OLAP
Main read pattern           Small number of records per query, uses key     Aggregate over large number of records
Main write pattern          Low-latency writes from user input              Bulk import (ETL) or event stream
Primarily used by           End user/customer, via web app                  Internal Business Analyst
What data represents        Latest state of data (current point in time)    History of events that happened over time
Dataset size                GB to TB                                        TB to PB 
Bottleneck                  Usually disk seek time                          Issue is updates

When you run analytics on a separate database, this is called your data warehouse. It’s separate from your OLTP operations. The process of getting data into the warehouse is known as Extract-Transform-Load (ETL). So why use a separate data warehouse for OLTP vs OLAP? The indexing algorithms that work well for OLTP aren’t very good at analytical queries.

Data Models

There’s a wide range of data models used in transaction processing and fewer models used in data models for analytics.

Data models for analytics usually uses star schema (aka dimensional modeling).

Other variations of the star schema is the snowflake schema, where dimensions are further broken down into subdimensions.

Star Schema

At the center of a schema is the fact table. Each row of the fact table represents an event that occurred at a particular time. Each row of the fact table might be analyzing website traffic (a page view or page click) or retail sales.

The name star schema comes from the fact that when the table relationships are visualized, the fact table is in the middle, surrounded by its dimension tables.

Fact Tables

Usually facts are captured as individual events, allowing for maximum flexibility of analysis later.

fact_sales table

date_key | product_sk | store_sk | promotion_sk | customer_sk | quantity | net_price | discount_price   |
140102   | 31         | 3        | NULL         | NULL        | 1        | 2.49      | 2.49             |
140102   | 69         | 5        | 19           | NULL        | 3        | 14.99     | 9.99             |
140102   | 74         | 3        | 23           | 191         | 1        | 4.99      | 3.89             |
140102   | 33         | 8        | NULL         | 235         | 4        | 0.99      | 0.99             |

Most of your data is in fact tables. Some of the columns in the fact table are attributes, but other columns are foreign key references to other tables (dimension tables).

Dimension Tables

Dimension tables represent the who, what, where, when, how, and why of the event.

dim_store table

store_sk | state | city
1        | WA    | Seattle
2        | CA    | San Francisco
3        | CA    | Palo Alto

dim_product table

product_sk  |  sku      | description   | brand     | category     |
30          | OK4012    | Bananas       | Freshmax  | Fresh Fruit  |
31          | KA9511    | Fish food     | Aquatech  | Pet supplies |
32          | AB1234    | Croissant     | Dealicious| Bakery       |

dim_date table

date_key    | year  |   month   | day   | weekend   | is_holiday
140101      | 2014  |   jan     | 1     | wed       | yes
140102      | 2014  |   jan     | 2     | thu       | no
140103      | 2014  |   jan     | 3     | fri       | no

dim_customer table

customer_sk |   name    | date_of_birth
190         | Alice     | 1979-03-29    
191         | Bob       | 1961-09-02
192         | Cecil     | 1991-12-13

dim_promotion table

promotion_sk    |   name        |   ad_type     | coupon_type
18              | New Year Sale | Poster        | NULL
19              | Aquarium deal | Direct mail   | Leaflet
20              | Coffee Bundle | In-store sign | NULL

Snowflake Schema

Snowflake schema is where dimensions are further broken down into subdimensions. The idea behind the snowflake schema is that the dimensions are normalized into multiple related tables (instead of star schema’s dimensions being denormalized with each dimension represented by a single table).

Snowflake schemas are more normalized than star schemas, but star schemas are often The principle behind snowflaking is normalization of the dimension tables by removing cardinality attributes and forming separate tables. preferred because they are simpler for analysts to work with.

Row vs Column oriented storage

Once your fact tables become large (trillions of rows, hundred columns wide), you’ll find that a typical data warehouse query only accesses about 5 columns at a time.

In most OLTP databases, data is laid out in a row-oriented fashion. Even with indexes, you’ll still be loading a lot of extra information with a row-oriented storage engine.

The idea behind column-oriented storage is that you don’t store all the values from one row together, but store all the values from each column together. An example of a column oriented storage is Parquet. Column oriented storage is also better able to be compressed (with bitmap encoding being particularly effective in data warehousing). If there are a lot of zeros in most of the bitmaps, then we say it is sparse. We get faster reads, but writes are more difficult. If you want to insert a row in the middle of a sorted table, you most likely have to rewrite all the column files.

Aggregation with Materialized Views

Another data warehouse is materialized aggregates, where instead of having to crunch through the raw data every time, we cache some of these counts/sums by creating a materialized view. In a relational model, it is often defined like a standard view (a table-like object whose contents are the results of some query). The different is that a materialized view is an actual copy of the query results written to disk.

Data Cube (aka OLAP cube)

A common special case of a materialized view is known as a data cube or OLAP cube. It is a grid of aggregates grouped by different dimensions.

Imagine that each fact has foreign keys to only two dimensional tables (date and product). Each cell contains the aggregate (e.g. SUM) of an attribute (e.g. net_price) of all facts with that date-product combination. Then you can apply the same aggregate along each row or column and get a summary that has been reduced by one dimension (e.g. the sales by product regardless of date, or the sales by date regardless of product)

In general, you often have more than two dimensions. If you had say five dimensions: date, product, store, promotion, and customer, it’s harder to imagine, but the idea is still the same: each cell contains the sales for a particular date-product-store-promotion-customer combination.

Advantage of a materialized data cube is that queries are very fast Disadvantage is that a data cube doesn’t have the same flexibility as querying the raw data

Most data warehouses keep as much raw data as possible and use aggregates like data cubes only as performance boost for certain queries.