William Liu

High Level Data Processing Concepts

High Level Goal: Lower latency for data processing to get more timely data

Definitions:

Architectures

###Lambda Architecture

You run a streaming system alongside a batch system, both doing the same calculation. You basically do a dual-mode execution.

Pros:

Cons:

###Kappa Architecture

Run a single pipeline using a well-designed system that’s built for the job at hand.

Requirements

You need two things:

  1. Correctness = consistent storage; streaming systems need a way for checkpointing persistent state over time. There needs to be consistency in light of machine failures. Strong consistency is required for exactly-once processing, which is required for correctness, which is a requirement for any system that wants to meet the capabilities of a batch system.
  2. Tools for Reasoning about Time = Need good tools for working with unbounded, unordered data of varying event-time skew

Event Time vs Processing Time

Not all use cases care about event times (if yours doesn’t, life is easier). Most use cases care (e.g. fraud detection, user behavior). In an ideal world, event time and processing time would equal, but that’s not real life; instead you get a skew between the two.

Windowing

To work with the infinite nature of unbounded data sets, systems typically provide some notion of ‘windowing’ the incoming data. i.e. we chop up the data set into finite pieces along temporal boundaries We have a couple options:

Use the processing time

Do not use the processing time if you care about correctness or if you care about the context of the event time. Your event time data may end up in the wrong processing time window (e.g. due to the lag of distributed systems), throwing correctness out the window.

Use the event time

Since data is unbounded, we can have disorder and variable skew, causing a completeness problem for event time windows.

Data Processing Patterns

Let’s look at the core types of usage patterns across:

Bounded Data

Processing bounded data is straightforward. We take a set of data and run it through a data processing engine (typically batch, e.g. MapReduce). We end up with a useful output.

Unbounded Data

Unbounded Data (Batch)

Batch systems have been used to process unbounded data sets. You slice up the unbounded data into a collection of separate bounded data sets that are appropriate for batch processing. Usually these events can be written into directory and file hiearchies with names that specify the window they correspond to.

Usually you’ll still encounter a a completeness problem. What if a batch fails? What if there’s a delay in one batch? This may mean delaying processing until you’re sure all events have been collected.

Depending on that data you’re processing (e.g. say Session data, where it’s defined as a period of activity for a user), you might end up with sessions that are split across batches. You can try to reduce splits by increasing your batch size, but that increases latency. You can add logic to stitch up sessions from previous runs, but that means more complexity.

Unbounded Data (Streaming)

Streaming systems are built for Unbounded Data. You should also consider streaming systems for data that is:

There’s a few approaches you can use when dealing with data that has the above characteristics (see ‘Approaches’)

Approaches

Time-agnostic

Time-agnostic processing is used in cases where time is irrelevant (i.e. all relevant logic is data driven).

An example of a time-agnostic processing is filtering, where you look at web traffic logs and look for a specific domain. If it’s a domain you’re interested in, then allow that in. The processing does not rely on the data being unbounded, unordered, and the event time skew is irrelevant.

Another example of a time-agnostic processing is inner-joins (aka hash-join), where you join two unbounded data sources. If you only care about the results of a join when an element from both sources arrive, there’s no temporal element to the logic. You would buffer a value if it’s seen in one source, then emit the joined record once the second value from the other source arrives.

If we thought about an outer join, we now get into the data completeness problem (once you’ve seen one side of the join, how do you know whether the other side is ever going to arrive or not?). The answer is that you don’t; you have to introduce some notion of a timeout, which introduces an element of time.

Approximation

Approximation is where we take an unbounded data source and provide output (where if you squint at them, look more or less like what you were hoping for).

Pros: They are low overhead and designed for unbounded data. Cons: Limited set of them exist, the algorithms themselves are complicated, and approximations limit their utility

Windowing (Overview)

Windowing means to take a data source (either unbounded or bounded) and chopping it along temporal boundaries into finite chunks for processing. There’s a lot of windowing strategies, including:

Sliding Windows - a generalization of fixed windows; fixed length and a fixed period.

Dynamic Windows - Session data from the above is an example of dynamic windows. Lengths cannot be defined prior.

Windowing by Processing Time

The system essentially buffers up incoming data into windows until some amount of processing time has passed. E.g. say we have five-minute fixed windows, the system would buffer up data for five minutes of processing time, after which it would treat all the data it had observed in those five minutes as a window and send them downstream for processing

Pros:

Cons:

An example is where a mobile app gathers usage statistics for later processing. The data recorded won’t be uploaded until the device comes online again, which means data might arrive with a skew of minutes to weeks or more.

Windowing by Event Time

Windowing by Event Time is the gold standard of windowing. Most data processing systems lack native support for it, although any system with a decent consistency model like Hadoop or Spark Streaming could be used to help build a windowing system). Event time windows have a few drawbacks due to the fact that windows often live longer (in processing time) than the actual length of the window itself.

Watermarks

A watermark is a notion of input completeness with respect to event times. A watermark with a value of time X makes the statement “all input data with event times less than X have been observed”. Watermarks act as a metric of progress when observing an unbounded data source with no known end.

Watermarks help answer the first half of the question “When in processing time are results materialized?”. Watermarks are temporal notions of input completeness in the event-time domain (i.e. the way the system measures progress and completeness relative to the event times of the records being processed in a stream of events).

You can look for two types of watermarks (perfect or heuristic):

Windows are materialized as the watermark passes the end of the window.

Triggers

A trigger is a mechanism for declaring when the output for a window should be materialized relative to some external signal.

Triggers help answer the second half of the question “When in processing time are results materialized?”. Triggers declare when output for a window should happen in processing time.

Accumulation

An accumulation mode specifies the relationship between multiple results that are observed for the same window. These results might be completely disjointed (i.e. distinct, non-overlapping piece of information, no overlap or accumulation of previous results within the same window) or the results might have overlap between them.

Questions

We want to answer these questions about the data: