William Liu

Data Exchange Formats (Avro, Thrift, Protocol Buffers)

There are a lot of Data Exchange formats, each with its own use-case. Some include:

Things to consider:

E.g. JSON vs Binary; Binary is very fast time and space, but hard to develop with because it is error prone

If you’re looking for a cross-language serialization of data using a schema (and requires schema evolution), but not sure which one to use, then here’s a comparison.

schema evolution

If/When you change a schema, you’ll have producers and consumers with different versions at the same time. Schema evolution allows your producers and consumers to continue to work across schemas. Some concepts for schema evolution involve forward and backward compatibility.

schema evolution scenarios

Scenarios for forward and backward compatibility are:

Scenario: No change in fields - producer (client) sends a message to a consumer (server) - all good e.g. MyMsg MyMsg user_id: 123 user_id: 123 amount: 1000 amount: 1000

Scenario: Added field - producer (old client) sends an old message to a consumer (new server); new server recognizes that the field is not set, and implements default behavior for out-of-date requests - all good e.g. MyMsg MyMsg user_id: 123 user_id: 123 amount: 1000 amount: 1000 time: 15

Scenario: Added field - producer (new client) sends a new message to a consumer (old server); old server simply ignores it and processes as normal - all good e.g. MyMsg MyMsg user_id: 123 user_id: 123 amount: 1000 amount: 1000 time: 15

JSON

Slow (e.g. store field names over and over) and verbose, but very easy to use. Does not have features like differentiating integers from floating point (i.e. inconsistent types). Flexible format is good and bad. Not good if you want a schema and some documentation. No graceful schema evolution.

Example

{
    "userName": "Will",
    "favoriteNumber": 13,
    "interests": ["programming", "climbing"]
}

Protocol Buffer (Protobuf)

Designed by Google and open-sourced since 2008, officially supports generated code in: C++, Java, Python, and Objective-C The proto3 language version also works with Dart, Go, Ruby, and C#

Example

syntax = "proto3"

message Person {
    string query = 1;
    int32 page_number = 2;
    int32 result_per_age = 3;
}

So how does this work? We define IDL Rules:

Message Types can be:

Things to watch out for:

Avro

Things to watch out for:

Thrift

Open sourced by Facebook in 2007. Thrift makes RPC a first class citizen (unlike Protobuf).

Bigger than Avro or Protocol Buffers, but is not just a data serialization library; it’s an entire RPC framework. Instead of using a single binary encoding (like Avro and Protobuf), Thrift has a whole variety of different serialization formats (called protocols).

Typical Operation Model

The typical model of Thrift/Protobuf is:

gRPC

In gRPC, a client application can directly call methods on a server application on a different machine as if it was a local object, making it easier to create distributed applications and services.

The idea is to define a service, specify the methods that can be called remotely with their parameters and return types. On the server side, the server implements this interface and runs a gRPC server to handle client calls. On the client side, the client has a stub that provides the same methods as the server.

Basically, a gRPC client (e.g. say Ruby Client) makes a Proto Request to a gRPC Server, then gets a Proto Response.

Interface Definition Language (IDL)