dev

Protocol buffers may be the next data serialization game changer

From the messages you send to the program you design, use data serialization for reliable communication and persist an object state across an independent architecture. A very common example used in daily life is JSON. However, it becomes an expensive task when the volume of data is huge. Protobuf is a perfect solution to such problems and in this article, we will embark on a journey to dive into what data serialization is, the standard methods and how Protobuf is conquering its drawbacks.

Karthik Kamalakannan / 05 July, 2018

05 July, 2018

Protocol buffers may be the next data serialization game changer

Raw data as such is as good as no data at all. The only way we make sense out of data is by formatting it into a legible format like a container, file format or data structure. However as the size of data increases the overhead of memory consumed becomes a bottleneck for performance, in such cases efficient mechanism is to create a map of a byte array. When using byte arrays, one should note that having an optimized serialization mechanism is critical to seeing a reduction of memory consumption. The Byte array is opaque to the core system.

Serialization is the conversion of an object to a series of bytes so that the object can be easily saved to persistent storage or streamed across a communication link. The byte stream can then be deserialized - converted into a replica of the original object. We need a serialization scheme which is deterministic across executions of a function, across platforms, and across versions of the serialization framework.

Data structures which don't enforce ordered serialization (e.g. sets, maps, dicts) should be avoided. The requirement is to consistently produce the same byte array across space and time. In cases where the byte array is interlinked to create a tree-like format, this is highly essential.

So, what has Protocol buffer got to do with this?

" In the simplest sense Protocol buffers (Protobufs) are a way to encode structured data in an efficient yet extensible format."

Protobufs are Google's language-independent, platform-independent method of serializing structured data. You define how you want your data to be structured once, then you can use the specially generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages.

Great, How to use them?

We specify how we want the information to be structured when serialized by defining them in a buffer message types in .proto files. Each of these buffer messages contains a logical record of information containing a set of name-value pairs. A basic example of a .proto file of an account is defined below.

syntax: "proto3";
 
message Account
{
    string public_key:1;
    int32 acc_no:2;
    string acc_name:3;
    string acc_email:4;
}
message AccountContainer
{
    repeated Account entries:1;
}

The format is visible simple each message has one or more uniquely named fields. Each field has an identifier and a value type. The value type can be integer, floating-point, boolean, string, bytes or even another message type which enables you to create a hierarchy of data.

Other fields like required and optional can be used for data validation while repeated can be used for a collection of similar data [Note this feature is removed from Protobuf v3.6.0 but not from official documentation]. The index for each of the pair denotes the order in which data is serialized or received.

Once defined the .proto file must be compiled in the preferred language compiler to generate the data access classes. These include functions to serialize the the structure to or from raw bytes. For example if the language chosen was python then the compiled file is generated as account_pb2.py, is imported into the application where retrieval or serialization. The process of serializing looks somewhat like this:

account:container.entries.add()
account.public_key: "3123"
account.acc_name:acc_name
account.acc_no:acc_no
account.acc_email:acc_email
state_entries_send:{}
state_entries_send[address] =
                 container.SerializeToString()

Similarly, The data is retrieved by:

entry:someFunction()
container:account_pb2.AccountContainer()
container.ParseFromString(entry.data)

But why not just use JSON or XML ?

Protocol buffers have many advantages over JSON or XML for serializing structured data.

About 9% smaller than JSON.
Easy to use and can define complex logic.
About 5 times faster.
Clearer structural definition.
Generated data is easier to use programmatically.
Highly scalable even with massive data.

Excited to get started?

Before starting to embed this in your latest project know how it stores and inter-operates data. Understand the options and decide if the versioned scheme is for your benefit. More often than not, Protobuf will pave the way for easier and efficient data and memory usage.

To begin, first, download the Protocol Buffer package of the preferred choice or use the complete package which includes languages like Python, Java, and C++. Refer to the documentation to build and install the packages.

Refer the official tutorial for an intuitive approach to the recommended conventional usage and implementation. Separate tutorials are made for each of the preferred languages as mentioned before. An instance of its usage can be found in the implementation of Hyperledger Sawtooth, where every data communication and storage in a decentralized manner works with efficiency and high scalability.

Last updated: January 23rd, 2024 at 1:50:36 PM GMT+0

Protocol buffers may be the next data serialization game changer

On this page

So, what has Protocol buffer got to do with this?

Great, How to use them?

But why not just use JSON or XML ?

Excited to get started?