Current Data Serialization Formats May Be a Waste of Money

- Programming, Business

Stream of concatenated JSON objects

Storing data. Transmitting data. Processing data. These fundamental topics of computer science are often overlooked nowadays thanks to the historical exponential growth of processing power, storage availability and bandwidth capabilities, along with a myriad of existing solutions to tackle them. So much so, that we're assuming these technologies are properly adapted for today's needs.

Specifically, we're going to look at the cloud computing costs of data serialization, and question whether current data serialization technologies are adapted for them. (Spoiler: They're probably not.)

The money problem

Let's consider a scenario where we would like to offer a service that would send and receive data over the Internet. We would have to deal with the following expenses:

As such, we would like to minimize the total sum of these costs over the lifetime of the service. In addition, we would also like to minimize these same costs for our consumers to give ourselves a competitive advantage.

Picking optimal data serialization formats is therefore critical to achieving this objective, because it will have an impact on all of these costs.

For implementation and maintenance, we also have to consider that once a data serialization format becomes popular, there's going to be a bunch of people that will have already done the base work, and thus shall not be considered here.

Current technologies

Human-readable formats

CSV, XML, JSON, YAML... those are all great data serialization formats because anyone can read them and modify them using a simple text editor. In terms of compactness however, they are pretty terrible because they are very verbose by design.

Let's say, for example, that you would like to represent an object with 5 boolean properties. Simply writing the values would require multiple bytes simply for writing "True" or "False" and delimiters between them. Similarly, if the name of the properties must be included in the format, that's more bytes to be consumed for writing them.

As such, not only does it take a bunch of space, but it also requires parsing text to deserialize the data, which is not very efficient. Removing some of the optional padding may help, but doing so has its limits.

Data compression

One quick fix in terms of bandwidth and storage consumption is to apply data compression over text data. However, the results are relatively generic and generally not optimal. Also, while they may save in bandwidth and storage, they also require additional processing power, although the net result is usually worth it in terms of raw expenses.

As for the existing data compression algorithm themselves, some common issues include:

Protocol Buffers

As a need for pure binary data serialization arose from the above issues, Protocol Buffers rose to fill the need. While not the only binary serialization solution, it became popular thanks to its open-source nature, its versatile data encoding, the powerful object definition, and the possibility of extending it using gRPC to define full web services. However, the encoding of Protocol Buffers is a bit strange, which may lead to some unexpected issues. For example:

As such, it's not a surprise that Protocol Buffers became popular, as each potential issue also have related advantages. Still, there is room for potential improvements.

Future technologies?

Based on the above, here are ideas that I could identify as potential optimizations for the original objective of minimizing costs:

This is far from an exhaustive list, and I do not know if these ideas could lead to a significantly better solution than those that currently exists, but I believe they are certainly worth consideration for future designs and prototypes.

Disclaimer: I originally wrote this article back in 2020-10-12 at the request of Steeve Leblanc as an independent analysis of his data encoding invention, but he asked me to refrain from publishing it at the time due to a pending patent application. As this is no longer an issue, I have released the above article in its exact original wording. Note that since then, he has founded TS-Alpha, a company I have acquired shares in, and later joined as a full-time employee in order to help him realize said future technologies.

Related content I wrote

Floating mathematical formulas

A Technical Introducition to MathML Core for Writing Mathematics on the Web

- Programming, Mathematics

Thanks to recent efforts, all major web browsers currently support MathML Core, a subset of MathML focused on important presentation markup, to support mathematics on the web. As of this writing, the MathML Core specifications are still not finalized, but given its strong origins and support, it can…

Fireworks

The New Open Source Video Game Randomizer List Is Now Live

- Video Games, Programming

Time to update your bookmarks! After a few months of work behind the scenes, the new open source version of The BIG List of Video Game Randomizer is now live for your enjoyment, with dark mode support and a brand new UI for better readability! The new URL is: https://randomizers.debigare.com/ (The…

Open treasure chest with a question mark in it

The Future of the Video Game Randomizer List

- Video Games, Programming, Anecdotes

It's hard to believe that it's been almost 8 years since I first posted on the ROMhacking.net forums a list of video game randomizers that I found online, and that it would evolve into the massive project it has become today, with almost 900 entries currently being listed. It's always a strange…

Dice stacked in a triangle shape, with their face numbers matching their row position

I Designed the Perfect Gambling Game, But...

- Mathematics, Business, Game Design

Back in 2006-07-08, during the 13th Canadian Undergraduate Mathematics Conference at McGill University, I presented a gambling game I designed with the novel property of being both advantageous to players and the house, and that despite this proprety, that pretty much nobody in their right mind…

Stream of zeros and ones in space

Minifying JSON Text Beyond Whitespace

- Programming, Mathematics

JSON is a common data serialization format to transmit information over the Internet. However, as I mentioned in a previous article, it's far from optimal. Nevertheless, due to business requirements, producing data in this format may be necessary. I won't go into the details as to how one could…

See all of my articles