Current Data Serialization Formats May Be a Waste of Money
- Programming, Business
Storing data. Transmitting data. Processing data. These fundamental topics of computer science are often overlooked nowadays thanks to the historical exponential growth of processing power, storage availability and bandwidth capabilities, along with a myriad of existing solutions to tackle them. So much so, that we're assuming these technologies are properly adapted for today's needs.
Specifically, we're going to look at the cloud computing costs of data serialization, and question whether current data serialization technologies are adapted for them. (Spoiler: They're probably not.)
The money problem
Let's consider a scenario where we would like to offer a service that would send and receive data over the Internet. We would have to deal with the following expenses:
- Implementation and maintenance costs
- Processing power for data serialization and deserialization
- Bandwidth and storage consumption
As such, we would like to minimize the total sum of these costs over the lifetime of the service. In addition, we would also like to minimize these same costs for our consumers to give ourselves a competitive advantage.
Picking optimal data serialization formats is therefore critical to achieving this objective, because it will have an impact on all of these costs.
For implementation and maintenance, we also have to consider that once a data serialization format becomes popular, there's going to be a bunch of people that will have already done the base work, and thus shall not be considered here.
Current technologies
Human-readable formats
CSV, XML, JSON, YAML... those are all great data serialization formats because anyone can read them and modify them using a simple text editor. In terms of compactness however, they are pretty terrible because they are very verbose by design.
Let's say, for example, that you would like to represent an object with 5 boolean properties. Simply writing the values would require multiple bytes simply for writing "True" or "False" and delimiters between them. Similarly, if the name of the properties must be included in the format, that's more bytes to be consumed for writing them.
As such, not only does it take a bunch of space, but it also requires parsing text to deserialize the data, which is not very efficient. Removing some of the optional padding may help, but doing so has its limits.
Data compression
One quick fix in terms of bandwidth and storage consumption is to apply data compression over text data. However, the results are relatively generic and generally not optimal. Also, while they may save in bandwidth and storage, they also require additional processing power, although the net result is usually worth it in terms of raw expenses.
As for the existing data compression algorithm themselves, some common issues include:
- Byte as the smallest component
- Upper size limit
- Equivalent values written differently
- Limited predefined dictionary
Protocol Buffers
As a need for pure binary data serialization arose from the above issues, Protocol Buffers rose to fill the need. While not the only binary serialization solution, it became popular thanks to its open-source nature, its versatile data encoding, the powerful object definition, and the possibility of extending it using gRPC to define full web services. However, the encoding of Protocol Buffers is a bit strange, which may lead to some unexpected issues. For example:
- Definition of data requires transforming it into an API using an external tool, then embed that API in the main code, which may be problematic for compatibility and maintenance. This is especially a problem when having to deal with consumers stuck with legacy systems.
- Data types do not match between definition (scalar value types) and serialization (wire types), probably to simplify the conversion to common variable types in popular programming languages.
- Integers may be serialized longer than necessary, due to a base 128 encoding whose digits are bytes. This issue also affects the encoding of the data type and field ID.
- Strings are encoded as UTF-8, even when a better encoding may exist. This is especially true if strings do not require the full range of Unicode characters, or even ASCII characters.
- Repeated values or simple patterns are not compressed. While this may be partially mitigated by implementing data compression over the serialized data, this will likely not be done optimally.
As such, it's not a surprise that Protocol Buffers became popular, as each potential issue also have related advantages. Still, there is room for potential improvements.
Future technologies?
Based on the above, here are ideas that I could identify as potential optimizations for the original objective of minimizing costs:
- Concatenate data at the bit level instead of the byte level
- Use a data compression algorithm that is specifically designed for the serialized data format
- Define a data serialization negotiation algorithm for simpler implementation and maintenance
- Allow dynamic data serialization within the same stream
- Use artificial intelligence to improve optimization of data compression
This is far from an exhaustive list, and I do not know if these ideas could lead to a significantly better solution than those that currently exists, but I believe they are certainly worth consideration for future designs and prototypes.
Disclaimer: I originally wrote this article back in 2020-10-12 at the request of Steeve Leblanc as an independent analysis of his data encoding invention, but he asked me to refrain from publishing it at the time due to a pending patent application. As this is no longer an issue, I have released the above article in its exact original wording. Note that since then, he has founded TS-Alpha, a company I have acquired shares in, and later joined as a full-time employee in order to help him realize said future technologies.
Related content I wrote
A Technical Introducition to MathML Core for Writing Mathematics on the Web
- Programming, Mathematics
Thanks to recent efforts, all major web browsers currently support MathML Core, a subset of MathML focused on important presentation markup, to support mathematics on the web. As of this writing, the MathML Core specifications are still not finalized, but given its strong origins and support, it can…
The New Open Source Video Game Randomizer List Is Now Live
- Video Games, Programming
Time to update your bookmarks! After a few months of work behind the scenes, the new open source version of The BIG List of Video Game Randomizer is now live for your enjoyment, with dark mode support and a brand new UI for better readability! The new URL is: https://randomizers.debigare.com/ (The…
The Future of the Video Game Randomizer List
- Video Games, Programming, Anecdotes
It's hard to believe that it's been almost 8 years since I first posted on the ROMhacking.net forums a list of video game randomizers that I found online, and that it would evolve into the massive project it has become today, with almost 900 entries currently being listed. It's always a strange…
I Designed the Perfect Gambling Game, But...
- Mathematics, Business, Game Design
Back in 2006-07-08, during the 13th Canadian Undergraduate Mathematics Conference at McGill University, I presented a gambling game I designed with the novel property of being both advantageous to players and the house, and that despite this proprety, that pretty much nobody in their right mind…
Minifying JSON Text Beyond Whitespace
- Programming, Mathematics
JSON is a common data serialization format to transmit information over the Internet. However, as I mentioned in a previous article, it's far from optimal. Nevertheless, due to business requirements, producing data in this format may be necessary. I won't go into the details as to how one could…