Minifying JSON Text Beyond Whitespace

- Programming, Mathematics

Stream of zeros and ones in space

JSON is a common data serialization format to transmit information over the Internet. However, as I mentioned in a previous article, it's far from optimal. Nevertheless, due to business requirements, producing data in this format may be necessary.

I won't go into the details as to how one could structure JSON text differently to contain the same information more efficiently, as there is no general solution. Instead, I want to focus on the following problem: how can JSON text be formatted more compactly in a way that JSON parsers would still interpret it the same way? This is a useful question to ponder on, since it may reduce data consumption costs.

The thing is, it's possible to parse JSON text in arbitrary ways. As such, it's not possible to consider whether different JSON texts are equivalent or not without setting up certain assumptions in the way parsers should operate. Here are the assumptions that we are going to work with:

Interestingly, I have not witnessed a single JSON minification algorithm that implements all of the strategies I discovered, despite trying countless programs that are supposed to minify JSON. As such, I wanted to document them here.

With that said, let's get to it!

Whitespace

This is the obvious one. As whitespace surrounding JSON text and its structural characters is optional, it can be safely removed outside of strings.

Easy, but we can do more.

Strings

There are multiple ways to encode Unicode code points within a JSON string. Namely, each code point can be represented in at least one of the following ways:

It's important to note here that the only two-character escape representation that can be unescaped is the solidus character, which only requires 1 byte to encode unescaped. Because of this and the above, it's always better to represent a code point unescaped when possible, with the two-character escape representation preferred over the six-character one otherwise.

One very important point to note not explicit in the JSON grammar is that unpaired surrogates cannot be safely unescaped. While JSON supports ill-formed code unit sequences as strings, no Unicode encoding form allows it (including UTF-8), so special consideration should be taken to handle this case properly.

It's also worth noting that names in JSON objects are strings as well, so all of the above apply in this case as well.

Numbers

As each character in a number can only be encoded in 1 byte, only the number of characters needed to write the entire number needs to be taken into account.

There are a few simple ways numbers can be directly simplified by simply deleting unneeded characters. Specifically:

In addition, rational numbers can be freely converted between their decimal expansion and scientific notation, and there are infinite ways to write a number in scientific notation depending on the chosen exponent. Consider the following mathematical transformations, which may reduce the length of a number:

With such transformations, you have a wide range of equal numbers with various character lengths to choose from. Therefore, numbers may potentially be further reduced with a bit of mathematics by modifying the exponent and balancing the rest of the number accordingly for equality.

While there are potentially many optimal representations for the same number with these transformations, I have determined that only the following ones need to be taken into consideration to achieve minimal length when not including the unneeded characters described previously:

The reason is because one of these representations is guaranteed to minimize the absolute value of the exponent without leading or trailing zeros elsewhere, the first representation checks the case with no exponent, and the last representation checks the case with the smallest amount of characters possible before the exponent.

Numbers equal to 0 are trivial to minify, since the decimal expansion 0 is always optimal in this case. For every other case, I recommend converting the number to scientific notation with no fraction part and no trailing zeros first, then use the resulting integer component and exponent as function parameters to calculate the relative length of each of these representations, in order to pick one that is optimal among them. Just be warned that this calculation should be performed carefully since these parameters may be of arbitrary size.

As a final note, if we were to also assume that parsers should not be more precise than their IEEE 754 binary64 representation for historical reasons, then a conversion to that format and back could be performed to determine the proper precision to keep. This would limit the length of the integer component and fraction part to a total of 17 decimal digits at most. However, this may not be a safe assumption, and the temporary conversion attempt may overflow or underflow, so I recommend against this particular simplification.

Related content I wrote

Floating mathematical formulas

A Technical Introducition to MathML Core for Writing Mathematics on the Web

- Programming, Mathematics

Thanks to recent efforts, all major web browsers currently support MathML Core, a subset of MathML focused on important presentation markup, to support mathematics on the web. As of this writing, the MathML Core specifications are still not finalized, but given its strong origins and support, it can…

Fireworks

The New Open Source Video Game Randomizer List Is Now Live

- Video Games, Programming

Time to update your bookmarks! After a few months of work behind the scenes, the new open source version of The BIG List of Video Game Randomizer is now live for your enjoyment, with dark mode support and a brand new UI for better readability! The new URL is: https://randomizers.debigare.com/ (The…

Open treasure chest with a question mark in it

The Future of the Video Game Randomizer List

- Video Games, Programming, Anecdotes

It's hard to believe that it's been almost 8 years since I first posted on the ROMhacking.net forums a list of video game randomizers that I found online, and that it would evolve into the massive project it has become today, with almost 900 entries currently being listed. It's always a strange…

Dice stacked in a triangle shape, with their face numbers matching their row position

I Designed the Perfect Gambling Game, But...

- Mathematics, Business, Game Design

Back in 2006-07-08, during the 13th Canadian Undergraduate Mathematics Conference at McGill University, I presented a gambling game I designed with the novel property of being both advantageous to players and the house, and that despite this proprety, that pretty much nobody in their right mind…

Stream of concatenated JSON objects

Current Data Serialization Formats May Be a Waste of Money

- Programming, Business

Storing data. Transmitting data. Processing data. These fundamental topics of computer science are often overlooked nowadays thanks to the historical exponential growth of processing power, storage availability and bandwidth capabilities, along with a myriad of existing solutions to tackle them. So…

See all of my articles