Minifying JSON Text Beyond Whitespace
- Programming, Mathematics
JSON is a common data serialization format to transmit information over the Internet. However, as I mentioned in a previous article, it's far from optimal. Nevertheless, due to business requirements, producing data in this format may be necessary.
I won't go into the details as to how one could structure JSON text differently to contain the same information more efficiently, as there is no general solution. Instead, I want to focus on the following problem: how can JSON text be formatted more compactly in a way that JSON parsers would still interpret it the same way? This is a useful question to ponder on, since it may reduce data consumption costs.
The thing is, it's possible to parse JSON text in arbitrary ways. As such, it's not possible to consider whether different JSON texts are equivalent or not without setting up certain assumptions in the way parsers should operate. Here are the assumptions that we are going to work with:
- JSON text should be encoded in UTF-8. This is a safe assumption as this is required for JSON text exchanged between systems that are not part of a closed ecosystem.
- There should be no special meaning in the way JSON text is laid out. This is a safe assumption because JSON allows insignificant whitespace.
- There should be no special meaning in the way Unicode code points in JSON strings are escaped or not. This is a safe assumption as several code points can only be encoded in a very specific way, and because strings interpreted that way are Unicode strings.
- JSON numbers should exclusively represent rational numbers in their decimal expansion or scientific notation. This is a safe assumption because they match how numbers are normally written and interpreted by humans, because other potential properties (such as significant digits or a special meaning for negative zero) cannot be deduced from their formatting alone, and because this interpretation is compatible with systems traditionally converting JSON numbers to IEEE 754 binary64.
Interestingly, I have not witnessed a single JSON minification algorithm that implements all of the strategies I discovered, despite trying countless programs that are supposed to minify JSON. As such, I wanted to document them here.
With that said, let's get to it!
Whitespace
This is the obvious one. As whitespace surrounding JSON text and its structural characters is optional, it can be safely removed outside of strings.
Easy, but we can do more.
Strings
There are multiple ways to encode Unicode code points within a JSON string. Namely, each code point can be represented in at least one of the following ways:
- Unescaped, which requires 1 to 4 bytes to encode (in UTF-8) depending on the code point
- Two-character escape representation, which always requires 2 bytes to encode
- Six-character escape representation, which requires 6 or 12 bytes to encode depending on whether a surrogate pair is required or not
It's important to note here that the only two-character escape representation that can be unescaped is the solidus character, which only requires 1 byte to encode unescaped. Because of this and the above, it's always better to represent a code point unescaped when possible, with the two-character escape representation preferred over the six-character one otherwise.
One very important point to note not explicit in the JSON grammar is that unpaired surrogates cannot be safely unescaped. While JSON supports ill-formed code unit sequences as strings, no Unicode encoding form allows it (including UTF-8), so special consideration should be taken to handle this case properly.
It's also worth noting that names in JSON objects are strings as well, so all of the above apply in this case as well.
Numbers
As each character in a number can only be encoded in 1 byte, only the number of characters needed to write the entire number needs to be taken into account.
There are a few simple ways numbers can be directly simplified by simply deleting unneeded characters. Specifically:
- The plus sign is always redundant when it may appear and can be omitted. (Example:
1e+4
≡1e4
.) - The minus sign is not meaningful for a zero value. (Example:
-0e5
≡0e5
.). This also applies to the exponent part. (Example:1e-0
≡1e0
.) - Leading zeros past the letter E in the exponent part may be removed. (Example:
1e004
≡1e4
.) This may get rid of the entire exponent if applicable. (Example:1e0
≡1
.) - Trailing zeros in the fraction part are unnecessary. (Example:
1.20e4
≡1.2e4
.) This may also get rid of the period if applicable. (Example:1.00e4
≡1e4
.)
In addition, rational numbers can be freely converted between their decimal expansion and scientific notation, and there are infinite ways to write a number in scientific notation depending on the chosen exponent. Consider the following mathematical transformations, which may reduce the length of a number:
- Converting a number from their decimal expansion to scientific notation allows elimination of trailing zeros. (Example:
1000
≡1e3
) - A trailing zero in the integer component can be removed by increasing the exponent by 1 if there is no fraction part. (Example:
10e3
≡1e4
.) - Moving the period left or right by one digit also respectively increases or decreases the exponent by 1. (Examples:
12.3e-10
≡1.23e-9
, and1.23e10
≡12.3e9
.) - Moving the period all the way to the right by decrementing the exponent gets rid of said period. (Examples:
1.5e9
≡15e8
, and1.5e-8
≡15e-9
.)
With such transformations, you have a wide range of equal numbers with various character lengths to choose from. Therefore, numbers may potentially be further reduced with a bit of mathematics by modifying the exponent and balancing the rest of the number accordingly for equality.
While there are potentially many optimal representations for the same number with these transformations, I have determined that only the following ones need to be taken into consideration to achieve minimal length when not including the unneeded characters described previously:
- Decimal expansion (such as
0.0012345678
) - Normalized scientific notation (such as
1.2345678e-3
) - Scientific notation with no fraction part and no trailing zeros in the integer component (such as
12345678e-10
)
The reason is because one of these representations is guaranteed to minimize the absolute value of the exponent without leading or trailing zeros elsewhere, the first representation checks the case with no exponent, and the last representation checks the case with the smallest amount of characters possible before the exponent.
Numbers equal to 0 are trivial to minify, since the decimal expansion 0
is always optimal in this case. For every other case, I recommend converting the number to scientific notation with no fraction part and no trailing zeros first, then use the resulting integer component and exponent as function parameters to calculate the relative length of each of these representations, in order to pick one that is optimal among them. Just be warned that this calculation should be performed carefully since these parameters may be of arbitrary size.
As a final note, if we were to also assume that parsers should not be more precise than their IEEE 754 binary64 representation for historical reasons, then a conversion to that format and back could be performed to determine the proper precision to keep. This would limit the length of the integer component and fraction part to a total of 17 decimal digits at most. However, this may not be a safe assumption, and the temporary conversion attempt may overflow or underflow, so I recommend against this particular simplification.
Related content I wrote
A Technical Introducition to MathML Core for Writing Mathematics on the Web
- Programming, Mathematics
Thanks to recent efforts, all major web browsers currently support MathML Core, a subset of MathML focused on important presentation markup, to support mathematics on the web. As of this writing, the MathML Core specifications are still not finalized, but given its strong origins and support, it can…
The New Open Source Video Game Randomizer List Is Now Live
- Video Games, Programming
Time to update your bookmarks! After a few months of work behind the scenes, the new open source version of The BIG List of Video Game Randomizer is now live for your enjoyment, with dark mode support and a brand new UI for better readability! The new URL is: https://randomizers.debigare.com/ (The…
The Future of the Video Game Randomizer List
- Video Games, Programming, Anecdotes
It's hard to believe that it's been almost 8 years since I first posted on the ROMhacking.net forums a list of video game randomizers that I found online, and that it would evolve into the massive project it has become today, with almost 900 entries currently being listed. It's always a strange…
I Designed the Perfect Gambling Game, But...
- Mathematics, Business, Game Design
Back in 2006-07-08, during the 13th Canadian Undergraduate Mathematics Conference at McGill University, I presented a gambling game I designed with the novel property of being both advantageous to players and the house, and that despite this proprety, that pretty much nobody in their right mind…
Current Data Serialization Formats May Be a Waste of Money
- Programming, Business
Storing data. Transmitting data. Processing data. These fundamental topics of computer science are often overlooked nowadays thanks to the historical exponential growth of processing power, storage availability and bandwidth capabilities, along with a myriad of existing solutions to tackle them. So…