This Text Is Not Corrupted
- Quality Assurance, Programming
Both in my personal and professional life, I often have to deal with corrupted text data. In almost every case, the cause is that it is not parsed using the correct character encoding. In other words, the way the characters were saved on the computer was different from the way the computer attempts to read them.
But the corruption effects may seem odd to non-technical people. Sometimes the text becomes complete jibberish, but with a noticeable pattern. Sometimes most of the text remains readable and the problem is specific to a few special characters. Sometimes it causes the decoding program to crash. Why is that?
A little history
Almost all modern character encodings are based on ASCII, which is a encoding containing 128 different characters, each associated to a unique 7-digits binary number. Encoding them into a computer was then simple - by storing each character as a group of 7 bits, Done.
As computers evolved to architectures with a byte length of 8 bits, storing each character on 8 bits with a leading 0 was natural and lead to faster calculations. It also became natural to use the extra bit to extend ASCII in order to double the number of supported characters to 256.
This is when a problem started to emerge, as people wanted to extend ASCII in their own way. All of a sudden, countless 8-bit character encodings emerged. Still, each region and platform lived happily with its own character encoding, and the world continued to turn... that is until the Internet started to become widely available publicly and all hell broke loose for internationalization.
Of course solutions have been implemented since then, but this problem still persists today, and when the character encoding used for a document is unknown or incorrect, things can get pretty nasty.
Let's take a look at some common offenders.
ISO 8859-1 vs Windows-1252
ISO 8859-1 was probably the most common character encoding used for Western Europe languages, as it matches ASCII perfectly for its first 128 characters and added 128 additional characters that made a very good compromise between functionality, compatibility and usage frequency per language.
But as time went on, some key missing characters became problematic. In particular, the creation of the euro sign (€) character to support the euro currency had to be put somewhere, but there was no room for it. Something had to change.
Microsoft's primary solution to this problem was Windows-1252, which replaced the non-printable C1 control characters from ISO 8859-1 with printable characters. This worked fine in the Microsoft ecosystem, but not so much elsewhere since it would cause errors when attempting to incorrectly process C1 control characters.
ISO 8859-1 vs ISO 8859-15
Unix operating systems' primary solution to the euro problem was ISO 8859-15, which replaced 8 characters from the last 128 characters of ISO 8859-1. In particular, the euro sign character replaced the generic currency sign (¤) character.
So even though ISO 8859-1 and ISO 8859-15 are very similar, they are not fully compatible, which may lead to subtle bugs that are rarely reported but have a very negative impact on users.
Fortunately, due to its historical context, ISO 8859-15 is not very prevalent on the public web today. Still, I have seen it being used in live databases designed years prior.
ASCII vs Shift JIS
Japan was one of the pioneers of modern computing, so it's no surprise that a few Japanese character encodings are still present today.
But Japanese has three alphabets: hiragana, katakana and kanji, and neither of them are based on the Latin alphabet. Also, kanji contains thousands of characters, and that's only counting those officially recognized by the Japanese government for reading newspapers. And yet, Japanese people still needed to write stuff in English for compatibility purposes.
A common encoding used to support this was Shift JIS, which matches almost all of ASCII for the first 128 characters, and encoded the troublesome kanji alphabet on 2 bytes instead of just 1.
I say "almost" because exactly 2 characters are different when comparing ASCII with Shift JIS. The problem is that one of them is the backslash (\) character, which was replaced by the yen sign (¥) character.
This is kind of a big deal. Not only is the backslash character used in the Windows operating system as directory separators for file system paths, but it's also used in many popular computer languages as an escape character, both of which may crash applications when improperly parsed.
Note that some popular Korean character encodings may also be affected by this backslash issue, having it replaced with the won sign (₩) character. For example, I've had a work colleague that was unable to work one day because of a bug introduced in a Python script by another work colleague. The bug caused the script to crash if Windows's language was set to Korean. Even though I quickly realized what the problem was, by the time I fixed the issue the company had lost about 1 man-day of work, so be wary of this.
ISO 8859-1 vs UTF-8
Unicode is the current solution to unify character encodings. It is designed to support more than a million unique characters, and the list of supported characters is continuously expanded by the Unicode Consortium, a non-profit organization. For compatibility purposes, the first 256 characters of Unicode match exactly ISO 8859-1.
But Unicode is not a character encoding itself. To encode such a large set of characters, the Unicode Consortium proposes a few Unicode transformation formats (UTF) to do so.
A common one that is relevant here is UTF-8, which has a variable byte length per character that is fully compatible with ASCII since it encodes the first 128 characters as.a single byte, making it a popular encoding choice nowadays.
Countless possibilities
As you can see, there are many subtle differences between character encodings that may have a major impact Keep in mind that the character encoding looking right it doesn't mean that it is right.
I hope you learned something new! If so, I invite you to subscribe to my blog if you want to learn more and read follow-up articles on other software-related subjects. Until next time!
Bonus content!
Finding the proper character encoding
If you ever stumble on a web page or text document with corrupted characters, try changing the character encoding. All good web browsers and text editors should have that feature.
When in doubt, try UTF-8 first, which also decodes ASCII documents correctly. If most of the text is readable but some characters are combined into a single corrupted ones, try Windows-1252 next, which also decodes ISO 8859-1 documents that does not contain the extremely rare C1 control characters. If there are still a few corrupted characters left, then you should try to guess which characters are expected instead of the corrupted ones and find an appropriate character encoding based on those guesses and the document's origin.
Google, why?
Just as I was finishing this post, I discovered that the Google Chrome browser removed the option to manually change the character encoding of a web page in version 55. I am completely baffled by this decision, as declaring character encoding information for web content is not a requirement, and in some cases not even possible. If you're using this browser, beware of this limitation.
Related content I wrote
A Technical Introducition to MathML Core for Writing Mathematics on the Web
- Programming, Mathematics
Thanks to recent efforts, all major web browsers currently support MathML Core, a subset of MathML focused on important presentation markup, to support mathematics on the web. As of this writing, the MathML Core specifications are still not finalized, but given its strong origins and support, it can…
The New Open Source Video Game Randomizer List Is Now Live
- Video Games, Programming
Time to update your bookmarks! After a few months of work behind the scenes, the new open source version of The BIG List of Video Game Randomizer is now live for your enjoyment, with dark mode support and a brand new UI for better readability! The new URL is: https://randomizers.debigare.com/ (The…
The Future of the Video Game Randomizer List
- Video Games, Programming, Anecdotes
It's hard to believe that it's been almost 8 years since I first posted on the ROMhacking.net forums a list of video game randomizers that I found online, and that it would evolve into the massive project it has become today, with almost 900 entries currently being listed. It's always a strange…
Minifying JSON Text Beyond Whitespace
- Programming, Mathematics
JSON is a common data serialization format to transmit information over the Internet. However, as I mentioned in a previous article, it's far from optimal. Nevertheless, due to business requirements, producing data in this format may be necessary. I won't go into the details as to how one could…
Current Generative AIs Have Critical Quality Issues
- Business, Quality Assurance, Security
The hype for generative AI is real. It is now possible for anybody to dynamically generate various types of media that are good enough to be mistaken as real, at least at first glance, either for free or at a low cost. In addition, the seemingly-creative solutions they come up with, and the…