The last post about the UTF-8 decoder got me thinking about how unicode has changed and improved over time.
Originally it was a two byte solution which is a logical step up from one byte ascii. The unicode team was saying two bytes was enough for all. This was a great solution since all the existing different encodings, single byte and multi-byte, could be replaced by unicode.
The early adopters, like Microsoft, embraced this and spent a lot of effort to adopt and promote it. They split their text interfaces in two, one for ascii and one for unicode.
One of the early issues to overcome was storing a unicode string natively on different processors which store words (two bytes units) differently. The Mac native order was opposite of the pc. The byte order character BOM was introduced so you could detect and read in the stored string correctly. (Note: Since UTF-8 is a byte stream it doesn’t need BOM.)
Once the unicode team decided that two bytes were not enough for their needs, the surrogate pair workaround was introduced to support more characters and be backward compatible. (Note: UTF-8 doesn’t use surrogate pairs.)
- Unicode: utf-16 surrogate pairs — see section 7.5. UTF-16 surrogate pairs
It wasn’t until later that UTF-8 encoding was invented. What a good idea that was. The web world has adopted it as well as most new applications.
UTF-8 has the advantage that you can treat it as a byte string in many cases. Most existing code doesn’t need to change. Only when you need a text unicode feature do you need to decode the characters.
Python
In Python 2 the basic string type was used for text strings and byte strings.
When they updated to 3.0, they decided to separate text strings from byte strings and treat all text strings as unicode.
“Python 3.0 mostly uses UTF-8 everywhere, but it was not a deliberate choice and it caused many issues when the locale encoding was not UTF-8.”
They were of the opinion that being able to loop over the code points and indexing similar to ascii is important. But they didn’t commit. There was a build flag for how strings are stored wide 32 bit or narrow 16 bit.
32 bit characters work since unicode was guaranteeing not to exceed 10FFFF code points. But 32 bits waste a lot of space for most text strings so they developed ways to compress and store smaller strings.
Both James and Victor explain well how python unicode support works and has evolved.
- painful-history-python-filesystem-encoding.html — Victor Stin
- how python does unicode — James Bennett
Rust
Rust is a newer language and its text strings are UTF-8. It insures that text strings are always correctly encoded.
Nim
Nim strings are also UTF-8 encoded however they allow invalid encoded strings. I think this is the sweet spot. You have performance since you don’t have to validate as much and more importantly you can use strings as byte strings as well as text strings. The bulk of the code is the same for each. When you need unicode features you can validate and pull in the special unicode handling code.
I believe UTF-8 is the One True Encoding.