I wrote the utf-8 decoder for statictea because I found a bug in nim’s unicode validator. I am amazed how many common programs have bugs in their decoders.
I documented what I found in the utf8tests project. You can see the test results and the decoder in it:
For statictea, instead of referencing the decoder from the utf8tests project, I just copied the utf8decoder module into statictea. When the external module is updated, the statictea build process will tell that the module changed and should be updated.
What’s cool about the decoder:
- it’s fast and small
- it passes all the tests
- it returns byte sequences for valid and invalid cases
- it can start in the middle, no need to start at the beginning of a string
- it is easy to build high level functions with it
The decoder is only a few lines of code and it is table driven.
The decoder self corrects and synchronizes if you start in the middle of a character byte sequence. The first return will be invalid, but the following return sequences will be valid.
The utf8decoder module contains useful functions yieldUtf8Chars, validateUtf8String, sanitizeUtf8 and utf8CharString all built around the decoder.
It’s important that your decoder passes the tests. For example, Microsoft introduced a security issue because their decoder allowed over long characters, e.g. it allowed multiple encodings for the same character. This was exploited by a hacker by using a multi-byte separator slash in a file path to access private files.