{"id":225,"date":"2024-10-20T12:54:40","date_gmt":"2024-10-20T19:54:40","guid":{"rendered":"https:\/\/flenniken.net\/blog\/?p=225"},"modified":"2024-10-20T12:54:40","modified_gmt":"2024-10-20T19:54:40","slug":"utf-8-decoder","status":"publish","type":"post","link":"https:\/\/flenniken.net\/blog\/utf-8-decoder\/","title":{"rendered":"UTF-8 Decoder"},"content":{"rendered":"\n<p>I wrote the utf-8 decoder for statictea because I found a bug in nim\u2019s unicode validator. I am amazed how many common programs have bugs in their decoders.<\/p>\n\n\n\n<p>I documented what I found in the utf8tests project. You can see the test results and the decoder in it:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/github.com\/flenniken\/utf8tests\">utf8tests<\/a><\/li>\n<\/ul>\n\n\n\n<p>For statictea, instead of referencing the decoder from the utf8tests project, I just copied the utf8decoder module into statictea. When the external module is updated, the statictea build process will tell that the module changed and should be updated.<\/p>\n\n\n\n<p>What\u2019s cool about the decoder:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>it\u2019s fast and small<\/li>\n\n\n\n<li>it passes all the tests<\/li>\n\n\n\n<li>it returns byte sequences for valid and invalid cases<\/li>\n\n\n\n<li>it can start in the middle, no need to start at the beginning of a string<\/li>\n\n\n\n<li>it is easy to build high level functions with it<\/li>\n<\/ul>\n\n\n\n<p>The decoder is only a few lines of code and it is table driven.<\/p>\n\n\n\n<p>The decoder self corrects and synchronizes if you start in the middle of a character byte sequence. The first return will be invalid, but the following return sequences will be valid.<\/p>\n\n\n\n<p>The utf8decoder module contains useful functions yieldUtf8Chars, validateUtf8String, sanitizeUtf8 and utf8CharString all built around the decoder.<\/p>\n\n\n\n<p>It\u2019s important that your decoder passes the tests. For example, Microsoft introduced a security issue because their decoder allowed over long characters, e.g. it allowed multiple encodings for the same character. This was exploited by a hacker by using a multi-byte separator slash in a file path to access private files.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I wrote the utf-8 decoder for statictea because I found a bug in nim\u2019s unicode validator. I am amazed how many common programs have bugs in their decoders. I documented what I found in the utf8tests project. You can see &hellip; <a href=\"https:\/\/flenniken.net\/blog\/utf-8-decoder\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[27,26],"class_list":["post-225","post","type-post","status-publish","format-standard","hentry","category-uncategorized","tag-nim","tag-statictea"],"_links":{"self":[{"href":"https:\/\/flenniken.net\/blog\/wp-json\/wp\/v2\/posts\/225","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/flenniken.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/flenniken.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/flenniken.net\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/flenniken.net\/blog\/wp-json\/wp\/v2\/comments?post=225"}],"version-history":[{"count":2,"href":"https:\/\/flenniken.net\/blog\/wp-json\/wp\/v2\/posts\/225\/revisions"}],"predecessor-version":[{"id":227,"href":"https:\/\/flenniken.net\/blog\/wp-json\/wp\/v2\/posts\/225\/revisions\/227"}],"wp:attachment":[{"href":"https:\/\/flenniken.net\/blog\/wp-json\/wp\/v2\/media?parent=225"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/flenniken.net\/blog\/wp-json\/wp\/v2\/categories?post=225"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/flenniken.net\/blog\/wp-json\/wp\/v2\/tags?post=225"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}