In the Google Closure UTF-8 to byte array tests is the string
\u0000\u007F\u0080\u07FF\u0800\uFFFF
which is supposed to be converted to the array
[0x00, 0x7F, 0xC2, 0x80, 0xDF, 0xBF, 0xE0, 0xA0, 0x80, 0xEF, 0xBF, 0xBF]
I've tried a few other JavaScript and TypeScript UTF-8-to-byte array implementations and they claim that the UTF-8 string is invalid.
The string appears to cover the values that transition from 1 byte to 2 byte to 3 byte values.
Is Google correct or the other libraries?
Google is correct.
The string
'\u0000\u007F\u0080\u07FF\u0800\uFFFF'represents Unicode codepointsU+0000 U+007F U+0080 U+07FF U+0800 U+FFFF.The literal translation of those codepoints to UTF-8 is indeed bytes
00 7F C2 80 DF BF E0 A0 80 EF BF BF, just as Google says.Note that
U+FFFFis a non-character codepoint, per the Unicode standard:In particular: