I'm trying to build a token for Simplified Chinese Identifiers.
Simplified Chinese Identifiers are defined in the specification as follows:
simplified-Chinese-identifier = first-sChinese-identifier-character *subsequent-sChinese-identifier-character
first-sChinese-identifier-character = (first-Latin-identifier-character / CP936-initialcharacter)
subsequent-sChinese-identifier-character = (subsequent-Latin-identifier-character / CP936-
subsequent-character)
CP936-initial-character = < character ranges specified in section 3.3.5.1.3>
CP936-subsequent-character = < character ranges specified in section 3.3.5.1.3>
Here is UNICODE-BESTFIT and Windows Codepage 936.
What I did is, for instance, to look for %xA3C1 in the page, then take its corresponding code, which is 0xff21. Thus, I found the corresponding code for %xA3C1,%xA3DA, %xA3E1, %xA3FA, %xA1A2, %xA1AA, %xA1AC, %xA1AD, %xA1B2, %xA1E6; %xA1E8, %xA1EF, %xA2B1, %xA2FC, %xA4A1, %xFE4F, and build CP936-initial-character as follows:
let cP936_initial_character = [%sedlex.regexp? 0xff21 .. 0xff3a | 0xff41 .. 0xff5a | 0x3001 .. 0x2014 | 0x2016 .. 0x2026 | 0x3014 .. 0x2103 | 0x00a4 .. 0x2605 | 0x2488 .. 0x216b | 0x3041 .. 0xfa29]
However, the problem is that some ranges look odd, for example, 0x00a4 .. 0x2605 and 0x2488 .. 0x216b which are not in a good order; and 0x3041 .. 0xfa29 looks too large.
Does anyone know what's the correct way to build this token?


Follow WindowsBestFit/readme.txt; in particular, description of multibyte mapping records in the
WCTABLEsection (theWCTABLEtag marks the start of the Unicode UTF-16 (WideChar) to "MultiByte" bytes…).The following, partially commented, Python 3 script (sorry, I don't speak VBA):
bestfit936.txtfile line by line,WCTABLEsection and builds an array of Unicode codepoints whichgb2312codepoint (codepage 936) matches rules of Simplified Chinese Identifiers and which Unicode Category is letter ('Ll','Lu','Lo') (see variableinit_chars_16),init_chars_16and creates corresponding array of characters (variableinit_chars_utf16),init_chars_16to longest consecutive chains (variableinit_chars_groups), and1to print all characters). There is 15477 applicable codepoints in (unfortunately) 1977 consecutive ranges.This is done only for
CP936-initial-characterhowever the same could be applied forCP936-subsequent-characteras well (supply argument2, see also Usage below and output examples).Output:
.\SO\68766804.pyOutput:
.\SO\68766804.py 2