Background
I am writing a DFA based regex parser, for performance reasons, I need to use a dictionary [Unicode.Scalar : State] to map the next states. Now I need a bunch of special unicode values to represent special character expressions like ., \w, \d...
My Question
Which of the unicode values are safe to use for this purpose?
I was using U+0000 for ., but I need more now. I checked the unicode documentation, the Noncharacters seems promising, but in swift, those are considered invalid unicode. For example, the following code gives me a compiler error Invalid unicode scalar.
let c = "\u{FDD0}"
If you insist on using
Unicode.Scalar, nothing.Unicode.Scalaris designed to represent all valid characters in Unicode, including not-assigned-yet code points. So, it cannot represent noncharacters nor dangling surrogates ("\u{DC00}"also causes error).And in Swift,
Stringcan contain all validUnicode.Scalars includingU+0000. For example"\u{0000}"(=="\0") is a valid String and its length (count) is 1. With usingU+0000as a meta-character, your code would not work with valid Swift Strings.Consider using
[UInt32: State]instead of[Unicode.Scalar: State]. Unicode uses only0x0000...0x10FFFF(including noncharacters), so using values greater than0x10FFFFis safe (in your meaning).Also getting
valueproperty ofUnicode.Scalartakes very small cost and its cost can be ignored in optimized code. I'm not sure usingDictionaryis really a good way to handle your requirements, but[UInt32: State]would be as efficient as[Unicode.Scalar: State].