How to check if a UTF-8 string starts with an 'a'

361 Views Asked by At

I have a UTF-8 string given as a null-terminated const char*. I would like to know if the first letter of this string is an a by itself. The following code

bool f(const char* s) {
  return s[0] == 'a';
}

is wrong, as the first letter (grapheme cluster) of the string might be à - made from 2 unicode scalar values: a and `. So this very simple question seems extremely difficult to answer, unless you know how grapheme clusters are made.

Still, many libraries parse UTF-8 files (YAML files, for instance) and therefore should be able to answer this kind of question. But these libraries don't seem to depend upon a Unicode library.

So my question are:

  • How to write code that checks if a string starts with the letter a?

  • Assuming that there is no simple answer to the first question, how do parsers (such as YAML parsers) manage to parse files without being able to answer this kind of question?

4

There are 4 best solutions below

6
Josh Lee On BEST ANSWER

It simply doesn't matter.

Consider: Is this string valid JSON?

"̀"

(That's the byte sequence 22 cc 80 22.)

You seem to be arguing that it is not: Since a JSON string should start with " (QUOTATION MARK) but instead this starts with (QUOTATION MARK + COMBINING GRAVE ACCENT).

The only reasonable response is that you're thinking at the wrong level: Text serialization is defined in terms of code points. Grapheme clusters are only considered for processing natural language and editing text.

And this certainly is considered valid JSON.

>>> json.loads(bytes.fromhex('22cc8022'))
'̀'
1
Nicol Bolas On

How to write a code that checks if a string starts with the letter a?

There is no simple answer to this. To answer this question, you would need to be test the Unicode CCC property of a codepoint. If it's non-zero, then it is a combining character.

Of course, C has no API for doing so.

How do parsers (such as YAML parsers) manage to parse files without being able to answer this kind of question.

This is not a question they need to answer. Why? Because they never ask it.

If YAML is reading a key, then it reads up until the name terminating character (like :). A Unicode combining character cannot combine through such a character, and the YAML specification doesn't care if there's a combining character on the other side of the :. If it sees a :, then it knows that it has reached the end of the name, and everything before that is a key.

If it's reading a text string, then it similarly keeps reading until it reads a terminating character or character sequence.

Parsing text with most text formats is based on regular expression matching (or something similar) against some terminating condition. That is, a string would be any of some set of characters (alternative, all characters except for some set), up to the terminus character(s).

2
user803422 On

Here is a code that checks if an utf8 string starts with the letter 'a'?

bool f(const char* s) {

        if (s[0] == 'a') return true;

        if (strlen(s) >= 2 && s[0] == '\xc3') {
                char s1 = s[1];
                if (s1 == '\x80') return true; // LATIN CAPITAL LETTER A WITH GRAVE
                if (s1 == '\x81') return true; // LATIN CAPITAL LETTER A WITH ACUTE
                if (s1 == '\x82') return true; // LATIN CAPITAL LETTER A WITH CIRCUMFLEX
                if (s1 == '\x83') return true; // LATIN CAPITAL LETTER A WITH TILDE
                if (s1 == '\x84') return true; // LATIN CAPITAL LETTER A WITH DIAERESIS
                if (s1 == '\x85') return true; // LATIN CAPITAL LETTER A WITH RING ABOVE

                if (s1 == '\xa0') return true; // LATIN SMALL LETTER A WITH GRAVE
                if (s1 == '\xa1') return true; // LATIN SMALL LETTER A WITH ACUTE
                if (s1 == '\xa2') return true; // LATIN SMALL LETTER A WITH CIRCUMFLEX
                if (s1 == '\xa3') return true; // LATIN SMALL LETTER A WITH TILDE
                if (s1 == '\xa4') return true; // LATIN SMALL LETTER A WITH DIAERESIS
                if (s1 == '\xa5') return true; // LATIN SMALL LETTER A WITH RING ABOVE
        }
        return false;
}
0
R.. GitHub STOP HELPING ICE On

s[0] == 'a' is the correct test for whether the first character is a. If a string contains a decomposed version of à, that would be two characters, a and the combining grave. Up until Apple decided to enforce NFD everywhere, this was basically a non-issue, because people who wanted à to be treated as a character/letter by itself would enter it as one, and people who wanted it as an a with a mark attached would enter it as two. Yes, this goes against the Unicode intent of canonical equivalence, but the Unicode intent of canonical equivalence largely goes against user expectation and intent (not to mention existing text & text processing models).

If you really want to check that the first character is an a and is not followed by any combining marks, this should work:

wchar_t tmp = WEOF;
mbrtowc(&tmp, s+1, MB_LEN_MAX, &(mbstate_t){0});
if (tmp && wcwidth(tmp)==0) {
    /* character following 'a' is a combining mark */
}

This depends on the POSIX wcwidth function, but you can find portable versions of it or write your own based on the Unicode tables (really you could write a simpler function that only checks for combining status, not also the East Asian Width property).

To answer your second question about parsers, they don't have any reason to know or care about the issue you're concerned about. File formats like yaml, json, etc. are not subject to canonical equivalence (at least not at the parsing level; the content stored in the file, which applications will interpret, might be subject to it). A string that is a different sequence of Unicode characters, even if it would be canonically equivalent, is a different string that compares not-equal.