Using ctype.h with utf-8 character

98 Views Asked by At

I'm study C from Kernighan and Ritchie (1988), which use ASCII in character manipulation. In chapter 2, they start using the header file ctype.h. Searching in internet and reading the comments in ctype.h file, they wrote that is for ascii; so it makes sense that for other encoding of characters as utf-8 don't work so well.

I was printing the value of iscnctr() for the values between 0-32 and 127-159 (decimal); and I was expecting that it would return 0 or 1, but instead, it return 0 and 32.

Why it doesn't return 0 or 1? And there is ctype.h for utf-8?

2

There are 2 best solutions below

4
KamilCuk On

Why it doesn't return 0 or 1?

The is* functions from ctype.h return zero if the character does not meet the condition and non-zero otherwise. Any non-zero. 32 is non-zero.

Looking at cppreference isctrl it should return non-zero for 0-31 and 127 for ASCII.

there is ctype.h for utf-8?

The short answer is no. UTF-8 is a multibyte encoding. Function from ctype.h are for single byte narrow characters.

The standard way, is when you have a string containing a multibyte character (in the C programming sense) first convert it to wide characters by first setting the appropriate locale for your environment and then call the mbtowc. Then you can use isw* function from wctype.h to identify the character category.

Why is 32 tho?

Because of a tiny amount of speed, that was relevant 50 years ago. Changing bitwise result to 0/1 is an additional operation. Nowadays, it doesn't make a difference, which is why modern programming languages have bool. 50 years ago, there was no bool in C and single operations were much more important.

It is usually implemented like the following. There is a big table that maps of every character to a single byte with flags.

char map[256] = { 0, 0, _ISCTRL_FLAG | _ISUPPER_FLAG, ..... etc for 256 bytes .... };

Then all isctrl is checks bitwise if a bit inside a map is set.

enum {
  _ISCTRL_FLAG = 32;
};
static inline
int isctrl(unsigned char c) {
   returm map[c] & _ISCTRL_FLAG;
}

Because the result of & is equal to _ISCTRL_FLAG when the bit is set, the result is 32 or 0.

1
Mark Reed On

In C, any nonzero value is considered true, so many Boolean functions return whatever is easiest to return rather than arranging to return 1. So if you expect a Boolean result, just use it in a Boolean context; don't chck for equality to 1.

However, you can coerce it to 0 or 1 by negating it twice with !!.

As far as Unicode, C doesn't really support it. It has regular one-byte chars and multibyte wchar_ts, but the latter aren't required to be UTF-8 or any specific encoding. You can find libraries that will check properties against UniData, but there aren't any ctyep-like functions for Unicode in the standard library.