Problem: using getchar() for Umlauts/Umlaute (Ö, Ä, Ü, ß)

105 Views Asked by At

I want to run a simple program:

#include <stdio.h>
#include <stdlib.h>

int main(void)
{
int c;

printf("Please enter a letter: ");

while ((c = getchar()) !='.')
    printf("The letter is: %c", c);

return 0;
}

But when I write strings in the printf-function the Output looks like this (with "a" as input example):

Please enter a letter: a
The letter is: aBuchstabe:

And even, when I use a umlaut, for example "ü" then I get this:

Please enter a letter: �
The letter is: �The letter is: �The letter is:

I thought, I can use getchar() for Umlaute/Umlauts?! It seems like printf() cant handle it. But I don't know what to do. When I use putchar() I will get the umlaut. Or is it not possible to use umlaute/umlauts in Clang? I know, that there is a set of signs which are permitted for sourcecode in C.

What I do wrong?

2

There are 2 best solutions below

3
KamilCuk On

When dealing with anything more than the basic English alphabet, you have to move to wide characters. There is a high chance that ü takes more than one byte - it just does not "fit" into char.

Also, check for EOF.

#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <locale.h>
int main(void) {
    setlocale(LC_ALL, "");
    printf("Please enter a letter: ");
    wint_t c;
    while ((c = getwchar()) != WEOF && c !='.')
        printf("The letter is: %lc\n", c);
    return 0;
}
0
John Bollinger On

The basic source and execution character sets do not contain any letters with diacritical marks. Any such characters in the execution character set are extended characters, which might or might not be multibyte characters. If the execution character set is encoded in UTF-8 (very common) then all characters with diacritical marks will be multibyte characters, but that is not the only alternative.

getchar() reads one char-sized unit and (on success) returns an unsigned representation of it. To read multibyte characters this way takes multiple calls, one per byte. Your example program does not account for that, but consider this alternative:

#include <stdio.h>
#include <stdlib.h>

int main(void) {
    printf("Please enter some letters: ");

    while (1) {
        int c = getchar();

        if (c < 0 || c == '\n') {
            break;
        }
        // Using printf here for parallelism with the original example:
        printf("%c", c);
    }
    putchar('\n');

    return 0;
}

I think you will find that it echoes one line of input accurately, diacritical marks and all. And you will note that it uses the same I/O functions that the original example does.

I thought, I can use getchar() for Umlaute/Umlauts?!

Character handling is more complicated than novices tend to appreciate. If you have not already done so, you should read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). It's getting a bit old, but it's still relevant.

In any case, yes, you can use getchar() to input characters with umlauts if the execution character set supports such characters at all, but not necessarily at a rate of one (whole) character per getchar() call. It may take multiple calls to get all the bytes of such a character.

It seems like printf() cant handle it.

printf() prints the data you present to it. If you ask it to print only the first byte of a multi-byte character, then that's what it will do. If you ask it to print all the bytes of a multibyte character, then that's what it will do.

When I use putchar() I will get the umlaut.

Substituting putchar(c) for printf("The letter is: %c", c) gives you something much more analogous to my example, above, than to the one presented in the question. It makes a big difference if you insert a bunch of other characters (The letter is: ) between the bytes of a multibyte character.

Or is it not possible to use umlaute/umlauts in Clang?

There is nothing in Clang or any conforming C implementation that would prevent echoing input bytes directly to the standard output. What effect that has depends on external factors, especially the terminal configuration, but no, Clang does not have any inherent problem with umlauts.

I know, that there is a set of signs which are permitted for sourcecode in C.

Well, there is a set of characters (the basic source character set) that all C implementations are required to accept in C source. In practice, substantially all C implementations accept more than that, and almost every C program depends on that. Some implementations, under some circumstances, will even accept characters with umlauts in C source.

C does have wide streams, which operate in units of type wchar_t, and a set of I/O functions for operating on them. These were intended to ease I/O involving characters whose encoded values are too large for char to handle. And they do, somewhat, but that's not necessary for handling multibyte characters, and it's not certain to be sufficient, either.