I am trying to read non-printable characters from a text file, print out the characters' ASCII code, and finally write these non-printable characters into an output file.
However, I have noticed that for every non-printable character I read, there is always an extra non-printable character existing in front of what I really want to read.
For example, the character I want to read is "§". And when I print out its ASCII code in my program, instead of printing just "167", it prints out "194 167".
I looked it up in the debugger and saw "§" in the char array. But I don't have  anywhere in my input file. screenshot of debugger
And after I write the non-printable character into my output file, I have noticed that it is also just "§", not "§".
There is an extra character being attached to every single non-printable character I read. Why is this happening? How do I get rid of it?
Thanks!
Code as follows:
case 1:
mode = 1;
FILE *fp;
fp = fopen ("input2.txt", "r");
int charCount = 0;
while(!feof(fp)) {
original_message[charCount] = fgetc(fp);
charCount++;
}
original_message[charCount - 1] = '\0';
fclose(fp);
k = strlen(original_message);//split the original message into k input symbols
printf("k: \n%lld\n", k);
printf("ASCII code:\n");
for (int i = 0; i < k; i++)
{
ASCII = original_message[i];
printf("%d ", ASCII);
}
C's
getchar(andgetcandfgetc) functions are designed to read individual bytes. They won't directly handle "wide" or "multibyte" characters such as occur in the UTF-8 encoding of Unicode.But there are other functions which are specifically designed to deal with those extended characters. In particular, if you wish, you can replace your call to
fgetc(fp)withfgetwc(fp), and then you should be able to start reading characters like§as themselves.You will have to
#include <wchar.h>to get the prototype forfgetwc. And you may have to add the callat the top of your program to synchronize your program's character set "locale" with that of your operating system.
Not your original code, but I wrote this little program:
When I type "A", it prints
A 65. When I type "§", it prints§ 167. When I type "Ƶ", it printsƵ 437. When I type "†", it prints† 8224.Now, with all that said, reading wide characters using functions like
fgetwcisn't the only or necessarily even the best way of dealing with extended characters. In your case, it carries a number of additional consequences:original_messagearray is going to have to be an array ofwchar_t, not an array ofchar.original_messagearray isn't going to be an ordinary C string — it's a "wide character string". So you can't callstrlenon it; you're going to have to callwcslen.%s, or its characters using%c. You'll have to remember to use%lsor%lc.So although you can convert your entire program to use "wide" strings and "
w" functions everywhere, it's a ton of work. In many cases, and despite anomalies like the one you asked about, it's much easier to use UTF-8 everywhere, since it tends to Just Work. In particular, as long as you don't have to pick a string apart and work with its individual characters, or compute the on-screen display length of a string (in "characters") usingstrlen, you can just use plain C strings everywhere, and let the magic of UTF-8 sequences take care of any non-ASCII characters your users happen to enter.