Why is ICU's ucnv_getNextUChar not setting error codes?

56 Views Asked by At

Here's some code to demonstrate my problem

#include <unicode/ucnv.h>
#include <stdio.h>

UConverter * converter;
void test_char(char inchar){
    const char * inbuf=&inchar;
    UErrorCode err=U_ZERO_ERROR;
    UChar32 c = ucnv_getNextUChar(
        converter,
        &inbuf,
        inbuf+1,
        &err
    );

    printf("%x %s\n", c, u_errorName(err));
}
int main(){
    UErrorCode err=U_ZERO_ERROR;
    converter = ucnv_open("cp932", &err);
    test_char(0x41); /*A*/
    test_char(0xB1); /*ア*/
    test_char(0xE1); /*Should be U_TRUNCATED_CHAR_FOUND*/
    test_char(0xF1); /*Should be U_INVALID_CHAR_FOUND*/
    return 0;
}

This code prints

41 U_ZERO_ERROR
ff71 U_ZERO_ERROR
1a U_ZERO_ERROR
1a U_ZERO_ERROR

Why do the invalid characters always return U_ZERO_ERROR when there is clearly an error? Why does it return the Substitute control code instead? isn't Substitute a valid SHIFT-JIS character? How do I distinguish between a valid Substitute and an invalid SHIFT-JIS string?

1

There are 1 best solutions below

0
oshaboy On

I found the answer in the fine print of the ucnv.h library

When a converter encounters an illegal, irregular, invalid or unmappable character its default behavior is to use a substitution character to replace the bad byte sequence. This behavior can be changed by using ucnv_setFromUCallBack() or ucnv_setToUCallBack() on the converter. The header ucnv_err.h defines many other callback actions that can be used instead of a character substitution.

Therefore if you want actual error codes you need to change your converter callback.

ucnv_setToUCallBack(
    converter,
    UCNV_TO_U_CALLBACK_STOP,
    NULL,
    NULL,
    NULL,
    &err
);