Working with NSString manipulating special characters

49 Views Asked by At

I want to print an NSString byte by byte:

NSString *path = @"/User/user/ǢǨǣ/";
const char *byte_array = [path UTF8String];

for (unsigned char *it = byte_array; *it != 0; ++it) {
   NSLog(@"char: %c \t hex: %02x \n", *it, *it);
}

Producing:

 Ç - FFFFFFC7
 ¢ - FFFFFFA2
 Ç - FFFFFFC7
 ¨ - FFFFFFA8
 Ç - FFFFFFC7
 £ - FFFFFFA3

This should be the output for Ǣ(C7 A2) Ǩ(C7 A8) ǣ(C7 A3). I think those "FFFFFF" form every "byte" affects my code. I'm wondering if is any way of manipulating paths with special characters in them.

1

There are 1 best solutions below

0
Rob On

The output is behaving as if it was signed char * being converted to a signed integer and then displayed as a 32-bit hex string. While I am unable to produce the exact behavior you describe (without manually casting it to a signed integer), looking at the headers, UTF8String is defined as:

@property (nullable, readonly) const char *UTF8String NS_RETURNS_INNER_POINTER; // Convenience to return null-terminated UTF8 representation

And you even defined your pointer to be signed char *:

const char *byte_array = [path UTF8String];

Note, neither of those are unsigned char *, but just char *. (FWIW, that seems exceedingly curious to me at I always think of a “byte” as an unsigned char, i.e., a uint8_t.)

I personally would use NSData and uint8_t to avoid ambiguity:

NSString *path = [@"/User/user/ǢǨǣ" precomposedStringWithCanonicalMapping];

NSData *data = [path dataUsingEncoding:NSUTF8StringEncoding];
[data enumerateByteRangesUsingBlock:^(const void * _Nonnull bytes, NSRange byteRange, BOOL * _Nonnull stop) {
    for (NSUInteger i = 0; i < byteRange.length; i++) {
        uint8_t byte = ((uint8_t *)bytes)[i];
        NSLog(@"%02x", byte);
    }
}];

And those last three characters came out as:

c7
a2
c7
a8
c7
a3

As an aside, when I did a cut-and-paste your code snippet, rather than receiving c7a2 for Ǣ, I received c386 (Æ) followed by cc84 (the “combining macron”, i.e., the “combining” rendition of ¯). I do not know whether Stack Overflow or your editor introduced that, but this is a common problem when looking at hexadecimal representations of UTF8 characters, as there are multiple possible representations of the same character. If you are really looking at hex UTF8 representations, you may want to standardize this with, for example, precomposedStringWithCanonicalMapping, as shown above.