I don't understand why my std::string string is not containing the UTF-8 string correctly

496 Views Asked by At

EDIT: I am editing my question to provide everyone with clearer information on my issues through code. I've also changed my input string from Japanese to a Greek string so kindly take note. Thank you very much!


I have this wstring input below:

wstring command = L"Σὲ γνωρίζω ἀπὸ τὴν κόψη";

This is the existing code (take note: I did not create this code) that converted the std::wstring to std::string:

string wstring2string(const wstring& str) 
{
   string str2(str.length(), L' ');
   std::copy(str.begin(), str.end(), str2.begin());
   return str2;
}

After this function, the value in the string became like this:

£r ³½ÉÁw¶É 

Debugged string values here

This function works well with non-UTF-8 and non-Unicode texts. I just can't wrap my head around why it can't work with UTF-8 texts, too.

1

There are 1 best solutions below

3
user17732522 On

This is the existing code (take note: I did not create this code) that converted the std::wstring to std::string.

The function simply copies each code unit from the original string to the output string, implicitly converting the numeric value of the code unit according to the integer properties of wchar_t and char. Practically that means all code units are simply truncated to their lowest byte, loosing all other information. The function does not consider any encoding or anything. It is completely broken.

To convert from std::wstring to std::string you should first know how the input and output are intended to be encoded (e.g. the system's wide and narrow execution character set encoding) and then you should use a unicode library offering transcoding between these two encodings.

The C++ standard library does have functions for it (https://en.cppreference.com/w/cpp/locale/wstring_convert), but they are deprecated for security and specification problems and so shouldn't be used or used carefully.

If you want to convert from the native wide character set encoding to the current C locale's narrow multibyte encoding, you can also use https://en.cppreference.com/w/cpp/string/multibyte/wcsrtombs, but then you must also be careful to make sure that the correct locale is set.

A very complete unicode solution is ICU, but for what you are asking here you only need a tiny part of it.

On POSIX systems there is iconv.

You can find third party libraries as well.