Disadvantages of using `std::wstring` for Unicode in cross-platform code?

102 Views Asked by At

Situation

I have a large existing Win32 C++ code-base, and I want to make it portable so that it compiles and runs on both Windows (MSCV) and Linux (gcc).

For a new project I would try to go UTF-8 Everywhere, but this existing code-base already stores and processes its text in std::wstring as UTF-16.
So I expect to cause less upheaval, and have less risk of breaking existing behavior on Windows, if I keep it that way and try to work with it.

Plan

So this is what text handling would look like once the code-base is cross-platform:

  • Use std::wstring for storing text in memory, and operate on it using standard library functionality that accepts std::wstring/wchar_t.
  • On Windows, this means UTF-16 (with 2 bytes per code unit).
  • On Linux, this means UTF-32 (with 4 bytes per code unit).
  • On the program's input/output boundaries, when text must be converted to/from other encodings, have #ifdefs to do the correct thing on both platforms.

Question

What are the downsides/problems of this approach?

Already considered

Problems I already considered:

  • Higher memory usage compared to UTF-8.

  • Per-code-unit processing like std::tolower will behave differently on the two platforms if there are Unicode codepoints outside the Basic Multilingual Plane.

  • Some std::wstring-accepting overloads used by the current code-base, such as the std::ifstream(std::wstring, ...) constructor, are actually Microsoft-specific extensions and not available on Linux/GCC - so extra platform-specific #ifdefs will be necessary in those places.

But aside from that?

0

There are 0 best solutions below