Situation
I have a large existing Win32 C++ code-base, and I want to make it portable so that it compiles and runs on both Windows (MSCV) and Linux (gcc).
For a new project I would try to go UTF-8 Everywhere, but this existing code-base already stores and processes its text in std::wstring as UTF-16.
So I expect to cause less upheaval, and have less risk of breaking existing behavior on Windows, if I keep it that way and try to work with it.
Plan
So this is what text handling would look like once the code-base is cross-platform:
- Use std::wstring for storing text in memory, and operate on it using standard library functionality that accepts
std::wstring/wchar_t. - On Windows, this means UTF-16 (with 2 bytes per code unit).
- On Linux, this means UTF-32 (with 4 bytes per code unit).
- On the program's input/output boundaries, when text must be converted to/from other encodings, have
#ifdefs to do the correct thing on both platforms.
Question
What are the downsides/problems of this approach?
Already considered
Problems I already considered:
Higher memory usage compared to UTF-8.
Per-code-unit processing like
std::tolowerwill behave differently on the two platforms if there are Unicode codepoints outside the Basic Multilingual Plane.Some
std::wstring-accepting overloads used by the current code-base, such as thestd::ifstream(std::wstring, ...)constructor, are actually Microsoft-specific extensions and not available on Linux/GCC - so extra platform-specific#ifdefs will be necessary in those places.
But aside from that?