dimanche 27 octobre 2019

MSVC UTF8 string encoding uses incorrect code points

I'm trying to write the character "Ā" (https://www.fileformat.info/info/unicode/char/0100/index.htm) into a C++11 UTF8 string (using u8 prefix).

const char *const utf8 = u8"Ā";
const char *const utf8_2 = u8"\u0100";
const char *const chars = "Ā";

const int utf8_len = strlen(utf8);
const int utf8_2_len = strlen(utf8_2);
const int chars_len = strlen(chars);

Running this under MSVC (16.2.4) results in:

utf8_len == 5
utf8_2_len = 2;
chars_len = 2;

Where:

utf8 == "Ä€"
utf8_2 == "Ä€"
chars == "Ä€"

The source file is set to UTF8 (without BOM).

Trying the same with Clang and GCC works as expected:

https://godbolt.org/z/PNZFCa

Does anyone know why this behaviour is occurring? Why is the u8 prefixed Unicode character being encoded as 5 bytes (when it should be 2)?

Aucun commentaire:

Enregistrer un commentaire