dimanche 28 août 2016

std::codecvt_utf8_utf16 doesn't convert utf-8 to utf-16 in big-endian

I converted a string in utf-8 encoding to string in utf-16, by using wstring_convert & codecvt_utf8_utf16

here is the sample code I tested:

#include <iostream>
#include <codecvt>
#include <string>

#include <fstream>
#include <cstdint>

std::u16string UTF8ToWide(const std::string& utf_str)
{
    std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> converter;
    return converter.from_bytes(utf_str);
}

void DisplayBytes(const void* data, size_t len)
{
    const uint8_t* src = static_cast<const uint8_t*>(data);
    for (size_t i = 0; i < len; ++i) {
        printf("%.2x ", src[i]);
    }
}

// the content is:"你好 hello chinese test 中文测试"
std::string utf8_s = "\xe4\xbd\xa0\xe5\xa5\xbd hello chinese test \xe4\xb8\xad\xe6\x96\x87\xe6\xb5\x8b\xe8\xaf\x95";

int main()
{
    auto ss = UTF8ToWide(utf8_s);
    DisplayBytes(ss.data(), ss.size() * sizeof(decltype(ss)::value_type));
    return 0;
}

according to reference manual, the default argument of std::codecvt_mode in the facet codecvt_utf8_utf16 is big-endian.

However, the test program displays bytes as follows

60 4f 7d 59 20 00 68 00 65 00 6c 00 6c 00 6f 00 20 00 63 00 68 00 69 00 6e 00 65 00 73 00 65 00 20 00 74 00 65 00 73 00 74 00 20 00 2d 4e 87 65 4b 6d d5 8b

which is in little-endian.

I ran the test code on Visual Studio 2013 and clang, respectively, and ended up with the same results.

So, why is the big-endian mode of codecvt_utf8_utf16 doesn't have any effect on these conversions?

Aucun commentaire:

Enregistrer un commentaire