lundi 25 septembre 2017

Unicode: Character Count Shows More Characters in Notepad++ or MS Word

I have to validate a unicode string for invalid charactesrs such as '^'. Also I have to show the position of the invalid character in the input string. I have written the below code...

size_t getPositionByNumberOfChars(string input, size_t positionInBytes)
{
    size_t charPosition = 0;
    size_t bytesFromFront = 0;
    for (auto it = input.begin(), end = input.end(); it != end; ++it)
    {
        if (bytesFromFront == positionInBytes)  // Condition to break when found the exact position.
            break;
        bytesFromFront++;
        if ((*it & 0xc0) != 0x80)
            charPosition += 1;                  // We have found either an ASCII or UTF-8 Char.
    }
    return charPosition + 1;
}

void validateInputString(string inputData)
{
    vector<string> invalidUTF8Chars = {/*Some invalid chars*/};
    for (auto i : invalidUTF8Chars)
    {
        size_t pos = inputData.find(i, 0);
        if (pos != string::npos)
        {
            size_t charPosition = getPositionByNumberOfChars(inputData, pos);
            .
            .
            .
        }   
    }
}

This is giving proper output. But for some languages, the charPosition is incorrect... Example: ಭಾರ^^^^ತಭಾರತಕರ್ನಾಟಕ

As per the above input, the invalid char(^) is present at position 3, but the variable charPosition gives me the count as 4. Also the Notepad++ and MS Word gives total count as 19 but there are only 14 chars (ಭಾ|ರ|^|^|^|^|ತ|ಭಾ|ರ|ತ|ಕ|ರ್ನಾ|ಟ|ಕ). Is anything I am missing here?

Aucun commentaire:

Enregistrer un commentaire