I have to validate a unicode string for invalid charactesrs such as '^'. Also I have to show the position of the invalid character in the input string. I have written the below code...
size_t getPositionByNumberOfChars(string input, size_t positionInBytes)
{
size_t charPosition = 0;
size_t bytesFromFront = 0;
for (auto it = input.begin(), end = input.end(); it != end; ++it)
{
if (bytesFromFront == positionInBytes) // Condition to break when found the exact position.
break;
bytesFromFront++;
if ((*it & 0xc0) != 0x80)
charPosition += 1; // We have found either an ASCII or UTF-8 Char.
}
return charPosition + 1;
}
void validateInputString(string inputData)
{
vector<string> invalidUTF8Chars = {/*Some invalid chars*/};
for (auto i : invalidUTF8Chars)
{
size_t pos = inputData.find(i, 0);
if (pos != string::npos)
{
size_t charPosition = getPositionByNumberOfChars(inputData, pos);
.
.
.
}
}
}
This is giving proper output. But for some languages, the charPosition is incorrect... Example: ಭಾರ^^^^ತಭಾರತಕರ್ನಾಟಕ
As per the above input, the invalid char(^) is present at position 3, but the variable charPosition
gives me the count as 4. Also the Notepad++ and MS Word gives total count as 19 but there are only 14 chars (ಭಾ|ರ|^|^|^|^|ತ|ಭಾ|ರ|ತ|ಕ|ರ್ನಾ|ಟ|ಕ). Is anything I am missing here?
Aucun commentaire:
Enregistrer un commentaire