lundi 26 juin 2017

std::regex and ignoring flags

After learning basic rules,I specialized my focus on std::regex, creating two console apps: 1.renrem and 2.bfind.
And I decided to create some convenient functions to deal with regex in as easy as possible plus all with std; named RFC ( = regex function collection )

There are several strange things that always make me surprise, but this one ruined all my attempt and those two console apps.

One of the important functions is count_match that counts number of match inside a string. Here is the full code:

unsigned int count_match( const std::string& user_string, const std::string& user_pattern, const std::string& flags = "o" ){

    const bool flags_has_i = flags.find( "i" ) < flags.size();
    const bool flags_has_g = flags.find( "g" ) < flags.size();

    std::regex::flag_type regex_flag                  = flags_has_i ? std::regex_constants::icase         : std::regex_constants::ECMAScript;
//    std::regex_constants::match_flag_type search_flag = flags_has_g ? std::regex_constants::match_default : std::regex_constants::format_first_only;
    std::regex rx( user_pattern, regex_flag );
    std::match_results< std::string::const_iterator > mr;

    unsigned int counter = 0;
    std::string temp = user_string;
    while( std::regex_search( temp, mr, rx ) ){
        temp = mr.suffix().str();
        ++counter;
    }

    if( flags_has_g ){
        return counter;
    } else {
        if( counter >= 1 ) return 1;
        else               return 0;
    }

}  

First of all, as you can see, the line for search_flag was commented because it is ignored by std::regex_search and I do not know why? since -- the exact flag is accepted for std::regex_repalce. So std::regex_search ignores the format_first_only but std::regex_replace accepts it. Let's it goes.

The main problem is here that the icase flag is also ignored when the pattern is character class -> []. In fact when the pattern is only capital letter or small letter: [A-Z] or [a-z]

Supposing this string s = "ONE TWO THREE four five six seven"

the output for std

std::cout << count_match( s, "[A-Z]+" ) << '\n';          // 1 => First match
std::cout << count_match( s, "[A-Z]+", "g" ) << '\n';     // 3 => Global match
std::cout << count_match( s, "[A-Z]+", "gi" ) << '\n';    // 3 => Global match plus insensitive  

whereas for the exact and laugauge and with boost the output is:

std::cout << count_match( s, "[A-Z]+" ) << '\n';          // 1 => First match
std::cout << count_match( s, "[A-Z]+", "g" ) << '\n';     // 3 => Global match
std::cout << count_match( s, "[A-Z]+", "gi" ) << '\n';    // 7 => Global match plus insensitive  


I know about regex flavors PCRE; or ECMAScript 262 that c++ uses it, But I have no ides why a simple flag, is ignored for the only search function that c++ has? Since std::regex_iterator and std::regex_token_iterator are also use this function internally.

And shortly, I can not use those two my apps and RFC with std library because if this!

So if someone knows according to which rule it is maybe a valid rude in ECMAScript 262 or perhaps if I am wrong anywhere please tell me. Thanks.


tested with

gcc version 6.3.0 20170519 (Ubuntu/Linaro 6.3.0-18ubuntu2~16.04)
clang version 3.8.0-2ubuntu4  

code:

perl -le '++$c while $ARGV[0] =~ m/[A-Z]+/g; print $c ;' "ONE TWO THREE four five six seven" // 3
perl -le '++$c while $ARGV[0] =~ m/[A-Z]+/gi; print $c ;' "ONE TWO THREE four five six seven" // 7  

code:

uint count_match( ref const (char[]) user_string, const (char[]) user_pattern, const (char[]) flags ){

    const bool flag_has_g = flags.indexOf( "g" ) != -1;

    Regex!( char ) rx = regex( user_pattern, flags );
    uint counter = 0;
    foreach( mr; matchAll( user_string, rx ) ){
        ++counter;
    }

    if( flag_has_g ){
        return counter;
    } else {
        if( counter >= 1 ) return 1;
        else               return 0;
    }
} 

the output:

writeln( count_match( s, "[A-Z]+", "g" ) );  // 3
writeln( count_match( s, "[A-Z]+", "gi" ) ); // 7  

code:

var s = "ONE TWO THREE four five six seven";

var rx1 = new RegExp( "[A-Z]+" , "g" );
var rx2 = new RegExp( "[A-Z]+" , "gi" );

var counter = 0;
while( rx1.exec( s ) ){
   ++counter;
}
document.write( counter + "<br>" ); // 3

counter = 0;
while( rx2.exec( s ) ){
   ++counter;
}
document.write( counter ); // 7

Okay. After testing with gcc 7.1.0 it turned out that with version below 6.3.0 the output is: 1 3 3 and but with 7.1.0 the output is 1 3 7 here is the link.

Also with this version of clang the output is correct. Here is the link. thanks to igor-tandetnik user

Aucun commentaire:

Enregistrer un commentaire