dimanche 30 avril 2017

Why is constructing std::string from mmaped file blocking threads?

I'm trying to read and process multiple files, in parallel, right now I have two file and two parsing functions which I call in 2 threads:

In the first case, I'm constructing string from parts of the file (reading headers of csv), the first function:

void csv_parse_items_file(const char* file, size_t fsize,
    //void(*deal)(const string&, const size_t&, const int&), 
size_t arrstart_counter = 0) {
size_t idx = 0;
int line = 0;
size_t last_idx = 0;
int counter = 0;

cout<<"items_header before loop, thread_id="+std::to_string(thread_index())<<endl;
map<string, int> headers;
{
    int counter = 0;
    while (file[idx] && file[idx] != '\n') {
        if (file[idx] == '\t' || file[idx] == '\n') {
            string key(file, last_idx, idx - last_idx);
            headers[key] = counter++;
            last_idx = idx + 1;
        }

        ++idx;
    }
}
cout<<"items_header after loop, thread_id="+std::to_string(thread_index())<<endl;
... then the processing continues in a loop

the second function:

void csv_parse_users_file(const char* file, size_t fsize,
    //void(*deal)(const string&, const size_t&, const int&), 
    size_t arrstart_counter = 0) {
size_t idx = 0;
int line = 0;
size_t last_idx = 0;
int counter = 0;
map<string, int> headers;
{
    int counter = 0;
    while (file[idx] && file[idx] != '\n') {
        if (file[idx] == '\t' || file[idx] == '\n') {
            string key(file, last_idx, idx - last_idx);
            headers[key] = counter++;
            last_idx = idx + 1;
        }

        ++idx;
    }
}

when I run in this config, the output is:

users_mapped 86431022
items_mapped237179072
1497021
1306055
items_header before loop, thread_id=0
processed 100000users thread_id:1 
processed 200000users thread_id:1 
processed 300000users thread_id:1 
processed 400000users thread_id:1 
processed 500000users thread_id:1 
processed 600000users thread_id:1 
processed 700000users thread_id:1 
processed 800000users thread_id:1 
processed 900000users thread_id:1 
processed 1000000users thread_id:1 
processed 1100000users thread_id:1 
processed 1200000users thread_id:1 
processed 1300000users thread_id:1 
processed 1400000users thread_id:1 
finished_processing_users:1497020
0x700008d52c80 finished
items_header after loop, thread_id=0
processed 100000items, thread_id:0 
processed 200000items, thread_id:0 
processed 300000items, thread_id:0 
processed 400000items, thread_id:0 
processed 500000items, thread_id:0 
processed 600000items, thread_id:0 
processed 700000items, thread_id:0 
processed 800000items, thread_id:0 
processed 900000items, thread_id:0 
processed 1000000items, thread_id:0 
processed 1100000items, thread_id:0 
processed 1200000items, thread_id:0 
processed 1300000items, thread_id:0 
finished_p

Now, if I edited the first function, and commented out this line string key(file, last_idx, idx - last_idx); so the first function will start like this:

void csv_parse_items_file(const char* file, size_t fsize,
    //void(*deal)(const string&, const size_t&, const int&), 
size_t arrstart_counter = 0) {
size_t idx = 0;
int line = 0;
size_t last_idx = 0;
int counter = 0;

cout<<"items_header before loop, thread_id="+std::to_string(thread_index())<<endl;
map<string, int> headers;
{
    int counter = 0;
    while (file[idx] && file[idx] != '\n') {
        if (file[idx] == '\t' || file[idx] == '\n') {
            //string key(file, last_idx, idx - last_idx);
            headers["ok"] = counter++;
            last_idx = idx + 1;
        }

The output is:

    users_mapped 86431022
items_mapped237179072
1497021
1306055
items_header before loop, thread_id=0
items_header after loop, thread_id=0
processed 100000users thread_id:1 
processed 200000users thread_id:1 
processed 300000users thread_id:1 
processed 100000items, thread_id:0 
processed 400000users thread_id:1 
processed 500000users thread_id:1 
processed 200000items, thread_id:0 
processed 600000users thread_id:1 
processed 700000users thread_id:1 
processed 300000items, thread_id:0 
processed 800000users thread_id:1 
processed 900000users thread_id:1 
processed 1000000users thread_id:1 
processed 400000items, thread_id:0 
processed 1100000users thread_id:1 
processed 1200000users thread_id:1 
processed 500000items, thread_id:0 
processed 1300000users thread_id:1 
processed 1400000users thread_id:1 
finished_processing_users:1497020
0x700001870c80 finished
processed 600000items, thread_id:0 
processed 700000items, thread_id:0 
processed 800000items, thread_id:0 
processed 900000items, thread_id:0 
processed 1000000items, thread_id:0 
processed 1100000items, thread_id:0 
processed 1200000items, thread_id:0 
processed 1300000items, thread_id:0 
finished_processing_items:1306054

The header file is less then 1000 chars compared with the size of the files (86431022 and 237179072).

    $g++ -v
Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 8.0.0 (clang-800.0.42.1)
Target: x86_64-apple-darwin16.1.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/binrocessing_items:1306054

compiled with g++ -pthread -c -g -std=c++11

files mmaped with mmap(NULL, size_, PROT_READ, MAP_PRIVATE, fd_, 0);

I can't figure out why having the string construction in both thread from two different mmaped files, with no common variables other then cout, cause one thread to wait for the other! is there any locks when construction std::string?

Aucun commentaire:

Enregistrer un commentaire