lundi 30 août 2021

parallel push_back for vector of vector

I have a large text file around 13G which the content is an edge list of a graph. Each line has two integers uandv represent the endpoint of an edge. I want to read it to a vector of vector as an adjency vector of the graph.

Then It comes to folowing code.

const int N = 3328557;
vector<vector<int> >adj{N};

int main() {
    FILE * pFile;
    pFile = fopen("path/to/edge/list", "r");

    int u, v;
    while (fscanf(pFile, "%d%d", &u, &v) == 2) {
        adj[u].push_back(v);
        adj[v].push_back(u);
    }
    
    fclose(pFile);
}

It consumes about 7min. After some analysis, I find adj[u].push_back(v) and adj[v].push_back(u) consumes most time because of random address.

Then I use a two dimension array as cache. Once it's filled, I copy all the value to vector and clear it.

const int N = 3328557;
const int threshold = 100;

vector<vector<int> >adj{N};
int ln[N];
int cache[N][threshold];

void write2vec(int node) {
    for (int i = 0; i < ln[node]; i++)
        adj[node].push_back(cache[node][i]);
    ln[node] = 0;
}

int main() {
    FILE * pFile;
    pFile = fopen("path/to/edge/list", "r");

    int u, v;
    while (fscanf(pFile, "%d%d", &u, &v) == 2) {
        cache[u][ln[u]++] = v;
        if (ln[u] == threshold)
            write2vec(u);
        cache[v][ln[v]++] = u;
        if (ln[v] == threshold)
            write2vec(v);
    }
    
    fclose(pFile);
}

This time it consumes 5.5 min. It's still too long. Then I think the two push_back in the first code can be parallelized. But I don't know how to do. And does anyone has other idea?

Thanks.

Aucun commentaire:

Enregistrer un commentaire