I have a large text file around 13G which the content is an edge list of a graph. Each line has two integers uandv represent the endpoint of an edge. I want to read it to a vector of vector as an adjency vector of the graph.
Then It comes to folowing code.
const int N = 3328557;
vector<vector<int> >adj{N};
int main() {
    FILE * pFile;
    pFile = fopen("path/to/edge/list", "r");
    int u, v;
    while (fscanf(pFile, "%d%d", &u, &v) == 2) {
        adj[u].push_back(v);
        adj[v].push_back(u);
    }
    
    fclose(pFile);
}
It consumes about 7min. After some analysis, I find adj[u].push_back(v) and adj[v].push_back(u) consumes most time because of random address.
Then I use a two dimension array as cache. Once it's filled, I copy all the value to vector and clear it.
const int N = 3328557;
const int threshold = 100;
vector<vector<int> >adj{N};
int ln[N];
int cache[N][threshold];
void write2vec(int node) {
    for (int i = 0; i < ln[node]; i++)
        adj[node].push_back(cache[node][i]);
    ln[node] = 0;
}
int main() {
    FILE * pFile;
    pFile = fopen("path/to/edge/list", "r");
    int u, v;
    while (fscanf(pFile, "%d%d", &u, &v) == 2) {
        cache[u][ln[u]++] = v;
        if (ln[u] == threshold)
            write2vec(u);
        cache[v][ln[v]++] = u;
        if (ln[v] == threshold)
            write2vec(v);
    }
    
    fclose(pFile);
}
This time it consumes 5.5 min. It's still too long. Then I think the two push_back in the first code can be parallelized. But I don't know how to do. And does anyone has other idea?
Thanks.
Aucun commentaire:
Enregistrer un commentaire