I have a large text file around 13G which the content is an edge list of a graph. Each line has two integers u
andv
represent the endpoint of an edge. I want to read it to a vector of vector as an adjency vector of the graph.
Then It comes to folowing code.
const int N = 3328557;
vector<vector<int> >adj{N};
int main() {
FILE * pFile;
pFile = fopen("path/to/edge/list", "r");
int u, v;
while (fscanf(pFile, "%d%d", &u, &v) == 2) {
adj[u].push_back(v);
adj[v].push_back(u);
}
fclose(pFile);
}
It consumes about 7min. After some analysis, I find adj[u].push_back(v)
and adj[v].push_back(u)
consumes most time because of random address.
Then I use a two dimension array as cache. Once it's filled, I copy all the value to vector and clear it.
const int N = 3328557;
const int threshold = 100;
vector<vector<int> >adj{N};
int ln[N];
int cache[N][threshold];
void write2vec(int node) {
for (int i = 0; i < ln[node]; i++)
adj[node].push_back(cache[node][i]);
ln[node] = 0;
}
int main() {
FILE * pFile;
pFile = fopen("path/to/edge/list", "r");
int u, v;
while (fscanf(pFile, "%d%d", &u, &v) == 2) {
cache[u][ln[u]++] = v;
if (ln[u] == threshold)
write2vec(u);
cache[v][ln[v]++] = u;
if (ln[v] == threshold)
write2vec(v);
}
fclose(pFile);
}
This time it consumes 5.5 min. It's still too long. Then I think the two push_back
in the first code can be parallelized. But I don't know how to do. And does anyone has other idea?
Thanks.
Aucun commentaire:
Enregistrer un commentaire