Problem
I have some code that I need to optimize for work. Given two datasets, I need to compare every element in one dataset with every element in another. The elements in the datasets are string vectors. that look like this: {"AB", "BB", "AB", "AA", "AB", ...}
, where there are 3 possible values: AB
, BB
, and AA
. So for example, one dataset would be something like:
AB AA BB BB AA AB
AB AA AA AA BB AB
AA AA AB BB BB BB
while the other dataset might be
BB AB AB AA AB AB
AA AA BB BB BB BB
Note: The vector length will be the same within and between datasets. In this case, it's length 6
.
So the first data set contains three vectors, and the second dataset contains two for a total of 6 comparisons
This example contained 3 vs 2 vectors. My real problem will have something like 1.3M vs 6,000
Reproducible Example
The following code will create the vectors to the datasets to the desired sizes similar to how they'll show up in my real code. The first part of the main function simply generates the datasets. This part doesn't need to be optimized because these will be read in from a file. I'm generating them here for the sake of this question. The part that needs to be optimized is the nested for loop in the latter part of the main function
#include <chrono>
#include <iostream>
#include <vector>
// Takes in a 2D string vector by reference, and fills it to the required size with AA, AB, or BB
void make_genotype_data(int numRows, int numCols, std::vector<std::vector<std::string>>& geno) {
std::string vals[3] = {"AA", "AB", "BB"};
for (int i = 0; i < numRows; i++) {
std::vector<std::string> markers;
for (int j = 0; j < numCols; j++) {
int randIndex = rand() % 3;
markers.push_back(vals[randIndex]);
}
geno.push_back(markers);
markers.clear();
}
}
int main(int argc, char **argv) {
// Timing Calculation
using timepoint = std::chrono::time_point<std::chrono::high_resolution_clock>;
auto print_exec_time = [](timepoint start, timepoint stop) {
auto duration_us = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
auto duration_ms = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);
auto duration_s = std::chrono::duration_cast<std::chrono::seconds>(stop - start);
std::cout << duration_s.count() << " s\n";
};
// Create the data
auto start = std::chrono::high_resolution_clock::now();
int numMarkers = 100;
std::vector<std::vector<std::string>> old_genotypes;
std::vector<std::vector<std::string>> new_genotypes;
make_genotype_data(50, numMarkers, old_genotypes);
make_genotype_data(6000, numMarkers, new_genotypes);
auto stop = std::chrono::high_resolution_clock::now();
std::cout << "*****************" << std::endl;
std::cout << "Total time for creating data" << std::endl;
print_exec_time(start, stop);
std::cout << "*****************" << std::endl;
int nCols = old_genotypes[0].size();
float threshold = 0.8;
// Compare old_genotypes with new_genotypes
start = std::chrono::high_resolution_clock::now();
for (int i = 0; i < old_genotypes.size()-1; i++) {
auto og = old_genotypes[i];
for (int j = 0; j < new_genotypes.size()-1; j++) {
auto ng = new_genotypes[j];
int numComparisons = 0;
int numMatches = 0;
for (int i = 1; i < nCols; i++) {
if (ng[i] != "--" && og[i] != "--") {
if (ng[i] == og[i]) {
numMatches++;
}
numComparisons++;
}
}
float similarity = (float) numMatches / numComparisons;
if (similarity >= threshold) {
std::cout << i << " from old_genotypes and " << j << " from new_genotypes have high similarity: " << similarity << std::endl;
}
}
}
stop = std::chrono::high_resolution_clock::now();
std::cout << "*****************" << std::endl;
std::cout << "Total time for comparison" << std::endl;
print_exec_time(start, stop);
std::cout << "*****************" << std::endl;
}
On 6,000 vs 5,000 it takes about 4 minutes. So for 6,000 vs 1.3M, it'll take about 17 hours. It's quite slow. And I have no idea what I can do to improve the speed of the nested for loop. I'm a bit new to C++, so I don't know too many of the intricacies, but I can follow along with the jargon.
I'd really appreciate some pointers (pun intended :D) to help me optimize this. I am willing to try parallelization by breaking one of the datasets up into chunks and feeding each chunk to a core to compare against the second dataset (though don't know how to parallelize in C++). But I only want to explore parallelization after taking the serialized version as far as it can go (it'll help with the parallelized version anyway).
Aucun commentaire:
Enregistrer un commentaire