I need to traverse a vector, read each element, and map to the modulo division value. Modulo division is fast for divisors of power2. So, I need to choose between a mod and mod_power2 during the runtime. Following is a rough outline. Please assume that I am using templates to visit the vector.
Bit manipulation tricks were taken from https://graphics.stanford.edu/~seander/bithacks.html
static inline constexpr bool if_power2(int v) {
return v && !(v & (v - 1));
}
static inline constexpr int mod_power2(int val, int num_partitions) {
return val & (num_partitions - 1);
}
static inline constexpr int mod(int val, int num_partitions) {
return val % num_partitions;
}
template<typename Func>
void visit(const std::vector<int> &data, Func &&func) {
for (size_t i = 0; i < data.size(); i++) {
func(i, data[i]);
}
}
void run1(const std::vector<int> &v1, int num_partitions, std::vector<int> &v2) {
if (if_power2(num_partitions)) {
visit(v1,
[&](size_t i, int v) {
v2[i] = mod_power2(v, num_partitions);
});
} else {
visit(v1,
[&](size_t i, int v) {
v2[i] = mod(v, num_partitions);
});
}
}
void run2(const std::vector<int> &v1, int num_partitions, std::vector<int> &v2) {
const auto part = if_power2(num_partitions) ? mod_power2 : mod;
visit(v1, [&](size_t i, int v) {
v2[i] = part(v, num_partitions);
});
}
My question is, run1 vs run2. I prefer run2 because it is easy to read and no code duplication. But when when I check both in godbolt (https://godbolt.org/z/3ov59rb5s), AFAIU, run1 is inlined better than run2.
So, is there a better way to write a run function without compromising on the perf?
Aucun commentaire:
Enregistrer un commentaire