lundi 22 mai 2017

Distributed algorithm for calculation of Pearson cross correlation matrix partitioned by time and key

What could be an algorithm for computation of Pearson cross-correlation matrix in a distributed environment where my data is divided by id (say: 1-4) and time (say: Jan-Dec) among different nodes.

For example:

Node A({id1, Jan}{id2, Jan}), Node B({id3, Jan}, {id4, Jan}),
Node C({id1, Feb}, {id2, Feb}), Node A({id1, March}{id2, March})

Basically, I meant to say Jan data for all id is not at one node.

I'm wondering what strategy I could use where I do not have to ship large data from one node to another node as Pearson correlation is a pairwise computation. I'm ok with just transferring small intermediate result between nodes. How should I partition my data based on id and time so that I efficiently calculate cross-correlation matrix among multiple ids.

The language of choice is C++

Aucun commentaire:

Enregistrer un commentaire