lundi 13 mars 2017

An high-performance similarity cache for metric spaces

The Problem:

First of all: this is an high-performance application, so time execution is the most important aspect. I have a back end system which computes some expensive function:

template<typename C, typename R>
R backEndFunction (C &code){
  ...
}

Where code is a 1xN vector belonging to a metric space. Notice that since N is large, computing distance between two codes is expensive (this problem is known as Curse of Dimensionality).

I'm designing a "similarity cache" (following a LRU policy), which is placed between the user who submit query codes and the back end system. So each cached element is a pair (actually a triple, as we'll see later) of type (C,R) containing the cached code and the associated result.

It works like this, given a query code by the user:

  1. Compute the distance between the query and each cached code and find the minimum distance. Notice that if the code is the same, the distance is 0.
  2. If the distance is lower than a given threshold, there is a cache hit, so take the hit element and put it in front of the cache and finally return the result relative to the hit element.
  3. Otherwise, call the back end system function, insert the new pair in front of the cache, pop the last one and finally return the computed result.

Cache design:

First of all, this cache is supposed to contain say 10k elements. This is important, becuase if we had say 100k elements we had to solve step 1. with a Nearest Neighbor algorithm (e.g. LSH). Instead, with this few elements, a parallel brute force approach is still feasible.

The parallel section has been implemented using OpenMP. Since using OpenMP can't be used for non-vector-like-structures, our cache can't be a simple std::queue or similars.

So, this is my solution, involves two data structures:

  1. std::vector<CacheElem> values, where CacheElem is a triple (code, result, listElem) where listElem is an iterator to an element of the second structure (below).
  2. std::list<size_t> lru to implement the lru policy (you don't say?), where lru[i] is the index of the corresponding element in values.

So, if lru[i]=j then values[j].listElem is an iterator to lru[j]. So when the cache receives a query code:

  1. Parallel compute the minimum distance between the query and all elements in values
  2. If there is a cache hit, use the iterator listElem as reference to the corresponding element in lru and put it in front of the list.
  3. If there is a cache miss, compute the query result (using the back end), push in front of lru the same index of the last element (the one which has to be removed), replace the values[lru[size]] with the query code, value and lru.begin() and finally pop the last element.

Obviously if the cache isn't full all the "pop last element" part isn't necessary.

The code (first version, didn't test it yet):

/**
 *  C = code type
 *  R = result type
 *  D = distance type (e.g. float for euclidean)
 */
template <typename C, typename R, typename D>
class Cache {

typedef std::shared_ptr<cc::Distance<C,D>> DistancePtr;

public:
    Cache(const DistancePtr distance, const std::function<R(C)> &backEnd, const size_t size = 10000, const float treshold = 0);
    R Query(const C &query);
    void PrintCache();
private:
    struct Compare{
        Compare(D val = std::numeric_limits<D>::max(), size_t index = 0) : val(val), index(index) {}
        D val;
        size_t index;
    };
    #pragma omp declare reduction(minimum : struct Compare : omp_out = omp_in.val < omp_out.val ? omp_in : omp_out) initializer (omp_priv=Compare())

    struct CacheElem{
        CacheElem(const C &code, const R &result, std::list<size_t>::iterator listElem) : code(code), result(result), listElem(listElem) {}
        C code;
        R result;
        std::list<size_t>::iterator listElem; //pointing to corresponding element in lru0
    };
    DistancePtr distance;
    std::function<R(C)> backEnd;
    std::vector<CacheElem> values;
    std::list<size_t> lru;
    float treshold;
    size_t size;
};


template <typename C, typename R, typename D>
Cache<C,R,D>::Cache(const DistancePtr distance, const std::function<R(C)> &backEnd, const size_t size, const float treshold)
: distance(distance), backEnd(backEnd), treshold(treshold), size(size) {
    values.reserve(size);
    std::cout<<"CACHE SETUP: size="<<size<<" treshold="<<treshold<<std::endl;
}

template <typename C, typename R, typename D>
void Cache<C,R,D>::PrintCache() {
    std::cout<<"LRU: ";
    for(std::list<size_t>::iterator it=lru.begin(); it != lru.end(); ++it)
        std::cout<<*it<<" ";
    std::cout<<std::endl;
    std::cout<<"VALUES: ";
    for(size_t i=0; i<values.size(); i++)
        std::cout<<"("<<values[i].code<<","<<values[i].result<<","<<*(values[i].listElem)<<")";
    std::cout<<std::endl;
}

template <typename C, typename R, typename D>
R Cache<C,R,D>::Query(const C &query){
    PrintCache();
    Compare min;
    R result;
    std::cout<<"query="<<query<<std::endl;
    //Find the cached element with min distance
    #pragma omp parallel for reduction(minimum:min)
    for(size_t i=0; i<values.size(); i++){
        D d = distance->compute(query, values[i].code);
        #pragma omp critical
        {
        std::cout<<omp_get_thread_num()<<" min="<<min.val<<" distance("<<query<<" "<<values[i].code<<")= "<<d;
        if(d < min.val){
            std::cout<<" NEW MIN!";
            min.val = d;
            min.index = i;
        }
        std::cout<<std::endl;
        }
    }
    std::cout<<"min.val="<<min.val<<std::endl;
    //Cache hit
    if(!lru.empty() && min.val < treshold){
        std::cout<<"cache hit with index="<<min.index<<" result="<<values[min.index].result<<" distance="<<min.val<<std::endl;
        CacheElem hitElem = values[min.index];
        //take the hit element to top of the queue
        if( hitElem.listElem  != lru.begin() )
            lru.splice( lru.begin(), lru, hitElem.listElem, std::next( hitElem.listElem ) );
        result = hitElem.result;
    }
    //cache miss
    else {
        result = backEnd(query);
        std::cout<<"cache miss backend="<<result;
        //Cache reached max capacity
        if(lru.size() == size){
            //last item (the one that must be removed) value is its corresponding index in values
            size_t lastIndex = lru.back();
            //remove last element
            lru.pop_back();
            //insert new element in the list
            lru.push_front(lastIndex);
            //insert new element in the value vector, replacing the old one
            values[lastIndex] = CacheElem(query, result, lru.begin());
            std::cout<<" index to replace="<<lastIndex;
        }
        else{
            lru.push_front(values.size()); //since we are going to inser a new element, we don't need to do size()-1
            values.push_back(CacheElem(query, result, lru.begin()));
        }
        std::cout<<std::endl;
    }
    PrintCache();
    std::cout<<"-------------------------------------"<<std::endl;
    return result;
}

My questions:

  1. Knowing that for each query I have to inspect all the cached elements, do you think there is a more performant solution than mine? Especially considering the solution of using std::vector<CacheElem> values; and std::list<size_t> lru;?
  2. In case that this is the best solution, we know that std::list isn't a good solution for high-performance applications, but for the given problem, I didn't find a better one. Do you know any queue-like structure where you can put random elements at the top of the queue and I have to pop_back and push_front a (as in this case)?

Aucun commentaire:

Enregistrer un commentaire