samedi 20 janvier 2018

unexpected results with word2vec algorithm

I implemented word2vec in c++. I found the original syntax to be unclear, so I figured I'd re-implement it, using all the benefits of c++ (std::map, std::vector, etc)

This is the method that actually gets called every time a sample is trained (l1 denotes the index of the first word, l2 the index of the second word, label indicates whether it is a positive or negative sample, and neu1e acts as the accumulator for the gradient)

void train(int l1, int l2, double label, std::vector<double>& neu1e)
{
        // Calculate the dot-product between the input words weights (in 
        // syn0) and the output word's weights (in syn1neg).
        auto f = 0.0;

        for (int c = 0; c < m__numberOfFeatures; c++) 
            f += syn0[l1][c] * syn1neg[l2][c];

      // This block does two things:
      //   1. Calculates the output of the network for this training
      //      pair, using the expTable to evaluate the output layer
      //      activation function.
      //   2. Calculate the error at the output, stored in 'g', by
      //      subtracting the network output from the desired output, 
      //      and finally multiply this by the learning rate.
      auto z = 1.0 / (1.0 + exp(-f));
      auto g = m_learningRate * (label - z);

      // Multiply the error by the output layer weights.
      // (I think this is the gradient calculation?)
      // Accumulate these gradients over all of the negative samples.
      for (int c = 0; c < m__numberOfFeatures; c++) 
        neu1e[c] += (g * syn1neg[l2][c]);    

      // Update the output layer weights by multiplying the output error
      // by the hidden layer weights.
      for (int c = 0; c < m__numberOfFeatures; c++) 
        syn1neg[l2][c] += g * syn0[l1][c];         
}

This method gets called by

void train(const std::string& s0, const std::string& s1, bool isPositive, std::vector<double>& neu1e)
    {
        auto l1 = m_wordIDs.find(s0) != m_wordIDs.end() ? m_wordIDs[s0] : -1;
        auto l2 = m_wordIDs.find(s1) != m_wordIDs.end() ? m_wordIDs[s1] : -1;
        if(l1 == -1 || l2 == -1)
            return;

        train(l1, l2, isPositive ? 1 : 0, neu1e);
    }

which in turn gets called by the main training method.

Full code can be found at

https://github.com/jorisschellekens/ml/tree/master/word2vec

With complete example at

https://github.com/jorisschellekens/ml/blob/master/main/example_8.hpp

When I run this algorithm, the top 10 words 'closest' to father are:

father
Khan
Shah
forgetful
Miami
rash
symptoms
Funeral
Indianapolis
impressed

Which seems weird. Is something wrong with my algorithm?

Aucun commentaire:

Enregistrer un commentaire