Practical Experience with Word2Vec

I have been playing with word2vec on and off since it was released in 2013. It is a neural network that scans vast amounts of text and produces a vector representation of each word that can do some useful things. There are quite a few intros posted online – like here [http://radimrehurek.com/2014/12/making-sense-of-word2vec/]. If you are not familiar with word2vec – go read one of these first then come back. What I wanted to talk about here was just some of the practical issues I came across when trying to get up and running.

Which code?

The canonical implementation of word2vec is the C code released by Tomas Mikolov [https://code.google.com/p/word2vec/] – the original author of word2vec. Gensim by Radim Rehurek [http://radimrehurek.com/gensim/index.html] is a fantastic implementation. Python makes ‘playing with’ these models very easy – and Gensim is robust and suprisingly fast python implementation. (There are also a number of other implementations but I don’t have any experience of them.)

Using a pretrained model

The easiest way to start working with word2vec is just to load up a pretrained model. The original authors released a model trained on around 100 billion words! It is readily accessible and of high quality. Loading it up is pretty easy – however be warned that it requires at least 4GB of RAM.

Understanding vectors

In some intro material to word vectors they explain word vectors with an example based on the king-man+woman example that goes something like this: Imagine that the vector for king has one co-ordinate correlated with ‘royalty’ and another correlated with ‘maleness’. So by subtracting ‘maleness’ and ‘adding ‘femaleness’ we can end up with a word like queen.[REF TODO] That is ok as a starting point if you are not used to thinking about word vectors. However be warned – the co-ordinates of each vector do not actually work like this. I spent a long time thinking like this. It leads to questions like ‘can we find words that are half male?’, ‘can we find words that are equal part male and female?’, ‘what do words with a negative male coordinate represent?’ As far as I can tell these are all meaningless questions. In reality ‘maleness’ is in some way distributed across all of the co-ordinates of the vector.

The curse of dimensionality

Words are vectors in high dimensional spaces. For word2vec 300-500 dimensions are not uncommon. These sorts of spaces are very hard to reason about (see especially’The Curse of Dimensionality’ [http://en.wikipedia.org/wiki/Curse_of_dimensionality].) I can’t really give much practical advice other than to say it is pretty normal for your head to hurt when trying to think about it.

Training your own model

When you want to start training your own model (rather than using a pretrained model). There are a few things to watch for. Firstly, it is important to appreciate that there is some inherent randomness in the models – so no two models will be the same. This makes reproducibility and comparing different iterations a little bit difficult. There seem to be some efforts within gensim to manage this a bit – but I haven’t really looked into this aspect of it. There are a number of critical thing to decide to train a model:

  • What corpus to use? What data will you use to train your model.
  • What algorithms will you use (word2vec actually implements a couple of different algorithms)
  • What parameters will you use when training?
  • How will you evaluate your model?

Mikolov’s released ‘big-data.sh’ late last year (2014) which gives a good baseline for training your own model. My comments are all related to training a general purpose English language model.

Corpus

Word2vec really relies on a large amount of text. It took me a while to appreciate how large this needs to be. If you don’t have a large amount of data – then the word2vec tool supports multiple passes or iterations over your data to try and improve the quality of the model.

  • text8 – (from here [http://mattmahoney.net/dc/textdata]) is a good starting point – not really big enough for anything serious but training is quite quick and gives ok results which makes it good for prototyping, sanity checks and so on
  • text9 contains about 120M words. It is big enough that you start to get some meaningful results. Training this model (with Gensim) takes at least 8G of RAM. A model can be trained in 6-12 hours on a 2013 quad core i5-3580
  • Mikolov’s big data corpus contains about 8.5B words. It requires at least 11G RAM to build with Gensim. It took over 4 days to train on my i5 using the CBOW algorithm (and 3 iterations) and almost 8 days to train using the skipgram algorithm

Iterations and Corpus Size

One of the basic premises of Word2Vec seems to be that data beats algorithms – in the sense that a simpler algorithm that can be trained on more data is better than a more sophisticated algorithm that takes longer to train. So in a sense choice of algorithms needs to be measured against training time. For example, when training a model on the text9 data set, I used both the ‘skipgram’ and the ‘hierarchical softmax’ algorithms simultaneously (with 1 iteration) and performance against the ‘questions-words’ analogy task improved significantly. However training time also improved significantly. If I simply used the skipgram algorithm with more iterations, the performance was even better again for the same training time. As a rule of thumb (which admittedly is pretty arbitrary) – I would suggest that you need at least 1 billion words to train a reasonable model (again – this is for general purpose English only). In the case of the text9 corpus with about 100M words, this suggests at least 10 iterations over the data are needed. (I did not find much improvement beyond 10 iterations with text9.) Mikolov suggests 3 iterations of 8.5b words each in his big-data script.

Leave a comment