Practical Experience with Word2Vec

I have been playing with word2vec on and off since it was released in 2013. It is a neural network that scans vast amounts of text and produces a vector representation of each word that can do some useful things. There are quite a few intros posted online – like here [http://radimrehurek.com/2014/12/making-sense-of-word2vec/]. If you are not familiar with word2vec – go read one of these first then come back. What I wanted to talk about here was just some of the practical issues I came across when trying to get up and running.

Which code?

The canonical implementation of word2vec is the C code released by Tomas Mikolov [https://code.google.com/p/word2vec/] – the original author of word2vec. Gensim by Radim Rehurek [http://radimrehurek.com/gensim/index.html] is a fantastic implementation. Python makes ‘playing with’ these models very easy – and Gensim is robust and suprisingly fast python implementation. (There are also a number of other implementations but I don’t have any experience of them.)

Using a pretrained model

The easiest way to start working with word2vec is just to load up a pretrained model. The original authors released a model trained on around 100 billion words! It is readily accessible and of high quality. Loading it up is pretty easy – however be warned that it requires at least 4GB of RAM.

Understanding vectors

In some intro material to word vectors they explain word vectors with an example based on the king-man+woman example that goes something like this: Imagine that the vector for king has one co-ordinate correlated with ‘royalty’ and another correlated with ‘maleness’. So by subtracting ‘maleness’ and ‘adding ‘femaleness’ we can end up with a word like queen.[REF TODO] That is ok as a starting point if you are not used to thinking about word vectors. However be warned – the co-ordinates of each vector do not actually work like this. I spent a long time thinking like this. It leads to questions like ‘can we find words that are half male?’, ‘can we find words that are equal part male and female?’, ‘what do words with a negative male coordinate represent?’ As far as I can tell these are all meaningless questions. In reality ‘maleness’ is in some way distributed across all of the co-ordinates of the vector.

The curse of dimensionality

Words are vectors in high dimensional spaces. For word2vec 300-500 dimensions are not uncommon. These sorts of spaces are very hard to reason about (see especially’The Curse of Dimensionality’ [http://en.wikipedia.org/wiki/Curse_of_dimensionality].) I can’t really give much practical advice other than to say it is pretty normal for your head to hurt when trying to think about it.

Training your own model

When you want to start training your own model (rather than using a pretrained model). There are a few things to watch for. Firstly, it is important to appreciate that there is some inherent randomness in the models – so no two models will be the same. This makes reproducibility and comparing different iterations a little bit difficult. There seem to be some efforts within gensim to manage this a bit – but I haven’t really looked into this aspect of it. There are a number of critical thing to decide to train a model:

  • What corpus to use? What data will you use to train your model.
  • What algorithms will you use (word2vec actually implements a couple of different algorithms)
  • What parameters will you use when training?
  • How will you evaluate your model?

Mikolov’s released ‘big-data.sh’ late last year (2014) which gives a good baseline for training your own model. My comments are all related to training a general purpose English language model.

Corpus

Word2vec really relies on a large amount of text. It took me a while to appreciate how large this needs to be. If you don’t have a large amount of data – then the word2vec tool supports multiple passes or iterations over your data to try and improve the quality of the model.

  • text8 – (from here [http://mattmahoney.net/dc/textdata]) is a good starting point – not really big enough for anything serious but training is quite quick and gives ok results which makes it good for prototyping, sanity checks and so on
  • text9 contains about 120M words. It is big enough that you start to get some meaningful results. Training this model (with Gensim) takes at least 8G of RAM. A model can be trained in 6-12 hours on a 2013 quad core i5-3580
  • Mikolov’s big data corpus contains about 8.5B words. It requires at least 11G RAM to build with Gensim. It took over 4 days to train on my i5 using the CBOW algorithm (and 3 iterations) and almost 8 days to train using the skipgram algorithm

Iterations and Corpus Size

One of the basic premises of Word2Vec seems to be that data beats algorithms – in the sense that a simpler algorithm that can be trained on more data is better than a more sophisticated algorithm that takes longer to train. So in a sense choice of algorithms needs to be measured against training time. For example, when training a model on the text9 data set, I used both the ‘skipgram’ and the ‘hierarchical softmax’ algorithms simultaneously (with 1 iteration) and performance against the ‘questions-words’ analogy task improved significantly. However training time also improved significantly. If I simply used the skipgram algorithm with more iterations, the performance was even better again for the same training time. As a rule of thumb (which admittedly is pretty arbitrary) – I would suggest that you need at least 1 billion words to train a reasonable model (again – this is for general purpose English only). In the case of the text9 corpus with about 100M words, this suggests at least 10 iterations over the data are needed. (I did not find much improvement beyond 10 iterations with text9.) Mikolov suggests 3 iterations of 8.5b words each in his big-data script.

Python – pip and virtualenv on Ubuntu

TL;DR;

If you are developing in Python on ubuntu – ensure that you use the Ubuntu provided pip and virtualenv:

sudo apt-get install python-pip

sudo apt-get install python-virtualenv

sudo apt-get install virtualenvwrapper  (assuming you like virtualenvwrapper)

sudo pip install virtualenvwrapper (the ubuntu provided virtualenvwrapper is too horribly out of date to use)


 

So I have just spent two evenings battling with one particular wrinkle about Python on Ubuntu that I found a little hard to resolve – even after copious googling – so wanted to blog my case here in case it helps others.

So this is related to doing Python (2) development on Ubuntu. I have done some Python development previously on Ubuntu and had followed setup guides such as this one: http://docs.python-guide.org/en/latest/starting/install/linux/ 

This worked fine for me. virtualenv and pip all seemed to work fine and life was good. However I had already fallen into one trap and I was about to fall into another.

The second more minor trap I was about to fall into was accidentally running ‘pip install’ when not inside a virtualenv. This installs into the global scope – rather than the virtualenv – which just ends up being messy. Even worse is running ‘sudo pip install’ which ends up installing things differently again. So the MUST DO from all of this is to add the following to my .bashrc:

export PIP_REQUIRE_VIRTUALENV=true

This essentially means that if you try and run pip when not inside your virtualenv you get an error message. This helps keep your environments much cleaner. This seems particularly important on Ubuntu where Python is used by the system for so many things – and so trying to keep the dependencies of your project clean is much harder.


But the real tricky one was that the Ubuntu Python and the ‘official Python’ put their packages in different locations (/usr/lib/python2.7 versus /usr/local/lib/python2.7/dist-packages). Mostly this did not cause me any issues until I came across the scenario that I wanted to install and access a library that I wanted to develop on alongside a virtualenv.

Following standard practice I forked/cloned the library from git and ran:

python setup.py develop

(This was in a directory outside my main project folder since I planned to use this library in other projects). However I was not able to ‘import project’ within my virtualenv. It couldn’t be found. I could have manually added paths to my PYTHONPATH to resolve this but it “felt wrong”. After much digging, the reason essentially boiled down to the fact that my ‘Ubuntu provided’ python had installed a link to the library in /usr/local/lib/Python – when my ‘pip installed’ virtualenv was expecting it to be in /usr/lib/Python.

Once I realised this there was a bunch of pain involved in removing the stuff that I had ‘sudo pip installed’ or ‘pip installed’ outside of the virtualenv, then installing the Ubuntu versions (see TL;DR;).

Information junk food

It seems that it is the ‘new normal’ to bemoan how easy it is to overdo passive consumption of information on the internet. Facebook, Twitter, random blog posts such as this one – all consume our time but don’t always ad a lot of value. This post is the most recent one that I came across. All true – but almost a tru-ism. Even the comments over on Hacker News were pretty predictable. One of the ‘predictable’ comments compared this sort of low quality information with junk food. Most of these articles seem to end with general exhortations to ‘watch what we read’.

To take the ‘junk food’ analogy further, the invention of ‘junk food’ (low quality – easy to eat food) eventually led to the creation of ‘Nutrition labels’ that at least make it visible to people what the nutritional content of the food is. So what would ‘Nutrition labels’ on our ‘information’ look like? Is this a useful analogy?

Moving house

I recently moved house in the real world. Always a hassle – but a welcome one at least. I am not sure if I was inspired by the physical move or not – but I finally got around to moving my blog from tumblr to WordPress. I had really enjoyed tumblr – but I was finding that I wanted to post more ‘Categorized’ content. Specifically, I had a bunch of tech stuff that I had posted and wanted to continue posting – and also a bunch of sailing related stuff that I wanted to post. It seemed to me that WordPress seemed a bit better suited to this case – so (assuming my domain migrates correctly over the next few minutes) http://edmiston.id.au is now hosted on wordpress.com

Techempowers Web Framework Benchmarks

I have been following the ‘Web Framework Benchmarks’ being run by Techempower over the last couple of months. Last year I spent a  bit of time playing with some Python frameworks so I was interested to see how they stacked up.

Web benchmarks

Many people loath benchmarks – because it is so hard to be equitable between the candidates – and even then it is very hard to interpret their significance for a given application. (In other words – the results can be highly dependant upon the specifics of the tests and your particular application requirements may vary significantly from the benchmark.)

There is a place though for benchmarks to provide a ‘level if somewhat subjective’ playing field. And further the best way to do a benchmark is in the open so that every framework/community can give it their best shot and see how they go. It seemed to me that the Techempower guys were approaching this in the right spirit. It has been very instructive to see the same relatively simple tasks coded in different languages and frameworks.

Python Frameworks

I am a part time hacker. Each year I tend to find a project to spend a bit of time on… two years ago it was a Rails/Ruby project and last year it was a Flask/Python one. This year I was toying with the idea of trying out either a functional language (Scala/Haskell/Clojure) or trying Go. And then I saw this benchmark! At first the results of the Java based languages (e.g. Clojure) looked really promising and Go showed flashes of greatness. I was tempted to jump on the Go bandwagon – but I felt that the Python frameworks were not being given much attention – so instead I spent a bit of time on them.

SQLALchemy sucks cycles

I am pretty comfortable working in SQL – so I had always been in two minds about the value of an ORM. But looking at the Techempower benchmarks one thing that jumps out is that SQLAlchemy does incur a performance penalty. So I guess the moral is that if you want a well documented, well rounded, fully featured ORM for Python – use SQLAlchemy, but if you are comfortable in SQL – then just use that instead. (Disclaimer: The non-ORM test for Bottle still uses SQLAlchemy to provide connection pooling. It really does seem as though it is the ORM in SQLAlchemy and not the Core that incurs the hit in these tests.)

PyPy on Amazon EC2

In the most recent round 5 tests though – a community contribution added PyPy support for the Flask tests. For tests running in Amazon EC2, PyPy produced some pretty significant – in some cases staller – performance improvements. Unfortunately this did not seem to be the case on dedicated i7 hardware – further proving the ‘horses for courses’ nature of these benchmarks and the need to always run your own tests on your own setup.

First appearances are deceiving

The very first appearance of Python on these tests was a Django test and it appeared 4th from the bottom of the list showing only 5.9% of the throughput of the best performer (single db query on EC2). At this point it was easy to write off Python as a dynamic, interpreted language that was always going to be slow and was ‘yesterdays news’. However in the most recent (round 5) tests, the best performing Python framework is at 32.8% on the same test. That is an improvement of almost 600%!!! Optimisation can suffer from diminishing returns – however at least for Python the ‘obvious’ optimisations can make a big difference. In almost all tests, Python ended up within a factor of 2 of the best Go-based framework (revel and webgo were included in Round 5).

Bottle surprises

I remember seeing Bottle when I first came across Flask. It looked very neat – but didn’t have the mindshare of Bottle so initially I went with Flask. However on these benchmarks Bottle showed a clear performance advantage. I will definitely be looking into it further.

Python FTW

So initially I entered this thinking that Python would struggle to compete with the likes of Go and Java on these tests and that this exercise would be the ‘nail in the coffin’ for Python. And I am not claiming that Python is anywhere close to beating these platforms. However the performance gap between them is not an order of magnitude. Typically within a factor of 2 times slower – but not 10 or 20 times as it seemed in the early tests. Against this you have to trade off the maturity of existing libraries.

Having said that there are definitely many ‘tricks’ to know to get the most performance out of Python web apps. From database connections, to choice of json library to choice of web server and python interpreter. From the discussion I have seen – the same is true for all the platforms/languages – and why it has been so useful/interesting to watch the results over multiple rounds for these benchmarks as each test is optimised. Given the maturity of Python, some of these choices can be pretty confusing though. There is a load of information available – but not all of it is up to date – and what was the best choice 3 years ago is probably not the best choice today.

Conclusion

Ultimately these benchmarks have reinforced almost any of the languages (Python, Go, Scala, PHP, Clojure, Java etc) are capable of performing ‘performant enough’ applications. I guess the main point of this post is that I was surprised how competitive Python was.

Sailing

In addition to my interest in tech stuff – I am also an avid sailor. Since December I have been sailing a Spiral. This is a fun class – a lot like a laser (if you know what they are) – except a tad smaller.

So I have learnt a lot over the past couple of months – and since there isn’t a lot of content on line about Spiral sailing, I am going to add some here.

I met with some success at the recent NSW State Titles (3rd overall, 2 heat wins and a 3rd). Before that I had a pretty ordinary showing at the National Titles. I mention this simply to set the context for any comment I may make.

I plan/hope to post on a couple of different topics:

  • Boat: The Spiral is a one design class – but there are still plenty of things that can slow you down, and some areas for innovation.
  • Fitness: As a single hander – power to weight ratio is critical. I will post about my approach to fitness
  • Technique: Some specific aspects of boat handling and technique that I have learnt
  • Mind games: My weakness… but I will add some comments on this too.

Privacy vs Desire

There seems to be a huge uproar about the internet’s assault on my privacy.

Most recently there was a bit of a hub-bub about Google changing their privacy policy and a rush to ‘opt-out’ to prevent the big G sharing data between the various services I use – in order to protect my privacy.

At the very same time the internet is stealing hours and hours of precious time, checking for news, reading facebook and twitter and so on. And this theft is in many cases deliberate. (I posted some thoughts about ‘desire’ and the internet recently.)

Why is there no uproar about the assault on my time? It seems to me that loss of privacy that may cause ‘potential harm’ is of little consequence compared to ‘actual time’ I am losing. So if giving Google some increased access to my personal data allows them to make their services more useful and relevant for me it seems like a good trade. 

I was looking for a quote to close this post. A recently famous Steve Jobs quote came to mind, but instead a far older one is appropriate:

“If time be of all things the most precious, wasting time must be the greatest prodigality.” – Benjamin Franklin

Spiral – Boat Setup

Some of the things I have learnt about boat setup… starting from the back:

  • Rudder: The centre bolt needs to be regularly checked to make sure that it is tight enough that the rudder can’t wobble in the rudder box. As a guide, the rudder should be able to stay in any position you leave it without it falling down when you wheel the boat into the water.
  • Traveller: The hardest one for me to get right. It will be the subject of a post in its own right
  • Mainsheet: Class rules allow a 1:1 mainsheet at the outboard end of the boom. Most sailors are currently using 2:1 purchase. The 1:1 does take a little more muscle (actually 50% more… since typical purchase is 3:1 and I recommend 2:1)… however the reward is a lot less mainsheet to pull in which makes all of the mark roundings, and other tactical maneuvers much easier.
  • Outhaul: I use a 6:1 purchase and make sure everything is low friction enough that it can be easily pulled on even when the boom is under heavy vang load. The 6:1 purchase makes pulling it on easier, but also makes setting it more accurate. The outhaul seems quite critical on the Spiral
  • Centrecase: Make sure that the centreboard can’t fall out of the centrecase if you capsize (when it is not tied in). If needed, pack out the centrecase with seat belt webbing.
  • Vang and cunningham: Make sure they work – most seems to have this right.
  • Toe straps: They should be quite tight. If you are very very fit (see my post on fitness) then it might make sense to make them a bit longer, but tighter toestraps give more ‘connection’ to the boat. And longer toe straps don’t often allow people to lean out further – rather it just allows them to have a ‘slouched’ leaning style… which is not good!
  • Stiff, fair, light hulls and boards. (Make sure that the hull and boards are on weight and polished and stiff.)
  • Sails. Spiral sails actually are pretty hardy. The main thing to look for is that when the sail is set up on the beach with no cunningham and just modest mainsheet tension, that the sail has a nice aerofoil shape. Badly made sails will show non aerofoil shapes. Old, tired sails will need a bit of cunningham tension to look right.

Angular Adventures – Part 1

I have spent some time recently working with Angular.js and thought I would record some thoughts. In particular I was working on an RSS Reader type application with ‘infinite scrolling’ functionality. So in dot points:

  • It was very quick to become somewhat productive. The ‘phonecat’ example app and on line documentation are thorough. The documentation itself is pretty dense – so it is best to work through the tutorial first.
  • In my experience the two way data binding was both the greatest strength and ultimate downfall of Angular.js in my application. Two way data binding meant getting a functional application up was extremely quick… a lot of typing was saved coupling models to views. On the other hand, for a predominantly ‘read’ application, there seemed to be some performance degradation with maintaining this two way data binding.
  • I did manage to get an interesting ‘infinite scrolling’ implementation which I hope to share in a later post
  • The $resource service made working with a json RESTful api a breeze

So overall, for a dynamic application involving both reading and writing of dynamic data Angular.js could well make you life a lot easier. For predominantly ‘read only’ applications it may not be the right fit.

Advertising

There seems to be a debate about advertising as a business model. I have been thinking about this on an off long but was inspired to write this post by Fred Wilson’s recent post. My comment is to ask a question: Why is it that content publishers get to sell access to my eyeballs to advertisers? I understand that for ‘traditional mass media’ there is no practical way for me as a consumer to sell that access directly to an advertiser….. but in the enlightened internet age, why can’t I have a direct relationship with the advertisers and decide who gets access to my eyeballs… why does a content publisher get this exclusive right?