Archive | Uncategorized RSS for this section

Practical Experience with Word2Vec

I have been playing with word2vec on and off since it was released in 2013. It is a neural network that scans vast amounts of text and produces a vector representation of each word that can do some useful things. There are quite a few intros posted online – like here [http://radimrehurek.com/2014/12/making-sense-of-word2vec/]. If you are not familiar with word2vec – go read one of these first then come back. What I wanted to talk about here was just some of the practical issues I came across when trying to get up and running.

Which code?

The canonical implementation of word2vec is the C code released by Tomas Mikolov [https://code.google.com/p/word2vec/] – the original author of word2vec. Gensim by Radim Rehurek [http://radimrehurek.com/gensim/index.html] is a fantastic implementation. Python makes ‘playing with’ these models very easy – and Gensim is robust and suprisingly fast python implementation. (There are also a number of other implementations but I don’t have any experience of them.)

Using a pretrained model

The easiest way to start working with word2vec is just to load up a pretrained model. The original authors released a model trained on around 100 billion words! It is readily accessible and of high quality. Loading it up is pretty easy – however be warned that it requires at least 4GB of RAM.

Understanding vectors

In some intro material to word vectors they explain word vectors with an example based on the king-man+woman example that goes something like this: Imagine that the vector for king has one co-ordinate correlated with ‘royalty’ and another correlated with ‘maleness’. So by subtracting ‘maleness’ and ‘adding ‘femaleness’ we can end up with a word like queen.[REF TODO] That is ok as a starting point if you are not used to thinking about word vectors. However be warned – the co-ordinates of each vector do not actually work like this. I spent a long time thinking like this. It leads to questions like ‘can we find words that are half male?’, ‘can we find words that are equal part male and female?’, ‘what do words with a negative male coordinate represent?’ As far as I can tell these are all meaningless questions. In reality ‘maleness’ is in some way distributed across all of the co-ordinates of the vector.

The curse of dimensionality

Words are vectors in high dimensional spaces. For word2vec 300-500 dimensions are not uncommon. These sorts of spaces are very hard to reason about (see especially’The Curse of Dimensionality’ [http://en.wikipedia.org/wiki/Curse_of_dimensionality].) I can’t really give much practical advice other than to say it is pretty normal for your head to hurt when trying to think about it.

Training your own model

When you want to start training your own model (rather than using a pretrained model). There are a few things to watch for. Firstly, it is important to appreciate that there is some inherent randomness in the models – so no two models will be the same. This makes reproducibility and comparing different iterations a little bit difficult. There seem to be some efforts within gensim to manage this a bit – but I haven’t really looked into this aspect of it. There are a number of critical thing to decide to train a model:

  • What corpus to use? What data will you use to train your model.
  • What algorithms will you use (word2vec actually implements a couple of different algorithms)
  • What parameters will you use when training?
  • How will you evaluate your model?

Mikolov’s released ‘big-data.sh’ late last year (2014) which gives a good baseline for training your own model. My comments are all related to training a general purpose English language model.

Corpus

Word2vec really relies on a large amount of text. It took me a while to appreciate how large this needs to be. If you don’t have a large amount of data – then the word2vec tool supports multiple passes or iterations over your data to try and improve the quality of the model.

  • text8 – (from here [http://mattmahoney.net/dc/textdata]) is a good starting point – not really big enough for anything serious but training is quite quick and gives ok results which makes it good for prototyping, sanity checks and so on
  • text9 contains about 120M words. It is big enough that you start to get some meaningful results. Training this model (with Gensim) takes at least 8G of RAM. A model can be trained in 6-12 hours on a 2013 quad core i5-3580
  • Mikolov’s big data corpus contains about 8.5B words. It requires at least 11G RAM to build with Gensim. It took over 4 days to train on my i5 using the CBOW algorithm (and 3 iterations) and almost 8 days to train using the skipgram algorithm

Iterations and Corpus Size

One of the basic premises of Word2Vec seems to be that data beats algorithms – in the sense that a simpler algorithm that can be trained on more data is better than a more sophisticated algorithm that takes longer to train. So in a sense choice of algorithms needs to be measured against training time. For example, when training a model on the text9 data set, I used both the ‘skipgram’ and the ‘hierarchical softmax’ algorithms simultaneously (with 1 iteration) and performance against the ‘questions-words’ analogy task improved significantly. However training time also improved significantly. If I simply used the skipgram algorithm with more iterations, the performance was even better again for the same training time. As a rule of thumb (which admittedly is pretty arbitrary) – I would suggest that you need at least 1 billion words to train a reasonable model (again – this is for general purpose English only). In the case of the text9 corpus with about 100M words, this suggests at least 10 iterations over the data are needed. (I did not find much improvement beyond 10 iterations with text9.) Mikolov suggests 3 iterations of 8.5b words each in his big-data script.

Python – pip and virtualenv on Ubuntu

TL;DR;

If you are developing in Python on ubuntu – ensure that you use the Ubuntu provided pip and virtualenv:

sudo apt-get install python-pip

sudo apt-get install python-virtualenv

sudo apt-get install virtualenvwrapper  (assuming you like virtualenvwrapper)

sudo pip install virtualenvwrapper (the ubuntu provided virtualenvwrapper is too horribly out of date to use)


 

So I have just spent two evenings battling with one particular wrinkle about Python on Ubuntu that I found a little hard to resolve – even after copious googling – so wanted to blog my case here in case it helps others.

So this is related to doing Python (2) development on Ubuntu. I have done some Python development previously on Ubuntu and had followed setup guides such as this one: http://docs.python-guide.org/en/latest/starting/install/linux/ 

This worked fine for me. virtualenv and pip all seemed to work fine and life was good. However I had already fallen into one trap and I was about to fall into another.

The second more minor trap I was about to fall into was accidentally running ‘pip install’ when not inside a virtualenv. This installs into the global scope – rather than the virtualenv – which just ends up being messy. Even worse is running ‘sudo pip install’ which ends up installing things differently again. So the MUST DO from all of this is to add the following to my .bashrc:

export PIP_REQUIRE_VIRTUALENV=true

This essentially means that if you try and run pip when not inside your virtualenv you get an error message. This helps keep your environments much cleaner. This seems particularly important on Ubuntu where Python is used by the system for so many things – and so trying to keep the dependencies of your project clean is much harder.


But the real tricky one was that the Ubuntu Python and the ‘official Python’ put their packages in different locations (/usr/lib/python2.7 versus /usr/local/lib/python2.7/dist-packages). Mostly this did not cause me any issues until I came across the scenario that I wanted to install and access a library that I wanted to develop on alongside a virtualenv.

Following standard practice I forked/cloned the library from git and ran:

python setup.py develop

(This was in a directory outside my main project folder since I planned to use this library in other projects). However I was not able to ‘import project’ within my virtualenv. It couldn’t be found. I could have manually added paths to my PYTHONPATH to resolve this but it “felt wrong”. After much digging, the reason essentially boiled down to the fact that my ‘Ubuntu provided’ python had installed a link to the library in /usr/local/lib/Python – when my ‘pip installed’ virtualenv was expecting it to be in /usr/lib/Python.

Once I realised this there was a bunch of pain involved in removing the stuff that I had ‘sudo pip installed’ or ‘pip installed’ outside of the virtualenv, then installing the Ubuntu versions (see TL;DR;).

Information junk food

It seems that it is the ‘new normal’ to bemoan how easy it is to overdo passive consumption of information on the internet. Facebook, Twitter, random blog posts such as this one – all consume our time but don’t always ad a lot of value. This post is the most recent one that I came across. All true – but almost a tru-ism. Even the comments over on Hacker News were pretty predictable. One of the ‘predictable’ comments compared this sort of low quality information with junk food. Most of these articles seem to end with general exhortations to ‘watch what we read’.

To take the ‘junk food’ analogy further, the invention of ‘junk food’ (low quality – easy to eat food) eventually led to the creation of ‘Nutrition labels’ that at least make it visible to people what the nutritional content of the food is. So what would ‘Nutrition labels’ on our ‘information’ look like? Is this a useful analogy?

Angular Adventures – Part 1

I have spent some time recently working with Angular.js and thought I would record some thoughts. In particular I was working on an RSS Reader type application with ‘infinite scrolling’ functionality. So in dot points:

  • It was very quick to become somewhat productive. The ‘phonecat’ example app and on line documentation are thorough. The documentation itself is pretty dense – so it is best to work through the tutorial first.
  • In my experience the two way data binding was both the greatest strength and ultimate downfall of Angular.js in my application. Two way data binding meant getting a functional application up was extremely quick… a lot of typing was saved coupling models to views. On the other hand, for a predominantly ‘read’ application, there seemed to be some performance degradation with maintaining this two way data binding.
  • I did manage to get an interesting ‘infinite scrolling’ implementation which I hope to share in a later post
  • The $resource service made working with a json RESTful api a breeze

So overall, for a dynamic application involving both reading and writing of dynamic data Angular.js could well make you life a lot easier. For predominantly ‘read only’ applications it may not be the right fit.

Advertising

There seems to be a debate about advertising as a business model. I have been thinking about this on an off long but was inspired to write this post by Fred Wilson’s recent post. My comment is to ask a question: Why is it that content publishers get to sell access to my eyeballs to advertisers? I understand that for ‘traditional mass media’ there is no practical way for me as a consumer to sell that access directly to an advertiser….. but in the enlightened internet age, why can’t I have a direct relationship with the advertisers and decide who gets access to my eyeballs… why does a content publisher get this exclusive right?

Did a thought occur to you?

A thought just occurred to me….. That is a surprisingly pleasant experience… When a thought occurs I consider if I should tweet it…. sometimes I do. Sometimes though the thought is hard to express in 140 characters… maybe I should blog about it…. maybe I should flesh it out some more…. and then ir normally dies – blog post never written… So today I am starting on a concerted mission to blog more regularly – but in a shorter, less well formed format…. At least then it might get done!

Adventures with screen dpi and text editors

Just a quick note in case anyone else finds themselves in the same position.

I have just recently set up the excellent Sublime Text 2. One of the appeals of using this editor was sharing the one set-up across my Windows desktop, my Macbook Pro and the various Ubuntu virtual machines that I run in Virtualbox.

All was going swimmingly until I tried to set a convenient font size. The font size that looked right on the Linux guest (on the Macbook Pro) looked too small on the Macbook. To cut a long story short – I found no great answer. 

In this configuration the difference was caused by a different dpi setting for the two systems. The native Mac was working with a 72dpi, and the Ubuntu virtualbox was working with 96dpi. There didn’t seem to be any way to dpi setting for Ubuntu. The closest I got was change the text scaling factor as so:

 gsettings set org.gnome.desktop.interface text-scaling-factor X

I tried setting X to 0.75 (75dpi / 96dpi). This made all of the system fonts smaller. So then I was able to set larger system fonts on the Ubuntu guest that looked about the same size as they did on the Macbook Pro host – for most applications. (e.g. Terminal and gEdit.) However for some unknown reason, Sublime Text 2 did not seem to recognise this change in any way. (I am guessing this is because I did not acutally change the dpi in the Ubuntu vm – just changed the text scaling… and Sublime Text must be referencing the dpi in some way.)

So where I am at now is setting the text-scaling-factor back to 1 (so that within Ubuntu, font sizes with Sublime and other apps are aligned) – and having to manually change (increase) font size whenever I switch to editing within OS X.

So if anyone reads this after spending an hour or two looking for a solution…. feel consoled that you are not alone.

And if you happen to know a better answer – I would really like to know it.

Controlling myself

Lots of people seem to worry about being profiled by their use of the ‘net. ‘Big data’ is coming to analyse your every click, tweet and post for someone else’s benefit. This is captured in the ‘if you are not paying, you are the product’ soundbite that reverberates around the internet.

Some people obviously go to great extremes to maintain their anonymity online – but it seems pretty hard to do this without losing some value. (I am not going to get a huge amount of value from Facebook if I don’t have an account – and log in occasionally.) 

So if we must leave a trail of data behind us on the internet I wonder if it is possible to leave a trail that is so full of noise that it is essentially useless for profiling? In engineer-speak – can enough artificial ‘noise’ be added to my online activities so as to make the signal difficult for these algorithms to extract. In the case of Google Search for example, would a computer issuing fake queries on a massive range of topics be able to limit Google’s ability to identify the ‘real’ queries. 

Probably just hair-brained idea with no practicality – but it is sort of interesting to imagine what would happen. This thinking has been inspired by the recent Twitter vs app.net dichotomy that has already had too much said about it… The choice that is being portrayed is between an organisation that will make decisions in the best interests of advertisers, vs an organised that has promised to do it differently. But if I always retained control of my data – then I would not have to rely on promises…

Twitter Competition

In all the hoopla about Twitter’s continual clamp down on the use of its API (including by such people as Dalton Caldwell, Fred Wilson and I am sure many others) there has been much debate about ad supported vs paid services. There has also been some debate about developing competitive alternatives to Twitter, including Dalton Caldwell’s app.net and proposals to build upon existing open protocols (such as by Dave Winer).

My less than 2c contribution to this discussion is to wonder aloud about the market dynamics of ‘social media’/’user generated’ content sites. What is it about these ‘new media’ that seems to promote a ‘winner takes all’ dynamic. How many ‘twitter copycats’ are there? Yet ‘conventional media’ has many competitors… say mainstream daily print newspapers supported many competitors vying for ad dollars and customer dollars simultaneously. In this analogy I don’t see Facebook and Twitter as competitors… Twitter is like the daily news compared the the ‘weekly glossy’ of Facebook.

The relatively high switching cost (setting up accounts, finding people to follow and so on) compared to traditional media (buying a different newspaper, or flicking channels on a remote) seem to be pretty key to me. All of which brings an example that will be familiar to anyone with the slightest notion of Eric Ries and the Lean Startup ‘thing’ – and that example is IM. Eric often talks about his experience building on top of existing IM networks only to discover that his customers actually preferred to have multiple networks managed by an ‘all-in-one’ client.

Will we get such all-in-one clients? It is quite likely that these will be on Twitter’s hit-list in terms of any API crackdown. But without this, it would seem we will be limited to ‘one provider per media’. And replacing the incumbents is a lot bigger ask than simply providing a competitive alternative.

Google Search Settings

I found something very useful for me today that I only just discovered: Google Search settings.

On the Google search results page (not the home page) there is a little gear icon in the top right that lets you change some search settings. There are a few handy ones in there – but there was one real eye-opener for me: “Where Results Open”. When this is selected – if you click on a google search result it opens in a new window/tab – instead of the current one. This might sound trivial – but it will save me countless Right-click…. Open as New Tab operations!!