My first Machine Learning project: Using Naive Bayes to classify tweets

01 Feb 2017, by Pang Yan Han

Source code: https://gist.github.com/yanhan/d9061c9575d14228d2a9ecc9519a55aa

Before we go into the main content of this post, I have a confession to make - it’s pretty damned hard to get started with Machine Learning if you’re not doing it on the job and if you don’t happen to have some problem to solve (and have the appropriate data for it). One of the main hurdles I’ve faced is that after picking up some fundamentals, Machine Learning knowledge is like this hammer but there aren’t many suitable nails in sight, probably because of my lack of experience; a lot of problems also seem very, very tough. This post is my latest attempt to build up my skills in it - through some hands on experience on “real-world” data sets. Hopefully this trend keeps up and I’ll have many more projects under my belt over the course of the year.

Inspiration

Recently, I’ve finished reading John Foreman’s Data Smart and got quite inspired to try out some techniques in there. I spent an entire afternoon brainstorming about some project ideas. One of them is, why not do a variant of what’s covered in Data Smart chapter 3, which is, to use Naive Bayes to classify tweets? Except that, we’ll be using a programming language instead of Excel (thank goodness).

The problem

Alright, we’ve decided on the algorithm - Naive Bayes. To make things simple, we will build a binary class classifier, just like in the book. So, what kind of tweets? The best kind of tweets are probably those involving some word with ambiguous meaning. This word may or may not be used as a hashtag in the tweets. The word ‘react’ was one of the first which came to my mind - after all, it is used in conventional speech / writing and is the name of a popular JavaScript framework, so the same word stands for two very distinct things/meanings. A human reading the tweets should not have much issue labelling the data but more importantly, these 2 attributes of the word ‘react’ should garner a sufficiently large number of tweets for both classes.

Obtaining the data

I wish I could say that there was a data set that I just downloaded from somewhere or I used the Twitter API to do this, but I did a search on Twitter and manually copy and pasted tweets into 2 files, one for tweets about the React framework and the other containing tweets that are not about the React framework. For each class, I collected 250 tweets and all / most of them contain the word ‘react’ or ‘React’. This was a very painful and laborious process that took a few hours to complete. In particular, most tweets about the React JS framework do not contain the word ‘react’ or ‘React’ but contain ‘ReactJS’ (maybe in a different case), so it took a long time for me to find suitable tweets.

For each class, I shuffled the 250 tweets and partitioned them into a training set of 180 tweets (72% of the data) and a test set of 70 tweets (28% of the data).

A tiny bit of exploratory data analysis and thoughts before starting

This is a toy problem and it is extremely easy to classify tweets about React JS correctly - in fact we don’t need machine learning to do it. During the process of gathering the data, other than the fact that most tweets about React JS contain the word ReactJS in some kind of case variant or use the hashtag #ReactJS, they typically don’t use the word react and even if they do, it is almost always spelt React with an uppercase R instead of react. Whereas in tweets not about the React JS framework, we don’t see the uppercase React - it is always react. Ok, to make things more fun, we lowercase everything.

On the #ReactJS hashtag - by summing the token counts, we see that other than the token react, the next most common occurring token in the training set is #reactjs at 115 counts, followed by #javascript at 61 counts. Again, to make things more fun, we discard these two tokens.

The hand-rolled classifiers

There are 2 implementations: one hand-coded Naive Bayes implementation and one using scikit-learn.

For the hand-coded implementation, we follow most of the implementation in Data Smart. For each tweet, we first replace all occurrences of ., :, ?, !, ;, , which are followed by a space character with a single space character. Then we split on whitespace and reject all tokens with three characters or less, along with the tokens #reactjs and #javascript. We train 2 classifiers - one for recognizing tweets about React JS and the other for recognizing tweets not about React JS. Instead of the additive smoothing method covered in the book (which adds 1 to the count for every token), we use the Additive smoothing covered on Wikipedia and this Cross Validated question:


$$P(token | class) = \frac{x_i + \alpha}{N + \alpha * |V|} $$

where \(x_i\) is the token count in the class, \(\alpha = 1\), \( N \) is the sum of all token counts in the given class, and \( |V| \) is the size of the vocabulary in the entire training set (regardless of class). This is identical to the formula in the Cross Validated question, except that it doesn’t contain an additional \( + 1 \) in the denominator.

To handle tokens which are not present in the training set but present in the test set, we default to using this probability:


$$P(unseen\ token | class) = \frac{1}{N + |V|}$$

When passed a tweet, each classifier computes the sum of the log likelihood of each token in the tweet. We pass the tweet to both classifiers and compare the two log probabilities - the higher one wins and we say that the tweet belongs to that class.

The confusion matrix is as follows:

Looks pretty good.

The scikit-learn classifier

Originally, I didn’t want to do this because this is just a toy project but, since I’m doing it already, might as well figure out how to implement Naive Bayes using scikit-learn on something simple like this.

After reading some stuff from the following links in the awesome scikit-learn documentation:

I figured out how to implement something similar to our hand-rolled Naive Bayes classifier using scikit-learn. Instead of using CountVectorizer, I decided to use HashingVectorizer. The hardest part in this implementation is to figure out how to use the HashingVectorizer. Eventually, it came down to this:

stop_words = list(
    set(
        sklearn.feature_extraction.text.ENGLISH_STOP_WORDS
    ).union(_BANNED_TOKENS)
)
vectorizer = HashingVectorizer(
    stop_words=stop_words,
    token_pattern=r"""\b\w\w\w\w+\b""",
    norm=None,
    non_negative=True,
)

We needed to add the stopwords #reactjs and #javascript to the default list of stopwords that the HashingVectorizer was using. A little googling yielded this useful answer. By default, HashingVectorizer uses the u'(?u)\b\w\w+\b' regex which captures any token of length 2 or more, but we only want tokens which are at least length 4, so we used token_pattern=r"""\b\w\w\w\w+\b""" to override that setting. We also didn’t want any normalization (default is 'l2') and we didn’t want any non-negative values - just raw counts.

We used the HashingVectorizer.transform method to transform all the training tweets into a sparse matrix, and fed that to sklearn.naive_bayes.MultinomialNB via its fit method. To evaluate this model on our test data, we need to use the HashingVectorizer to transform the test data, then pass them to the trained Naive Bayes model.

The confusion matrix for the scikit-learn Multinomial Naive Bayes model:

While this classifier performs better than our hand-rolled classifier for tweets about ReactJS, it performs worse for tweets that are not about ReactJS.

Analysis

Some statistics of our models

The number 1895 refers to the size of the vocabulary for the entire training set (both tweets about React JS and not about React JS). The log probability for an unseen token is higher for the React JS model as compared to the non React JS model. The pitfall is that a tweet which consists entirely of unseen tokens will favor the React JS model - most likely that tweet is not about React JS, since most tweets about React JS will have some variant of the #reactjs hashtag which will exist in the training set.

For the MultinomialNB model, the numbers are -13.86442357 for a non React JS tweet and -13.86489007 for a React JS tweet. A much smaller discrepancy that slightly favors the non React JS class.

False negatives for our hand rolled Naive Bayes model:

  1. Why I like Vue over React - http://buff.ly/2jWKpzJ #vuejs #reactjs
  2. The power of React JS… http://catchoftheday.wesbos.com/store/lazy-glamorous-knives …
  3. these “google trends” charts can show almost whatever you want. people search for “React” rather than “React.js” http://image.prntscr.com/image/42ff33c9b0da42ef8084f45a647ccc00.png …

For the first tweet, the only tokens that are counted are like, over, react, http://buff.ly/2jWKpzJ and #vuejs. Among those, only #vuejs has predictive power, since Vue.js is a JavaScript framework. However, the token #vuejs does not occur in the training set. So it is understandable that this tweet is misclassified.

It is a similar story for the second tweet, with only JS... being the token with predictive power. However, that doesn’t occur in the training set.

For the final tweet, the only token with predictive power is "react.js" (notice the quotes), but the presence of the quotes probably screwed things up.

False negatives for the sklearn MultinomialNB model:

  1. these “google trends” charts can show almost whatever you want. people search for “React” rather than “React.js” http://image.prntscr.com/image/42ff33c9b0da42ef8084f45a647ccc00.png …

Which is pretty curious, since the vectorizer correctly tokenizes "React.js" into the token react.js.

False positives for hand rolled Naive Bayes model:

  1. i may not react but trust me i saw it
  2. Star Wars superfans (and one feisty Rebel soldier) react to #RogueOne. http://strw.rs/60138rHgt
  3. Ants react to their infection by climbing up plants and sinking their mandibles into plant tissue
  4. @Khlil10x react to the NFL games
  5. Why ppl wud insult my father? Ppl wud react if I wud use my fathers name an absolute truth 2 impose n violate others basic rights.

False positives for sklearn MultinomialNB model:

  1. How everyone should react
  2. i may not react but trust me i saw it
  3. Star Wars superfans (and one feisty Rebel soldier) react to #RogueOne. http://strw.rs/60138rHgt
  4. At this point, no one is believing 20% mexico surcharge is real. We’d have to abrogate NAFTA in total to do that. That why no market react.
  5. This is amazing. Jack Eichel’s Dad & other Sabres parents in Nashville react to Eichel’s game-winning goal in OT.
  6. Things to Consider: read the article & not just the headline before you react & retweet. Less exciting, but better for all.
  7. Ants react to their infection by climbing up plants and sinking their mandibles into plant tissue
  8. If you’re a #veteran, how do you react when someone calls you a hero? Here’s my perspective on it http://ow.ly/EKQE308ex4i
  9. @Khlil10x react to the NFL games
  10. Why ppl wud insult my father? Ppl wud react if I wud use my fathers name an absolute truth 2 impose n violate others basic rights.

Notice that all the false positives for the Naive Bayes model are also false positives for the MultinomialNB model. If there’s one thing I can say straight off the bat about these false positives, it is this - they are about pretty different things - there is a tweet about star wars, a tweet about ants, etc. This is a stark contrast to the tweets about ReactJS, which are well, about ReactJS. But all these is just guesswork - let’s look at more concrete stuff.

Analysis of false negatives for hand rolled Naive Bayes model

For the tweet Why I like Vue over React - http://buff.ly/2jWKpzJ #vuejs #reactjs

The main culprits are the tokens over and like. over is a stop word for scikit learn and its removal would have helped things slightly, but like is not a stop word and its relatively high log probability (compared to the other tokens which do not appear in the vocabulary) means that this tweet will still be incorrectly classified. Note that this tweet is not a false negative for the MultinomialNB model - some painstaking investigation into the log probabilities of the features of the MultinomialNB model revealed that the reactjs token greatly favored the React JS model and the react and http tokens slightly favored the React JS model; the other log probabilities are about even except for the token like favoring the non React JS model. Log probabilities given by the MultinomialNB model as follows:

For the tweet The power of React JS... http://catchoftheday.wesbos.com/store/lazy-glamorous-knives …

The culprit is the token power, which appears twice in the training set for non reactjs tweets but does not appear in the training set for reactjs tweets. Note that this tweet is correctly classified by the MultinomialNB model - most of the log probabilities for the tokens are pretty even and the difference makers for the React JS model are the tokens http, wesbos (there was an occurrence of @wesbos in the training set and after the processing by HashingVectorizer it became wesbos) and store (appeared once in training set) - notice that these are all tokens created as a result of removing all the punctuation in the URL and treating them just like whitespace. Seems like in a bag of words model, there may be more value chopping up URLs on top of leaving them as they are or maybe even as opposed to leaving them as they are. Log probabilities given by the MultinomialNB as follows:

For the tweet these "google trends" charts can show almost whatever you want. people search for "React" rather than "React.js" http://image.prntscr.com/image/42ff33c9b0da42ef8084f45a647ccc00.png …

This is a rather long tweet so I am only showing some of the more interesting tokens here. First, the errors we made from tokenization - trends", "react", "react.js", "google; these are highlighted in pink. In a larger data set, these might have affected results more because they will be unified with the correct tokens. In particular, the token react.js will favor the React JS model.

There are 5 stop words here, namely: rather, show, whatever, these, than. These would have been removed by the a scikit learn model using the default English stop words. Their presence in classification favors the non React JS model, which contains more of these tokens with the exception of these.

Finally, the tokens people and want, which are neither stop words nor incorrect tokens. Especially people - it greatly favors the non React JS model.

This tweet is also incorrectly classified by the MultinomialNB model. The HashingVectorizer will produce the following tokens: google, trends, charts, want, people, search, react, http, image, prntscr, 42ff33c9b0da42ef8084f45a647ccc00. Of these, react, http and image give a slight edge to the React JS model, but it is not sufficient to counter the effect of the tokens want and especially people. The table below shows the log probabilities of these tokens given by the MultinomialNB model:

Analysis of false positives for hand rolled Naive Bayes model

  1. i may not react but trust me i saw it
  2. Star Wars superfans (and one feisty Rebel soldier) react to #RogueOne. http://strw.rs/60138rHgt
  3. Ants react to their infection by climbing up plants and sinking their mandibles into plant tissue
  4. @Khlil10x react to the NFL games
  5. Why ppl wud insult my father? Ppl wud react if I wud use my fathers name an absolute truth 2 impose n violate others basic rights.

Tweet: i may not react but trust me i saw it

Our hand rolled model sees the tokens trust and react, both of which slightly favor the React JS model - it is a very similar story for the MultinomialNB model. It seems that very short tweets like this one and tweets which contain many stop words (only 2 out of 10 tokens here are meaningful) can be easily misclassified by a Naive Bayes model.

Tweet: Star Wars superfans (and one feisty Rebel soldier) react to #RogueOne. http://strw.rs/60138rHgt

More interesting log probabilities for hand rolled Naive Bayes model:

More interesting log probabilities for the MultinomialNB model:

Once again we see the flaw of a bag of words model - just because a token appears more frequently for documents of a given class doesn’t mean that a new document containing that token belongs to that class.

Tweet: Ants react to their infection by climbing up plants and sinking their mandibles into plant tissue

More interesting log probabilities for our hand rolled model:

Other than the above tokens, all the other tokens were unseen by both models. Recall that the log probability for an unseen token is about -8.1645 for the React JS model and -8.25088 for the non React JS model - these numbers add up to favor the React JS model, even if we remove all the stop words.

For the MultinomialNB model, it is the token react that made the big difference. All the other tokens are unseen in the training set, which gives a slight edge to the non React JS class but that was not enough to counter the effect of the react token.

Tweet: @Khlil10x react to the NFL games

Another very short tweet.

Our hand rolled models sees the tokens @khlil10x, react and games. The only seen token is react, which favors the React JS model. Since log probs for unseen tokens are higher for the React JS model, everything favors the React JS model.

For the MultinomialNB model, it is again the token react that favored the React JS model. The other tokens Khlil10x and games are unseen and give a very very slight edge to the React JS model.

Tweet: Why ppl wud insult my father? Ppl wud react if I wud use my fathers name an absolute truth 2 impose n violate others basic rights.

More interesting log probabilities for hand rolled model:

Even though insult and others gave an edge to the non React JS model, it wasn’t sufficient to offset the effects of the basic and react tokens along with 9 other unseen tokens.

More interesting log probabilities for MultinomialNB model:

It is a similar situation for the MultinomialNB model with react and basic tilting the scales towards the React JS class.

False positives for the MultinomialNB model that are not false positives for the hand rolled Naive Bayes model

The bigger question is, why does the MultinomialNB model perform worse than our hand rolled Naive Bayes model? The following tweets were correctly classified as negatives by our hand rolled Naive Bayes model but incorrectly classified by the MultinomialNB model:

  1. How everyone should react
  2. At this point, no one is believing 20% mexico surcharge is real. We’d have to abrogate NAFTA in total to do that. That why no market react.
  3. This is amazing. Jack Eichel’s Dad & other Sabres parents in Nashville react to Eichel’s game-winning goal in OT.
  4. Things to Consider: read the article & not just the headline before you react & retweet. Less exciting, but better for all.
  5. If you’re a #veteran, how do you react when someone calls you a hero? Here’s my perspective on it http://ow.ly/EKQE308ex4i

Tweet: How everyone should react

After HashingVectorizer processes this tweet, the only remaining token is react, which favors the React JS class. How, everyone and should are stop words.

Stop words that will be considered by our hand rolled model: everyone, should

Log probabilities of the stop words for hand rolled classifiers:

So it was indeed the log probabilities of stop words that influenced the outcome for our hand rolled classifiers.

Tweet: At this point, no one is believing 20% mexico surcharge is real. We'd have to abrogate NAFTA in total to do that. That why no market react.

More interesting log probabilities:

Stop words that will be considered by our hand rolled model: this, have, that

Log probabilities of stop words for hand rolled classifiers:

that appeared twice in this tweet and hence its log probability was counted twice. Notice the react. ending with a fullstop. Stop words definitely influenced the outcome for our hand rolled models.

Tweet: This is amazing. Jack Eichel's Dad & other Sabres parents in Nashville react to Eichel's game-winning goal in OT.

More interesting log probabilities:

Stop words that will be considered by our hand rolled model: this, other

Log probabilities of stop words for hand rolled classifiers:

Again, stop words do influence the outcome in this case.

Tweet: Things to Consider: read the article & not just the headline before you react & retweet. Less exciting, but better for all.

More interesting log probabilities:

Stop words that will be considered by our hand rolled model: before, less

Log probabilities of stop words for hand rolled classifiers:

Tweet: If you're a #veteran, how do you react when someone calls you a hero? Here's my perspective on it http://ow.ly/EKQE308ex4i

More interesting log probabilities:

Stop words that will be considered by our hand rolled model: when, someone

Log probabilities of stop words for hand rolled classifiers:

This is a more interesting case. While the stop word when tilts the scales towards the non React JS class, the token you're does so as well - this token does not exist in the vocabulary of the MultinomialNB classifier (it would have been chopped up into you and re and both discarded). Our hand rolled classifier also does not chop up the URL and so http isn’t a token for our hand rolled models.

From these 5 examples, we see that stop word removal by the HashingVectorizer introduced more false positives.

Conclusion

The Naive Bayes model works pretty well for a small data set and simple classification task like this one. For classification tasks involving text, it is probably a good starting point due to the simplicity of its implementation. As such, it makes a good baseline from which we can use to evaluate more sophisticated methods.

It is also pretty instructive to open up the model and figure out the reasons for misclassifications.

We summarize our findings here:

  1. short tweets are prone to misclassification as it is very likely for them to mostly contain stop words. The remaining tokens will get “more weightage”
  2. tweets containing a lot of unseen tokens are also prone to misclassification for the same reason as 1
  3. a bag of words model only takes into account of word frequencies and can be tricked. Context and other language structures will help in classification but they are totally not taken into account of by a bag of words model
  4. stop words affect classification, as seen from our analysis of the false negatives for the MultinomialNB model.

Afterthoughts

Even though this is a toy problem, we learnt how to do the following:

  1. Hand roll Naive Bayes and at least have this one method to deal with text data
  2. Use scikit-learn’s HashVectorizer with some customizations
  3. Inspect the Naive Bayes model to figure out the reasons for misclassifications

And we got our hands dirty with machine learning outside of a course like environment or following instructions in a book / tutorial.

This also happens to be our first blog post in quite a while. To those reading this, Happy New Year folks!

comments powered by Disqus