Over the past several years, there has been an explosion of interest in natural language processing (NLP) and text analysis. It’s important to understand the direct applications to a successful NLP and text analysis algorithm that could enhance life. However, a starting point to achieving these applications is understanding the similarity between words. Understanding the similarity and the dissimilarity between words could help move math closer to the applications below.
Text Mining and Doctors Offices
While at graduation from Boston University, I was talking with a gentleman I met in line. He mentioned to me how he was applying natural language processing and text analysis directly in a doctors office. The doctor would turn on the software, it would listen to the patient as the patient would explain their symptoms, then the software would make a statistical guess as to what the issue is, along with the confidence number.
A call center could use the natural language processing and text analysis so that when a user calls in with an issue, they could explain their issue over the phone and redirect them, explain to them, or route them to the most common solution to the problem they are having. How neat would this be? Interesting enough, maintaining the traditional call centers and even automated call center could cost millions of dollars. This could save them millions of dollars.
Implementing the Word2Vec algorithm, like words will group together and unlike words won’t group together.
The Word2Vec algorithm was introduced by a team at Google, led by Tomas Mikolov in 2013. This team developed an algorithm to convert words into a vector of words, thus being able use vector math. The Word2Vec algorithm we will be implementing will be done in R using a library called WordVectors.
Raw Data Training Corpus: raw data
Original Word2Vec C Code: Word2Vec
Caveats: Due to this processing power of my computer and my patience, only 17,005,207 words were used. When training for production use, normally billions of words would be trained.
No filters or cleansing was used in this data set. All words were tokenized by a space.
Closest Words to Each Other
The following graph is a graphical representation of the Word2Vec algorithm in action. Below are a few clusters outlined in square colors.
- Red – Letters of the alphabet
- Blue – Nationalities
- Yellow – Christian Religion
- Brown – Computers
- Green – Directions
Words to Describe Woman – Man and Good – Bad
Since the words are represented as a vector, this means that we could perform some vector math. The graph below represents all words that are similar to good minus the bad and woman minus man. This allows us to look at a specific category. First thing to know before observing the graph is that the closer to 1, the more similar the word is. The closer to -1, the least similar. Let’s observe the graph below:
- Far left is a man, in the at -0.2, which means since we chose woman minus man, this is scoring this correctly
- Top is good, which means one popular descriptor of woman minus man and good minus the bad is Good
- On the right are the words, desire, happy, pretty, loves, sleep, feelings, love, feel knows, which means that leans more towards the woman.
From the examples above, we can conclude that the Word2Vec algorithm does cluster like words together. In fact, the Word2Vec algorithm allows for filtering word clusters to improve the model that we are trying to achieve.