21-04-2017

Today I’m going to jump right into Text Mining, and in doing so, cover some key aspects of Machine Learning. Are you ready to calculate distance in a hyper-dimensional space? What if I tell you it’s a lot easier that it may seem? Bare with me, I can assure you that with some context, everyone is capable of it.

Text mining is roughly equivalent to Text Analytics, as it aims to turn text into data that is ready for analytics. Some common techniques that you may be familiar with are clustering and classification.

Simply put, Text Mining is often used to answer questions like:

  1. Which texts are similar?
  2. What category is this text?
  3. Which terms summarize the text well?

First, some context

Before diving deeper, a bit of context to the terminology might be helpful to avoid confusion. A single instance of text, whether it is a book or a tweet, is commonly called a document. The total collection of documents is called the corpus.

Looking first at a simple example of how similarity is measured between documents, abstraction is needed. A common abstraction method is to turn documents into a list of word counts. This is sometimes called a bag of words or, its technical name, term document matrix (TDM).

Machine Learning Multiple Dimensions

So, using the admittedly contrived example of two documents both containing only a single word:

Document A: ‘cat’  

Document B: ‘dog

The TDM of the corpus (all documents)  would be:

Now this might seem trivial, but this way of representation is basically a coordinate system, which you can use to easily calculate distance between two objects.

The euclidean, or straight line, distance between these documents then becomes:

This extends to any length of text. Extending the document with a single word

Document A: ‘The cat’  

Document B: ‘The dog

The TDM of the corpus (all documents)  would be:

This leads to three dimensions.

The distance between these documents then becomes:

Moving forward, when considering more realistic examples of actual documents containing hundreds or thousands of words, this will result in a hyperdimensional space where our intuition, like our plotting options, would fail us.

Luckily the formula works just fine! Take the four dimensional distance below for instance:


The idea of multiple dimensions might seem weird or unwieldy, but it is at the core of many areas of Machine Learning. Regression, Classification, and Clustering all operate within this area.

For Text Mining, this practically means we now have a way to cluster documents and find the document that is most similar. If we have sets of already classified documents, we can now use this to estimate the classification, for instance to predict if a document is a news or sports article.

However one clear issue stands with the method until now, namely a word like ‘the’ does not really hold a lot of information in a document, nor is it specific to text. If one document contains ‘the’  300 times and then another 700 times. The difference might seem relevant using above method, even though the difference is likely the result of document length and not the actual subject matter.

Term Frequency-Inverse Document Frequency

To account for this, a slightly more sophisticated method called Term Frequency-Inverse Document Frequency (TF-IDF) is normally used. Skipping the calculation for now, TF-IDF takes the word count like before but then scales them down by how many documents in the total collection or corpus contain the word. For instance, if all documents contain the word ‘the’, then the TF-IDF value would be zero and the frequency of of the word would not be relevant for the distance between documents at all.

Last Words

So, in conclusion, much of Text Mining and ML revolve around making a sensible abstraction aimed at solving a specific problem. Although methods are often genuinely complex, this does not mean that there is no way of getting an idea of what is going on.

Stay tuned for my next post, where I will to continue on the Text Mining path and look at all the options we have for applying Text Mining with SAP HANA. If you are already looking to apply ML or Text Mining please reach out, we are always eager to help.