Topic Modelling: Deriving insight from large volumes of unstructured text data through unsupervised machine learning

Publish date:

The rise of social networks has led to an increase in unstructured data available for analysis, with a large proportion of this data being in text format such as tweets, blog posts, and Facebook posts. This data has a wide range of applications, for example it is often used in marketing to understand people’s opinions […]

The rise of social networks has led to an increase in unstructured data available for analysis, with a large proportion of this data being in text format such as tweets, blog posts, and Facebook posts. This data has a wide range of applications, for example it is often used in marketing to understand people’s opinions on a new product or campaign, or to learn more about the target market for a particular brand.

When dealing with large volumes of unstructured text data, it can be difficult to extract useful information efficiently and effectively. There is almost always too much data to read through manually, so a method is needed that will extract the relevant information from the data and summarise it in a useful way.

Topic modelling is one method of doing this. Topic modelling is a technique that can automatically identify topics (groups of commonly co-occurring words) within a set of documents (e.g. tweets, blog posts, emails).

An effective topic model should output a number of very distinct groups of related words, which are easily identifiable as belonging to the same subject. For example, if the topic model was trained on thousands of tweets related to diet, one group of words might include “gluten”,”glutenfree”, “coeliac”, “intolerance”, which would correspond to a “gluten free diet” topic. Another group of words might be “vegan”, “dairyfree”, “meatfree”, which would represent a “vegan diet” topic.

Latent Dirichlet Modelling (LDA) is one of the most popular approaches for topic modelling, and is what will be discussed here.

The first step is to collect and prepare the documents to be analysed. The text within the documents should be cleaned so that the words that define each topic make sense, and would be relevant only to that topic. Usernames, URLs, symbols and common words (e.g. and, or, I, a, etc.) should all be removed before running the model.

These cleaned documents are then passed to the topic model. The model iterates through all of the words in each document and identifies words that occur together frequently. Every document is iterated over until the model becomes internally consistent (i.e. it does not change how words are allocated to topics during subsequent iterations).

The model outputs lists of frequently co-occurring words in the documents, along with the probability of each word belonging to that list. Each of these lists represents a topic. These topics can be visualised in a way that shows their relative sizes and how distinct they are from one another. This can be helpful in determining the overlap between topics, which may indicate if any of them should actually be merged into a single topic, and which topics are the most common within the documents. However, most of the interpretation of these lists of words into meaningful topics is a manual process and can be difficult if the words in the list are too common, or do not seem to be strongly related to one another.

In addition to summarising groups of documents, topic models can be useful for finding similarity between documents, or finding the relevancy of a document to a particular subject.

Topic modelling can be very powerful, but there are some potential issues with this technique. Firstly, it is computationally expensive, and if there are a very large number of documents it can take a long time to run and it might not even be possible to run them on a common laptop (e.g. 1 million documents, with 1000 topics and 500 iterations can take around 40 hours). These computational limitations can be overcome by using parallelisation techniques (e.g. running the model using multiple processers at once).

In addition, the model requires the number of topics to be identified before it can be run. This can be difficult to do, especially in cases where the documents are unseen and the content of the documents is unknown. A technique called Hierarchical Dirichlet Processing (HDP) has been developed which will select the most appropriate number of topics for a given set of documents, and this method can be used if the desired number of topics is not known in advance.

We will now look at an example with real data, using an LDA model to find what the general topics are within this food conversation on Twitter.

Over 20,000 tweets were collected from people in the UK talking about food and eating. After following the methodology described above using Python 3.5 and the Gensim LDA model, 10 topics were found in the data.

These topics were visualised using a package called LDAvis, which shows the intertopic distance (i.e. how similar or distinct the topics are), and the relative sizes of the topics. The figure below shows the output of the topic model using this visualisation, and it shows the words most strongly associated with each topic. The occurrence of specific words within each topic can also be visualised by selecting a word instead of a topic.


Words related to eating (e.g. eat, eating, eats, ate) were removed from the documents before analysis, as this was one of the words searched for when collecting the data. This means that at least one of these words would likely appear in every topic making interpretation of the topics more difficult.

Topic 1 is the largest topic in the data, and is comprised of people tweeting about what they had to eat the previous day, or what they will eat today. The closest topic to this is topic 6, which is people talking about what they will eat today. Instagram is one of the most relevant words for topic 6, suggesting that this topic could also be interpreted as people sharing photos of their food.

Topic 2 is “eating too much”, and is close to topics 3 and 7 which are “hunger” and “need to stop eating” respectively.

Topics 4 (“what people feel like eating”) and 5 (“haven’t eaten for a while”) are very close together in the intertopic distance map, which suggests overlap between people skipping meals and people craving different foods. Pizza, for example, is one of the words with the most overlap between these topics.

Topic 9 is “weight loss/health”, and breakfast is the most relevant meal within this topic. There is also a topic around eating with family or friends (topic 10), and again, breakfast is the most relevant meal within this topic. This can be a useful insight, for example companies that produce breakfast foods could use this to drive social engagement online by sharing recipes around healthy, family-friendly breakfasts.

The final topic, number 8, is YouTube videos of food. Animals also feature heavily in this topic, suggesting that a large number of videos about food that are shared on Twitter are about dogs or cats eating.

This example shows how topic modelling can be valuable in helping to understand the themes in the data, how people talk about a particular subject and how the different topics within the documents are related to each other.

Related Posts

Data Analytics

Do data scientists prefer R or Python?

Sumit Kumar
Date icon September 15, 2020

The rapid emergence of data science has been fueled by both R and Python. However,...

Creative Design

Travelling to distant stars – a thought experiment

Peter King
Date icon April 2, 2019

At Capgemini Invent we bring to life what’s next for our clients by combining strategy,...

Artificial Intelligence

Sparking the data science of optimization

Saktipada Maity
Date icon January 18, 2019

Trade promotion optimization is maturing and becoming a necessity for promotion planning. As...