Objective – To predict the EU referendum

Hypothesis – Determine whether social media is a viable forum for predicting the EU referendum


The EU referendum has been one of the most discussed and controversial topics in recent years. It has been over 40 years since Britain first joined the European Economic Community (as the EU was known then) when Prime Minister Edward Heath was in charge. Two years later, Britain had a referendum to decide whether to leave or stay in the EEC, 66% voted to stay.

Now, 41 years later – we have another referendum – but this time we have access to data that was unavailable in 1975. Social media right now is used by around 59% of the population of the UK which is around 38 million people1. A few of the social networks like Facebook and Twitter have APIs that can be used to analyse what people are talking about. So this seemed like the right time to make use of social media to determine whether it is a viable source to predict the result of the EU referendum.


The approach was to collect Twitter data with the word ‘eureferendum’. In order to do this, an AWS server (free tier) was used to store and host a Python script to listen into on the Twitter API. The output from the script would then be stored as a text file on the server. A number of attributes about the tweet, like the timestamp, the place of the tweet and the coordinates were captured. Unfortunately; for the coordinates to be captured with the tweet, the user must agree to this (by changing their settings) before sending the tweet. Due to this setting, few tweets were received with coordinates and we were unable to compile a heat map of the voting preference in each area.


Figure 1



How frequent were the tweets?

Below is the total number of tweets that were received each day from the 8th of May to the 23rd of June. During this time the total number of tweets collected was close 250000 and at the initial phase 1000 tweets a day was received. The rate of tweets that was received with associated with the EU referendum increased dramatically.

Figure 2


There are unexpected troughs in figure 2, where the number of tweets dropped unexpectedly. This is due to the fact that the server that was used was not powerful enough – it only contained 1GB ram and when a high influx of tweets was received at one time, the Python script crashed. The script had to be restarted so that the tweets would continue to be collected. It meant a number of tweets were lost.



Figure 3 below shows the results of the voting preference that the public were tweeting. By determining whether a tweet was talking about staying or leaving, the relevant count was incremented. Many tweets had no opinion about staying or leaving and the percentage is shown below. This could indicate those users were not going to vote.


Figure 3



Figure 4 below shows how people were voting from the initial day until the day of the voting. As time went on, the percentage of tweets that were voting to leave was fairly constant and it was hovering around the 55% percent mark. There was no sudden change in the voting preference and from the beginning the voting preference was towards leave. This was an early indicator that the UK public that use Twitter wanted to leave the EU.


Figure 4


Below is a word cloud for all of the tweets where numbers, dates, punctuation and certain words like “eureferendum” as it was the keyword that was used to collect the tweets, were cleared out. After cleaning the data, the most common words amongst the tweets was “brexit”, “vote”, “voteleave”. Again; a strong indicator that the UK public was leaning towards voting out of the EU.


Figure 5

























  Figure 6


Actual Results

Below is the actual results of the referendum and it is then compared to the results that was predicted. The prediction was close to the actual voting results as well as the turnout.


Actual results








Our Results









After analysing the data that was collected and comparing it to the actual results, we can confidently say that Twitter as a social media platform is a viable source of big data to predict the outcome of the voting for the EU referendum.



If this was to be done again on a similar topic – perhaps to give more evidence that Twitter can be used as a viable source of big data – then getting more data that contained the place and/or coordinates would enhance our results. Had the ages of the tweeters been captured, it would have been possible to gain an insight into whether Twitter is used by the younger generation and therefore we could show the voting preference by age groups. So, for next time, this is something that should be considered in capturing.  

One problem that occurred was not collecting more data as the AWS which as mentioned earlier it only contained 1 GB of RAM and this was not enough. Getting a better version of AWS or perhaps to look at Google Cloud Platform would resolve this.


Thanks to Kieran White, who helped start this with me but was unable to contribute further due to other commitments.