Skip to Content

Deriving sharp insights from unstructured texts: analytical approaches

Qixuan Yang
February 18, 2020

As exciting as textual data collection in the previous post sounds, readers may well ask, “So what?” This is a pressing question that raises concerns about the quality of the insights from the vast amount of (unstructured) data. We present three relevant use cases that current techniques can already solve: automatic classification of citizens’ complaints and requests, analysis of open-ended survey questions, and extraction of persons and entities related to a political issue.

1. Classification of citizens’ complaints

As the digital infrastructure makes sending emails or online complaints to administrative agencies becomes more convenient than ever, the expectation of a quick and accurate response increases as well. Thus, a rapid classification of the different subjects of emails or complaints is desirable. Traditional methods such as self-labeling or keyword-based categorization are severely inaccurate and do not improve accuracy significantly.

Supervised machine learning of classification offers an answer to that. For instance, we build a training dataset that includes complaint content and their labels, e.g., how to categorize them regarding political issues or administrative processes based on human knowledge. Then, we use statistical measures to guide the machine to learn the typical text features of a label. Features can refer to typical words, word combinations, or sometimes even characters.

Also, sentiment analysis can exploit the intrinsic “policy emotions” beneath the content and further distinguish the degrees of negativity of complaints. The combination of these two techniques in the NLP field provides government agencies with a quick and accurate way to guide messages to the right department, reducing the workload for human employees and potentially increasing citizens’ satisfaction. The following example is the product of our team within Capgemini which classifies emails and scores their sentiment for the private sector. A similar logic can apply to the public sector as well.

Note: Email Sentiment Analysis, Asset of Capgemini Invent, AI Garage. Developed by Srithar Jeyaraman, Sanchit Malhotra, Kaustav Chattopadhyay, Arpit Rawal, Dheeraj Tingloo, Ravi Mrityunjay, and Deepak Kumar (2019).

2. Open-ended questions in survey for measuring policy preferences

Surveys have been widely used in evidence-based governance. In Germany, chancellors have been conducting surveys since the 1950s. However, traditional survey questionnaire design has faced numerous revisionist critiques, and measurement through a multiple-choice format does not always reflect citizen’ political preferences. Open-ended questions in turn provide more abundant information but are perceived as hard to analyze systematically.

Unsupervised learning through topic modeling, a clustering algorithm can mitigate these concerns. In general, the algorithm looks for words that co-occur throughout multiple documents. If, for instance, “parental leave” and “child” were spotted jointly among thousands of answers, the machine would consider that they belong to the same latent cluster. Through iterative training, we can easily summarize what aspects of specific policies are mostly primed and even calculate the proportion of those concerns.

The following example shows one application of topic modeling. Based on a sample of online complaints of Chinese citizens, the machine learns to cluster words in a way that humans can also understand. For instance, among those complaints labeled “medicine,” a majority of texts contain the typical related features.

Note: Clustering of typical words about a topic based on Chinese citizens’ online complaints. Own research. Visualization by Qixuan Yang.

3. Determining relevant political actors and their positions

The last example involves extracting relevant persons and entities around a policy. Imagine that after collecting pertinent policy papers and news about carbon taxation, we want to provide an overview of who’s saying what. Here, named entity recognition (NER), another widely used technique in NLP, can be used for extracting relevant parties of a policy discussion. In this case, NER exploits existing or local dictionaries of names of persons and organizations and predicts them in a rule-based fashion or a statistical procedure.

NER can be further combined with other data science applications, such as network analysis or any of the classification or clustering algorithms described above. Through joint efforts, we are able not only to detect the relevant parties but also their positions and relations with each other.

In a nutshell, NLP makes it possible to analyze a large amount of unstructured text data systematically. The aforementioned techniques are not only interesting in the example scenarios but also generalizable to many other contexts. Classification and clustering can be used to analyze social media posts and discussions. NER, when fused with network analysis and clustering algorithms, can be used to trace the disinformation campaign – how fake news or false information disseminates in terms of persons/accounts/organizations and content.

As readers may have noticed, the techniques regarding data collection and analysis are already established and still developing. These “soft” assets, however, can only reach their full functionality when hardware conditions are met. In the next blog post of the series, we introduce a pipeline perspective that is essential for NLP to work in the public sector.