Skip to Content

The public sector in the age of AI: How to turn a pile of papers into analytical assets?

Qixuan Yang
February 11, 2020

Beyond traditional applications, such as chatbots or interactive websites, cutting-edge data science technologies can already deliver relevant and scalable products and services to the public sector: social network analysis for online public opinion, email classification for targeted, responsive governance, facial recognition at border controls, among others.

Nevertheless, we intuitively equate the public sector with paperwork and documents – a profile of inefficiency. Almost everyone complains about the bureaucracies associated with that image. However, the ever-growing field of natural language processing (NLP) may be able to turn this image around by either reducing manual text generation or exploiting a large amount of the text data.

The primary purpose of this series is to show how NLP, especially text analysis, can contribute to the public sector. In particular, this blog post focuses on the information gathering and text-based analytics that elevate evidence-based decision-making to the next level. The first post focuses on text mining techniques.

1. Text mining: Beyond copy-paste

Imagine that you are reporting to political leaders on the latest news regarding social welfare reform, a controversy that requires quick mapping of the media landscape, and the right reaction to public opinion. Conventionally, snippets of news are collected by reading “important” newspapers or clicking through “well-known” websites/social media accounts. In essence, it is still a copy-paste procedure based on digital content. However, such manual processes notoriously lack comprehensiveness, and are slow and error prone.

It is not enough to turn a government of paper into a government of jammed Word documents. Text-mining techniques aim to solve the problem of information gathering and storage in a transparent and systematic way. Below, we briefly show some examples and their uses.

1.1. Amassing web-based content

Web-scraping or -crawling aided by rule-based NLP techniques is an established solution to the speed and scale up the information gathering in the internet. In essence, we can exploit the structure of the HTML or use the application programming interface (API) provided by the hosts to automatically download the information in the desired format.

I was once involved in a research project[1] for which I was able to automatically scrape parliamentary questions from the German Bundestag’s archives from the 1970s onwards and store the documents in a structured dataset within days. The dataset contained precise information including titles, content, keywords, authors, dates, and endorsing parties, etc. Basic knowledge of regular expressions (RegEx) comes in handy when separating the questioning and answering actors from a single string.

For the public sector, this form of information collection provides a more comprehensive (i.e., less biased) overview of the media landscape. Moreover, the automatic pipeline offers consistent, transparent, and scalable insights for further analysis. Also, it is cheaper in terms of reducing the cost associated with manual works.

1.2. Converting handwritten documents and speeches

Existing optical character recognition (OCR) techniques can easily detect and transcribe handrwitten text or scanned documents into digitalized characters. Various machine learning and deep learning algorithms achieve high accuracy. These techniques are particularly interesting for government agencies that want to automate the digital storage of letters or handwritten documents, such as tax declarations or work permits.

Speech recognition systems based on neural networks can now quickly facilitate the transcription of, for instance, politicians’ speeches into a readable format. Both business and open-source solutions are well developed.

Example of handwritten numbers that could be successfully transformed into machine-readable digits. Visualization by Joseph Steppan, CC BY-SA 4.0/ LeCun, Cortes and Burges, MNIST Dataset

As exciting as textual data collection sounds, readers may well ask, “So what?” This is a pressing question that raises concerns about the quality of the insights from the vast amount of (unstructured) data. In the next article of this blog series, we will scrutinize the analytical potentials thanks to the development in machine learning research.

Qixuan Yang, Senior Data Scientist at the AI Garage in Capgemini Invent in Germany, is interested in the fusion between social sciences and data science. In particular, he focuses on text analysis in the realm of natural language processing. With strong background in interdisciplinary research, Qixuan aims to provide neat business solutions backed by statistical learning approaches, as well as evidence-based consulting for better decision-making. You can get in touch here.

[1] CAP, Comparative Agendas Project,