There’s a lot of excitement at Capgemini’s Applied Innovation Exchange in London this week. Today is Day 3 of the two-week Defra design sprint aimed at making better use of flood data.

The Environment Agency (EA) is part of the Defra family and is responsible for analysing flood risk in the UK.The EA wants to improve its capability of getting the right data and insights to users in a number of easily digestible ways. The team hopes to prove the concept that, on the right platform, data science can provide the answers needed.

I spoke to Graham Jackson from our Data Science team to understand more about the data science and DevOps involved. Here’s what he told me:


“We’re looking at three types of ESRI shapefile datasets, and because the EA’s data ethos is ‘open by default’, you can see a lot of the data we’re using yourself. We’re using open map data from here (e.g. flood risk areas), open boundary lines data from here (e.g. counties, regions, etc.), as well as property data derived from Ordinance Survey (OS) AddressBase® Premium (not open). This is a key dataset; it contains UK addresses and property types (house, flat, commercial, residential, etc.).”

Hardware and platform

“We’ll be processing the data and getting insights using a Virtual Private Cloud in AWS. The VPC has a Linux box (t2.micro) for data processing, an AWS Relational Database Service of Postgres to do the heavy lifting, and a Qlik Sense server instance to give us visualisations options. Importantly, the RDS instance has the PostGIS extension installed so we get to use all its great spatial functions!”


“After a bit of pre-processing (checking ERSI projections are consistent, etc.), we loaded the data into the cloud and sense checked it using QGIS. For example, here are flood warning areas for England:

“And here you can see property level data:

“We’ve also started on a few simple queries; essentially counting points in polygons, working out proportions etc. For example:

What’s next?

“Over the next 7 days we will be working closely with Defra and the Environment Agency to build a pipeline where users can query data on the cloud and get quick responses to questions like ‘how many properties are at risk of flooding in MP X’s area?’ or ‘how many golf courses are at risk of flooding in MP X’s area?’”

Why we’re doing this?

“Three reasons:

1.     This is a great opportunity to test innovative opportunities in the data science space.

2.     Leveraging a cloud platform to provide scalable, centralised, consistent queries will provide a step change in productivity and efficiency.

3.     The design sprint methodology is a powerful mechanism for driving incisive and rapid business change.”

To see previous sprintnotes, click here and here.