Major actors in different industries own data about phenomena and places, as well as about the way those two elements are correlated to each other: they know where their customers live; they know where a transaction occurred; they know where their competitors are or where they should be to intercept and meet a particular demand. This data is generally structured and trustable, but its use, even though quite structured itself, is usually limited to few reactive activities.

On the other side, since the dawn of the Prosumer and the rise of the discussion around the Digital Footprint, the unveiled potentialities of the User Generated Content has been unearthed. Indeed, we have clarified how most of the information we, as Data Scientists, have access to, contains a powerful layer: “where” that particular action/interaction occurred.

This data, created by users through a more or less aware use of technology, is generally unstructured, not trustable at first, but is growingly used to infer social behaviors. In spite of ethical implications that the misuse and abuse of this information could have, it is unquestionable its capability to generate insights.

Leveraging together structured and not structured data is something we are quite used to. Doing that keeping into account the variable “Space” is still a chimera. In reality, Data Science seems still not to be interested in Geographic Information, and this is a mistake.    

Figure 1: City ‘shaped’ by Twitter interactions in two different months (Naples, Italy, March (top) –June (bottom), 2013) 

Indeed, if data is neutral, and it should be by nature, the real life processes behind its ‘creation’ are not, and they will never be. The social interactions that pre-exist the transactional data we analyse to evaluate the performances of our stores, the Twitter thread we query to comprehend how our product is perceived and our campaign is received, as well as the customer journey we optimise, they all happen in a complex environment, a place.

Even though the processes we analyse through our algorithms tend to be executed, and to produce their exhausts, in virtual dimensions, they still are referred to physical places. This reality made of places, proximity, and physical networks too many times has been compressed in our analysis, reduced in such a way to, at the very best, a flattened space. But a complex series of dimension exists, and they are made by everything we know about the distribution of phenomena and objects around us, from streets to trees, from pollution to the number of people using the Tube. It is all out there, but we are not using it. At least not as we should.

At the same time, the old saying ‘birds of a feather flock together’ is still tremendously contemporary. Two people living in the same neighbourhood are still more likely to have similar characteristics than two people randomly chosen. In the same way, two similarly segmented neighbourhoods are more likely to share the same socio-demographic characteristics than two neighbourhoods randomly chosen. Finally, people commuting following the same trajectories and patterns will likely share the same needs and will be looking for the same services, at the same time in the same place.   

What for? Again, why should we care about the reality outside the window?

What if we want to forecast the expected sales value of a certain product, say a sweetened frozen food, assuming:

–       The area of interest is London

–       The product is sold in different convenience stores

–       The product has a quite strong seasonality with picks in warmest months

–       We are not interested in weekend sales value

Accessing transactional data we will be able to analyse in which quantities this and similar products have been sold in the last few years. We could use this information to start informing our understanding of the phenomenon and the statistical model to implement. The results could be satisfying, and we could be able to predict how many units of the product will be sold in the next months and the related sales value.

But what if we want more? What if we want to know where that product is going to be sold the most, and who will be buying it and when? What if we want to know where we can meet, for real, our customers? Transactional data can be good enough for that. However, generally they are not.

So what can we do?

We could start analysing the geo-demographic layer, studying the social structure of places of interest where the product has been sold. We can use existing demographic segmentations, or create our own if we are interested in a particular audience/target. At this stage we have already a clear idea of what our structured data can tell us, and the story that can be built around it.

But this could still be not enough to answer our questions. Our model now performs better, but still not well: the socio-demographic composition of the area analysed seems to be, in fact, not representative of our target.  

Going deeper, scraping the social media, we can collect all the mentions around the product and the related brand, analysing content, differentiating good from bad mentions; using then explicit geographic information or information contained in the associated metadata to geolocate (to locate events in the geographic space) them.

Enriching our model with this unstructured information allows us to comprehend more in depth what we are analysing, continuing exploiting the limits of hypothesis formulated without considering the geographic dimension. But, again, this could still be not enough: in reality, people talking about the product in the area of interest are different, in terms of socio-economic characteristics, from people living in the same area.

What are we missing?

The answer is Proximity!

We are missing the complexity of space and the simplicity of the phenomenon.

Sales volumes in London, as well as in other major cities, are (sometimes mainly) driven by its gravitational power, and subsequently by its capability to attract flows from the external toward the centre. Commuters and their behavior were what we needed to analyse; proximity the filter we needed to take into account.

In a warm month, our sweetened frozen food will be sold significantly more when, among other less geographical but still significant variables, a higher number of convenient stores is located close to a park! It is obvious, isn’t it? This seems to be particularly true in the Southern area, both East and West (map below).

Figure 2: Forecasted Sales Value by Output Area (Parks and other leisure areas defined by low population density)

This is the reason why we need to care about the ‘where’ behind data we analyse, because what we are analysing is happening for real somewhere out there…maybe in a park in a warm summer day.   


User Generated Content: digital content created by users online 

Prosumer: in this context it is intended as a user that at the same time creates and consumes information, mainly UGC

Digital Footprint: data created while using the Internet, with or without the owner knowing, mainly as a consequence of the use of Social Media.   



I am senior Data Scientist working in Capgemini UK Data Science team. We are currently recruiting Data Scientists and other roles