“People are surprised when they load data into incompassionate algorithms and don’t see the face of God.” (quote from an anonymous analyst)
* actually there are only seven
More surprisingly you can stream a Twitter feed into your Spark cluster and, with a few lines of code, see insights that appear like an apparition. The predictive power is as frightening as it is incredible. This article is not a witch-hunt seeking to target modern-day soothsayers, nor is it an excuse for every Data Scientist to add ‘oracle’ (the ancient Greek kind, not the ancient database kind) to their LinkedIn profile, but instead highlights some the reasons why Data Scientists (and those who commission or manage their work) should stop and think about their latest project in ethical terms as opposed to just technical.
1. Users, and some operators, give data and analysis an inflated level of objectivity
Users need to embrace uncertainty and risk; using Data Science as a tool to reduce uncertainty (to aid decision-making) rather than claiming to generate absolute conclusions.
2. Models hide the truth (SVM, Neural Network)
Ever tried to derive the impact of input variables on prediction results from a neural network? Don’t bother, it can’t be done exactly. There are many methods that allow you to attempt to predict the inner workings of these models, but of course these meta-models are still predictive and therefore inexact.
3. Data hides the truth
Big Data is great…right? Big Data provides a wealth of opportunity to discover insights but it also means that interrogating your model inputs is far more challenging and time-consuming. For example, when you build predictive models there will be certain inputs you choose not to use for legal, moral and technical reasons but with so much data how can you be certain that your remaining inputs aren’t highly correlated with these excluded inputs?
4. Data Scientists hide the truth
Rarely on purpose; but decisions and assumptions are made at every step in your analysis. The big ones are recorded and surfaced to stakeholders but it’s impossible to ensure every assumption you make is 100% scrutinised. It gets worse; everyone who has touched the data, hardware or software has also made assumptions. An obvious example is that many databases use a default date of 01/01/1901 and so you look out for this in your analysis and transform it in some way to avoid problems. But consider what other assumptions underlie your analysis that you’ve inherited from others and that others will inherit from you.
5. Users hide the truth
How often do you see your carefully recorded assumptions and caveats missing from the final presentation or whole sections of analysis removed or ‘enhanced’. Even if none of the above has occurred, how carefully do readers consume a report or presentation as opposed to skimming through it or going straight to the Executive Summary (did any of you read * note when you read the title?).
6. Models explain how to maintain the status quo but don’t address the question of whether it should be maintained.
Most predictive models used by Data Scientists are based upon historic patterns and the assumption that they’ll persist throughout the period you wish to predict. You can add clever corrections to account for expected future events / influencers. By doing this you may be reinforcing or causing undesirable behaviour. For example, you may be able to identify the most effective method for persuading people to buy doughnuts, but in doing so are you adversely affecting the health of people who may have consumed fewer doughnuts in the absence of your highly effective analysis and the marketing campaign based upon it.
7. Science is the first casualty when running short of time (shortly followed by documentation and testing)
We’ve all heard the saying ‘Correlation is not causation’, or similar and we all know it to be correct, but why is it correct?
In modern times (since the Renaissance) the Scientific Method uses Deductive Reasoning (or ‘Deduction’) whereby a general hypothesis is formed and then experiments are used to validate or refute the hypothesis using specific results.
Data Science rarely operates this way. Most commonly it uses Inductive Reasoningwhereby a few specific results are used to generate a general hypothesis that can be applied to a far greater set of results.
An extreme solution could be to use all data, not a subset that you can acquire and process. A more pragmatic solution is to use Abductive Reasoning whereby you use a subset of your data to generate a hypothesis and then test this using independent data. Ensuring that you never generate highly general conclusions from the tiny window of data you have to view the world.
Many of the causes listed above are intrinsic to Data Science but they’re not always identified since people are not looking out for them. Even the most intelligent algorithms struggle to apply any kind of judgment which is informed by cultural context. Often this context various geographically and so the problem is exacerbated by the ability to gather data quickly via the Internet with little or no constraint on distance between collector and source.
Long-term, Data Science training should provide more focus on if you should conduct analysis as opposed to just how to conduct analysis (i.e. the ethical dimension).
Short-term, individuals and projects should make time to consider the ethical dimension. This maybe as simple as thinking about it on the train home or explicitly allocating tasks in your next sprint to read around the subject for a few hours. This article will hopefully give you food for thought but there is plenty of material out there that is freely accessible. One such example is the Modelers’ Hippocratic Oath written by Emanuel Derman in his paper A defense of one's own life:
- I will remember that I didn’t make the world, and it doesn’t satisfy my equations.
- Though I will use models boldly to estimate value, I will not be overly impressed by mathematics.
- I will never sacrifice reality for elegance without explaining why I have done so.
- Nor will I give the people who use my model false comfort about its accuracy. Instead, I will make explicit its assumptions and oversights.
- I understand that my work may have enormous effects on society and the economy, many of them beyond my comprehension.
I'm a senior Data Scientist working in the Capgemini UK Data Science team and here is an example of our work. We're currently recruiting highly moral Data Scientists and slightly less moral Big Data Engineers.