AI techniques like Machine Learning (ML) can unravel deeper insights from sets of data than traditional statistical techniques. Big Data both requires and enables these new methods. We can now access large amounts of data through mass storage and high performance computing.
The first two V’s of Big Data—Volume and Variety—have to be met in order to get Machine Learning working. For instance: with large amounts of data regarding visits to a web shop, you can classify and predict how particular types of visitors will use the website. This way you can create product recommendations for our visitors, even for first timers.
When using a relatively simple ML technique like decision trees, every decision node needs at least ten occurrences in training and test data. With ten-thousands of decision nodes, this can easily require millions of data records to cover the complete model. Collecting these vast quantities of data can be challenging.
Garbage In, Garbage Out
But to fully exploit the possibilities of Artificial Intelligence and the promises of truthful and correct predictions and advice, we also need data with the right quality. The old saying about computing is still valid: “Garbage In, Garbage Out.” But why is this becoming more problematic with AI and Machine Learning? With traditional data analysis, when bad data is discovered our set, one can exclude it and start over. This is cumbersome but manageable.
“Bad data consists of missing data, outliers, skewed value distributions, redundancy of information, and features not well explicated.” (John Paul Mueller, Luca Massaron)
But such data cleaning cannot be done at scale. With Big Data and Machine Learning, bad data cannot be detected or pulled out of the system that easily. Artificial Intelligence techniques draw conclusions from large masses of data, which may or may not include garbage data. At a certain moment in time, it becomes impossible to determine on which data elements these predictions are based. In this way, Artificial Intelligence becomes black box technology: you don’t know where it’s drawing its conclusions from. “Unlearning” something is nearly impossible—remove one part and the entire model ceases to work. Just like a brain! When bad data is detected, you’re mostly required to restart the whole learning process from the beginning, which is time and cost-intensive.
Questions to be asked
So how do we establish whether our data is of good quality? This is called Veracity, another V of Big Data. Veracity refers to the trustworthiness of the data. Without digging too deep into the subject, there’re some basic questions we can ask about the data we’re going to use.
Why & Who: Data from a reputable source typically implies better accuracy than a random online poll. Data is sometimes collected, or even fabricated, to serve an agenda. We should establish the credibility of the data source, and for what purpose it was collected. Dare to ask if the data biased because it was to prove a political, business, ethnical or ideological point-of-view.
Where: Almost all data is geographically or culturally biased. Consumer data collected in the United States may not be representative of consumers in Asia. And the cultural differences within Asia are also huge. When we objectively measure data, like temperatures, the interpretation of that data can differ: what is classified as cold or warm? And of course, temperature readings from Paris are not very useful for weather predictions in Mumbai.
When: Validity is also one of the V’s of Big Data. Most data is linked to time in some way in that it might be a time series, or it’s a snapshot from a specific period. Out-of-date data should be omitted. But when using an AI in a longer time span, data can become old or obsolete during the process. “Machine unlearning” will be needed to get rid of data that is no longer valid.
How: It’s worth getting to know the gist of how the data of interest was collected. Domain knowledge is of the essence here. For instance, when collecting consumer data, we can fall back on the decades old methods of market research. Answers on an ill-constructed questionnaire will certainly render poor quality data.
What: Ultimately, you want to know what your data is about, but before you can do that, you should know what surrounds the numbers. Sometimes humans can detect bad data or outliers, because it’s looks illogical. We should investigate where strange looking data comes from. But Artificial Intelligence doesn’t have this form of common sense; it tends to take all data for true.
Taking care of your data
In order to answer the question in an orderly manner, you need to organize the research into the quality of the data. Data quality procedures should of course be in place and used. But more is to be done. Establishing the veracity of data is part of the process of data (and content) curation.
“Curation is the end-to-end process of creating good data through the identification and formation of resources with long-term value. (…) The goal of data curation in the enterprise is twofold: to ensure compliance and that data can be retrieved for future research or reuse.” (Mary Ann Richardson)
I strongly believe data curation should be expanded beyond the description above. Like a curator in a museum who establishes if an exhibit is genuine or fake, a data curator should do the same for his data. This not only requires data analytics skills, but also domain expertise about the subject from which the data stems.
When you want to use data for Machine Learning and Artificial Intelligence, you’ve to go beyond the standard criteria of data quality. Yes, these criteria—like availability, usability and reliability—are still valid. But we should also take veracity into account: is the data truthful. And you need methods and roles to establish the truthfulness of your data.
Special thank to my colleague Marijn Markus for his valuable input.