In my opinion, the fundamental flaw in Data Quality is that we, the user, are not specific enough. Too often we assume that what we think is right, is right. Unfortunately, quality is subjective. And confusion about meaning doesn’t make it any easier.
I have to get myself a new car. Economy dictates that after its first four years a car needs to be traded in. So I have to go through a selection process and find myself the best quality car.
For me, my current car ticks all my boxes: save, reliable and comfortable. Price is very good and I cannot think of any accessories I have to add. Quality!
Friends and co-workers have a different opinion: I’ts a boring car. No style. Bad lines. No pedigree. Not fast enough. Not cool.
Other friends will add: no room for 3 children, their toys, the cribs and no DVD player.
Clearly, their definition of what defines quality in a car is different than mine.
Well, as long as we don’t have to share our cars, we’re all fine and we’re all happy.
With data, it’s similar. Even though the company creates only one set of data, we all find our little ways in picking up that data and then applying our own tweaks to it to make it ‘ours’. We move it into different databases, but before we do that, we twist the data to fit its next role. Consequently, we end up with loads of spreadsheets and Access databases.
And that is bad.
So why doesn’t the company data fit all our needs? Why is it necessary for so many people to derive their own sets? Simple: they have different needs. Different processes need data in a different manner: filters, aggregations, classifications: all tuned for specific processes. Like with the cars: young people need to look cool, young parents need to carry loads, and me, I just need to get to work.
It seems strange, though. Why, for example, is a simple attribute such as car colour a point of heated discussion between 2 departments? Lets have a look: My car of choice comes in a variety of colours, two of which are Aurora Blue Mica and Stormy Blue Mica. In other words: Blue. Which is exactly the point of discussion. There are literally tens of thousands of different colour descriptions in the automotive world. A risk application would not be able to derive any useful statistics out of that. They would need about 10 values. The rainbow colours, white, black and grey. You can make nice pie-charts with 10 values. But Purchase would need to be very specific: which kind of Blue do we want? You don’t want the dealer to deliver Aurora when you really wanted Stormy (or dolphin, or ocean, or pacific, or …).
So here we have a conflict of interest: both departments want Colour but they both have quite different expectations of what Colour is and different expectations of whether the field Colour is populated correctly.
On closer examination we see that actually the two departments are looking for two different kinds of car colour descriptions: actual car colour and car base-colour. And once you’ve discovered that you’re actually looking for 2 different things, life becomes much easier.
Here’s another one: the price of the car. For me: bottom line number: that’s the price of my car. What do I have to pay? But in actual fact, there are many things that make up the price: catalogue price, VAT, registration tax, delivery fee, discount, options, accessories, you name it. And each and every department will have its own take on what the definition of the price of the car entails. A friend of mine even thinks the most important part of the price is the discount.
All too often do we not take the time to discuss the meaning we apply when we say things like price, status, contract, service. We all stick to our own definition and rarely share it with others. It seems we rather spend time discussing why we have the right values and they do not.
There is a name for the misuse of terms: homonym. Same spelling, same sound but different meaning. These are more dangerous in the world of data than synonyms. Yet they are hardly recognized and that is because their meaning is often so very much alike. So alike that the differences are sometimes hard to spot.
An example of a 3-fold homonym in a single sentence:
Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.
Or to you and me: Bison from Buffalo (NY) who are intimidated by other bison from Buffalo also happen to intimidate bison from Buffalo. So, we have Buffalo as a city, buffalo as an animal and buffalo as another word for to intimidate, to confuse, to bully.
Although this may seem a far fetched example, within companies we do tend to show different meanings for our counts of Customers, our totals for Invested amounts and our numbers of incidents. There are many standards around, such as the IAS and still we manage to find different meanings.
To break through the cycle, as a first step towards Data Quality, we need to sit down and clearly define what we mean by each and every data item we use in our processes. Even the ones we think are clear by their name. This is tedious and will certainly create resistance but it is necessary. Do not take anything for granted. Any external standard (such as IAS) may help, but be sure that the users agree to the meaning or inform you of their version.
With the proper definitions, you can start working at finding out if, how and when the expected data can be delivered by the applications. For car colour, for instance, you can add a field to the application and have the users fill it in or you can add derivation rules to the original field.
Most importantly, though, you will have created a proper understanding of the data and their usage in the company’s processes.
So before we start measuring the quality of the data, let’s see what it is supposed to be.
Now, I still need to select a colour for my car. I think I’ll go for blue. Either one. I don’t really care. I will be sitting inside and don’t see the colour anyway. Colour is way down on my list of Quality items.