Health problems often warrant a visit to the doctor’s office, leading to tests to help the physician determine the issue. These tests measure results against a pre-determined range of acceptable numbers. If the results are higher or lower than what’s considered normal, the doctor prescribes treatment.
But it does not end there. Ongoing checkups are needed as they ensure that the numbers remain in the safe range. The fact is that humans are intricate beings who require regular healthcare.
With the advent of big data and data lakes, enterprises have become extremely intricate from a data standpoint. A data lifecycle spans various verticals—sources, lakes, centralized repository and targets. The “health” of its core, i.e., the quality of the data, determines the well-being of the entire data landscape. Quality converts data into information/knowledge. Unless the core is healthy, the abundant information becomes useless, especially for downstream systems that use that data. The “unhealthy” data could lead to inaccurate decisions, which would negatively affect business growth.
Hence, a diagnosis and treatment plan for the data’s overall good health becomes a basic necessity.
Enterprise data can be prone to error because of potentially questionable data coming in from source systems or insufficient extract, transform and load (ETL) processes. Real-time access of data via Web-based technologies adds to the chaos. Given this multi-fold scope of possible data corruption, multiple user groups are responsible for the diagnosis.
STEP 1: IDENTIFICATION AND SELECTION
Businesses need to assess the data quality and raise red flags in case of any issues. High-level dimensions for assessing data quality are accuracy, completeness, timeliness and consistency—the importance of each dimension being business-dependent. There is no one-size-fits-all model, so businesses play a primary role in identifying suspect data.
Next, the technology group has to run with the identified data using inputs from business to select potentially unhealthy data. It’s important that the business and the technology group agree upon the outcome of this step.
The driving force behind this activity is business knowledge, not technology (tools/utilities).
STEP 2: PROFILING AND ANALYSIS
Profiling is the process of examining data available in either source systems or the warehouse and then gathering statistics and information about it to arrive at potential data issues.
Profiling is executed on identified and selected suspect data to run various kinds of analysis such as values, statistical, frequency, etc.
Manual analysis of the profiling results is crucial to uncovering genuine data issues and their causes, and it needs to be conducted by both the business and technical teams.
Causes identified during diagnosis need to be studied in detail, their root causes traced, and potential fixes formulated and applied across verticals.
STEP 3: IDENTIFY AND FIX CAUSES
Consider data flowing in from different source systems that feed a single data store. At the end of the day, the namesake master data store has ample duplicates and bad master data overall. Profiling reveals the duplicates, and the analysis leads to different source systems that provide such data.
One way to deal with this issue is to have a separate master data store managed in-house and a cross-reference that maps source information to such master data. Processes would need to be defined to get the incoming information correctly mapped to the master data in order to avoid duplicates and orphan records.
Such a fix involves people, processes and technology spanning verticals.
STEP 4: MONITOR AND CONTROL
Treatment mandates regular monitoring. With fixes in place, ongoing monitoring is imperative to verify successful results and for ensuring that the data remains healthy.
Monitoring needs data quality rules to be defined and established. These rules are checks that validate various information scenarios against a pre-defined safe range and allow discovery of issues in an automated fashion. The scope of such rules is vast.
Data quality scorecards allow summarized and detailed results to be presented to a broad audience, namely the governance body, data stewards and end users.
Routine checkups help prevent health issues for humans, and they can do the same for an enterprise’s data.
Prevention is always better than needing a cure, and proactive measures help businesses reap great benefits from quality data in the long run.
Taher Borsadwala, living life one datum at a time (or at least trying to)…