In my previous blog posts (part I, part II, and part III) about the ‘Open Business Data Lake Conceptual Framework (O-BDL), I introduced its background, concept, characteristics, and platform capabilities. In part IV, part V and part VI, I compared a Data Lake with other data processing platforms, described how an O-BDL should work, and defined possible business scenarios which can make use of an O-BDL. In this part, I’ll describe the O-BDL in more detail with regards to the O-BDL data and data ingestion concept.
In part V, I introduced the following process diagram (applying The ArchiMate® Enterprise Architecture Modeling Language) to describe how an O-BDL should work:
Based on this process, the O-BDL data and data ingestion concepts are described in more detail.
The Data concept within an O-BDL
The data concept within an O-BDL is shown in the following diagram:
As the diagram shows, data can be either structured, semi-structured, or unstructured (database tables, binary raw data from sensors, images and videos from cameras, tweets, documents/files, etc.). Metadata is data describing data and represents key inputs for information governance, especially data quality, data confidentiality, and discovery. Examples of metadata are:
- source, target, date of ingestion, file size, and tags attached to a photo or video
- data classifications, rules, and policies, along with business glossary definitions.
An event is a specific structured data piece that has a date and time of occurrence. An event can contain additional data pieces, especially semi-structured or unstructured data. A stream represents a flow or succession of ordered events. The order of “recorded” events in the stream does not necessarily reflect the order of occurrence in real life. When considering or processing a stream, events shall be immutable.
An O-BDL favors two types analysis streams:
- Batch streams can consume very large data sets but can potentially take time (hours).
- Real-time streams can deliver insights very quickly (sub-second latency) but they can’t leverage all kinds of analytics.
Insights are data items that typically represent the added value of an O-BDL. They are produced by successive distillation steps executing analytics in an O-BDL. Real-time insights are particular insights that are produced with a very low latency by real-time analyses consuming events or streams of data augmented by data stored in an O-BDL.
The Data Ingestion concept within an O-BDL
The data ingestion concept is shown in the following diagram:
Batch ingestion is the most common way of acquiring data within an O-BDL, meaning creating new data sets. It consists of acquiring a large number of data items that were previously existing elsewhere in the IT landscape. Loading (in a few hours) 30 years of customers’ orders is an example of batch ingestion. Implementations of an O-BDL should be designed to be able to execute multiple sustained batch ingestions at the same time.
Real-time ingestion is dedicated to processing streams or events, which are structured and generally small data. An O-BDL is designed to be able to execute multiple sustained real-time ingestions at high velocity (thousands of values/events per second).
Micro-batch ingestion implements a “bridge” between real-time and batch analyses. It turns streams of events into data sets that can be analyzed as historical data for very long timeframes.
The ingestion of metadata can be done in multiple ways, depending on the nature of the data but also on its automation or not. The simple way is to automatically extract metadata from data, and create metadata at the same time data is ingested. In some cases, the metadata extraction consists of several processing steps, some of them being performed asynchronously to the ingestion of data itself, following a metadata enrichment process that is implemented as a distillation step. Metadata enrichment can also happen through the action components and real-time analysis.
In the next (eighth) blog post I’ll elaborate on the O-BDL data processing concept.