What’s the difference between a Kinder egg and meta data?

Capgemini

22 Apr 2022

For one, you start off by “guessing” whether you will get a figurine or just an assemble toy – but for the other, “guessing” cannot give you a clear view. Let’s explain this better.

We’ve all shaken Kinder eggs to “guess” what’s inside and raise our chance of getting one of the figurines. I know of people who have even put them on the vegetable scale to increase their odds (many years ago, the figures had a higher weight than the assemble stuff!). But in the end, it was all boiled down to guess work.

But, when we deal with data, we don’t want to guess. We want to get the clearest view on “what’s inside.” And with the Data Mesh approach, in which we create Data Products to reuse existing data, this is more important than ever. But how do we do that?

First, let’s have a look on ‘What is a data product’ again. In the last post I sketched the Data Product already. It has these main elements in it:

Ideally, everything is code based, including setting up the infrastructure (infrastructure as code), data transformation steps etc.

The required infrastructure for that data product is connected to it. Depending on the type and usage of the product, it can vary.

Usually there are two main types:

The input and output ports are the connectors to the data sources (input) and to the data consumers (output). Depending on the type of data product and the data sources and uses cases, these can be regular API endpoints like jdbc, s3, kafka, mqtt, https, &mldr; .
But the more important (at least for this post) port is the Control Port. Why? Well, with this specific endpoint we get information from and control the defined data product.

So let’s take an example:

As a business user, in order to reuse existing data, I want to get the following information from a Data Product within my organization:

In what domain is that data product?
Who has created it?
What information / fields are available?
when was it updated last time?
what is the data quality of it?
what data sources is this Data Product connected to?
what connection options I have to work with the Data Product?

Here is where the Control Port comes in – as it provides a consistent view on the Data Product.

Endpoints are e.g.:

metadata {name, owner, source systems}
lineage information
address of ports {input, output}
schema {input, output}. link to company ontology
audit
metrics, i.e.
last updated timestamp
loading frequency
loading time (how long did it take to load data)
data volume
data quality metrics {e.g. # of unmapped records}

As you can see in the image above, the Control Port collects the information from a wide variety of sources like the Data Catalog, Data Quality Tool, the Data Lineage Tool, or the DataOps tool of the Platform Foundation. It acts like an endpoint federation, as i.e. usually the Data Catalog tool of choice provides:

a nice integration to publish the data product into the catalog
an API to get information out of the catalog

The same applies (or should be applied) to the other components of the Data Trust layer.

So let’s take an example on the metrics, as these are quite commonly required by the business in order evaluate further usage of the existing Data Product. When the business has easy access to the metrics of a Data Product, it can easily evaluate how much value this Data Product brings it for their use case(s). The exemplary metrics mentioned above give a good overview on typical demands and have their basis in the SRE Workbook.

But how does the user get this information? Do they have to query the API?

Well, this information has to be available as API endpoints to allow automatic processing. But for Business Users there has to be a proper front-end incorporating them and allowing the search, along with clear visual exploration and interaction with the underlying information. This can be a custom front-end communicating with underlying API or off the shelf tools. (Due to the broad landscape of these tools on the market, a separate post is required to capture that topic)
So, if we go back to the initial question: Yes, the majority of us has found ways to increase the chance to get a rare mini figure as the surprise in the Kinder eggs. And we were creative with it. But nowadays, when we talk about finding the right Data Product(s) for the use cases, it doesn’t need shaking or a scale. In a proper Data Mesh based architecture, all the relevant information are available at the control port and consumable via an API AND a proper frontend.

If you’d like to explore this topic or have any queries, please reach out.

About the speaker:

Arne Roßmann
Head of AI & Data Engineering Germany,
Chief Architect – Insights & Data at Capgemini

Data Meshes & Knowledge Graphs – Part 2

Knowledge Graph – The glue within Data Mesh