The missing links in your data platform

Capgemini

March 30, 2022

Enterprises can potentially reap immediate benefits through improvement in data discoverability, enabling data teams with a better understanding of data and its lineage, and establish a platform for exploratory data analysis through visual tools. The catch, is to find the missing links in the data lake.

Data platform (data lake/data warehouse) initiatives have been around for quite a long time now. However, they have not always lived up to the promise of transforming an enterprise into a data driven organization. The premise of data platforms has been around building a data repository of an organization’s digital footprint and gathering insights by asking questions of the data that were earlier impossible to answer. The problem enterprises face with this approach is the inability to understand the connectedness of data, it´s quality, history and the business rules associated with different digital footprints.

The data lake/data warehouse approach has recently been looked afresh with an alternative approach founded on domain driven design principles, termed the data mesh (Dehghani, 2019). The approach identifies a core problem in the data engineering and data consumption pipeline viz the inability of technology focused teams to understand data domains and the need to treat data as a product. The knowledge that domain experts have is both explicit and implicit. The general non-availability of this knowledge to the organization hampers the goal of data democratization that is a fundamental pillar of a data driven enterprise. Of course, the data mesh thinking additionally brings to forth both organizational aspects and data product thinking to data platform architectures and way of working.

So, how can we try to solve this problem? Enter the semantic data mesh.

The semantic data mesh augments the capabilities and promise of a data platform with an envelope of context/meaning. The semantic data mesh approach provides the missing links in the data platform and extends the data mesh pattern to cover both data continuum and data quantisation. The knowledge of how different data elements (objects, domains or ensembles) relate to each other. As a solution, the semantic data mesh presents an architecture that augments the data platform (built on-premises or on the cloud) using knowledge graphs through construction of ontologies. But let’s take a step back and understand all these new terms.

In the world of business intelligence and data warehousing, we have been quite accustomed to established principles of dimensional modelling, cubes, or the data vault. Terms such as knowledge graphs, ontologies may be quite unfamiliar to a unit dealing with conventional approaches to analytics.

The term knowledge graph was introduced by Google, in a blog post in 2012 (Singhal, 2012), with a corresponding patent filed earlier in 2005. The knowledge graph for Google was a way to enable searching, not for strings or words, but for things, words with meaning. As exemplified on the blog, a string ‘Taj Mahal’ may mean a landmark, one of the seven wonders of the world, or a local restaurant where one might be looking to spend the evening.

There is no formal definition for a knowledge graph. Only a common understanding of what it represents i.e. a way for machines to read and represent information about the world (Kejriwal et al., 2021). However, a good starting definition to understand knowledge graphs may be the one defined by Stanford University which states:

“A knowledge graph is a directed labeled graph in which the labels have well-defined meanings. A directed labeled graph consists of nodes, edges, and labels. Anything can act as a node, for example, people, company, computer, etc. An edge connects a pair of nodes and captures the relationship of interest between them, for example, friendship relationship between two people, customer relationship between a company and person, or a network connection between two computers. The labels capture the meaning of the relationship, for example, the friendship relationship between two people.”

The second term we use in this blog is semantic which refers to meaning in language or logic. This term has been quite widely used when describing the semantic web (Berners-Lee et al., 2001), which uses distributed graphs with a formal semantics for knowledge representation and reasoning on the internet. The purpose of the semantic web is like the one stated for knowledge graphs – providing a capability to software agents to retrieve content from the web and understand it´s meaning (as opposed to finding documents).

Knowledge graphs are powered by a different kind of data model founded on graph data models (Angles & Guiterrez, 2008), as opposed to entity-relationship models that are founded on relational theory (a relation, mathematically is a table). Database technologies that support graph data models are commonly missing in the technology stack that are behind most data platform implementations which usually comprise of:

Distributed file stores e.g. HDFS, AWS S3, Azure Storage Account, Google Cloud Storage
Distributed computing frameworks e.g. Databricks, Apache Spark
Relational data stores for information marts
Data catalogs
Data pipeline solutions

As you can see, a key element on capturing the meaning of data elements within the data lake is missing. This missing link is the enterprise knowledge graph that powers the semantic data mesh. Enterprise knowledge graphs build on the principles of open world knowledge graphs but leverage an organization’s internal data assets that originate in applications that maintain different business processes (Hogan et al., 2003).

An enterprise knowledge graph (EKG) provides a robust method to capture knowledge on a range of information representations including structural metadata, business metadata, business logic, data classifications and information security. An EKG comprises the following conceptual layers (Galkin et al., 2016):

A schema layer that provides a machine-readable description of knowledge
An instance layer that provides data elements for the elements of the schema
A metadata layer that provides information about information sources
A coherence layer that provides links to other knowledge sources

For those more familiar with the relational database logical models, the schema in a graph may be thought of as a table and column definitions while data instances may be thought of as rows within that table. Understanding the nomenclatural differences between the relational and graph worlds is crucial towards a coherent communication between teams coming from both these disciplines.

Of course, the EKG goes beyond the scope of a semantic data mesh and analytical use cases. However, since enterprises across industry verticals have embarked on constructing data lakes either on premise or on cloud, it becomes easier to initiate construction of an EKG on top of the data lake rather than taking a logical federated approach. We see the semantic data mesh as a good starting point for enterprises to start an EKG project and use this experience to continue towards a more comprehensive EKG initiative.

The same approaches that have been used to link open data on the Web are seen to be equally applicable to linking enterprise data. The linked data concept refers to a set of practices for structured data connectedness on the Web (Bizer et al., 2009). In the linked open data domain, the same semantic web principles that were used for the web of documents were applied to data. Recent research has shown that its possible to extend the same approach within the enterprise leading to the development of ideas around linked enterprise data (Novak & Tjoa, 2019). The business value linked enterprise data generates is an efficient data management infrastructure that allows different stakeholders to find information faster.

Within the scope of a semantic data mesh, enterprises can potentially reap immediate benefits through improvement in data discoverability, lowering the dependency on domain experts through enabling data teams with a better understanding of data and its lineage and establish a platform for exploratory data analysis for business analysts through visual tools for graph mining (Ziegler et al., 2020). Increasingly the knowledge graph and its underlying graph databases are seen to be fundamental components of the modern data platform that provide a foundation for AI solutions (Zou, 2020).

Authors