Skip to Content

Key tenets of enterprise-scale data mastery through data mesh architecture

Aniruddha Khadkikar / Magnus Carlsson
8 Mar 2023

Over the years we have seen an evolution of approaches towards enterprise data mastery, driven primarily by the ambition to take informed data-driven decisions. Early architectures applied what we now commonly refer to as a data warehouse pattern, wherein data was extracted and transformed to a data warehouse. This data warehouse fed to downstream data stores that served subsets of data and applied both security and business rules to the data to make it ready for consumption by business. The pattern, however, led to transactional pipelines wherein for every integration, transformation logic was first applied before ingestion, thereby leading to an explosion of complex interwoven pipelines. Complex logic was applied to both sets of pipelines, one for populating the warehouse and the other for feeding data to downstream data marts. Problems were further amplified as technical teams struggled with understanding data and building the warehouses without active business participation and ownership, thereby also affecting data quality.

The next phase of architectural evolution saw the emergence of lakes, that metaphorically referred to accumulation of data centrally, and fed through multiple data streams. In the first phase of the lake architectures, a simple pattern of wide scale ingestion was seen as a robust foundation to enable data analysis on demand. The pattern, termed as a data lake was further supported by the emergence of distributed compute and storage, in the form of Hadoop and its ecosystem and the development of Apache Spark, work around which was initiated at Berkeley Amp Lab. The pattern advocated a lazy transformation which was popularly termed as ELT (extract-load-transform).

The benefits of this approach were not readily realized, as soon, it was clear that the notion of schema on demand and discoverability of data by users was not easy. Although data was rapidly accumulated, the limited usage of this ingested data hindered organizations in their journey to become data driven. Data mastery, therefore, remained elusive.

To overcome the shortcomings of the data lake architecture pattern, the company that sprang up from the initial research around Apache Spark; Databricks, advocated a combination of both data lake and data warehouse patterns and termed it a lakehouse (data lake + data warehouse). The pattern recommends a three-tier incremental enrichment of data termed as a medallion architecture (inspired by Olympic medals probably) wherein different layers had different purposes. The bronze layer was inspired by the data lake pattern and was seen as a long-term data store of raw data, whereas the silver and gold layers were seen analogous to data warehouse and data marts from the data warehouse pattern.

The problem with the above approaches was that they were driven by technology and engineering perspectives. And not quite looking at information and domain models as first-class citizens. The active involvement of business in the construction of these data foundations was missing thereby leading to expensive projects, rework, and an innate distrust at times, of data served through these data foundations. Influenced by the microservices approach taken to decompose software architectures and the move from large, difficult to evolve monolithic systems towards distributed architectures, Dehghani (Data Mesh, O’Reilly, 2022) proposed an architecture pattern, termed the data mesh.

A key point to remember, is that the data mesh approach is a new approach of building the same types of solutions as earlier. However, with earlier approaches, evolution was very difficult due to complex inter-dependencies and the solution was essentially monolithic. This essentially prevented extensibility and scale. Consequently, the data mesh approach should be seen as a natural progression in architectural evolution, as evidenced in other software systems.

Here the most significant change has been the move towards thinking information-first and not about topologies, transformations, and data pipelines. And this represents a tectonic shift in the mindset, needed for adopting the data mesh pattern. This tectonic shift also requires organizations to embolden themselves to change and an objective rethink. Putting domain modeling and information at the core will lead to a key missing block being enabled. The missing piece in the jigsaw puzzle of a data driven enterprise.

The value from the data mesh architecture pattern is that it enables data activation and lays down a foundation for citizen data nerds (analysts, scientists, executives, managers, employees). To reach this data mastery at an enterprise scale requires the adoption of a few tenets. Without following the tenets, one should not expect the benefits of the data mesh architecture pattern to be realized. These core tenets are a good way to begin the journey and not get too bogged down by organizational change and the different roles needed. Adopting the three tenets below lays a solid foundation towards other data mesh principles that revolve around processes and way of working.

Tenet 1: Treat everything as a product

A fundamental change is to unlearn old approaches and take a new approach. Of treating data as a product. And building a portfolio of analytics products. The data product is a logical construct but necessitates a new mindset. Where conversations are less about technological integrations, but around products. This also requires business participation. Rather than thinking only around use-cases (for machine learning as example), one needs to build up a portfolio of analytics products that are anchored to business capabilities. Without this product mindset and capability map anchoring, the data driven ambitions of an enterprise cannot scale in a structured way. The product mindset should further be extended to frameworks that work as accelerators for agile value delivery.

Tenet 2: Develop platform thinking for enterprise-scale data mesh

The self serve data platform covers not just solution components delivered as capabilities, but a central management of organizational policies, security frameworks, information classification frameworks, data governance and devops services. The self serve data platform is delivered as a service towards lines of businesses to help in the construction of the data mesh. Also in scope of platform thinking are enterprise-wide governance, policies, and data security that provide guidance to lines of businesses on a compliant way to build the data mesh.

Tenet 3: Domain models together with semantics using graphs and ontologies

Key to data mesh success is an improved understanding of data to enable self service analysis. Data activation will largely influence if citizen data nerds (analytics, scientists, executives) can reach the data they need to foster a data driven culture. Domain modeling is the first step and not the final one, as in earlier approaches, wherein the final layer was the semantic layer. While business glossaries are useful in describing terms and their definitions, they fall short in helping with data activation as they do not document connectedness or provide a framework for reasoning capabilities. Semantics is better solved through the introduction of ontologies and knowledge graphs (Khadkikar & Engels, 2022) that leverage industry standards such as Fibo for financial services, IEC 61400-25 for Wind power plants, Saref for smart applications and ISO 15926 for industrial automation systems.

The early adoption of these tenets set the data mesh transformation journey in the right direction. Like in any change management exercise, the largest obstacle is to change mental models and convictions. Which is equally true to re-orienting oneself towards the data mesh architecture approach.

Author

Aniruddha Khadkikar

PhD & Senior Architect, CoE – Insights and Data, Sweden.
Enterprises continue to struggle to get value out of their data using analytics. Aniruddha helps in building robust data and analytics foundations employing principles of data mesh architecture and combining them with the power of knowledge graphs for improved understandability and explainability of data. The necessary ingredients for data activation.

Magnus Carlsson

VP & Head of CoE – Insights and Data, Sweden, Capgemini
All businesses and organizations can be run smarter and more efficiently with help of data, analytics, and AI. Many of the most pressing challenges we face today, can be solved using data. Magnus is passionate about solving real-world problems and developing new businesses based on data and the latest technology.

    Expert Perspective

    Data and AI

    The missing links in your data platform

    Capgemini
    Mar 30, 2022