Big Data Architectures – From Lambda Architecture to Streaming ETL Architecture

Publish date:

The term “Lambda Architecture” was first coined by Nathan Marz who was a Big Data Engineer working for Twitter at the time. This architecture enables the creation of real-time data pipelines with low latency reads and high frequency updates. This architecture was praised and well received by the Big Data Community and led to the […]

The term “Lambda Architecture” was first coined by Nathan Marz who was a Big Data Engineer working for Twitter at the time. This architecture enables the creation of real-time data pipelines with low latency reads and high frequency updates. This architecture was praised and well received by the Big Data Community and led to the creation of a book: “Big Data – Principles and best practices of scalable real-time data systems” by Nathan Marz and James Warren. The Lambda Architecture has 3 main components (see Figure1):

  • The Batch Layer creates views using a batch-oriented framework over the total and an immutable data set.
  • The Speed Layer creates incremental views using real-time data processing framework based on the most recent data.
  • The Serving Layer provides a unified view that can be queried by applications based on the views created by the Batch and Speed Layer. 

NB: what is meant by an immutable data set is that all (subsequent) versions of a record are stored in a distributed data store. No logic is implemented to modify or update a record with a newer version.

Figure 1: Lambda Architecture Main Components

The Batch Layer runs periodical batches every day or several hours on the whole immutable data set to update its views. This architecture introduces a technical challenge that requires discarding the Speed Layer views once the Batch Layer has updated its views with the most recent data (see figure 2).

Figure 2: Update Process of Batch and Speed Layer Views

Furthermore this architecture requires you to code your functionality twice in two distinct layers: the Speed and Batch Layer. Thankfully frameworks like Twitter Summing Bird and Apache Spark were created which reasonably prevent you from having to write code twice for the same piece of functionality in a Batch and Speed Layer. Still this architecture does incur having to maintain two layers performing the same functionality with different periodicities. In the past years, the advancements around real-time data pipelines have been huge. Apache Spark one of the most active open source projects in the world, created the Apache Spark Streaming framework. The Apache Spark Streaming frameworks now includes fault-tolerance, data replication and exactly once processing of the data. The argument does not hold anymore that pure real-time data streaming pipelines are too faulty or unreliable to be used standalone. It is now therefore possible to create ETL pipelines based on streaming technology that update the views of the Serving Layer as the data comes in. With the advent of new technologies, new architectures are also born ☺ (see figure 3).

Figure 3: Streaming ETL Architecture

References 

Gerelateerde posts

Internet of Things

Driving efficiency and sustainability with IoT

Eeuwen, Milan van
Date icon 12 november 2018

Insights generated by IoT solutions drive profitability, reliability, efficiency and...

cookies.

Door verder te navigeren op deze website accepteert u het gebruik van cookies.

Sluiten

Sluit cookie informatie