The term “Lambda Architecture” was first coined by Nathan Marz who was a Big Data Engineer working for Twitter at the time. This architecture enables the creation of real-time data pipelines with low latency reads and high frequency updates. This architecture was praised and well received by the Big Data Community and led to the creation of a book: “Big Data – Principles and best practices of scalable real-time data systems” by Nathan Marz and James Warren. The Lambda Architecture has 3 main components (see Figure1):

  • The Batch Layer creates views using a batch-oriented framework over the total and an immutable data set.
  • The Speed Layer creates incremental views using real-time data processing framework based on the most recent data.
  • The Serving Layer provides a unified view that can be queried by applications based on the views created by the Batch and Speed Layer. 

NB: what is meant by an immutable data set is that all (subsequent) versions of a record are stored in a distributed data store. No logic is implemented to modify or update a record with a newer version.

Figure 1: Lambda Architecture Main Components

The Batch Layer runs periodical batches every day or several hours on the whole immutable data set to update its views. This architecture introduces a technical challenge that requires discarding the Speed Layer views once the Batch Layer has updated its views with the most recent data (see figure 2).

Figure 2: Update Process of Batch and Speed Layer Views

Furthermore this architecture requires you to code your functionality twice in two distinct layers: the Speed and Batch Layer. Thankfully frameworks like Twitter Summing Bird and Apache Spark were created which reasonably prevent you from having to write code twice for the same piece of functionality in a Batch and Speed Layer. Still this architecture does incur having to maintain two layers performing the same functionality with different periodicities. In the past years, the advancements around real-time data pipelines have been huge. Apache Spark one of the most active open source projects in the world, created the Apache Spark Streaming framework. The Apache Spark Streaming frameworks now includes fault-tolerance, data replication and exactly once processing of the data. The argument does not hold anymore that pure real-time data streaming pipelines are too faulty or unreliable to be used standalone. It is now therefore possible to create ETL pipelines based on streaming technology that update the views of the Serving Layer as the data comes in. With the advent of new technologies, new architectures are also born ☺ (see figure 3).

Figure 3: Streaming ETL Architecture

References