A mesh to clear the mess!

These days, Microservice architecture is often perceived as the obvious solution to legacy application modernization, and rightly so. With business agility being the driver behind most modernization programs, coupled with the proliferation of Cloud, DevOps and Agile, there are few counterarguments to this choice.

Yet, not that many enterprises got it right. Why?

Modernization programs using microservice architecture almost always start off well but fail to scale. Indeed, going from a few services to an enterprise-scale, myriad of micro-sized services, creates an operational challenge: indeed the size of the system means that it becomes impossible for a human to comprehend. Now you may think that there is an easy fix: throwing more sysadmins to solve the problem. Unfortunately, we are way past that as well: operation teams require new skills and new tools to ensure the health of the system. This operational challenge is too often underestimated, and is probably the first cause of the downfall of modernization initiatives. That is key, but not the topic we are going to address today.

Breaking a legacy monolith application into smaller, modular, independently deployable, micro applications brings in a lot of business agility: changes and their impact are smaller and quicker . But to execute an end-to-end business process, it also means that many of those micro-services should interact with one another: this is done through orchestration. We are adding a new layer of complexity to architecture and operations here, and as the whole user experience relies on it, it is even more critical. That is why failing to manage this orchestration efficiently across several hundreds of such microservices is the other likely root cause for failure of these transformation programs.

So how do you solve this orchestration challenge?

Enterprise Service Bus (ESBs) did this rather well by having all these orchestrations managed in a single layer which also took care of the message transformations and the “plumbing” logic such as retry logic, circuit breakers / timeouts, dropped / duplicate requests, etc. However, this “centralized architecture” led to other problems like:

The ESB is always on a critical path and becomes your single point of failure
You could only scale it vertically, which meant beyond a point it was very hard to scale
It often became the bottleneck and slowed development teams down.

To simplify this mess, the concept of “Smart Endpoints and Dumb Pipes” was introduced – where there would be no “orchestration logic” sitting inside a service bus. Instead, microservices would be “choreographed” through an event-driven approach. Each microservice would do “its own thing” and publish an event/message on a queue; only those services that have subscribed to this type of event/message will need to react and then do “their own thing” and so on.

This decentralized, event-driven approach to choreography takes the “centralized architecture” based ESB out of the equation and introduces loose coupling leading to greater agility and fault tolerance.

However, it also adds a lot more responsibility back onto the service side. For example, all the “plumbing” work related to this intercommunication (retry mechanisms, circuit breaker patterns and timeouts, dropped requests, etc.) which earlier would be handled within the ESB, now gets pushed back to the microservice developer to incorporate into their respective services.

Of course, we have the option of relying on some third-party libraries (Hystrix, Polly, etc.) that can take care of the “plumbing” aspects thus freeing up the service developer. However, this takes away a key advantage of micro-service architecture which are polyglot by design. Indeed, it is rather unlikely that you will find these libraries in all different programming languages and runtimes with the same consistency, therefore negating the benefits provided by them.

The other option would be to have a separate “proxy” service for each “real” service, such that it runs alongside the real one – like a side-car. Even though this proxy would be running as an independent service, its function would be to provide the “plumbing” functions for its corresponding real service. Being an independent service (developed and deployed separately) it’s no longer intrusive to the real service.

This, as a matter of fact, is exactly the principle upon which the service mesh pattern was created.

So, a Service Mesh is basically just a dedicated Infrastructure Layer with the sole responsibility of handling service-to-service communications. It is typically implemented as an array of lightweight network proxies that are deployed alongside the application services without the service needing to be aware. It takes care of all the “plumbing” requirements to ensure reliable delivery of requests across the myriad of microservices that comprise a modern-day cloud native application. A service mesh does not introduce any new functionality. It only enables a shift in where the responsibility lies.

Referring to the diagram above, there is still one more concern though; if there is any configuration change that needs to be applied to the proxies – it would mean that every instance of that proxy service needs to be updated, which would create a significant operational overhead. Thus, a typical service mesh pattern always includes an additional control plane layer. This plane is connected to all the proxy service instances and can push configuration/policy changes down to each proxy without having to redeploy them hence making the management of this mesh a lot more manageable.

This control plane allows the same “centralized control” as with the ESBs, without being on the critical path, hence solving a key operational challenge. Note that the control plane has no role to play in the actual service-to-service communication, which happens entirely in what we call as the “data plane”. So, even if the control plane would have an outage, it only means that we may not be able to apply the policy/configuration changes centrally – which is not business-critical anyways.

Istio is currently one of the most popular solutions in the market today providing a comprehensive implementation of the service mesh pattern. But that’s not the only one. Consul, Linkerd and Envoy also offer decent options.

Gartner extends the mesh concept from service level to entire application level in what it calls as the “Mesh App and Service Architecture (MASA)”. MASA allows an application to be composed through a mesh of independent apps and services. In other words, MASA links mobile apps, desktop apps, web apps and IoT apps into an interconnected mesh which end users will see as the final product or “the” application! To implement that, one needs to think even beyond service mesh. Event Mesh is an extension of the service mesh concept that facilitates connections between not only microservices, but also legacy applications, devices, data streams, etc. While you may think this is far-fetched, companies like Solace already have Event Mesh solutions out in the market today.

So, is service mesh a must for every cloud-native application?

Well, it depends! Mandating it for every cloud-native application would be rather cloud-naive! One needs to decide based on where you are on your cloud-native journey. You don’t need to have it at the start of your journey, actually because it brings complexity, it might not be the quickest way to build a prototype or a MVP. But as the number of microservices grows, and you start seeing the problems discussed earlier in this blog, that’s when you may have to consider making this choice. Generally speaking though, we believe that most enterprise scale, cloud-native applications would need a service mesh.

If you are planning your cloud-native journey and would like to know more about this, feel free to reach out to us. We at Capgemini, have helped several of our clients with their cloud transformation programs and we will be happy to collaborate with you on this journey as well.

Generative AI

Cloud

Management team

Job offers

A mesh to clear the mess!

Capgemini

April 2, 2020

Aliasgar Muchhala

Expert in Cloud, Strategy and Transformation

Related