Skip to Content

Why and how to monitor a SONiC NOS network
A complete approach to monitoring a network based on SONiC NOS, to enable analytics and automation

Kishor Kulkarni
Apr 15, 2024
capgemini-engineering

Our blog examines why monitoring networks built on Sonic NOS (Software for Open Networking in the Cloud) is vital to meet growing user expectations for reliability and speed. We discuss the move towards autonomous networks that adapt to changing traffic and faults, reducing disruptions.

According to the market intelligence company IDC, worldwide spending on public cloud services is forecast to reach $1.35 trillion in 2027.

As the cloud becomes more ubiquitous and capable, consumers expect an ever-improving quality of service from the services they use. Whether it’s accessing a website, streaming media, or gaming, these consumers want it to be faster and more reliable. Network operators are under pressure to ensure that their infrastructure can deliver on these expectations.

This ‘always-on, always-connected, always-available’ expectation drives the need for autonomous networks which can respond dynamically to changing traffic demands, degradation or faults, so that users are impacted minimally by changes.

One critical aspect of an autonomous network is the ability to monitor network elements and traffic flows at various points in that network. Datacenter networks, based on the well-established Clos or Leaf-Spine topologies, along with technologies like MC-LAG (Multi Chassis Link Aggregation) have an extremely high level of redundancy and, as a result, high fault tolerance. However, enterprises and service provider access/aggregation networks use different topologies and may not have the luxury of installing such a high level of redundancy in the topology itself. Indeed, enterprises that operate less fault-tolerant infrastructures require even closer monitoring.

A variety of monitoring tools are available currently and well distributed in networks. However, many of these take a performance-monitoring approach, rather than a telemetry approach. The significance of this is mainly from a investment perspective – large datacenters can often afford expensive monitoring systems, some of which are bespoke to the equipment they use. In contrast, smaller enterprises usually can’t. Most importantly for the focus of this blog, these bespoke tools cannot monitor SONiC NOS based networks.

Why is this important? SONiC (Software for Open Networking in the Cloud) is a major enabler for open networking in datacenters, which require accurate and up to the second monitoring. SONiC is typically installed into a router or switch in a network. But network monitoring must be done by other software which is external, eg. on a server. However, existing installed tools may not readily be able to collect the necessary information from SONiC, due to compatibility issues.

To this end, we have put together a demonstration and proof of concept that shows a completely open solution to the challenge of properly monitoring your network, leveraging popular and proven open source components.

Our SONiC monitoring solution

We use a combination of sFlow-RT, Prometheus and Grafana for this. The sFlow-RT tool collects telemetry data using the established sFlow methodology. Prometheus stores the collected data in a real-time database and provides access to this data to other tools/systems. Grafana offers an administrator a rich visual view of the network traffic. We used the open-source Debian Linux distribution as the platform for these tools, and we generated the traffic data patterns using TRex, an open-source traffic generator.

The benefits of such network monitoring include:  

  • Immediate network visibility
  • Real-time insights  
  • Critical metrics on utilization and error statistics  
  • Being able to identify irregular traffic patterns  
  • The ability to respond to emerging network issues  
  • Maintaining optimal performance  

More about sFlow

SONiC-based network elements include switches and routers. We will now delve into the the details of sFlow – why it is important and how it is integrated into these network elements to monitor and collect essential network traffic data. This data includes packets, byte counts, traffic patterns, and flow information.   

  

sFlow (an abbreviation of “sampled flow”) is a network monitoring technology that provides real-time insight into network traffic. It doesn’t capture every single packet of data that flows through a network, but rather samples subsets of packets for analysis. sFlow collects valuable information about the source and destination of the sample packets, the type of traffic, and traffic volumes –  identifying patterns, trends, and irregularities within the network’s traffic. This is especially useful in troubleshooting to spot inefficiencies and problem areas in a network.  

The SONiC NOS includes an sFlow agent. This sFlow agent, when correctly set up, samples and collects data about network traffic flowing through the device. It sends the collected information, (encapsulated in sFlow datagrams) to a designated destination, typically a sFlow collector. These datagrams contain details about the sampled packets, such as source and destination addresses, ports, and other relevant information. 

sFlow-RT (sFlow-RT installation) is the sFlow Collector, a tool to collect sFlow data sent out by sFlow agents embedded in devices in the network. It can also provide critical metrics, like packets-received, bytes-received, packets-transmitted, bytes-transmitted, utilization, and error statistics. A Collector receives, stores and analyzes sFlow datagrams from multiple agents across the network. It stores the data in a real-time database, providing real-time visibility into network performance, traffic patterns, and anomalies in the network.  

Prometheus (Prometheus installation) is a monitoring and alerting toolkit that collects, stores, analyzes, and visualizes time-series data, including network flow data. It stores the collected telemetry data in a real time database and provides an elegant, user-friendly interface to read and use the collected data – both in real-time and post-facto. 

Grafana (Grafana installation) is a data visualizer. It transforms the data/metrics from other tools (like Prometheus) into meaningful visualizations. The inbuilt dashboards make it easy to get started, by setting up the most common parameters that network administrators are interested in. The dashboards can be customized and additional elements can be added, depending upon need. This helps users to interpret and analyze network performance data.

The detailed configurations are shown in the video. A summary of the configurations is below.

Switch/router with SONiCsFlow-RT/CollectorPrometheusGrafana
· Enable sFlow
· Configure polling interval
· Add an agent-id
· Enable sFlow on an interface and map it to the added agent-id
· Configure the name and IP-address of the system on which sFlow-RT is installed
· Open configuration file sflow-rt.conf
· Configure exported details (type=Prometheus, IP-address and port number of system running Prometheus
· Restart sFlow-RT service
· Add one or more jobs into the prometheus.yaml file.·  Configure Prometheus as the type of data source
· Configure the IP address of the system on which Prometheus is running, as the data source
· Import a pre-created dashboard by selecting from the available list

Better oversight: better network performance

This setup helps network administrators gain insights into network traffic, so that they can oversee their networks, identify hot spots, troubleshoot issues and optimize performance. This allows you to get the best out of your deployed network resources.  

Capgemini Engineering helps clients to best use SONiC in their projects. Contact our experts today to see how we can help you leverage the benefits of open networking.

Meet our expert

Kishor Kulkarni

Director-Principal Engineer, Capgemini Engineering
Kishor Kulkarni is a Network Architect with 28 years of experience in the telecommunications industry. He has been involved in the development of networking IPs for Edge/core routers focused on packet forwarding, QoS. He played a crucial part in developing SDN solutions aimed at orchestrating Metro Ethernet services, facilitating closed-loop automation and scalable performance monitoring solution.