Real-time analytics in Cloud

Publish date:

Here we look at what the big three – AWS, Azure, and GCP – offer in the real-time analytics space.

Real-time analytics, as defined by Gartner, is the discipline that applies logic and mathematics to data to provide insights for making better decisions quickly. Further categorized into on-demand and continuous, it means that analytics is completed from a within few seconds to minutes after the arrival of data.

Real-time analytics in CloudData processing is a three-stage process.

  • Ingestion, or the collection of data
  • Processing, or transforming the data into information
  • Analyzing information to derive insight.

The offering of a cloud player is what defines and differentiates them. Let us look at what the big three – AWS (Amazon Web Services), Azure, and GCP (Google Cloud Platform) – offer in the real-time analytics space.




Building Blocks


Kinesis Data Stream – a set of records, or Shards  

Job that specifies the input source to stream data, transformation query to filter, sort, aggregate, and join streaming data over time, and finally send it to a consuming or output medium


Pipeline is an entity that encapsulates input, series of transformations, and output. Once deployed on Google Cloud, it becomes a job that can be run repeatedly.
Shard – unit of data in a data stream, replicated three ways, stores up to 1 MB of data Streaming unit represents the computing resources (compute, memory, and throughput) allocated to execute a job


Uniform interface for both stream (live data) and batch (historical data) mode.
Kinesis Client Library – precompiled libraries to be used by the consumer/client for fault-tolerant consumption of data from the stream Inputs refer to the entities from which data is read (e.g. IoT hub, blob storage, event hubs)


Outputs refer to entities where data is sent (e.g. Cosmos DB, Azure data lakes, Service bus queues)


Function is Azure’s serverless computing entity.


Utilizes Shuffle is the set of data transformation operations that enables sorting the data by key in a scalable, efficient, and fault-tolerant manner.

Reference data inputs refer to a finite set of lookup data that can be used for data processing.

Available APIs


Supported via AWS SDK for .Net, Java, C++, Go, JavaScript, PHP V3, Python, Ruby V2 Supported via Azure SDK for .Net Utilizes Java and Python-based Apache Beam 100+ packages



Increasing instance size, manually or via Amazon EC2 auto-scaling based on metrics Scaling with streaming units Streaming auto-scaling option available (not default)
Increasing the number of instances up to the maximum number of opened shards (so that each shared can be processed independently on each instance)  

Query parallelization with partitioning of data – concept of embarrassingly parallel jobs with granularity 1 (1 input partition – 1 query instance – 1 output partition)


Needs to be run during pipeline development via maxNumWorkers – cannot be changed at run time, needs redeployment, upper limit of 1,000
Increasing number of shards (increases parallelism level)  

Increasing batch size of a job will improve throughput as more events will be processed within a limited count of calls to the machine learning web service

Bucketing data as per time – development phase feature



Encryption before writing to stream storage, decryption after retrieval from storage, enhances security of data at rest within the stream, enables meeting regulatory requirements


Supports encryption/decryption for data at rest, i.e. data stored on disk or backup media, per data chunk Supports encryption/decryption for data at rest, i.e. data stored on disk or backup media, per data chunk
Uses AWS Key Management Service Supported with Azure Key Vault and Azure Active Directory


Supported with Google Key Management Service
Available in selected regions (Ireland, London, and Frankfurt in the EU)  

Azure Active Directory is a non-regional product

Azure Key Vault is available in all EU regions.


Available in selected regions (Finland, Belgium, Netherlands, Frankfurt, and London in the EU)
Increases latency (<100μs) for GetRecords and PutRecord/s APIs  

Additional latency that can be significantly reduced by selecting service and key management in the same or in nearby regions


Additional latency that can be significantly reduced by selecting service and key management in the same or in nearby regions




Pay as you go, no upfront or minimum fee


Pay as you go, no upfront or minimum fee Pay as you go, no upfront or minimum fee

Charges based on per million payload unit rate; data counted in chunks of 25KB


Per-hour billing, charged at highest count of streaming units used in one hour Usage billed by the volume of streaming data processed, i.e. ingestion, pipeline stages, shuffling, and output to data sinks
Enhanced fan-out incurs extra cost charges based on per consumer, per shard, per hour, and per GB of data retrieved


Separate prices for batch and stream workers
Increase in default retention period is calculated per shard, per hour


First 5TB of service-based shuffling discounted
Additional cost of API usage for encryption/decryption


Additional cost for resources that a job consumes, e.g. Big Query, Cloud Bigtable, etc.
AWS default user on KMS is free; custom ID charges extra



Limitations – write, read, storage



1MB/second or 1,000 records/second of ingest capacity per shard


200 streaming units per subscription per region 1,000 compute each engine instances per job.

2MB/second shared among all consumers to read data from a shard

Scaling up to 2MB/second per consumer for reads available using enhanced fan-out


1,500 jobs per subscription per region 25 concurrent jobs allowed per project


125 concurrent jobs allowed per organization


Default retention period of 24 hours, stretchable to maximum of 168 hours (seven days)


Fixed 120 streaming units per job 3,000,000 requests per user per minute
  Harm limit 60 each of input, output and functions per job


15,000 monitoring requests per user per minute
Maximum 100 MB of blob size for reference data


160 shuffle slots (~30TB data) of concurrent shuffle per project
No explicit job retention


No explicit job retention

Example Use Cases

Real time analytics in cloud

Each player addresses real-time analytics in a uniform way and leverages their existing services for underlying infrastructure platform requirements. AWS addresses security more competitively than the others by providing encryption for data on the move. Choosing Azure may be a better alternative from a software lifecycle standpoint as most principles leverage existing Microsoft ETL know-how. Google’s niche is data and the company has built the service on Apache Beam, which is better known as the uber-API for big data. Among established data players (such as Attunity, Databricks, Talend, Informatica, Looker, to name a few), each provides the capability to migrate data layers on the cloud platform of choice, enabling integration with the cloud provider’s SaaS bucket.

A unified cloud player strategy may help with better pricing, performance, and a cohesive approach to common tasks (such as region selection, security, etc.) and choosing a SaaS implementation weighs heavily on where an existing organization is in their cloud journey. The real-time analytics area is no different.

Bhawna Gupta is a Microsoft certified Cloud Architect on Azure and has delivered several Cloud implementations.

Related Posts


Move-2-Cloud or Move-2-SAP S/4HANA, or both?

Gianluca Simeone
Date icon June 11, 2020

Common questions from our Customers include “Do I first have to move to the cloud and then to...


Being Zeus in the Cloud

Prabhakar Lal
Date icon June 3, 2020

I’ve met multiple cloud service providers to discuss ongoing or potential opportunities and...


Key points to consider for a successful cloud transformation

Rajesh Radhakrishnan
Date icon June 1, 2020

Cloud technologies offer significant advantages over traditional IT in terms of agility,...