Real-time analytics in Cloud

Publish date:

Here we look at what the big three – AWS, Azure, and GCP – offer in the real-time analytics space.

Real-time analytics, as defined by Gartner, is the discipline that applies logic and mathematics to data to provide insights for making better decisions quickly. Further categorized into on-demand and continuous, it means that analytics is completed from a within few seconds to minutes after the arrival of data.

Real-time analytics in CloudData processing is a three-stage process.

  • Ingestion, or the collection of data
  • Processing, or transforming the data into information
  • Analyzing information to derive insight.

The offering of a cloud player is what defines and differentiates them. Let us look at what the big three – AWS (Amazon Web Services), Azure, and GCP (Google Cloud Platform) – offer in the real-time analytics space.

AWS

Azure

GCP

Building Blocks

 

Kinesis Data Stream – a set of records, or Shards  

Job that specifies the input source to stream data, transformation query to filter, sort, aggregate, and join streaming data over time, and finally send it to a consuming or output medium

 

Pipeline is an entity that encapsulates input, series of transformations, and output. Once deployed on Google Cloud, it becomes a job that can be run repeatedly.
Shard – unit of data in a data stream, replicated three ways, stores up to 1 MB of data Streaming unit represents the computing resources (compute, memory, and throughput) allocated to execute a job

 

Uniform interface for both stream (live data) and batch (historical data) mode.
Kinesis Client Library – precompiled libraries to be used by the consumer/client for fault-tolerant consumption of data from the stream Inputs refer to the entities from which data is read (e.g. IoT hub, blob storage, event hubs)

 

Outputs refer to entities where data is sent (e.g. Cosmos DB, Azure data lakes, Service bus queues)

 

Function is Azure’s serverless computing entity.

 

Utilizes Shuffle is the set of data transformation operations that enables sorting the data by key in a scalable, efficient, and fault-tolerant manner.
   

Reference data inputs refer to a finite set of lookup data that can be used for data processing.

Available APIs

 

Supported via AWS SDK for .Net, Java, C++, Go, JavaScript, PHP V3, Python, Ruby V2 Supported via Azure SDK for .Net Utilizes Java and Python-based Apache Beam 100+ packages
 

Scaling

 

Increasing instance size, manually or via Amazon EC2 auto-scaling based on metrics Scaling with streaming units Streaming auto-scaling option available (not default)
Increasing the number of instances up to the maximum number of opened shards (so that each shared can be processed independently on each instance)  

Query parallelization with partitioning of data – concept of embarrassingly parallel jobs with granularity 1 (1 input partition – 1 query instance – 1 output partition)

 

Needs to be run during pipeline development via maxNumWorkers – cannot be changed at run time, needs redeployment, upper limit of 1,000
Increasing number of shards (increases parallelism level)  

Increasing batch size of a job will improve throughput as more events will be processed within a limited count of calls to the machine learning web service

Bucketing data as per time – development phase feature
 

Encryption

 

Encryption before writing to stream storage, decryption after retrieval from storage, enhances security of data at rest within the stream, enables meeting regulatory requirements

 

Supports encryption/decryption for data at rest, i.e. data stored on disk or backup media, per data chunk Supports encryption/decryption for data at rest, i.e. data stored on disk or backup media, per data chunk
Uses AWS Key Management Service Supported with Azure Key Vault and Azure Active Directory

 

Supported with Google Key Management Service
Available in selected regions (Ireland, London, and Frankfurt in the EU)  

Azure Active Directory is a non-regional product

Azure Key Vault is available in all EU regions.

 

Available in selected regions (Finland, Belgium, Netherlands, Frankfurt, and London in the EU)
Increases latency (<100μs) for GetRecords and PutRecord/s APIs  

Additional latency that can be significantly reduced by selecting service and key management in the same or in nearby regions

 

Additional latency that can be significantly reduced by selecting service and key management in the same or in nearby regions
 

Pricing

 

 

Pay as you go, no upfront or minimum fee

 

Pay as you go, no upfront or minimum fee Pay as you go, no upfront or minimum fee
 

Charges based on per million payload unit rate; data counted in chunks of 25KB

 

Per-hour billing, charged at highest count of streaming units used in one hour Usage billed by the volume of streaming data processed, i.e. ingestion, pipeline stages, shuffling, and output to data sinks
Enhanced fan-out incurs extra cost charges based on per consumer, per shard, per hour, and per GB of data retrieved

 

Separate prices for batch and stream workers
Increase in default retention period is calculated per shard, per hour

 

First 5TB of service-based shuffling discounted
Additional cost of API usage for encryption/decryption

 

Additional cost for resources that a job consumes, e.g. Big Query, Cloud Bigtable, etc.
AWS default user on KMS is free; custom ID charges extra

 

 

Limitations – write, read, storage

 

 

1MB/second or 1,000 records/second of ingest capacity per shard

 

200 streaming units per subscription per region 1,000 compute each engine instances per job.
 

2MB/second shared among all consumers to read data from a shard

Scaling up to 2MB/second per consumer for reads available using enhanced fan-out

 

1,500 jobs per subscription per region 25 concurrent jobs allowed per project

 

125 concurrent jobs allowed per organization

 

Default retention period of 24 hours, stretchable to maximum of 168 hours (seven days)

 

Fixed 120 streaming units per job 3,000,000 requests per user per minute
  Harm limit 60 each of input, output and functions per job

 

15,000 monitoring requests per user per minute
Maximum 100 MB of blob size for reference data

 

160 shuffle slots (~30TB data) of concurrent shuffle per project
No explicit job retention

 

No explicit job retention

Example Use Cases

Real time analytics in cloud

Each player addresses real-time analytics in a uniform way and leverages their existing services for underlying infrastructure platform requirements. AWS addresses security more competitively than the others by providing encryption for data on the move. Choosing Azure may be a better alternative from a software lifecycle standpoint as most principles leverage existing Microsoft ETL know-how. Google’s niche is data and the company has built the service on Apache Beam, which is better known as the uber-API for big data. Among established data players (such as Attunity, Databricks, Talend, Informatica, Looker, to name a few), each provides the capability to migrate data layers on the cloud platform of choice, enabling integration with the cloud provider’s SaaS bucket.

A unified cloud player strategy may help with better pricing, performance, and a cohesive approach to common tasks (such as region selection, security, etc.) and choosing a SaaS implementation weighs heavily on where an existing organization is in their cloud journey. The real-time analytics area is no different.

Bhawna Gupta is a Microsoft certified Cloud Architect on Azure and has delivered several Cloud implementations.

Related Posts

cloud

Bringing your renewable enterprise to life

Valery Smague
Date icon September 27, 2019

Moving forward with SAP S/4 HANA and a wide spectrum of “intelligent” technologies to become...

cloud

Even huge cloud migrations can go smoothly – with a little help

Becky Hsu
Date icon September 10, 2019

Every cloud migration brings unique challenges to overcome, but a good plan and team can make...