Azure Databricks – introduction
Apache Spark is an open-source unified analytics engine for big data processing, with built-in modules for streaming, SQL, machine learning, AI and graph processing. It was created by Databricks. With Azure Databricks Microsoft intends to help businesses work with data in a faster, easier, and more collaborative way. Using Databricks, customers can accelerate innovation with one-click setup, streamlined workflows, and an interactive workspace that enables easy collaboration between data scientists, engineers and business analysts.
It is 100% based on Spark and is extensible with support for Scala, Java, R and Python along with Spark SQL, GraphX, Streaming, and Machine Library (MLIB). Azure Databricks provides a unified platform to manage clusters for various use cases such as running production ETL pipelines, streaming analytics, and ad-hoc analytics.
Azure Databricks – clusters
Azure Databricks clusters provide a unified platform for various use cases such as running production ETL pipelines, streaming analytics, ad-hoc analytics, and machine learning.
Azure Databricks has two types of clusters: interactive and job. Interactive clusters are used to analyze data collaboratively with interactive notebooks. Job clusters are used to run fast and robust automated workloads using the UI or API.
Capacity planning in Azure Databricks clusters
Cluster capacity can be determined based on the needed performance and scale. Planning helps to optimize both usability and costs of running the clusters.
Azure Databricks provides different cluster options based on business needs:
|General purpose||Balanced CPU-to-memory ratio. Ideal for testing and development, small to medium databases, and low to medium traffic web servers.|
|Compute optimized||High CPU-to-memory ratio. Good for medium traffic web servers, network appliances, batch processes, and application servers.|
|Memory optimized||High memory-to-CPU ratio. Great for relational database servers, medium to large caches, and in-memory analytics.|
|Storage optimized||High disk throughput and IO ideal for Big Data, SQL, NoSQL databases, data warehousing, and large transactional databases.|
|GPU||Specialized virtual machines targeted for heavy graphic rendering and video editing, as well as model training and inferencing (ND) with deep learning. Available with single or multiple GPUs.|
|High-performance compute||Our fastest and most powerful CPU virtual machines with optional high-throughput network interfaces (RDMA).|
VM options in Azure Databricks
To start with, general purpose clusters can be used for development purposes and once code is production ready, tests can be done on memory or storage optimized VMs and accordingly production clusters can be decided.
Key factors/guidelines for creating and configuring Azure Databricks clusters
Storage location – Azure BLOB storage or Azure Data Lake Store, choose a storage location same as the cluster storage. Choosing a storage location different from the cluster location will introduce data read/write latency.
VM size and type – Choose a VM size and type based on the business needs as illustrated in Fig – 1.
The VM size and type is determined by CPU, RAM, and network. Choosing more CPU cores will have greater degree of parallelism and for in memory processing worker nodes should have enough memory. For most cluster types data is typically stored in BLOB or Data Lake Store and network bandwidth available to a VM typically increases with larger sizes.
For example – if you have use cases where transformation is done using Azure Databricks and directly reporting to Power BI – memory-optimized VM will be a good choice.
Autoscaling – Databricks has an auto-scaling feature, which can help with scaling. As the workload increases more nodes will be spun up to accommodate the workload. This is especially good for SQL Warehouse, and ETL workloads. It probably shouldn’t be used for machine learning workloads that are very iterative. The cluster page has a min and max worker node settings. So, you might start your testing with a min = 4, max = 15. As the workload diminishes the workers are deallocated and you are not charged. You can review the executors tab or cluster page in the spark UI to see how many workers are spun up in your workload/test.
Data partition – Data and the partitions of the data can greatly affect memory consumption and performance. If one partition is skewed it can cause OOM on a worker on shuffle operations.
Partition the data according to the date or key columns and run these partitions sequentially with lesser cluster configuration. For example – If my data volume is 2,000 GB, I can choose a cluster with General Purpose–Standard_D32S_v3 128 GB RAM 32 cores 6 DBU 10-20 nodes or I can partition my data to eight partitions having 250 GB each and have cluster size as Standard_D32S_v3 128 GB RAM 32 cores 6 DBU 1–3 nodes. This will reduce the cluster cost to a greater extent.
Maximum RAM size that can be used in Databricks cluster is 432 GB and maximum number of nodes that can be allocated is 1200. The number of nodes to be used varies according to the cluster location and subscription limits.
Larger memory with fewer workers – In Spark Shuffle, operations are costlier and it will be better to choose larger memory with fewer workers. Larger memory and smaller number of workers will make the shuffle operations more efficient and reduce OOM issues.
Other activities in worker nodes – When you are choosing the worker nodes have some additional memory for the operating system, Spark dll/jars and various other activities.
There is no specific thumb rule for creating Azure Databricks clusters. Testing for your workload and data in development and deciding the right cluster sizes in production based on testing and other factors discussed above is the best possible route.
Cluster configuration in the range of 100 nodes are termed as big clusters and anything more than 400 will be a very huge cluster. To reduce costs, partition the data and use cluster configuration with a smaller number of nodes.