Apache Spark Architecture Explained

Apache Spark is an open-source framework designed for distributed processing of large volumes of data. It enables high-speed analytics on computer clusters, being much faster than traditional tools such as Hadoop MapReduce thanks to its in-memory processing. While Hadoop MapReduce writes intermediate results to disk between each step, Spark processes data primarily in memory, dramatically reducing latency. In this article, we will explain how Spark works.

Architecture

Driver Program

This is the entry point of any Spark application. If you are using PySpark, it would be your PySpark code. Its main functions are:

Create the SparkSession/SparkContext – Initialises Spark and establishes a connection with the cluster.
Define the workflow – Receives the transformations and actions written in your code.
Build the DAG (Directed Acyclic Graph) – Generates a logical execution plan.
Split the work into Stages and Tasks – Sends these steps to the Cluster Manager.
Receive results – Collects processed data or stores it in external storage.

Cluster Manager

The Cluster Manager in Apache Spark is responsible for managing cluster resources (CPU, memory, and nodes) and assigning them to Spark applications that request them:

Your Driver Program (the PySpark script) requests resources from the Cluster Manager.
The Cluster Manager locates available Workers, creates Executors on them, and these execute distributed tasks before returning results to the Driver.

Workers

These are the machines where the actual tasks (executors) are run. Executors are assigned to each Worker by the Cluster Manager according to available resources.

Execution Flow of a Spark Job

Once the workflow is defined, Spark uses what is called lazy evaluation. Spark does not execute transformations immediately when they are defined. Instead, it builds a DAG (graph) that contains all the transformations. This allows Spark to reorganise and optimise operations.

Spark then breaks down the optimised DAG into stages and tasks:

A stage groups tasks that do not require shuffles (data reorganisation between nodes).
Each stage is divided into tasks. A task processes a data partition. For example, if you have 200 partitions, Spark generates 200 tasks.
Executors on different Workers run these tasks in parallel.

This design ensures fault tolerance: if an executor fails, Spark can use lineage to recompute only the affected partitions instead of repeating the entire process.

Conclusion

Apache Spark stands out in the Big Data ecosystem thanks to its master–worker distributed architecture and optimised execution flow: the Driver builds a logical DAG of transformations, which is divided into stages and tasks executed in parallel by Executors across Workers, while the Cluster Manager efficiently manages cluster resources.

This design enables Spark to process large volumes of data quickly, at scale, and with fault tolerance—combining automatic optimisation with massive parallelism. In short, Spark turns the complexity of Big Data into an orderly, efficient workflow, transforming enormous datasets into actionable information with speed and reliability.

Beyond ETL, Spark is widely used for real-time stream processing and machine learning pipelines, making it a versatile tool for modern data teams

Need help designing Spark pipelines or migrating workloads? At Crow Tech, our data engineering experts can help you build reliable, scalable solutions on AWS Glue and beyond.