Apache Spark has emerged as a powerful and versatile distributed computing system for big data processing, outshining the traditional MapReduce model in various aspects. In this blog post, we will explore the key challenges associated with MapReduce and how Spark overcomes these challenges, making it the go-to choice for data processing over MapReduce.
Let’s delve right in !!
Challenge 1: MapReduce has many input-output disk reads:
The MapReduce framework’s reliance on disk-based storage leads to many IO disk seeks, which impacts the performance negatively.
Say, we have a problem where we have to count the number of words in a document. Let’s look at the number of disk input-output operations in each stage for both MapReduce and Apache Spark.
- Reading Input Data:
- MapReduce: reads data block from the distributed file system: IOcountMR = 1
- Spark: reads data block from the distributed file system: IOcountSpark = 1
2. Writing Intermediate Data:
- MapReduce: writes intermediate key-value pairs to local disk: IOcountMR = 1
- Spark: stores intermediate data in memory: IOcountSpark = 0
3. Shuffling and sorting:
- MapReduce: reads intermediate data from local disk and writes it to the local disk of reducer nodes: IOcountMR = 2
- Spark: shuffles and sorts data in memory: IOcountSpark = 0
4. Reading intermediate data in Reduce stage:
- MapReduce: reads shuffled and sorted key-value pairs from local disk: IOcountMR = 1
- Spark: accesses shuffled and sorted data in memory: IOcountSpark = 0
5. Writing final output:
- MapReduce: writes final output to the distributed file system: IOcountMR = 1
- Spark: writes final output to the distributed file system: IOcountSpark = 1
Total disk IO operations count:
Apache Spark generally performs fewer disk IO operations than MapReduce, thanks to its in-memory processing capabilities. In the case of the word count problem, Spark performs disk IO operations only when reading input data and writing the final output, whereas MapReduce has additional disk IO operations for handling intermediate data and shuffling.
Challenge 2: In MapReduce, many lines of code have to be written to accomplish even a simple task:
Please note that the comparison provided below is an optimistic estimation, as the actual number of lines of code may vary depending on the implementation.
- Defining the problem:
- MapReduce: Specify the word count problem: countMR = 1
- Spark: Specify the word count problem: countSpark = 1
2. Writing the Mapper/Map function:
- MapReduce: (optimistic estimate): Create a Mapper class, override the map method, write the map logic, including additional lines for boilerplate and imports: countMR = 5+
- Spark: Use the
flatMap
transformation to tokenize the input text into words: countSpark = 1
3. Writing the Reducer/Reduce function:
- MapReduce: (optimistic estimate): Create a Reducer class, override the reduce method, write the reduce logic, including additional lines for boilerplate and imports: countMR = 5+
- Spark: Use the
reduceByKey
transformation to aggregate word counts by key: countSpark = 1
4. Configuring and running the job:
- MapReduce: (optimistic estimate): Create a driver class, set up the configuration, and run the job, including additional lines for boilerplate and imports: countMR = 5+
- Spark: Create a Spark context, configure and run the job using
textFile
,flatMap
,reduceByKey
, andsaveAsTextFile
: countSpark = 2 (from creating a spark context and saving as text file).
Total lines of code count (optimistic estimate):
- MapReduce (countMR): 16+ (likely higher due to Java’s verbosity and more complex implementation)
- Spark (countSpark): 5 (concise implementation using high-level APIs and built-in transformations such as
flatMap
andreduceByKey
)
Challenge 3: Comparing support for data handling between MapReduce and Spark
- Data-processing models:
- MapReduce: Supports only batch processing (processes data in large, discrete chunks, generally blocks)
- Spark: Supports batch processing, real-time/stream processing (processes data as it arrives or in large chunks), and micro-batch processing
2. Real-time data processing capabilities:
- MapReduce: Lacks native support for real-time data processing, requiring separate tools or frameworks (e.g., Apache Storm or Apache Flink)
- Spark: Offers built-in support for real-time data processing through Spark Streaming or Structured Streaming
3. Example scenario — processing incoming log data:
- MapReduce: Must wait for log data to accumulate and form large chunks before processing, introducing latency in analysis and insights
- Spark: Can process log data as it arrives using real-time processing, enabling immediate analysis and insights
4. Support for Machine Learning:
- MapReduce: Requires external libraries, such as Mahout, to perform machine learning tasks, resulting in additional complexity and potential compatibility issues
- Spark: Offers built-in machine learning support through the MLlib library, simplifying the implementation of machine learning algorithms and reducing the need for additional tools
5. Support for Graph Processing:
- MapReduce: Lacks native support for graph processing, which requires using external libraries or frameworks (e.g., Giraph or GraphX) that work on top of MapReduce, increasing complexity and reducing performance
- Spark: Provides built-in graph processing support with the GraphX library, offering a more seamless and performant experience for processing graph data
6. Data processing APIs:
- MapReduce: Primarily uses the lower-level MapReduce programming model (mapper and reducer), which can be more difficult and time-consuming to implement
- Spark: Offers high-level APIs in multiple languages (Scala, Python, Java, and R), making it easier to develop and maintain data processing applications
7. SQL support and data processing engines:
- MapReduce: Requires external tools like Hive or Pig for SQL-like processing, which can add complexity and reduce performance due to the additional layers involved.
- Spark: Includes built-in support for SQL processing through Spark SQL, which integrates seamlessly with the rest of the Spark ecosystem and offers better performance and optimization.
While MapReduce often requires separate tools and libraries for extended functionality, Spark provides a unified platform with built-in support for various data processing tasks, making it easier to work with and offering better performance.
Challenge 4: With MapReduce, constrained to always think from a Map-Reduce perspective
- Problem framing:
- MapReduce: Developers must think in terms of Map and Reduce operations, even for tasks that don’t necessarily fit into this model. This can lead to more complex solutions and hinder productivity.
- Spark: Developers can leverage a variety of built-in transformations and actions, allowing them to approach problems in a more natural and intuitive way.
2. Filtering and transforming data example:
- MapReduce: To filter and transform data, developers need to implement custom mappers and reducers that handle both the filtering and transformation logic. This can result in more complex code and additional overhead.
- Spark: Developers can use built-in transformation functions like
filter()
andmap()
to accomplish the same task with less code and complexity. They can chain multiple transformations together in a more readable and maintainable way.
3. Code readability and maintenance:
- MapReduce: The MapReduce paradigm can lead to code that is harder to read and maintain, as developers must mentally map the problem to the MapReduce model and understand the custom mapper and reducer logic.
- Spark: The high-level APIs and built-in transformation functions make Spark code more readable and maintainable, as they more closely resemble the natural way of expressing the data processing logic.
Challenge 5: MapReduce has an eager transformation model which is inefficient
Let’s consider an example where there are 1 billion records and only 5 need to be filtered and processed based on a specific condition.
MapReduce Pipeline:
- In the Map phase, a custom Mapper function is written that filters the input data based on the given condition. This Mapper function is executed on all 1 billion records. (Inefficient because it processes all records, even though only a small fraction is needed)
- The filtered key-value pairs (5 records in this case) are written to the local disk as intermediate data. (Inefficient due to disk I/O overhead)
- The intermediate data is shuffled, sorted, and transferred across the network to the Reducer nodes. (Inefficient because it involves data transfer and sorting, even though only this operation is happening on a few records)
- In the Reduce phase, a custom Reducer function has to be written that processes the 5 records as needed.
- The final output is written to the distributed file system.
The primary inefficiencies in the MapReduce process are in stages 1, 2, and 3. Processing all records in the Map stage, writing intermediate data to disk, and shuffling and sorting data across the network result in overhead, increased latency, and unnecessary resource consumption.
Spark Pipeline:
- An RDD or data frame is created from the input data. RDDs are much more efficient than Mapper and reducer stages as in MapReduce, because:
- RDDs enable in-memory processing. In-memory processing reduces I/O overhead and latency, leading to faster execution.
- RDDs use lazy transformations, which allow Spark to optimize the execution plan before processing the data.
- Spark’s RDD-based processing pipeline can handle multiple transformations in a single pass. This reduces the need for multiple rounds of reading and writing intermediate data, further enhancing performance.
2. A filter transformation is applied to the data based on the given condition. This transformation is lazy, which means::
- Lazy transformations create a DAG (Directed Acyclic Graph) that represents the sequence of transformations to be executed. They are not executed until an action is called, which helps optimize the overall execution plan.
3. Any additional transformations needed to process the filtered data is applied. These transformations are also lazy. Additional lazy transformations add to the DAG, allowing Spark to optimize the entire sequence of transformations before executing them.
4. An action (e.g collect() or count()) is called to trigger the execution of the transformations. When an action is called, Spark evaluates the DAG and optimizes the execution plan. This optimization step ensures that only the necessary data is processed and minimizes data movement and computation.
5. At the end, after the collect() or count() method will be called (based on the operation specified), the DAG will select only 5 records from 1 billion records which means there is no kind of execution (either loading or processing) of 1 billion records.
Thus, Spark’s ability to analyze and optimize the DAG helps avoid unnecessary data processing, leading to lower latency and resource consumption compared to MapReduce. Therefore, it can be said that Spark’s efficiency comes from its use of RDDs, DAGs, and lazy transformations.
That’s all for now !!
If you found this blog helpful, please clap for it and share it.
Credit goes to Sumit Mittal, for teaching this awesome technology.