Before discussing Apache Pyspark, let’s understand what Apache Spark is.
Apache Spark: is a lightning-fast unified analytics engine for big data and machine learning. It was originally developed at UC Berkeley in 2009. It is an open-source, distributed processing system used for big data workloads. It utilizes in-memory caching, and optimized query execution for fast analytic queries against data of any size.
History of Apache Spark: Apache Spark started in 2009 as a research project at UC Berkley’s AMPLab. The goal of Spark was to create a new framework, optimized for fast iterative processing like machine learning, and interactive data analysis while retaining the scalability and fault tolerance of Hadoop MapReduce. In June 2013, Spark entered incubation status at the Apache Software Foundation (ASF) and was established as an Apache Top-Level Project in February 2014. Spark can run standalone, on Apache Mesos, or most frequently on Apache Hadoop.
How does Apache Spark work?
Hadoop MapReduce is a programming model for processing big data sets with a parallel, distributed algorithm. However, MapReduce takes sequential multi-step to run a job. With each step, MapReduce reads data from the cluster, performs operations, and writes the results back to HDFS(Hadoop Distributed File System). Because each step requires a disk read, and write, MapReduce jobs are slower due to the latency of disk I/O.
Spark was created to overcome the limitations of MapReduce, by doing processing in-memory, reducing the number of steps in a job, and reusing data across multiple parallel operations. With Spark, only one step is needed where data is read into memory, operations performed, and the results are written back — resulting in much faster execution. Spark uses an in-memory cache to speed up machine learning algorithms that repeatedly call a function on the same dataset.
Features of Spark: Six features that make Spark one of the most extensively used Big Data platforms are:
- Fault Tolerance: in Apache Spark is the capability to operate and to recover loss after a failure occurs.RDDs (Resilient Distributed Dataset)are designed to handle the failure of any worker node in the cluster.
- Dynamic in Nature: parallel applications can be developed easily, as Spark provides 80 high-level operators.
- Lazy Evaluation: All the transformations in spark RDD are lazy in nature. i.e they do not generate results right away, but they create new RDDs from existing RDD and thus increase the system efficiency.
- Real-Time Stream Processing: Hadoop MapReduce can process only existing data but Spark supports stream processing in real-time.
- Speed: It has a very high processing speed due to reduced read-write operations to disk. The speed is almost 100x faster while performing in-memory computation and 10x faster while performing disk computation.
- In-Memory Computation: Data is stored in RAM instead of some slow disk drives and is processed in parallel. It uses an in-memory cache to speed up machine learning algorithms that repeatedly call a function on the same dataset.
Apache Spark Ecosystem:
The Spark framework includes:
- Spark Core as the foundation for the platform
- Spark SQL for interactive queries
- Spark Streaming for real-time analytics
- Spark MLlib for machine learning
- Spark GraphX for graph processing
Spark Core is the foundation of the platform. It is responsible for memory management, fault recovery, scheduling, distributing & monitoring jobs, and interacting with storage systems.
PySpark: is an interface for Apache Spark in Python. PySpark supports most of Spark’s features such as Spark SQL, DataFrame, Streaming, MLlib (Machine Learning) and Spark Core.
PySpark RDD(Resilient Distributed Dataset): It is the fundamental data structure of Apache Spark.
Resilient- It is fault-tolerant and is capable of rebuilding data on failure. Distributed- Data is distributed among the multiple nodes in a cluster. Dataset- Data with which we are working.
RDD is immutable and follows lazy evaluation. Spark RDD can also be cached and manually partitioned. It supports two types of operations:
Transformation: create a new dataset from an existing one. e.g. map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. Transformations in spark are lazy i.e. they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset. There are two types of transformations, those that specify narrow dependencies and those that specify wide dependencies. Narrow dependencies(Narrow Transformation) are those where each input partition will contribute to only one output partition. Wide dependencies(Wide Transformation) will have input partitions contributing to many output partitions, often referred to as shuffle where Spark will exchange partitions across the cluster. With narrow transformations, Spark will automatically perform an operation called pipelining on narrow dependencies, this means that if we specify multiple filters on DataFrames they’ll all be performed in-memory while when we perform shuffle, Spark will write the results to disk.
Action: return a value to the driver program after running a computation on the dataset. e.g. reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist(or cache) method, in which Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk or replicated across multiple nodes.
PySpark DataFrame: PySpark DataFrame is almost similar to Pandas DataFrame except that, pandas DataFrame stores data on a single machine while PySpark DataFrames are distributed in the cluster (the data in DataFrame are stored in different machines in a cluster) and any operations in PySpark executes in parallel on all machines. PySpark DataFrames are lazily evaluated. They are implemented on top of RDDs. PySpark applications start with initializing SparkSession which is the entry point of PySpark as below.
PySpark SQL: It is a Spark module for structured data processing. It runs on top of Spark Core. DataFrame and Spark SQL share the same execution engine so they can be interchangeably used seamlessly. DataFrame and Spark SQL share the same execution engine so they can be interchangeably used seamlessly. For example, you can register the DataFrame as a table and run a SQL easily as below:
PySpark UDFs (User-defined Functions)are similar to UDF on traditional databases. In PySpark, you create a function in a Python syntax and wrap it with PySpark SQL udf() or register it as udf and use it on DataFrame and SQL respectively.
PySpark MLlib: PySpark MLlib is a machine learning library. It is a wrapper over PySpark core to do analysis using machine learning algorithms. It provides tools such as :
- ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering
- Featurization: feature extraction, transformation, dimensionality reduction, and selection
- Pipelines: tools for constructing, evaluating, and tuning ML Pipelines
- Persistence: saving and loading algorithms, models, and Pipelines
- Utilities: linear algebra, statistics, data handling, etc.
ML Pipelines: provides high-level APIs built on top of DataFrames that help users create machine learning pipelines. Transformers, Estimators and Pipelines are the main concepts in ML Pipelines.
Transformers: A Transformer is an algorithm which can transform one DataFrame into another DataFrame. For example, a feature transformer might take a DataFrame, read a column (e.g., text), map it into a new column (e.g., feature vectors), and output a new DataFrame with the mapped column appended.
Estimators: An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. For example, a learning algorithm such as LinearRegression is an Estimator, and calling fit() trains a LinearRegressionModel, which is a Model and hence a Transformer.
Pipeline: consists of a sequence of PipelineStages (Transformers and Estimators) to be run in a specific order. A Pipeline is an Estimator. Thus, after a Pipeline’s fit() method runs, it produces a PipelineModel, which is a Transformer.
In the above figure, the first row represents the Pipeline with three stages. Tokenizer and HashingTF are Transformers while LogisticRegression is an Estimator. The second row represents the data flowing through the pipeline, where cylinders indicate DataFrames. The Pipeline.fit() method is called on the original DataFrame, which has raw text documents and labels. The Tokenizer.transform() method splits the raw text documents into words, adding a new column with words to the DataFrame. The HashingTF.transform() method converts the words column into feature vectors, adding a new column with those vectors to the DataFrame. Now, since LogisticRegression is an Estimator, the Pipeline first calls LogisticRegression.fit() to produce a LogisticRegressionModel. For practical implementation visit my GitHub link.
To summarize, in this article we learned about Apache Spark, how Apache Spark works, PySpark DataFrame, PySpark SQL and MLlib using PySpark.For practical implementation of DataFrame, SQL and different algorithms using PySpark and MLlib visit my Github repository. For more information on Spark visit the official website of Apache Spark.
About the Author: I am Priya, currently working as a Data Scientist and possess knowledge of Exploratory Data Analysis, Machine Learning and Natural Language Processing. If you have any questions please connect with me on my Linkedin profile.