Apache Spark Cluster - Installation!
Following are covered in this post.
- Install
Apache Spark (v3.2.0)
in pseudo-distributed fashion on local workstation. - Run
Jupyter Notebook
integrated withPySpark
. - Run
pi.py
- out of box example - via,-
Spark Submit
on command line -
Notebook
usingPySpark
-
It is assumed that you have Docker
and Compose
installed.
Installation:
docker-compose.yaml
Once you have above content saved into docker-compose.yaml
, run
This will fire up a master node and 2 worker nodes. You can view the Spark UI at http://localhost:9090
in your browser. It will look similar to the picture below.
Run a job using Spark Submit:
Now time to run pi.py
using spark-submit
. Run the following commands to see map-reduce
in action.
Run a job via Jupyter Notebook
using PySpark
:
Run the following command from within spark-master
docker container,
Follow the markdown below to run the commands in jupyter-notebook
.
import sys
from random import random
from operator import add
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("PythonPi")\
.getOrCreate()
partitions = 1000
n = 100000 * partitions
def f(_):
x = random() * 2 - 1
y = random() * 2 - 1
return 1 if x ** 2 + y ** 2 <= 1 else 0
count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))
Pi is roughly 3.141040
spark.stop()
To recap, we successfully,
- installed Apache-Spark cluster (v3.2.0)
- spark-submit a job
- launched jupyter-notebook using pyspark
- executed spark friendly commands in jupyter-notebook.
Happy Coding!!