Following are covered in this post.

  • Install Apache Spark (v3.2.0) in pseudo-distributed fashion on local workstation.
  • Run Jupyter Notebook integrated with PySpark.
  • Run pi.py - out of box example - via,
    • Spark Submit on command line
    • Notebook using PySpark

It is assumed that you have Docker and Compose installed.

Installation:

docker-compose.yaml

version: "3.3"
services:
  spark-master:
    container_name: spark-master
    image: mvangala/apache-spark-cluster:3.2.0
    ports:
      - "9090:8080"
      - "7077:7077"
      - "8889:8889"
    environment:
      - SPARK_LOCAL_IP=spark-master
      - SPARK_WORKLOAD=master

  spark-worker-a:
    container_name: spark-worker-a
    image: mvangala/apache-spark-cluster:3.2.0
    ports:
      - "9091:8080"
      - "7000:7000"
    depends_on:
      - spark-master
    environment:
      - SPARK_MASTER=spark://spark-master:7077
      - SPARK_WORKER_CORES=1
      - SPARK_WORKER_MEMORY=1G
      - SPARK_DRIVER_MEMORY=1G
      - SPARK_EXECUTOR_MEMORY=1G
      - SPARK_WORKLOAD=worker
      - SPARK_LOCAL_IP=spark-worker-a
  
  spark-worker-b:
    container_name: spark-worker-b
    image: mvangala/apache-spark-cluster:3.2.0
    ports:
      - "9092:8080"
      - "7001:7000"
    depends_on:
      - spark-master
    environment:
      - SPARK_MASTER=spark://spark-master:7077
      - SPARK_WORKER_CORES=1
      - SPARK_WORKER_MEMORY=1G
      - SPARK_DRIVER_MEMORY=1G
      - SPARK_EXECUTOR_MEMORY=1G
      - SPARK_WORKLOAD=worker
      - SPARK_LOCAL_IP=spark-worker-b

Once you have above content saved into docker-compose.yaml, run

docker-compose -f docker-compose.yaml up -d

This will fire up a master node and 2 worker nodes. You can view the Spark UI at http://localhost:9090 in your browser. It will look similar to the picture below.

Spark UI

Run a job using Spark Submit:

Now time to run pi.py using spark-submit. Run the following commands to see map-reduce in action.

docker exec -it spark-master bash
# this command will launch you into spark-master container

spark-submit examples/src/main/python/pi.py 1000
# run pi.py script with 1000 partitions as an argument

Run a job via Jupyter Notebook using PySpark:

Run the following command from within spark-master docker container,

pyspark
# this launches jupyter-notebook on port 8889 

Follow the markdown below to run the commands in jupyter-notebook.

import sys
from random import random
from operator import add
from pyspark.sql import SparkSession
spark = SparkSession\
        .builder\
        .appName("PythonPi")\
        .getOrCreate()

partitions = 1000
n = 100000 * partitions
def f(_):
    x = random() * 2 - 1
    y = random() * 2 - 1
    return 1 if x ** 2 + y ** 2 <= 1 else 0
count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add)
print("Pi is roughly %f" % (4.0 * count / n))
Pi is roughly 3.141040
spark.stop()

PySpark Notebook

To recap, we successfully,

  • installed Apache-Spark cluster (v3.2.0)
  • spark-submit a job
  • launched jupyter-notebook using pyspark
  • executed spark friendly commands in jupyter-notebook.

Happy Coding!! :+1:

Buy Me A Coffee