In this post, I share my learnings into a real world use case with Apache Beam + SQL Server + Apache Spark stack. If you are new to Apache Beam, you may read about it here. If you want to follow along with the code and would like to bring up the Spark Cluster, you may read about it here.
We will use SQL Server for our database, connect to it using JDBC, use Apache Beam to process the data elements in parallel using Apache Spark cluster and finally write each processed set to a file.
I went with Apache Beam Java SDK due to it’s maturity with respect to number of available IO transormations.
1) SQL Server Database
Let’s first,
launch a SQL Server docker container,
create a user, a database, a table, and
load some data.
Login to sql-server container and create a database & a user,
Login with #USERNAME and #USERPASS you used above, to create a table and load some mock data.
2) Beam Pipeline
Using Eclipse IDE, create a Maven project with pom.xml (which includes spark and beam java dependencies) as below,
Save the code below as BeamSQL.java.
3) Run Beam Pipeline on Spark Cluster
Maven install and build to generate beam-sql-0.0.1-SNAPSHOT-shaded.jar. Copy this jar onto spark-master docker container. If you don’t have Spark Cluster locally, check my blog post here.
As you notice, when you list the files in /mnt, you will see 7 output files created with 2 rows of data from database table Movie. Apache Beam parallelizes PCollection elements in ParDo and final write transformation happened on window of 2 elements (this can be further customized as documented here), thus resulting in 7 output files.
This simple example demonstrates the power of Apache Beam programming model and SDKs in parallelizing complex workflows in distributed computing environment so effortlessly.