In Part-1, we have seen our web app take shape to a point where user can fetch movie/actor information on-demand. However, we have a glaring performance bottleneck since every web call is being served by doing expensive file based joins on the server side. Traditionally, such data lookups are delegated to database systems. However, SQL based databases can only scale up (a.k.a vertical scaling) but not scale out (a.k.a horizontal scaling). It has become a norm to seek out alternative NoSQL approaches as data size grow beyond what relational algebriac based database systems can handle.
We’ll explore one such alternative system - Neo4J, which is designed based off of a graph theory. Following video illustrates the easeness with which one can explore the interconnected nature of data points using Neo4J.
In this post, we go over loading IMDB datasets into Neo4J, explore fetching movie cast and actor’s movies. We will also update our FetchInfo API endpoint to fetch data from the database as opposed to file lookups. Picture below depicts how the data can be explored once we populate Neo4J using IMDB datasets.
Loading IMDB datasets into database
Before starting our loading process, we need to come up with a blueprint into how the nodes (actors, movies etc.) are connected via relationships (acted in, released_in etc.) Below is a picture of one such blueprint.
At this point, we have everything we need to load the data. So, let’s go ahead and run the following commands to fire up Neo4J docker container.
It’s time to copy the datasets into neo4j container.
Login to the neo4j container and load the datasets using,
That’s it, just like that, we have our IMDB dataset loaded into Neo4J. Let’s explore the movie cast and actor’s movies in the graph way!!
Now, it’s time to upgrade our FetchInfo endpoint to use the database instead of file content.
After implementing database lookup, server.py looks as follows:
And the performance of our application feels much smoother as showed in the video below.