Real time data processing is the most sought after skill now a days. And Apache Kafka is the clear front runner in processing data streams at scale and has been battle tested at corporate giants such as Netflix, Uber to name a few.
In this post, I wanted to share how we can read twitter real time tweet stream via a Kafka Producer into a Kafka Stream (a.k.a Topic) and in turn read by Kafka Consumer. Though, the current workflow seems trivial, it demonstrates the Kafka’s profound usecase of publish-subscribe model. One can envision a scenario where we can have multiple consumers consuming the same data elements in parallel, processing the down stream business logic in a fault tolerance fashion.
With no further ado, let’s get into action. But, first thing first, we need to fire up the Kafka environment on our local workstation to explore Kafka's fundamentals.
Kafka Installation using Compose
docker-compose.yaml
Save the above contents as docker-compose.yaml and run
Let’s explore Kafka's basics by logging onto kafka container.
Now, open another terminal to launch a Kafka Consumer so that we can consume the data off of the Kafka Topic we have created.
Now type, let’s say the proverbial hello world phrase via kafka-producer we opened in the first terminal window. And this message magically appears in the kafka-consumer window you opened.
Now that we have seen publish-subscribe model in action, let’s do this with real time twitter data.
You will need to create,
Twitter Developer Account at https://developer.twitter.com.
an app with oauth2 client credentials.
copy key into api.key file and secret into api.secret file.
finally, upgrade your account (it’s free) to have elevated access to Twitter API, so that you can read the tweets in bulk.
Now log onto kafka-client container using,
get_tweets.py
When you run the script, you will notice the real time tweets show up in your consumer. So, that’s a wrap … for now!!