Twitter Crawler

Introduction

This project is developed for crawling public users’ timeline with two different approaches:

  • Scrapy
  • Selenium + Beautiful soup
    then the crawled data is writen into a file in json format. The kafka producer then reads the file and produce its data into ‘raw-tweets’ topic.

Technologies/Languages Used

Technology Usage
Python is the languages used for projects.
SCRAPY is used for crawling data from twitter.
SELENIUM is a standalone browser.
BEAUTIFUL SOUP is used for parsing HTML and XML documents.
Docker is used for virtualization and containerizing services, including backend and frontend services.
Kafka Kafka is used to produce data
Git is used for version control.

Project information

  • Category: Software
  • Project date: December 2020

Project Description
This project crawls twitter's timeline and produces crawled data into kafka topic.