Skip to content

Scraping Tweets from Twitter using twint, kafka, CMAK and MongoDB

License

Notifications You must be signed in to change notification settings

Bhavan-Naik/Twitter_Scraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Twitter Data Pipeline


Requirements and References:

Apache Kafka 
https://www.digitalocean.com/community/tutorials/how-to-install-apache-kafka-on-ubuntu-20-04. Version: https://archive.apache.org/dist/kafka/2.1.1/kafka_2.11-2.1.1.tgz

twint
https://github.com/twintproject/twint

CMAK 
https://github.com/yahoo/CMAK

MongoDB 
https://linuxhint.com/install_mongodb_ubuntu_20_04/

MongoDB-Compass 
https://docs.mongodb.com/compass/current/install/

Java 11+

Python 3.6+

Run the twitter_shell.sh file in order to install the basic packages, including twint

Execution steps:

Step 1:

Checking running status of Kafka and MongoDB:

$sudo systemctl start kafka

$sudo systemctl status kafka

$sudo systemctl start mongodb

$sudo systemctl status mongodb

Step 2:

Open first terminal

Navigate to your "CMAK" directory and run the following commands:

$cd target/universal/cmak-3.0.0.5

$bin/cmak -java-home /usr/lib/jvm/java-11-openjdk-amd64/

Step 3:

Open second terminal

Navigate to "kafka" home directory:

$bin/zookeeper-shell.sh localhost:2181

Once the zookeeper shell opens and starts blinking for next commands:

$ls /kafka-manager

$create /kafka-manager/mutex ""

$create /kafka-manager/mutex/locks ""

$create /kafka-manager/mutex/leases ""

Go to web browser (localhost:9000) and add cluster with following details name:(any_name),host: localhost:2181, kafka-version:2.1.1 and save.

Step 4:

Open third terminal

Navigate to your "kafka" home directory:

$bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic topic_name

Navigate to your twitter folder:

$python3 producer_filename.py --broker-list localhost:9092 --topic topic_name > /dev/null

Step 5:

Open fourth terminal:

$mongodb-compass

Connect to your particular database.

Open fifth terminal and navigate to twitter folder:

$python3 consumer_filename.py --bootstrap-server localhost:9092 --topic topic_name --from-beginning

"Ctrl+C" after all the tweets have been consumed by the consumer.

About

Scraping Tweets from Twitter using twint, kafka, CMAK and MongoDB

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published