Application Description:
PyCrawlConnect is an application developed using the Python programming language with the aim of connecting data obtained from web crawling to Apache Kafka, and subsequently forwarding this data to Elasticsearch. This application is designed to provide an efficient solution for managing crawled data using cutting-edge technologies.
Key Features:
-
Web Data Crawling: The application can crawl data from various websites using web scraping techniques or APIs provided by specific sites. The crawling module is designed for flexibility and easy configuration.
-
Kafka Integration: PyCrawlConnect features the capability to connect crawled data to Apache Kafka. Kafka is used as middleware to manage message queues, ensuring reliability and fault tolerance in the data delivery process.
-
Elasticsearch Connector: After data is generated and sent via Kafka, this application can forward the data to Elasticsearch. This allows users to store and index crawled data in Elasticsearch for easy analysis and search.
-
Easy Configuration: PyCrawlConnect is designed with easily configurable settings. Users can quickly adjust crawling settings, Kafka configurations, and Elasticsearch parameters through a structured configuration file.
-
Comprehensive Documentation: The project comes with comprehensive documentation that explains installation steps, configuration, and how to use the application. This documentation will assist developers or other users who wish to contribute to or use the application.
-
Open Source: PyCrawlConnect is an open-source project available on the GitHub platform. Developers can collaborate, provide feedback, or make contributions through pull requests.
By using PyCrawlConnect, users can easily manage and analyze crawled data from various sources in an efficient and structured manner. This project is expected to provide a reliable solution in the context of real-time web data processing.
-
Virtual Machine This application runs on a virtual machine, so before you install the application, you have to prepare the virtual machine first. See Virtual Machine Installation & Configuration.
-
Python Already installed Python with version 3.10.12. See the Installation and Setting up Python.
-
Kafka If you want to run this and then send the data to the Kafka Topic then you have to install and run Kafka first. See How to Install and Configuration Kafka.
-
Elasticsearch If the data consumed by Kafa wants to be forwarded to ElasticSearch, you must install ElasticSearch first. See the Steps to Elasticsearch Installation & Configuration.
# Change Directory
cd /home/
# Install gh
sudo apt install gh
# Auth gh
gh auth login
# Clonig Repository
gh repo clone muhfalihr/PyCrawlConnect
NOTE: Perform a clone on the Kafka VM.
-
Open the VirtualBox Software.
-
Select and click the Kafka VM.
-
Click Start.
-
Select and click the Elasticsearch VM.
-
Click Start.
-
Type your VM username and password.
-
Switch to superuser account (root)
sudo su
Remote Kafka VM.
-
Open your 4 Desktop Terminals.
-
SSH into the Kafka VM. Do this in each Terminal.
ssh root@192.168.57.9
Remote Elasticsearch VM
-
Open your 1 Desktop Terminals.
-
SSH into the Elasticsearch VM.
ssh root@192.168.57.8
# elasticsearch service
systemctl start elasticsearch.service
# elasticsearch service
systemctl start kibana.service
-
Change Directory
cd kafka_2.13-3.2.0/
-
Starting a Zookeeper server in an Apache Kafka environment. Do this at the 1st terminal.
./bin/zookeeper-server-start.sh ./config/zookeeper.properties
-
Running Kafka Server. Do this at the 2nd terminal.
./bin/kafka-server-start.sh ./config/server.properties
-
Running Kafka UI (CMAK). Do this at the 3rd terminal.
CMAK/target/universal/cmak-*/bin/cmak
-
Open your Visual Studio Code.
-
Install Extensions Remote - SSH: Editing Configuration Files
-
Press F1 or Ctrl+Shift+P
-
Select and click Remote-SSH: Connect to Host...
-
Click Add New SSH Host...
-
Enter SSH Connection Command
-
Select /home/ubuntu/.ssh/config.
-
A Host added pop up will appear at the bottom right. Then click Connect.
-
Enter the password for SSH to the host earlier.
-
Press Ctrl+Shfit+E > Click Open Folder > Type /home/PyCrawlConnect/ > Click OK.
-
Edit the .env file, according to what you want.
-
Activate your virtual environment. In this example it is activated from the root directory.
source .venv/my-venv/bin/activate
-
Run the files apibook.py, apimovie.py, apinews.py. NOTE : Run it in a different terminal in VS Code.
# Terminal 1 python3 apibook.py # Terminal 2 python3 apimovie.py # Terminal 3 python3 apinews.py
-
Install Extensions Live server
-
Open the index.html file in the html directory. Then click Go Live at the bottom right.
-
Set up the .env file first.
-
Run the kafka_consumer.py file in the helper directory. Do this at the 4th terminal.
python3 helper/kafka_consumer.py
-
Open your browser.
-
Type elasticsearch.hosts.
http://192.168.57.8:5601
-
Enter your username and password in the login form. Then press ENTER.
-
In the Management menu, click Dev Tools.
-
Retrieves the specified JSON document from an index.
GET <index>/_doc/<_id>
The PyCrawlConnet project is licensed by Apache License 2.0.