PyCrawlConnect

Application Description:

PyCrawlConnect is an application developed using the Python programming language with the aim of connecting data obtained from web crawling to Apache Kafka, and subsequently forwarding this data to Elasticsearch. This application is designed to provide an efficient solution for managing crawled data using cutting-edge technologies.

Key Features:

Web Data Crawling: The application can crawl data from various websites using web scraping techniques or APIs provided by specific sites. The crawling module is designed for flexibility and easy configuration.
Kafka Integration: PyCrawlConnect features the capability to connect crawled data to Apache Kafka. Kafka is used as middleware to manage message queues, ensuring reliability and fault tolerance in the data delivery process.
Elasticsearch Connector: After data is generated and sent via Kafka, this application can forward the data to Elasticsearch. This allows users to store and index crawled data in Elasticsearch for easy analysis and search.
Easy Configuration: PyCrawlConnect is designed with easily configurable settings. Users can quickly adjust crawling settings, Kafka configurations, and Elasticsearch parameters through a structured configuration file.
Comprehensive Documentation: The project comes with comprehensive documentation that explains installation steps, configuration, and how to use the application. This documentation will assist developers or other users who wish to contribute to or use the application.
Open Source: PyCrawlConnect is an open-source project available on the GitHub platform. Developers can collaborate, provide feedback, or make contributions through pull requests.

By using PyCrawlConnect, users can easily manage and analyze crawled data from various sources in an efficient and structured manner. This project is expected to provide a reliable solution in the context of real-time web data processing.

Flowchart PyCrawlConnect

Requirements

Virtual Machine This application runs on a virtual machine, so before you install the application, you have to prepare the virtual machine first. See Virtual Machine Installation & Configuration.
Python Already installed Python with version 3.10.12. See the Installation and Setting up Python.
Kafka If you want to run this and then send the data to the Kafka Topic then you have to install and run Kafka first. See How to Install and Configuration Kafka.
Elasticsearch If the data consumed by Kafa wants to be forwarded to ElasticSearch, you must install ElasticSearch first. See the Steps to Elasticsearch Installation & Configuration.

Clone the repository to your directory

# Change Directory
cd /home/

# Install gh
sudo apt install gh

# Auth gh
gh auth login

# Clonig Repository
gh repo clone muhfalihr/PyCrawlConnect

NOTE: Perform a clone on the Kafka VM.

How to use it ?

1. Turn on the Kafka and Elasticsearch VM and log in to the root user.

Open the VirtualBox Software.
Select and click the Kafka VM.
Click Start.
Select and click the Elasticsearch VM.
Click Start.
Type your VM username and password.
Switch to superuser account (root)
```
sudo su
```

2. Remote Server to Kafka dan Elasticsearch VM.

Remote Kafka VM.

Open your 4 Desktop Terminals.
SSH into the Kafka VM. Do this in each Terminal.
```
ssh root@192.168.57.9
```

Remote Elasticsearch VM

Open your 1 Desktop Terminals.
SSH into the Elasticsearch VM.
```
ssh root@192.168.57.8
```

3. Start the elasticsearch and kibana service.

# elasticsearch service
systemctl start elasticsearch.service

# elasticsearch service
systemctl start kibana.service

4. Running Zookeeper and Kafka Server as well as CMAK.

Change Directory
```
cd kafka_2.13-3.2.0/
```
Starting a Zookeeper server in an Apache Kafka environment. Do this at the 1st terminal.
```
./bin/zookeeper-server-start.sh ./config/zookeeper.properties
```

Running Kafka Server. Do this at the 2nd terminal.

./bin/kafka-server-start.sh ./config/server.properties

Running Kafka UI (CMAK). Do this at the 3rd terminal.
```
CMAK/target/universal/cmak-*/bin/cmak
```

5. Run code for Crawling and Produce using VS code.

Open your Visual Studio Code.
Install Extensions Remote - SSH: Editing Configuration Files
Press F1 or Ctrl+Shift+P
Select and click Remote-SSH: Connect to Host...
Click Add New SSH Host...
Enter SSH Connection Command
Select /home/ubuntu/.ssh/config.
A Host added pop up will appear at the bottom right. Then click Connect.
Enter the password for SSH to the host earlier.
Press Ctrl+Shfit+E > Click Open Folder > Type /home/PyCrawlConnect/ > Click OK.
Edit the .env file, according to what you want.
Activate your virtual environment. In this example it is activated from the root directory.
```
source .venv/my-venv/bin/activate
```

Run the files apibook.py, apimovie.py, apinews.py. NOTE : Run it in a different terminal in VS Code.

# Terminal 1
python3 apibook.py

# Terminal 2
python3 apimovie.py

# Terminal 3
python3 apinews.py

Install Extensions Live server
Open the index.html file in the html directory. Then click Go Live at the bottom right.

6. Run the code for Consuming and Logging in Elasticsearch in Terminal Desktop.

Set up the .env file first.
Run the kafka_consumer.py file in the helper directory. Do this at the 4th terminal.
```
python3 helper/kafka_consumer.py
```

7. Check the data coming into Elasticsearch.

Open your browser.
Type elasticsearch.hosts.
```
http://192.168.57.8:5601
```
Enter your username and password in the login form. Then press ENTER.
In the Management menu, click Dev Tools.
Retrieves the specified JSON document from an index.
```
GET <index>/_doc/<_id>
```

License

The PyCrawlConnet project is licensed by Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.vscode		.vscode
api		api
controller		controller
helper		helper
html		html
library		library
log		log
.env		.env
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
apibook.py		apibook.py
apimovie.py		apimovie.py
apinews.py		apinews.py
apisocialmedia.py		apisocialmedia.py
logging.yaml		logging.yaml
requirements.txt		requirements.txt

License

muhfalihr/PyCrawlConnect

Folders and files

Latest commit

History

Repository files navigation

PyCrawlConnect

Flowchart PyCrawlConnect

Requirements

Clone the repository to your directory

How to use it ?

1. Turn on the Kafka and Elasticsearch VM and log in to the root user.

2. Remote Server to Kafka dan Elasticsearch VM.

3. Start the elasticsearch and kibana service.

4. Running Zookeeper and Kafka Server as well as CMAK.

5. Run code for Crawling and Produce using VS code.

6. Run the code for Consuming and Logging in Elasticsearch in Terminal Desktop.

7. Check the data coming into Elasticsearch.

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages