Skip to content

Project to connect crawled data to Kafka and monitor using elasticsearch. Still under development, PLEASE UNDERSTAND. Haha:)

License

Notifications You must be signed in to change notification settings

muhfalihr/PyCrawlConnect

Repository files navigation

Instagram: ____mfr.py

PyCrawlConnect

ProjectImage

Application Description:

PyCrawlConnect is an application developed using the Python programming language with the aim of connecting data obtained from web crawling to Apache Kafka, and subsequently forwarding this data to Elasticsearch. This application is designed to provide an efficient solution for managing crawled data using cutting-edge technologies.

Key Features:

  1. Web Data Crawling: The application can crawl data from various websites using web scraping techniques or APIs provided by specific sites. The crawling module is designed for flexibility and easy configuration.

  2. Kafka Integration: PyCrawlConnect features the capability to connect crawled data to Apache Kafka. Kafka is used as middleware to manage message queues, ensuring reliability and fault tolerance in the data delivery process.

  3. Elasticsearch Connector: After data is generated and sent via Kafka, this application can forward the data to Elasticsearch. This allows users to store and index crawled data in Elasticsearch for easy analysis and search.

  4. Easy Configuration: PyCrawlConnect is designed with easily configurable settings. Users can quickly adjust crawling settings, Kafka configurations, and Elasticsearch parameters through a structured configuration file.

  5. Comprehensive Documentation: The project comes with comprehensive documentation that explains installation steps, configuration, and how to use the application. This documentation will assist developers or other users who wish to contribute to or use the application.

  6. Open Source: PyCrawlConnect is an open-source project available on the GitHub platform. Developers can collaborate, provide feedback, or make contributions through pull requests.

By using PyCrawlConnect, users can easily manage and analyze crawled data from various sources in an efficient and structured manner. This project is expected to provide a reliable solution in the context of real-time web data processing.

Flowchart PyCrawlConnect

Requirements

Clone the repository to your directory

# Change Directory
cd /home/

# Install gh
sudo apt install gh

# Auth gh
gh auth login

# Clonig Repository
gh repo clone muhfalihr/PyCrawlConnect

NOTE: Perform a clone on the Kafka VM.

How to use it ?

1. Turn on the Kafka and Elasticsearch VM and log in to the root user.

  • Open the VirtualBox Software.

  • Select and click the Kafka VM.

  • Click Start.

  • Select and click the Elasticsearch VM.

  • Click Start.

  • Type your VM username and password.

  • Switch to superuser account (root)

    sudo su

2. Remote Server to Kafka dan Elasticsearch VM.

Remote Kafka VM.

  • Open your 4 Desktop Terminals.

  • SSH into the Kafka VM. Do this in each Terminal.

    ssh root@192.168.57.9

Remote Elasticsearch VM

  • Open your 1 Desktop Terminals.

  • SSH into the Elasticsearch VM.

    ssh root@192.168.57.8

3. Start the elasticsearch and kibana service.

# elasticsearch service
systemctl start elasticsearch.service

# elasticsearch service
systemctl start kibana.service

4. Running Zookeeper and Kafka Server as well as CMAK.

  • Change Directory

    cd kafka_2.13-3.2.0/
  • Starting a Zookeeper server in an Apache Kafka environment. Do this at the 1st terminal.

    ./bin/zookeeper-server-start.sh ./config/zookeeper.properties
  • Running Kafka Server. Do this at the 2nd terminal.

    ./bin/kafka-server-start.sh ./config/server.properties
  • Running Kafka UI (CMAK). Do this at the 3rd terminal.

    CMAK/target/universal/cmak-*/bin/cmak

5. Run code for Crawling and Produce using VS code.

  • Open your Visual Studio Code.

  • Install Extensions Remote - SSH: Editing Configuration Files

  • Press F1 or Ctrl+Shift+P

  • Select and click Remote-SSH: Connect to Host...

  • Click Add New SSH Host...

  • Enter SSH Connection Command

  • Select /home/ubuntu/.ssh/config.

  • A Host added pop up will appear at the bottom right. Then click Connect.

  • Enter the password for SSH to the host earlier.

  • Press Ctrl+Shfit+E > Click Open Folder > Type /home/PyCrawlConnect/ > Click OK.

  • Edit the .env file, according to what you want.

  • Activate your virtual environment. In this example it is activated from the root directory.

    source .venv/my-venv/bin/activate
  • Run the files apibook.py, apimovie.py, apinews.py. NOTE : Run it in a different terminal in VS Code.

    # Terminal 1
    python3 apibook.py
    
    # Terminal 2
    python3 apimovie.py
    
    # Terminal 3
    python3 apinews.py
  • Install Extensions Live server

  • Open the index.html file in the html directory. Then click Go Live at the bottom right.

6. Run the code for Consuming and Logging in Elasticsearch in Terminal Desktop.

  • Set up the .env file first.

  • Run the kafka_consumer.py file in the helper directory. Do this at the 4th terminal.

    python3 helper/kafka_consumer.py

7. Check the data coming into Elasticsearch.

  • Open your browser.

  • Type elasticsearch.hosts.

    http://192.168.57.8:5601
    
  • Enter your username and password in the login form. Then press ENTER.

  • In the Management menu, click Dev Tools.

  • Retrieves the specified JSON document from an index.

    GET <index>/_doc/<_id>
    

License

The PyCrawlConnet project is licensed by Apache License 2.0.

Releases

No releases published

Packages

No packages published

Languages