GitHub - matebenyovszky/algolia-pdf-crawler: This repository contains a PDF crawler that extracts text from PDF documents (currently using Microsoft Read model) and uploads it to Algolia index.

Algolia PDF Crawler

This repository contains a PDF crawler that extracts text from PDF documents and uploads it to Algolia for indexing and searching. Currently it uses Microsoft Read model for OCR. Futher development ideas in .ideas.

Installation

Install the required packages by running the following command:

pip install -r requirements.txt

Set up the configuration by modifying the config.json file. Ensure that the necessary environment variables are defined.

Configuration Parameters

The configuration for the PDF crawler is stored in the config.json file. The following parameters can be configured:

websites: An array of website objects containing the URL, base URL, and skip settings for each website to be crawled.
pdf_scrape_enabled: A boolean value indicating whether PDF scraping is enabled.
msdiread_ocr_enabled: A boolean value indicating whether OCR using Microsoft Form Recognizer is enabled.
upload_enabled: A boolean value indicating whether the upload to Algolia is enabled.
full_crawl: A boolean value indicating whether a full crawl should be performed.

Environment Variables

Ensure that the following environment variables are defined:

ALGOLIA_API_KEY: Your Algolia API key
ALGOLIA_APP_ID: Your Algolia application ID
ALGOLIA_INDEX_NAME: The name of the Algolia index to upload the data to
microsoft_di_key: Your Microsoft Document Intelligence API key
microsoft_di_endpoint: The endpoint for the Microsoft Document Intelligence service

Usage

To use the PDF crawler, follow these steps:

Install the required packages from requirements.txt.
Configure the settings in the config.json file.
Define the necessary environment variables.
Run the main.py file to start the PDF crawling process.

Contributions

Contributions to the PDF crawler are welcome! Submit a pull request with any improvements or bug fixes.

License

This project is licensed under the MIT License - see the LICENSE.md file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.vscode		.vscode
.env.example		.env.example
.gitignore		.gitignore
.idea		.idea
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
README.md		README.md
algolia.py		algolia.py
config.json		config.json
docker-compose.yml		docker-compose.yml
main.py		main.py
pdf_ocr_msdiread.py		pdf_ocr_msdiread.py
pdf_scraper.py		pdf_scraper.py
pdf_text_layer.py		pdf_text_layer.py
requirements.txt		requirements.txt

License

matebenyovszky/algolia-pdf-crawler

Folders and files

Latest commit

History

Repository files navigation

Algolia PDF Crawler

Installation

Configuration Parameters

Environment Variables

Usage

Contributions

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages