Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
-
Updated
Jun 6, 2024 - Python
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
A self-hosted search engine for documents.
PDF text data extraction web app with OCR for scanned documents
A very simple news crawler with a funny name
AI Media and Misinformation Content Analysis Tool: Analyze text and images
RAG with LM studio, local LLMs, Scientific PDF text extraction,
Fan translation tools for SCUMM engine games
This repository contains code for a simple application to detect text from images using Pythonracter Recognition (OCR), and Streamlit for creating a user-friendly web application. The application allows users to upload images or capture them via camera input and extracts text present
Extract embedded metadata from HTML markup
Golang PDF library for creating and processing PDF files (pure go)
Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats
Case study using dotfurther's Open Discover Platform with the RavenDB document store to rapidly create a full-text search/eDiscovery/information governance capable demonstration application.
Get text content from any file
Translate visual novels in real time
Module for automatic summarization of text documents and HTML pages.
This GitHub repository hosts the notebooks and tools developed as part of this thesis to automate the extraction, processing, and analysis of data from the MICCAI 2023 conference, aiding in the systematic review and providing a structured foundation for further research in this crucial area.
A TYPO3 CMS extension that provides Apache Tika functionality
Add a description, image, and links to the text-extraction topic page so that developers can more easily learn about it.
To associate your repository with the text-extraction topic, visit your repo's landing page and select "manage topics."