Module for automatic summarization of text documents and HTML pages.
-
Updated
May 6, 2024 - Python
Module for automatic summarization of text documents and HTML pages.
Golang PDF library for creating and processing PDF files (pure go)
A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments
This repository has moved! https://github.com/unidoc/unipdf
CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)
Heuristic based boilerplate removal tool
Text Extraction, Rendering and Converting of PDF Documents
A very simple news crawler with a funny name
A self-hosted search engine for documents.
A simple library and set of tools for parsing, modifying, and composing SRT files.
PDF text data extraction web app with OCR for scanned documents
AWS Lambda functions to extract text from various binary formats.
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
[UNMANTEINED] Extract values from strings and fill your structs with nlp.
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
A TYPO3 CMS extension that provides Apache Tika functionality
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
Add a description, image, and links to the text-extraction topic page so that developers can more easily learn about it.
To associate your repository with the text-extraction topic, visit your repo's landing page and select "manage topics."