text-extraction

Here are 209 public repositories matching this topic...

miso-belica / sumy

Module for automatic summarization of text documents and HTML pages.

python nlp pagerank-algorithm text-extraction reduction summarization html-page summary lsa sumy textteaser summarizer html-extraction html-extractor

Updated May 6, 2024
Python

unidoc / unipdf

Star

Golang PDF library for creating and processing PDF files (pure go)

golang pdf signing text-extraction pdf-generator pdf-generation pdf-reader pdf-manipulation pdf-library pdf-document-processor pdf-compression pdf-sign pdf-reports

Updated May 1, 2024
Go

whitelok / image-text-localization-recognition

Star

A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集シーンテキストの位置認識と識別のための論文リソースの要約

machine-learning awesome ocr deep-learning text-extraction text-recognition deep-learning-algorithms convolutional-neural-networks text-detection scene-texts

Updated Sep 17, 2023

chrismattmann / tika-python

Sponsor

Star

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

Updated Apr 14, 2024
Python

adbar / trafilatura

Star

Python & command-line tool to gather text on the Web: web crawling/scraping, extraction of text, metadata, comments

Updated May 13, 2024
Python

unidoc / unidoc

Star

This repository has moved! https://github.com/unidoc/unipdf

golang pdf text-extraction pdf-files pdf-invoice unidoc pdf-library

Updated May 23, 2019
Go

vsymbol / CUTIE

Star

CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)

computer-vision deep-learning text-extraction

Updated Dec 8, 2022
Python

miso-belica / jusText

Sponsor

Star

Heuristic based boilerplate removal tool

python text-extraction html-parser html-parsing

Updated May 9, 2024
Python

ropensci / pdftools

Star

Text Extraction, Rendering and Converting of PDF Documents

r text-extraction rstats pdf-files r-package poppler pdf-format poppler-library pdftools

Updated Oct 9, 2023
C++

flairNLP / fundus

Star

A very simple news crawler with a funny name

python nlp rss sitemap crawler scraper corpus text-extraction web-scraping news-crawler commoncrawl web-corpus news-scraping cc-news

Updated May 13, 2024
Python

ICIJ / datashare

Star

A self-hosted search engine for documents.

docker elasticsearch extract text-extraction named-entity-recognition web-gui datashare investigative-journalism

Updated May 14, 2024
Java

cdown / srt

Star

A simple library and set of tools for parsing, modifying, and composing SRT files.

python library tools command-line text-extraction subtitles subtitle srt subtitles-parsing mit-license command-line-tool subtitle-parser subtitle-fixer

Updated Mar 19, 2024
Python

nainiayoub / pdf-text-data-extractor

Star

PDF text data extraction web app with OCR for scanned documents

python pdf ocr text-extraction pdf-to-text ocr-text-reader ocr-python streamlit streamlit-webapp

Updated Jul 6, 2023
Python

skylander86 / lambda-text-extractor

Star

AWS Lambda functions to extract text from various binary formats.

pdf ocr aws-lambda lambda-functions tesseract text-extraction searchable-pdfs pdf-ocr-extraction

Updated Feb 7, 2018
Python

pd3f / pd3f

Star

🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

python pdf machine-learning ocr pipeline text-extraction pdf-to-text language-model extract-text parsr pd3f

Updated Oct 13, 2023
HTML

shixzie / nlp

Star

[UNMANTEINED] Extract values from strings and fill your structs with nlp.

nlp go golang natural-language-processing parse text text-extraction

Updated Sep 18, 2017
Go

archivesunleashed / aut

Star

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

scala big-data spark apache-spark hadoop analysis python3 text-extraction pyspark digital-humanities dataframe big-data-analytics webarchives network-graphing

Updated Feb 27, 2024
Scala

rajesh-bhat / spark-ai-summit-2020-text-extraction

Star

keras cnn text-extraction lstm text-recognition text-detection summit ctc-loss spark-ai

Updated Dec 7, 2020
Jupyter Notebook

TYPO3-Solr / ext-tika

Star

A TYPO3 CMS extension that provides Apache Tika functionality

search php metadata cms cms-extension tika language-detection typo3 typo3-cms-extension file-indexing text-extraction

Updated May 14, 2024
PHP

bookieio / breadability

Star

Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)

python text-mining text-extraction html-parsing html-extraction html-extractor

Updated May 9, 2024
HTML

Improve this page

Add a description, image, and links to the text-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the text-extraction topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

text-extraction

Here are 209 public repositories matching this topic...

miso-belica / sumy

unidoc / unipdf

whitelok / image-text-localization-recognition

chrismattmann / tika-python

adbar / trafilatura

unidoc / unidoc

vsymbol / CUTIE

miso-belica / jusText

ropensci / pdftools

flairNLP / fundus

ICIJ / datashare

cdown / srt

nainiayoub / pdf-text-data-extractor

skylander86 / lambda-text-extractor

pd3f / pd3f

shixzie / nlp

archivesunleashed / aut

rajesh-bhat / spark-ai-summit-2020-text-extraction

TYPO3-Solr / ext-tika

bookieio / breadability

Improve this page

Add this topic to your repo