(Ongoing module in development) Getting Wikipedia articles parsed content. Created for getting text corpuses data fast and easy. But can be freely used for other purpuses too
-
Updated
Jan 3, 2023 - Python
(Ongoing module in development) Getting Wikipedia articles parsed content. Created for getting text corpuses data fast and easy. But can be freely used for other purpuses too
The AP Exam Corpus Project is a Python application that generates corpora for AP exams.
Tools for creating speech corpora by extracting audio from YouTube videos
It can help you to convert srt file into CN-? parallel corpus
Python scripts for the construction of the LEXB parallel corpus of South Tyrolean legislation (IT-DE).
Open source Python package to produce word sketches inspired by Sketch Engine (to make reproducible analyses)
Python API for extracting data from the MPQA corpus
This package provides utility classes and static methods for Python that make use of different third party software commonly used in text processing such as: Unitex-GramLab, TreeTagger, Apache-Tika and Google-Tesseract.
branches of https://victorio.uit.no/langtech/trunk/tools/CorpusTools used by Giellatekno.UiT.no for corpus gathering.
Forpus is a Python library for processing plain text corpora to various corpus formats.
Corpus analysis of plain text and providing Type-Token Ratio as well as some other statistics.
Tool to generate lists of Bengali words and transcriptions matching given phonological descriptions
An open-source web-based application for multi-task lexical normalisation
Cod yr ap Paldaruo i iOS ar gyfer torfoli casglu corpws lleferydd | Code for the Paldaruo speech corpus crowdsourcing ap for iOS
Linguistic resources for adapting FreeLing to Chilean Spanish
Utility to guess some affix splits on Cherokee texts. Developed to use with the Moses Machine Translation software.
Online parallel text alignment tool.
Analyzes binary executables and can generate a test corpus for defined instruction paths, each discovered function, or it can generate a test corpus to reach every basic block detected in non library/shared object parts of the bin's text section.
Tidy concordances, collocates, and wordlist
Repositório para disponibilização de bases de dados do Wikipedia e Simple Wikipedia pré-processadas, além de scripts de pré-processamento e geração de bases em Python.
Add a description, image, and links to the corpus-tools topic page so that developers can more easily learn about it.
To associate your repository with the corpus-tools topic, visit your repo's landing page and select "manage topics."