#

corpus-data

Here are 155 public repositories matching this topic...

esbatmop / MNBVC

MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化，也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。

nlp chinese chinese-nlp corpus-data chinese-simplified nlp-machine-learning chinese-language

Updated May 25, 2024

PyThaiNLP / thaigov-v2-corpus

Thai News Dataset from Thai government website.

corpus thai-language corpus-data thai-nlp pythainlp

Updated May 25, 2024
Jupyter Notebook

luciamariaalvarezcrespo / GalMisoCorpus2023

📑 Galician corpus for misogyny detection

nlp machine-learning corpus corpus-data nlp-machine-learning misogyny galician misogyny-detection

Updated May 24, 2024
Python

johentsch / ms3

A parser for annotated MuseScore 3 files.

Updated May 23, 2024
Python

Corpus-of-Taylor-Swift

sagesolar / Corpus-of-Taylor-Swift

This is a dataset consisting of all song lyric words found on all of Taylor Swift's studio albums (up to and including TTPD), as well as a selection of other songs written by her.

song-dataset corpus taylor-swift corpus-data song-lyrics ttpd

Updated May 18, 2024

PlexPt / chatgpt-corpus

ChatGPT 中文语料库对话语料小说语料客服语料用于训练大模型

awesome corpus question-answering corpus-data

Updated May 15, 2024

takamichi-lab / j-spaw

J-SpAW: Japanese speech corpus for speaker verification and anti-spoofing

japanese corpus-data anti-spoofing

Updated May 12, 2024

DFKI-NLP / product-corpus

This repository contains the DFKI Product Corpus, a dataset of 174 documents annotated for product and company named entities, and the relation CompanyProvidesProduct.

nlp natural-language-processing corpus information-extraction english dataset named-entity-recognition corpus-data ner relation-extraction

Updated May 8, 2024

complexico / verb-noun-assoc-corpus-experiment

Repository of data and results for an undergraduate thesis titled "A Corpus-Based Study to Triangulating Experimental Evidence Regarding Verb-Noun Association for Action Verbs" by I Gede Semara Dharma Putra.

experiment corpus-linguistics corpus-data experimental-data udayana-university complexico corpus-of-contemporary-american-english multi-methodology

Updated May 8, 2024

MHenderson / dhlawrencer

An R package for D. H. Lawrence's novels.

rstats corpus-data english-literature rstats-package

Updated Apr 29, 2024
R

MHenderson / thomashardyr

An R package for Thomas Hardy's novels.

rstats corpus-data rstats-package

Updated Apr 29, 2024
R

aplmikex / deduplication_mnbvc

文本去重

nlp chinese chinese-nlp corpus-data chinese-simplified nlp-machine-learning chinese-language

Updated May 23, 2024
Python

EvgeniaViskovatykh / Quantitative-analysis-of-semantic-shift

nlp machine-learning word-embeddings corpus-data embeddings-word2vec linguistic-analysis corpus-processing

Updated Apr 18, 2024
Jupyter Notebook

CanCLID / canto-filter

粵文語料篩選器 Cantonese text filter

nlp data corpus cantonese corpus-data cantonese-language

Updated Apr 14, 2024
Python

CLARIAH / wp6-missieven

General Missives in Text-Fabric

nlp history dutch corpus-linguistics corpus-data corpus-tools corpus-processing

Updated Mar 27, 2024
Jupyter Notebook

wsricardo / news-crawler

Scripts de bots, web scrappings e web crawlers para pesquisa.

python crawler machine-learning news web technology corpus requests beautifulsoup corpus-data webscrapping noticias corpus-news

Updated Mar 23, 2024
Jupyter Notebook

grammarly / ua-gec

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

natural-language-processing corpus dataset corpus-data corpus-tools gec nlp-datasets grammatical-error-correction ukrainian-language

Updated Feb 11, 2024
Macaulay2

ToineSayan / wikivitals-lvl5-04-2022

A dataset built from Wikivitals articles: a corpus of documents with an underlying graph structure and a hierarchy of labels

nlp wikipedia dataset corpus-data dataset-generation

Updated Feb 7, 2024
Jupyter Notebook

dimboump / crosswriters

Code for final assignment for CLS course at the University of Antwerp (SoSe 2022)

supervised-learning corpus-data support-vector-machines computational-literary-studies

Updated Jan 23, 2024
Jupyter Notebook

mist

termsurf / mist

Public Domain Words and Texts for Conlangs

vocabulary corpus linguistics corpus-data sentences conlanging conlang word-list

Updated Dec 12, 2023
JavaScript

Improve this page

Add a description, image, and links to the corpus-data topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the corpus-data topic, visit your repo's landing page and select "manage topics."