MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。
-
Updated
May 25, 2024
MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。
Thai News Dataset from Thai government website.
📑 Galician corpus for misogyny detection
A parser for annotated MuseScore 3 files.
This is a dataset consisting of all song lyric words found on all of Taylor Swift's studio albums (up to and including TTPD), as well as a selection of other songs written by her.
ChatGPT 中文语料库 对话语料 小说语料 客服语料 用于训练大模型
J-SpAW: Japanese speech corpus for speaker verification and anti-spoofing
This repository contains the DFKI Product Corpus, a dataset of 174 documents annotated for product and company named entities, and the relation CompanyProvidesProduct.
Repository of data and results for an undergraduate thesis titled "A Corpus-Based Study to Triangulating Experimental Evidence Regarding Verb-Noun Association for Action Verbs" by I Gede Semara Dharma Putra.
An R package for D. H. Lawrence's novels.
文本去重
粵文語料篩選器 Cantonese text filter
General Missives in Text-Fabric
Scripts de bots, web scrappings e web crawlers para pesquisa.
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
A dataset built from Wikivitals articles: a corpus of documents with an underlying graph structure and a hierarchy of labels
Code for final assignment for CLS course at the University of Antwerp (SoSe 2022)
Public Domain Words and Texts for Conlangs
Add a description, image, and links to the corpus-data topic page so that developers can more easily learn about it.
To associate your repository with the corpus-data topic, visit your repo's landing page and select "manage topics."