MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。
-
Updated
May 25, 2024
MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。
ChatGPT 中文语料库 对话语料 小说语料 客服语料 用于训练大模型
汉语现代诗歌语料库整理,3489诗人,81.7K诗歌,15.43M字。持续扩充...
chinese NLP corpus of chinese science fiction,chinese science fiction corpus : About 4675 Chinese science fiction novels 大约有4675本科幻小说,中文科幻小说自然语言处理语料库,中文科幻小说文本语料库,中文科幻小说文本数据库,科幻小说语料
A curated list of Open Information Extraction (OIE) resources: papers, code, data, etc.
UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language
chinese NLP corpus of chinese science fiction, chinese science fiction corpus: Archive of the Ark Plan of Ula Science Fiction Website 乌拉科幻小说网方舟计划存档,中文科幻小说自然语言处理语料库,中文科幻小说文本语料库,中文科幻小说文本数据库,科幻小说语料
Utilities for Processing the Switchboard Dialogue Act Corpus
DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)
Tunisian Sentiment Analysis Corpus.
Vietnamese Wikipedia Corpus
golden arabic corpus build for test Assem's arabicstemmer and other arabic stemmers
GermaParl: Corpus of Plenary Protocols of the German Bundestag (TEI Format)
Repository dedicated to a collection of resources and helping material for Urdu language Processing related tasks
Utilities for Processing the Meeting Recorder Dialogue Act Corpus
Scraper
Add a description, image, and links to the corpus-data topic page so that developers can more easily learn about it.
To associate your repository with the corpus-data topic, visit your repo's landing page and select "manage topics."