#

corpus-data

Here are 155 public repositories matching this topic...

esbatmop / MNBVC

MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化，也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。

nlp chinese chinese-nlp corpus-data chinese-simplified nlp-machine-learning chinese-language

Updated May 25, 2024

shijiebei2009 / CEC-Corpus

📚中文突发事件语料库（Chinese Emergency Corpus）-上海大学-语义智能实验室

Updated Sep 26, 2019

PlexPt / chatgpt-corpus

ChatGPT 中文语料库对话语料小说语料客服语料用于训练大模型

awesome corpus question-answering corpus-data

Updated May 15, 2024

sheepzh / poetry

汉语现代诗歌语料库整理，3489诗人，81.7K诗歌，15.43M字。持续扩充...

nlp poetry literature corpus-data chinese-corpus

Updated Aug 1, 2023
Python

guhhhhaa / 4675-scifi

chinese NLP corpus of chinese science fiction,chinese science fiction corpus : About 4675 Chinese science fiction novels 大约有4675本科幻小说，中文科幻小说自然语言处理语料库，中文科幻小说文本语料库，中文科幻小说文本数据库，科幻小说语料

nlp corpus science-fiction scifi chinese-nlp corpus-data datasets nlp-resources nlp-machine-learning nlp-datasets

Updated Oct 22, 2022

gkiril / oie-resources

A curated list of Open Information Extraction (OIE) resources: papers, code, data, etc.

Updated Oct 25, 2022

grammarly / ua-gec

UA-GEC: Grammatical Error Correction and Fluency Corpus for the Ukrainian Language

natural-language-processing corpus dataset corpus-data corpus-tools gec nlp-datasets grammatical-error-correction ukrainian-language

Updated Feb 11, 2024
Macaulay2

guhhhhaa / wula-scifi

chinese NLP corpus of chinese science fiction, chinese science fiction corpus: Archive of the Ark Plan of Ula Science Fiction Website 乌拉科幻小说网方舟计划存档，中文科幻小说自然语言处理语料库，中文科幻小说文本语料库，中文科幻小说文本数据库，科幻小说语料

nlp corpus science-fiction scifi chinese-nlp corpus-data datasets nlp-resources nlp-machine-learning nlp-datasets

Updated Oct 22, 2022

hailiang-wang / egret-wenda-corpus

A Public Corpus for Machine Learning

qa corpus corpus-data

Updated Jul 3, 2018
JavaScript

NathanDuran / Switchboard-Corpus

Utilities for Processing the Switchboard Dialogue Act Corpus

dialogue corpus corpus-data corpus-tools switchboard dialogues corpus-processing dialogue-data switchboard-corpus dialogue-act

Updated Jan 24, 2021
Python

shijiebei2009 / CEEC-Corpus

📚中文环境突发事件语料库（Chinese Environment Emergency Corpus）-上海大学-语义智能实验室

Updated Nov 3, 2015

KehaoWu / Jinyong-Corpus

金庸15部小说字典

nlp corpus-data

Updated Nov 17, 2018

dataset-vn / DANeS

DANeS is an open-source E-newspaper dataset by collaboration between DATASET JSC (dataset.vn) and AIV Group (aivgroup.vn)

open-source machine-learning natural-language-processing corpus artificial-intelligence dataset newspaper corpus-data text-sentiment danes datasetvn aivgroup

Updated May 11, 2022
Python

fbougares / TSAC

Tunisian Sentiment Analysis Corpus.

sentiment-analysis corpus-data arabic dialect tunisian

Updated Jan 12, 2021

undertheseanlp / corpus.viwiki

Vietnamese Wikipedia Corpus

vietnamese corpus-linguistics corpus-data vietnamese-nlp

Updated May 18, 2017
Python

ibnmalik / golden-corpus-arabic

golden arabic corpus build for test Assem's arabicstemmer and other arabic stemmers

corpus stemmer corpus-data arabic corpus-generator corpurate

Updated Aug 24, 2018
Python

PolMine / GermaParlTEI

GermaParl: Corpus of Plenary Protocols of the German Bundestag (TEI Format)

text-mining corpus-data parliamentary-data

Updated Jun 1, 2023

PakUrdu-Research-Center / awesome-urdu

Repository dedicated to a collection of resources and helping material for Urdu language Processing related tasks

corpus open-data awesome-list corpus-data research-paper urdu urdu-nlp urdu-text-processsing urdu-language

Updated Oct 24, 2019

NathanDuran / MRDA-Corpus

Utilities for Processing the Meeting Recorder Dialogue Act Corpus

dialogue corpus corpus-data corpus-tools dialogues corpus-processing dialogue-act

Updated Jan 24, 2021
Python

magizbox / scraper

Scraper

crawler vietnamese corpus-linguistics corpus-data vietnamese-nlp

Updated Dec 21, 2018
Python

Improve this page

Add a description, image, and links to the corpus-data topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the corpus-data topic, visit your repo's landing page and select "manage topics."