Skip to content

A not-complete list of datasets for NLP tasks. All the rights to original authors 🙏.

Notifications You must be signed in to change notification settings

nluninja/nlp_datasets

Repository files navigation

A not-complete list of datasets for NLP tasks

A useful list of datasets I collected for NLP tasks. You can fork and/or clone this repository and get all the datasets available.

git clone https://github.com/nluninja/nlp_datasets

Available datasets

Name Description classes format language
20 Newsgroups dataset file set arranged into 20 topic folders see corpus page files en
The Anatomical Entity Mention (AnEM) corpus PubMeb dataset Anatomical_system, Cell,Cellular_component, Developing_anatomical_structure, Immaterial_anatomical_entity, Multi-tissue_structure, Organ, Organism_subdivision, Organism_substance, Pathological_formation, Tissue conll/iob2
AG News Topic dataset News Topic Classification dataset - Antonio Gulli - UniPi World, Sports, Business, Sci/Tech csv en
CoNLL 2003 named entity recognition dataset People, Location, Organization, Misc conll/iob2 en
emotions classification dataset emotion classification dataset which contains tweets labeled into 6 categories joy, sadness, anger, fear, love, surprise csv en
Georgetown University Multilayer corpus in CoNLL CoNLL tagged corpus for entity extraction 23 classes (person, substance, quantity, time, place, organization) conll/iob2 en
Relationship and Entity Extraction Evaluation Dataset in CoNLL CoNLL tagged corpus for entity extraction 21 classes (person, temporal, weapon, MilitaryPlatform, quantity, organization) conll/iob2 en
sentiment140 dataset dataset which contains tweets labeled according to their polarity negative, neutral, positive csv en
Toxic Comments dataset Reviews Wikipedia comments labeled into 6 categories with score toxic, severe_toxic, obscene, threat, insult, identity_hate csv en
WikiGold Dataset named entity recognition dataset People, Location, Organization, Misc conll/iob2 en
Wikipedia Movie Plots dataset descriptions of movies from around the world scraped from WikiPedia Genre Classes csv en
WNUT 17 Emerging Entities Dataset Twitter/StackOverflow data for discovering emerging entities Entity Classes conll/iob2 en
Yelp! Reviews reviews dataset from Yelp! for classification/sentiment analysis tasks 1 to 5 rates csv en

I appreciate your contribution to this repo, so don't hesitate to submit your changes via pull request for bug fixing or for adding a new dataset as well!

pull request https://github.com/nluninja/nlp_datasets

use the corpus_template for uploading the new dataset. I look forward seeing your contribution! 🙏 😘

About

A not-complete list of datasets for NLP tasks. All the rights to original authors 🙏.

Topics

Resources

Stars

Watchers

Forks