GitHub - natasha/corus: Links to Russian corpora + Python functions for loading and parsing

Links to publicly available Russian corpora + code for loading and parsing. 20+ datasets, 350Gb+ of text.

Usage

For example lets use dump of lenta.ru by @yutkin. Manually download the archive (link in the Reference section):

wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz

Use corus to load the data:

>>> from corus import load_lenta

>>> path = 'lenta-ru-news.csv.gz'
>>> records = load_lenta(path)
>>> next(records)

LentaRecord(
    url='https://lenta.ru/news/2018/12/14/cancer/',
    title='Названы регионы России с\xa0самой высокой смертностью от\xa0рака',
    text='Вице-премьер по социальным вопросам Татьяна Голикова рассказала, в каких регионах России зафиксирована наиболее высокая смертность от рака, сооб...',
    topic='Россия',
    tags='Общество'
)

Iterate over texts:

>>> records = load_lenta(path)
>>> for record in records:
...     text = record.text
...     ...

For links to other datasets and their loaders see the Reference section.

Documentation

Materials are in Russian:

Install

corus supports Python 3.5+, PyPy 3.

$ pip install corus

Reference

Dataset	API `from corus import`	Tags	Texts	Uncompressed	Description
Lenta.ru
Lenta.ru v1.0	`load_lenta` `#`	`news`	739 351	1.66 Gb	`wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.0/lenta-ru-news.csv.gz`
Lenta.ru v1.1+	`load_lenta2` `#`	`news`	800 975	1.94 Gb	`wget https://github.com/yutkin/Lenta.Ru-News-Dataset/releases/download/v1.1/lenta-ru-news.csv.bz2`
Lib.rus.ec	`load_librusec` `#`	`fiction`	301 871	144.92 Gb	Dump of lib.rus.ec prepared for RUSSE workshop `wget http://panchenko.me/data/russe/librusec_fb2.plain.gz`
Rossiya Segodnya	`load_ria_raw` `#` `load_ria` `#`	`news`	1 003 869	3.70 Gb	`wget https://github.com/RossiyaSegodnya/ria_news_dataset/raw/master/ria.json.gz`
Mokoron Russian Twitter Corpus	`load_mokoron` `#`	`social` `sentiment`	17 633 417	1.86 Gb	Russian Twitter sentiment markup Manually download https://www.dropbox.com/s/9egqjszeicki4ho/db.sql
Wikipedia	`load_wiki` `#`		1 541 401	12.94 Gb	Russian Wiki dump `wget https://dumps.wikimedia.org/ruwiki/latest/ruwiki-latest-pages-articles.xml.bz2`
GramEval2020	`load_gramru` `#`		162 372	30.04 Mb	`wget https://github.com/dialogue-evaluation/GramEval2020/archive/master.zip` `unzip master.zip` `mv GramEval2020-master/dataTrain train` `mv GramEval2020-master/dataOpenTest dev` `rm -r master.zip GramEval2020-master` `wget https://github.com/AlexeySorokin/GramEval2020/raw/master/data/GramEval_private_test.conllu`
OpenCorpora	`load_corpora` `#`	`morph`	4 030	20.21 Mb	`wget http://opencorpora.org/files/export/annot/annot.opcorpora.xml.zip`
RusVectores SimLex-965	`load_simlex` `#`	`emb` `sim`			`wget https://rusvectores.org/static/testsets/ru_simlex965_tagged.tsv` `wget https://rusvectores.org/static/testsets/ru_simlex965.tsv`
Omnia Russica	`load_omnia` `#`	`morph` `web` `fiction`		489.62 Gb	Taiga + Wiki + Araneum. Read "Even larger Russian corpus" https://events.spbu.ru/eventsContent/events/2019/corpora/corp_sborn.pdf Manually download http://bit.ly/2ZT4BY9
factRuEval-2016	`load_factru` `#`	`ner` `news`	254	969.27 Kb	Manual PER, LOC, ORG markup prepared for 2016 Dialog competition `wget https://github.com/dialogue-evaluation/factRuEval-2016/archive/master.zip` `unzip master.zip` `rm master.zip`
Gareev	`load_gareev` `#`	`ner` `news`	97	455.02 Kb	Manual PER, ORG markup (no LOC) Email Rinat Gareev (gareev-rm@yandex.ru) ask for dataset `tar -xvf rus-ner-news-corpus.iob.tar.gz` `rm rus-ner-news-corpus.iob.tar.gz`
Collection5	`load_ne5` `#`	`ner` `news`	1 000	2.96 Mb	News articles with manual PER, LOC, ORG markup `wget http://www.labinform.ru/pub/named_entities/collection5.zip` `unzip collection5.zip` `rm collection5.zip`
WiNER	`load_wikiner` `#`	`ner`	203 287	36.15 Mb	Sentences from Wiki auto annotated with PER, LOC, ORG tags `wget https://github.com/dice-group/FOX/raw/master/input/Wikiner/aij-wikiner-ru-wp3.bz2`
BSNLP-2019	`load_bsnlp` `#`	`ner`	464	1.16 Mb	Markup prepared for 2019 BSNLP Shared Task `wget http://bsnlp.cs.helsinki.fi/TRAININGDATA_BSNLP_2019_shared_task.zip` `wget http://bsnlp.cs.helsinki.fi/TESTDATA_BSNLP_2019_shared_task.zip` `unzip TRAININGDATA_BSNLP_2019_shared_task.zip` `unzip TESTDATA_BSNLP_2019_shared_task.zip -d test_pl_cs_ru_bg` `rm TRAININGDATA_BSNLP_2019_shared_task.zip TESTDATA_BSNLP_2019_shared_task.zip`
Persons-1000	`load_persons` `#`	`ner` `news`	1 000	2.96 Mb	Same as Collection5, only PER markup + normalized names `wget http://ai-center.botik.ru/Airec/ai-resources/Persons-1000.zip`
The Russian Drug Reaction Corpus (RuDReC)	`load_rudrec` `#`	`ner`	4 809	1.73 Kb	RuDReC is a new partially annotated corpus of consumer reviews in Russian about pharmaceutical products for the detection of health-related named entities and the effectiveness of pharmaceutical products. Here you can download and work with the annotated part, to get the raw part (1.4M reviews) please refer to https://github.com/cimm-kzn/RuDReC. `wget https://github.com/cimm-kzn/RuDReC/raw/master/data/rudrec_annotated.json`
Taiga	Large collection of Russian texts from various sources: news sites, magazines, literacy, social networks `wget https://linghub.ru/static/Taiga/retagged_taiga.tar.gz` `tar -xzvf retagged_taiga.tar.gz`
Arzamas	`load_taiga_arzamas` `#`	`news`	311	4.50 Mb
Fontanka	`load_taiga_fontanka` `#`	`news`	342 683	786.23 Mb
Interfax	`load_taiga_interfax` `#`	`news`	46 429	77.55 Mb
KP	`load_taiga_kp` `#`	`news`	45 503	61.79 Mb
Lenta	`load_taiga_lenta` `#`	`news`	36 446	95.15 Mb
Taiga/N+1	`load_taiga_nplus1` `#`	`news`	7 696	24.96 Mb
Magazines	`load_taiga_magazines` `#`		39 890	2.19 Gb
Subtitles	`load_taiga_subtitles` `#`		19 011	909.08 Mb
Social	`load_taiga_social` `#`	`social`	1 876 442	648.18 Mb
Proza	`load_taiga_proza` `#`	`fiction`	1 732 434	38.25 Gb
Stihi	`load_taiga_stihi` `#`		9 157 686	12.80 Gb
Russian NLP Datasets	Several Russian news datasets from webhose.io, lenta.ru and other news sites.
News	`load_buriy_news` `#`	`news`	2 154 801	6.84 Gb	Dump of top 40 news + 20 fashion news sites. `wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2014.tar.bz2` `wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part1.tar.bz2` `wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/news-articles-2015-part2.tar.bz2`
Webhose	`load_buriy_webhose` `#`	`news`	285 965	859.32 Mb	Dump from webhose.io, 300 sources for one month. `wget https://github.com/buriy/russian-nlp-datasets/releases/download/r4/webhose-2016.tar.bz2`
ODS #proj_news_viz	Several news sites scraped by members of #proj_news_viz ODS project.
Interfax	`load_ods_interfax` `#`	`news`	543 961	1.22 Gb	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/interfax.csv.gz`
Gazeta	`load_ods_gazeta` `#`	`news`	865 847	1.63 Gb	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/gazeta.csv.gz`
Izvestia	`load_ods_izvestia` `#`	`news`	86 601	307.19 Mb	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/iz.csv.gz`
Meduza	`load_ods_meduza` `#`	`news`	71 806	270.11 Mb	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/meduza.csv.gz`
RIA	`load_ods_ria` `#`	`news`	101 543	233.88 Mb	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/ria.csv.gz`
Russia Today	`load_ods_rt` `#`	`news`	106 644	187.12 Mb	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/rt.csv.gz`
TASS	`load_ods_tass` `#`	`news`	1 135 635	3.27 Gb	`wget https://github.com/ods-ai-ml4sg/proj_news_viz/releases/download/data/tass-001.csv.gz`
Universal Dependencies
GSD	`load_ud_gsd` `#`	`morph` `syntax`	5 030	1.01 Mb	`wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-dev.conllu` `wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-test.conllu` `wget https://github.com/UniversalDependencies/UD_Russian-GSD/raw/master/ru_gsd-ud-train.conllu`
Taiga	`load_ud_taiga` `#`	`morph` `syntax`	3 264	353.80 Kb	`wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-dev.conllu` `wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-test.conllu` `wget https://github.com/UniversalDependencies/UD_Russian-Taiga/raw/master/ru_taiga-ud-train.conllu`
PUD	`load_ud_pud` `#`	`morph` `syntax`	1 000	207.78 Kb	`wget https://github.com/UniversalDependencies/UD_Russian-PUD/raw/master/ru_pud-ud-test.conllu`
SynTagRus	`load_ud_syntag` `#`	`morph` `syntax`	61 889	11.33 Mb	`wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-dev.conllu` `wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-test.conllu` `wget https://github.com/UniversalDependencies/UD_Russian-SynTagRus/raw/master/ru_syntagrus-ud-train.conllu`
morphoRuEval-2017
General Internet-Corpus	`load_morphoru_gicrya` `#`	`morph`	83 148	10.58 Mb	`wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/GIKRYA_texts_new.zip` `unzip GIKRYA_texts_new.zip` `rm GIKRYA_texts_new.zip`
Russian National Corpus	`load_morphoru_rnc` `#`	`morph`	98 892	12.71 Mb	`wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/RNC_texts.rar` `unrar x RNC_texts.rar` `rm RNC_texts.rar`
OpenCorpora	`load_morphoru_corpora` `#`	`morph`	38 510	4.80 Mb	`wget https://github.com/dialogue-evaluation/morphoRuEval-2017/raw/master/OpenCorpora_Texts.rar` `unrar x OpenCorpora_Texts.rar` `rm OpenCorpora_Texts.rar`
RUSSE Russian Semantic Relatedness
HJ: Human Judgements of Word Pairs	`load_russe_hj` `#`	`emb` `sim`			`wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/hj.csv`
RT: Synonyms and Hypernyms from the Thesaurus RuThes	`load_russe_rt` `#`	`emb` `sim`			`wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/rt.csv`
AE: Cognitive Associations from the Sociation.org Experiment	`load_russe_ae` `#`	`emb` `sim`			`wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-train.csv` `wget https://github.com/nlpub/russe-evaluation/raw/master/russe/evaluation/ae-test.csv` `wget https://raw.githubusercontent.com/nlpub/russe-evaluation/master/russe/evaluation/ae2.csv`
Toloka Datasets
Lexical Relations from the Wisdom of the Crowd (LRWC)	`load_toloka_lrwc` `#`	`emb` `sim`			`wget https://tlk.s3.yandex.net/dataset/LRWC.zip` `unzip LRWC.zip` `rm LRWC.zip`
The Russian Adverse Drug Reaction Corpus of Tweets (RuADReCT)	`load_ruadrect` `#`	`social`	9 515	2.09 Mb	This corpus was developed for the Social Media Mining for Health Applications (#SMM4H) Shared Task 2020 `wget https://github.com/cimm-kzn/RuDReC/raw/master/data/RuADReCT.zip` `unzip RuADReCT.zip` `rm RuADReCT.zip`

Support

Chat — https://t.me/natural_language_processing
Issues — https://github.com/natasha/corus/issues
Commercial support — https://lab.alexkuk.ru

Add new source

Implement corus/sources/<source>.py
Add import into corus/sources/__init__.py
Add meta into corus/source/meta.py
Add example into docs.ipynb (check meta table is correct)
Run tests (readme is updated)

Development

Dev env

python -m venv ~/.venvs/natasha-corus
source ~/.venvs/natasha-corus/bin/activate

pip install -r requirements/dev.txt
pip install -e .

python -m ipykernel install --user --name natasha-corus

Lint + update docs

make lint
make exec-docs

Release

# Update setup.py version

git commit -am 'Up version'
git tag v0.10.0

git push
git push --tags

Name		Name	Last commit message	Last commit date
Latest commit History 171 Commits
.github/workflows		.github/workflows
corus		corus
data		data
requirements		requirements
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docs.ipynb		docs.ipynb
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

corus

corus

data

data

requirements

requirements

.gitignore

.gitignore

LICENSE

LICENSE

Makefile

Makefile

README.md

README.md

docs.ipynb

docs.ipynb

setup.cfg

setup.cfg

setup.py

setup.py

Repository files navigation

Usage

Documentation

Install

Reference

Support

Add new source

Development

About

Releases

Packages

Contributors 3

Languages

License

natasha/corus

Folders and files

Latest commit

History

Repository files navigation

Usage

Documentation

Install

Reference

Support

Add new source

Development

About

Topics

Resources

License

Stars

Watchers

Forks

Languages