Skip to content

This repository provides download and cleaning scripts for various datasets of NLP in Indic languages and converting them into a standard format.

License

Notifications You must be signed in to change notification settings

deterministic-algorithms-lab/Std-Indic-NLP

Repository files navigation

Standardizing Indic NLP

This Repo is in development stage currently.

The purpose of this repository is to provide scripts for downloading various Indic NLP datasets and converting them into a standard format, to enable easy merging and processing of datasets. To allow training of different models, without changing the structure of data much.

Points to Note

1.) By default, the files must not be split into train, test, valid dataset.

2.) "Dataset in standard format" is shortened to "standard dataset" everywhere else in this repo.

3.) Look through the README.md and CONTRIBUTING.md above and inside the folder corresponding to the NLP task, to understand usage/contribute.

4.) See Projects if you want to contribute to already planned tasks.

5.) Kindly open a new issue whenever you want to add a new dataset/cleaner/task and discuss changes, and search through pull-requests to make sure that no one else is working on the same thing already.

Installation

To get it up and running for development, run the following commands :

$ git clone https://github.com/deterministic-algorithms-lab/Std-Indic-NLP
$ cd Std-Indic-NLP
$ pip install -e ./

See : Discussion on Reddit.

About

This repository provides download and cleaning scripts for various datasets of NLP in Indic languages and converting them into a standard format.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages