GitHub - divkakwani/webcorpus: Generate large textual corpora for almost any language by crawling the web

webcorpus is an end-to-end tool used to crawl and generate datasets from the crawled data. It can be used to generate monolingual corpora and has various processors to create labelled datasets automatically. webcorpus is particulary suited for low-resource languages which need automated methods for creating large-scale datasets.

This project has been used to generate IndicCorp, a large-scale corpora for Indic languages, and some datasets for IndicGLUE.

Installation

Make sure you have java installed on your system. Next, install it using pip:

sudo pip3 install webcorpus

Usage

To build the dataset, we first need to crawl the web and then process the crawls to create the final dataset.

Step 1: Crawling Sources

To start crawling websites, you first need to start the webcorpus crawling server:

sudo webcorpus start

Once the server has started, you can start crawls using the following command.

webcorpus crawl --path <path> --name <name> --url <url> --log <path> [--host <ip address>]

You can see the status of the crawls anytime by executing:

webcorpus log [--host <ip address>]

The last two steps can also been remotely, which can be useful in distributed mode where you are running multiple webcorpus servers.

Step 2: Processing Corpus

webcorpus process --operation <operation code> --lang <lang code> --input <input path> --output <output path>

Currently, the following processing operations are supported: extract_arts, extract_sents, extract_genres, archive.

Name		Name	Last commit message	Last commit date
Latest commit History 370 Commits
docs		docs
scripts		scripts
sources		sources
webcorpus		webcorpus
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
readme.md		readme.md
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg
scrapyd.conf		scrapyd.conf
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

scripts

scripts

sources

sources

webcorpus

webcorpus

.gitignore

.gitignore

LICENSE

LICENSE

MANIFEST.in

MANIFEST.in

readme.md

readme.md

requirements.txt

requirements.txt

scrapy.cfg

scrapy.cfg

scrapyd.conf

scrapyd.conf

setup.py

setup.py

Repository files navigation

Installation

Usage

Step 1: Crawling Sources

Step 2: Processing Corpus

About

Releases

Packages

Contributors 6

Languages

License

divkakwani/webcorpus

Folders and files

Latest commit

History

Repository files navigation

Installation

Usage

Step 1: Crawling Sources

Step 2: Processing Corpus

About

Topics

Resources

License

Stars

Watchers

Forks

Languages