Skip to content

simple bs4 based web crawl for a corpus in need of statistical machine translation

License

Notifications You must be signed in to change notification settings

MarsPanther/crawl-for-parallel-corpora

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

crawl-for-parallel-corpora

simple bs4 based web crawl for a corpus in need of statistical machine translation

This Project collects Bible Dataset for Ethiopian languages and English respective transalatioin:

From [https://www.jw.org/am/]

How to Run to get Data for Four languages

This is a NLP Data Collection Effort for to increase NLP data in Under-resourced languages.

  • print(get_book_data('english'))
  • print(get_book_data('amharic'))
  • print(get_book_data('tigrigna'))
  • print(get_book_data('oromifa'))