Skip to content

Curated list of publicly available parallel corpus for Indian Languages

Notifications You must be signed in to change notification settings

Kartikaggarwal98/Indian_ParallelCorpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 

Repository files navigation

Parallel Corpus for Indian Languages

Available parallel data for training machine translation models in indic languages: Hindi, Bengali, Gujarati, Gondi, Kannada, Manipuri, Marathi, Malayalam, Oriya, Punjabi, Sanskrit, Tamil, Telugu.

Assamese-X

  1. Samaantar Corpus
  2. As-En PMIndia Corpus
  3. As-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row asm-eng.

Bengali-X

  1. Samaantar Corpus
  2. Bn-En BEUT Parallel corpus: 2.75million pairs of bengali-english sentences @EMNLP 2020
  3. Bn-En Project Anuvaad
  4. Bn-En Indian Parallel Corpora
  5. CVIT-IIITH PIB Multilingual Corpus: en, gu, hi, ml, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  6. CVIT-IIITH Mann ki Baat Corpus: en, gu, hi, ml, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  7. Bn-En Indian-Language Dataset
  8. Bn-En Asian Language Treebank (ALT) Parallel Corpus
  9. Bn-En PMIndia Corpus
  10. Bn-En OPUS: Set source as en and target as bn
  11. Bn-En SUPARA 0.8M: Requires an IEEE DataPort Subscription
  12. Bn-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row ben-eng.

Gujarati-X

  1. Samaantar Corpus
  2. Gu-En WikiTitles Parallel Corpus : wikititles-v1.gu-en.tsv.gz
  3. Gu-En Project Anuvaad
  4. Gu-En Tsardia
  5. CVIT-IIITH PIB Multilingual Corpus: en, bn, hi, ml, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  6. CVIT-IIITH Mann ki Baat Corpus: en, bn, hi, ml, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  7. Gu-En Shahparth123
  8. Gu-En PMIndia Corpus
  9. Gu-En Bible Corpus
  10. Gu-En OPUS: Set source as en and target as gu
  11. Gu-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row guj-eng.

Gondi-X

  1. Gondi-Hindi Parallel Corpus

Hindi-X

  1. Samaantar Corpus
  2. Hi-En IITB Parallel Corpus: v3.0 released !!
  3. Hi-En Project Anuvaad
  4. Hi-En Indian Parallel Corpora
  5. CVIT-IIITH PIB Multilingual Corpus: en, bn, gu, ml, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  6. CVIT-IIITH Mann ki Baat Corpus: en, bn, gu, ml, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  7. Hi-En Asian Language Treebank (ALT) Parallel Corpus
  8. Hi-En PMIndia Corpus
  9. Hi-En Bible Corpus
  10. Hi-En Wiki Matrix Comparable Corpus
  11. Hi-En OPUS: Set source as en and target as hi. [ Some of the corpus are part of IITB Parallel Corpus.]
  12. Hi-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row hin-eng.
  13. IIITH Code-Mix Hi-En Corpus
  14. Hi-En Flickr 8k: Multimodal Dataset
  15. Hi-San parallel corpus: Hindi-Sanskrit monolingual and parallel data from Ramayana, Rigveda, Bhagvad Gita, etc.

Kannada-X

  1. Samaantar Corpus
  2. Kn-En Project Anuvaad
  3. Kn-En PMIndia Corpus
  4. Kn-En Bible Corpus
  5. OPUS: Set source as en and target as kn
  6. Kn-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row kan-eng.

Manipuri-X

  1. Mn-En PMIndia Corpus

Marathi-X

  1. Samaantar Corpus
  2. Mr-En Project Anuvaad
  3. CVIT-IIITH PIB Multilingual Corpus: en, bn, gu, hi, ml, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  4. CVIT-IIITH Mann ki Baat Corpus: en, bn, gu, hi, ml, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  5. Mr-En PMIndia Corpus
  6. Mr-En Bible Corpus
  7. Mr-En OPUS: Set source as en and target as mr
  8. Mr-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row mar-eng.

Malayalam-X

  1. Samaantar Corpus
  2. Ml-en Project Anuvaad
  3. Indian Parallel Corpora
  4. CVIT-IIITH PIB Multilingual Corpus: en, bn, gu, hi, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  5. CVIT-IIITH Mann ki Baat Corpus: en, bn, gu, hi, mr, or, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  6. Ml-en Indian-Language Dataset
  7. Ml-en English_Malayalam_ParallelCorpora
  8. Ml-en PMIndia Corpus
  9. Ml-en Bible Corpus
  10. Ml-en OPUS: Set source as en and target as ml
  11. Ml-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row mal-eng.

Oriya-X

  1. Samaantar Corpus
  2. Or-En MTEnglish2Odia
  3. Or-En OdiEnCorp 2.0
  4. Or-En OdiEnCorp 1.0
  5. Or-En IndoWordnet Parallel Corpus
  6. CVIT-IIITH PIB Multilingual Corpus: en, bn, gu, hi, ml, mr, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  7. CVIT-IIITH Mann ki Baat Corpus: en, bn, gu, hi, ml, mr, pa, ta, te, ur. [Source-code, pretrained models and other resources also available.]
  8. Or-En PMIndia Corpus
  9. Or-En OPUS: Set source as en and target as or
  10. Or-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row ori-eng.

Punjabi-X

  1. Samaantar Corpus
  2. Pu-En Project Anuvaad
  3. Pu-En Punjabi-English Corpus
  4. Pu-En PMIndia Corpus
  5. Pu-En OPUS: Set source as en and target as pa
  6. Pu-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row pan-eng.

Sanskrit-X

  1. San-Hi parallel corpus: Sanskrit Hindi monolingual and parallel data from Ramayana, Rigveda, Bhagvad Gita, etc.

Tamil-X

  1. Samaantar Corpus
  2. Ta-En Project Anuvaad
  3. Ta-En Indian Parallel Corpora
  4. Ta-En National Language Process Center
  5. Ta-En EnTam
  6. CVIT-IIITH PIB Multilingual Corpus: en, bn, gu, hi, ml, mr, or, pa, te, ur. [Source-code, pretrained models and other resources also available.]
  7. CVIT-IIITH Mann ki Baat Corpus: en, bn, gu, hi, ml, mr, or, pa, te, ur. [Source-code, pretrained models and other resources also available.]
  8. Ta-En Indian-Language Dataset
  9. Ta-En Multiple Dataset Links
  10. Ta-En PMIndia Corpus
  11. Ta-En Parallel Corpus
  12. Ta-En PMIndia Corpus
  13. Ta-En OPUS: Set source as en and target as ta
  14. Ta-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row tam-eng.

Telugu-X

  1. Samaantar Corpus
  2. Te-En Project Anuvaad
  3. Te-En Indian Parallel Corpora
  4. CVIT-IIITH PIB Multilingual Corpus: en, bn, gu, hi, ml, mr, or, pa, ta, ur. [Source-code, pretrained models and other resources also available.]
  5. CVIT-IIITH Mann ki Baat Corpus: en, bn, gu, hi, ml, mr, or, pa, ta, ur. [Source-code, pretrained models and other resources also available.]
  6. Te-En Indian-Language Dataset
  7. Te-En PMIndia Corpus
  8. Te-En Bible Corpus
  9. Te-En OPUS: Set source as en and target as te
  10. Te-En Backtranslated Tatoeba Challenge: Parallel data obtained by backtranslation on monolingual data. Row tel-eng.

Other Resources

  1. PMIndia Parallel Corpus Creation: Code for creating a parallel corpus from pmindia.gov.in. [Paper Link]