Skip to content

State of the Art Language models and Classifier for Sanskrit language (ancient indian language)

License

Notifications You must be signed in to change notification settings

goru001/nlp-for-sanskrit

Repository files navigation

NLP for Sanskrit

This repository contains State of the Art Language models and Classifier for Sanskrit, which is an ancient Indian language.

The models trained here have been used in Natural Language Toolkit for Indic Languages (iNLTK)

Dataset

Created as part of this project

  1. Sanskrit Wikipedia Articles

  2. Sanskrit Shlokas Dataset

Results

Language Model Perplexity

Architecture/Dataset Sanskrit Wikipedia Articles
ULMFiT ~6
TransformerXL ~3

Classification Metrics

ULMFiT
Dataset Accuracy Kappa Score
Sanskrit Shlokas Dataset 84.3 76.1

Visualizations

Embedding Space
Architecture Visualization
ULMFiT Embeddings projection
TransformerXL Embeddings projection

Pretrained Language Model

Download pretrained Language Model from here

Classifier

Download classifier from here

Tokenizer

Trained tokenizer using Google's sentencepiece

Download the trained model and vocabulary from here

About

State of the Art Language models and Classifier for Sanskrit language (ancient indian language)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published