தமிழ் மொழி ஒப்பு - tamil-lm

மொழியொப்பேற்றம், வெறும் செய்திதாள் கட்டுரைகளிலிருந்து, கணினி எப்படி சொற்களுக்கு இருக்கும் தொடர்பை கண்டுபிடிக்கிறது.

Language modelling based on skip-grams over tamil news dataset

காட்சிப்பொருள் - Demo

இந்த தளத்தில் சென்று தொடர்புடைய சொற்கள் எவையவை, தொடர்பில்லாத சொற்கள் எவையவை என்று கணியொப்பே கண்டறிந்து காட்டுவதைப்பார்கலாம்.

Please go to the link for the demo.

தரவுக்கணம் - Dataset

செய்திதாள் கட்டுரைளை படியெடுத்து, கொஞ்சம் சுத்தஞ்செய்து ஆக்கப்பட்ட தரவுக்கணம் ஒன்று விரைவில் வெளியிடப்படும். சுத்தஞ்செய்யும் முறைகளும், விளக்கப்படும். The data is scraped from tamil news websites. The dataset will be made available soon.

ஒப்புகள் - Models

இச்சட்டியில், மொத்தம் மூன்று ஒப்புகள் உள்ளன, எனினும், skipgram நன்கு செயல்படுகிறது. Though there are three models available in the repo, skipgram works well.

Model weights.

Plain text vocabulary and embedding vectors

The embedding vectors and corresponding tokens can be downloaded from vaaku2vec.zip

How to visualize?

You can upload the vocab.vectors.tsv and vocab.tokens.tsv in TensorFlow Projector to visualize them.

Use TSNE projection and let it run for more than 400 iterations. You can see the cone-ice shape come to life. It is really fun. | | | | | |

Training

$ python main.py train

Sister Projects

Malayalam - Vaaku2Vec

Thanks

Abin Simon for setting up the UI.
Adam Shamshudeen for setting up the server side.
Malaikannan Sankarasubbu For your support and guidance
Sebastian Ruder For helping me learn word embeddings years ago.

And all the good people who write blogs everyday to better the humanity.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
anikattu		anikattu
assets/images		assets/images
main		main
model		model
pair_word_prediction		pair_word_prediction
skipgram		skipgram
skipgram_conv2d_embedding		skipgram_conv2d_embedding
skipgram_kvmemnet		skipgram_kvmemnet
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

vanangamudi/tamil-lm2

Folders and files

Latest commit

History

Repository files navigation

தமிழ் மொழி ஒப்பு - tamil-lm

தரவுக்கணம் - Dataset

ஒப்புகள் - Models

Model weights.

Plain text vocabulary and embedding vectors

How to visualize?

Training

Sister Projects

Thanks

About

Resources

License

Stars

Watchers

Forks

Languages