Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Pretrained Ancient Greek Fasttext #1215

Open
Zoomerhimmer opened this issue Mar 17, 2023 · 4 comments
Open

Adding Pretrained Ancient Greek Fasttext #1215

Zoomerhimmer opened this issue Mar 17, 2023 · 4 comments
Assignees

Comments

@Zoomerhimmer
Copy link

Could these fasttext embeddings be added to the models list? It's under the CCA 4.0. The 300D vec is far superior to the NLPL word2vec implementation. I loaded it in gensim but didn't know how to plug into the cltk pipeline.

@kylepjohnson
Copy link
Member

Hi @Zoomerhimmer I am open minded about changing to to the fasttext vectors.

The 300D vec is far superior to the NLPL word2vec implementation

I want to learn more about this. What makes them better? Is it they are using "character grams" in addition to word tokens (word2vec)?

@kylepjohnson
Copy link
Member

Here's our embeddings code: https://github.com/cltk/cltk/blob/c15e0b27bab2526710408d30d5ca3879964ca17c/src/cltk/embeddings/embeddings.py

Would you like to work on this @Zoomerhimmer ?

Important parts of the code:

  1. Import gensim:
    from gensim import models # type: ignore
  2. Maps of lang-to-w2v url:
    MAP_NLPL_LANG_TO_URL = dict(
    (You would need to make a new lang-to-fasttext url.)
  3. At the check for MAP_NLPL_LANG_TO_URL here, you would add your new fasttext url:
    if self.iso_code not in MAP_NLPL_LANG_TO_URL:

Probably a few other small things to change.

Do you know where these grc ft models are saved online?

@Zoomerhimmer
Copy link
Author

I actually can’t find any details about the fasttext model itself, except from the name. However, I believe it outperforms the current NLPL model. Here’s an example (I got ft to load with your helpful hints):

ft_model.model.most_similar('λέγω')
[('φημὶ', 0.6184707283973694), ('λέγωμαι', 0.590637743473053), ('δείκνυμί', 0.587498128414154), ('δείκνυμι', 0.5802488327026367), ('τέμνομαι', 0.577020525932312), ('ἀποδείκνυμαι', 0.5757586359977722), ('εἴπω', 0.5722758769989014), ('φημι', 0.5673102736473083), ('ἀπολέγω', 0.5672284960746765), ('λέγομαι', 0.5671581029891968)]
These words are all near to each other in meaning. The only word I can’t make sense of is τέμνομαι.
wv_model.model.most_similar('λέγω')
[('ἀμὴν', 0.7812811732292175), ('εἴρηκα', 0.7477512359619141), ('καλῶ', 0.7466088533401489), ('ἐπιδείξω', 0.7412055134773254), ('Ἀμὴν', 0.7397247552871704), ('ποιήσω', 0.7381742596626282), ('ἐρῶ', 0.723713755607605), ('λαβέ', 0.7228215932846069), ('φράσω', 0.719882071018219), ('ἀναγνώσεται', 0.7092446684837341)]
This list has a wider semantic scope, but apparently more confidence that these words are associated. I get why ἀμὴν would be near verbs of speaking when a speaker wants to add veridical force to his statement and ποιήσω makes sense if we think of promises. However, λαβέ seems off the rails for me. I could be speaking out of ignorance though--and maybe comparing direct discourse verbs aren’t the best.

Another verb:
The NLPL model didn’t have ἰσχύω in it’s vocab so I did 3rd person.

ft_model.model.most_similar('ἰσχύει')
[('ἐξισχύει', 0.7105363011360168), ('ἰσχύεις', 0.7062532901763916), ('κατισχύει', 0.6672629117965698), ('ἐνισχύει', 0.6650412082672119), ('ἰσχύες', 0.6644095182418823), ('ὑπερισχύει', 0.6635435223579407), ('ἰσχύσει', 0.6463064551353455), ('δυνατεῖ', 0.626556396484375), ('δύναται', 0.6209343075752258), ('ἰσχύῃ', 0.6127247214317322)]
All these makes sense. One would expect prepositionally prefixed verbs to tend towards similarity.
wv_model.model.most_similar('ἰσχύει')
[('τοὔλαττον', 0.9440385699272156), ('ʽδῆλον', 0.9435624480247498), ('προαιρεῖται', 0.9398163557052612), ('ἀδιαφόρων', 0.9391382932662964), ('δικαιοπραγεῖν', 0.9363390207290649), ('ἀναγκάζεται', 0.9361328482627869), ('ἀνεξέταστον', 0.934328556060791), ('ἀναγκαιότερον', 0.9341817498207092), ('εὐπρεπὲς', 0.9334261417388916), ('χρηματίζεσθαι', 0.9333732724189758)]
This list doesn’t even include δύναμαι yet it seems highly confident of these other relatives that are worlds apart from each other.

Here are some nouns:

ft_model.model.most_similar('ἀνήρ')
[('ἀνὴρ', 0.8346245884895325), ('ὡνήρ', 0.6498374342918396), ('ἀγύναιος', 0.6230184435844421), ('ἀνδήριος', 0.619379997253418), ('ἀνδραίμων', 0.612857460975647), ('ἀνδρὼν', 0.6128512024879456), ('ἀνδρικώτατος', 0.6093217730522156), ('ἁνήρ', 0.6082747578620911), ('ἀστὸς', 0.6050947904586792), ('αὐτὴς', 0.6047882437705994)]
(I would think normalizing graves to acute accents wouldn’t do any harm and would remove duplicates like the first item.) These terms seem reasonable. Most are just variants of ‘male’, but we have ‘unmarried’, ‘native’, and ‘her husband’.
wv_model.model.most_similar('ἀνήρ')
[('κακὸς', 0.8629888296127319), ('ἀγαθός', 0.8620637059211731), ('ἐλεύθερος', 0.8437824249267578), ('κακός', 0.839881956577301), ('ἐχθρὸς', 0.8395668268203735), ('θρασὺς', 0.83908611536026), ('Ἕλλην', 0.8368448615074158), ('ἀνὴρ', 0.8336415886878967), ('ξένος', 0.8300831913948059), ('οἷος', 0.8276161551475525)]
These terms focus a lot more on how to describe a man: ‘evil’, ‘good’, ‘free’, ‘hostile’, ‘brave’, ‘Greek’, ‘male’, ‘stranger’, ‘such a one as’. Again a much wider range, but still sensible if we think of how these would occur in context.

ft_model.model.most_similar('ἄνθρωπος')
[('ἄνθρωπός', 0.7523252367973328), ('θεάνθρωπος', 0.7364073991775513), ('ἅνθρωπος', 0.7242015600204468), ('ὥνθρωπος', 0.7177456617355347), ('ἀνθρωπός', 0.7167887091636658), ('γελαστικὸς', 0.6792545914649963), ('ἄνθρωπέ', 0.6771118640899658), ('γελαστικός', 0.653688907623291), ('λογικὸς', 0.6507590413093567), ('λογικός', 0.6483426094055176)]
It is interesting that the God-man (Christ) would come up near the top (is this an artifact of character grams?). Otherwise we’ve got the ‘cheerful/laughing man’ and the ‘reasonable man’ tailing a bunch of alternate forms for ‘man’.
wv_model.model.most_similar('ἄνθρωπος')
[('πλούσιος', 0.808495283126831), ('ἁμαρτάνει', 0.7945271134376526), ('ἰατρὸς', 0.7912680506706238), ('δοῦλος', 0.7888801693916321), ('ἰσχυρὸς', 0.788048267364502), ('ἅνθρωπος', 0.7873396277427673), ('ἰατρός', 0.7864810824394226), ('ἀκριβὴς', 0.7815724015235901), ('τοιοῦτος', 0.7811455130577087), ('μικρὸς', 0.7802046537399292)]
Again this model focuses more on word pairs and less on synonyms, which may be better depending on what you want to do. It’s funny how humans would be associated with sinning (‘ ἁμαρτάνει’); makes sense though.

To conclude, I guess character grams would be the difference for quality, because Ancient Greek was one of the most highly inflected languages around. 300 dimensions is also bigger than 100, so that probably contributes, though I don’t know how much. Maybe I should reach out to the researcher and ask about his corpus and training parameters.

@Zoomerhimmer
Copy link
Author

Zoomerhimmer commented Mar 18, 2023

Do you know where these grc ft models are saved online?

They have four locations. Here is the Zenodo site's download link (https://zenodo.org/record/7630945/files/grc_fasttext_skipgram_nn2_xn10_dim300.vec?download=1). This is the repo (https://zenodo.org/record/7630945) they have other formats and various sizes too.

I was a bit confused over the _build_fasttext_url function. Do all FastText vectors have to be stored on dl.fbaipublicfiles.com? And how do we deal with licensing/attribution? I'm totally unfamiliar with that stuff.

@clemsciences clemsciences self-assigned this Mar 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants