New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Use stanza model for Finnish #255
base: dev
Are you sure you want to change the base?
Conversation
Oh wow, this looks great! I didn't know about this. I would love to add this. We actually have a language install system, so the image size would not increase, it would only take up space for users who actually use this language. Does this require a GPU? Can you please test what the size would be without the nvidia driver? My only problem with it would be GPU dependence plus that my laptop is probably too weak to test this. After adding the 2 missing Spacy languages my plan was to use different tokenizers, it would be VERY useful if I could keep using Spacy for more languages. Thank you so much for working on this! @sergiolaverde0 You may be interested in this. |
@simjanos-dev I am so glad you liked it! Yes, it works without GPU: I just added installation of CPU version of
What are my next actions? Fix the documentation (add references about stanza to all places where spacy is mentioned, write proper commit and PR messages) and undraft the PR, or is there something else that needs to be done? |
@@ -8,6 +8,8 @@ RUN apt-get update -y \ | |||
&& apt-get clean \ | |||
&& rm -rf /var/lib/apt/lists/* | |||
|
|||
RUN pip install torch --index-url https://download.pytorch.org/whl/cpu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment about stanza and NVIDIA drivers is needed here.
Looking at the URL for the Pytorch install this doesn't need a GPU since it uses CPU as the computing platform. I heard we can reduce the size of that install by compiling Pytorch from source without the unnecessary features but I haven't done it before and I don't know by how much we can cut it. If the accuracy increase is noticeable for enough languages maybe we should consider the possibility of making it the default. I'm concerned with performance when using CPU so that's another thing to check. I see a future here, but it will take effort. |
My enthusiasm has dropped a lot, I thought it would be much smaller. The model size is still huge compared to the 20-50MB model we used before. A few more questions:
I don't think I want to do that. Some users already had issues with the ram. I've seen attempted installs on raspberry pi-s, small free tier hosted servers and old laptops. I myself have an old laptop. And I also want to host LinguaCafe on a VPS in the future and try to optimize it. I want to rather make LinguaCafe smaller if possible by default. However, I definitely want to add these models as well as an option.
I'm not sure, I will need some time to figure out what I would like to do. I will more than likely have a problem with testing this myself. Since this is only needed for lemmas (except for languages that have no spaces or have readings), what if we would use a huge amount of text, and generate a list of lemmas that we would use for linguacafe? For most languages, that is the only value that is being added by using an other model or tokenizer than the multilingual Spacy one. 2 other options would be: adding them as extra installable languages like "Finnish (large)", or adding an API that let people use other tokenizers. It would be easy to copy the current python container, and modify it and add different models. What do you think? |
Well seeing how we have already more or less frozen the features for v12.0 and since I have an assignment for this weekend I suggest to give it some time. Next week I will try to compile Pytorch from source and see what would be the absolute minimum size so we can make a better informed decision. For the time being the option for larger models is my favourite. @rominf sorry if we take our time, but Simjanos is right to be concerned about accessibility of the hardware requirements. |
Implementing this will definitely take a lot of time. I want to add everything to linguacafe that I can, but I cant do it at the rate requests are coming in. Its been an insane progress in the last 4 months since release. |
Please take your time! I will post my results, so that you have a food for thought meanwhile. You are right about importance of accessibility of the hardware requirements: my mistake, I was not thoughtful about this. I will write about Finnish only, since I have not tried to do lemmatization in other languages. Stanza language support is split to multiple models. For lemmatization only My PC info:
Here are the results of lemmatization of Universal Dependencies tree bank:
I also measured RAM usage on lemmatization of Alice in Wonderland in Finnish using scalene. Here is the script:
Results:
As you can notice from the script, I don't use This is the size of the image now (without pos):
To sum up, stanza without pos processor is a bit more accurate on Finnish than spacy, takes significantly less disk space and RAM, however, much slower. Stanza with pos processor is much more accurate on Finnish than spacy, but takes significantly more disk space and RAM and tremendously slower. The proposal about having multiple variants of language is my favorite as well! Do you want me to do a benchmark of spacy vs stanza for other languages? UPD: added results for Python 3.12 for Alice in Wonderland test. |
This is a really detailed test report, thank you so much!
Wow. I have i5-8250u and 8GB ram.
I think we should go with that as well to provide the best experience possible. At first I was thinking about it the wrong way. My first idea was to have multiple languages for different tokenizers, but I realized it would be extremely difficult to implement, since Language names are used at a ton of places. It is however reasonably simple to switch tokenizers. So we can just make the tokenizer selectable on the admin page without separating them into their own language.
I mostly interested in that if we add multiple languages, would the additional disk space required decrease, due to shared dependencies. I think the latest 2GB disk size you commented is very reasonable to be added as an option. But if the models themselfs are so small, is there any way to decrease the disk space further? Can we remove Spacy, and use Stanza by itself to save space? I know it returns a different format, but I can write a different tokenizer function for it. The tokenizer is quite slower, but the PHP side of processing the text takes time as well, so it might won't be that much of an issue, plus users can decide which one they want to use. I would like to help implement this, but I won't be able to provide testing, or support for users who will have issues with it, because my laptop would die trying to run this. I will think about how to implement a tokenizer selector. We should probably rebrand installable languages to installable packages or something. |
What if we extend your idea about installable lemmatizers even further? Since some people want to run LinguaCafe in constrained environments, what if:
This can be done! My proposal is to preinstall simplemma instead of spacy, so that the image is minimal. It has low footprint and runs very fast (it can be seen from the table in my previous message – I added simplemma there) – a good fit for raspberry pi. If user selects enhanced models spacy or stanza is installed in a few seconds using uv, which installs stanza just in few seconds (5 seconds on my machine)! This is just one extra call to uv for installation of the package in venv. I created four venvs using uv: empty, simplemma, spacy, and stanza. Here is what I got: pytorch takes the most place, as @sergiolaverde0 expected. Here is the showcase of how fast uv is:
Docker image building with uv becomes much, much faster and here is the footprint:
PS: please have a look at "UPD" in my previous message: stanza on Python 3.12 is quite a bit faster than on Python 3.9. |
Hi, I have a few questions, hoping not to derail this too much:
About suing I see I also want to remark: |
I will comment on it more later, just a few quick comments from phone.
I want to check other non spacy tokenizers as well, and compare the sizes. I think Spacy is a good default option based on its size, and if theres an other smaller tokenizer for Vietnamese, I would prefer that instead of Stanza. Theres also an option for using Spacy multilingual model and simple lemmatizer together. It would be really good and easy solution for Czech and Latin lemma support. We could replace spacy for most languages with simple lemmatizer, but there are 3 points to keep in mind:
I am thinking about it. I have no strong opinions about it, but I feel like using Spacy is a good default option when available. Tokenization speeds importance will decrease in the future, because I want to make a queue for them, and users will be able to start reading after the first chapter is finished. |
|
Thank you! I can do testing and support users for this feature. Also, I do not think |
I'll try it out sometime.
In that case I am open for adding Stanza as an additional option for at least Finnish. If it goes well, I think we can add more languages and Stanza tokenizers. I will do everything on the Laravel and front-end side, and can also do Python if needed. (Honestly I am a bit worried about having parts of the code that I don't test/support completely.)
Currently I think the only thing needed on the Python/docker side is to make it installable like other language packages. I will experiment with simplemma for Czech, Latin, Ukrainian and Welsh in the future. It also has Hindi which was a requested language. I wanted to split up tokenizer.py for a while, because it keeps growing. Now it will be kind of necessary. Currently it should have 3 files: tokenizer, import and models(I'm not sure if this one can be separated). I will probably do it for v0.13 or v0.14. It might take a while for me to do my part, I will be working a bit less on linguacafe, and will work on parts of it that I want to use, because I feel a bit burned out. And thank you so much for working on this! Both Stanza and Simplemma are great tools for tokenizing, I didn't even know about them. |
I did some really quick mockups last night and was able to reduce it to 1.81 GB by changing the base to I will add the first change for v0.13 regardless of what happens with the tokenizers because there's no reason not to. While doing this I realized we can use Python 3.12 regardless of anything so sorry for wasting your time with that pointless inquiry. Later I will test how does size evolve as I replace more and more languages with the Stanza variants, and check if I can shrink Pytorch more. |
Did you mean replace to test the image size, or did you mean you will replace all spacy packages with stanza? Edit: I think it was the former. Im a bit slow today and was confused. |
@@ -22,6 +24,8 @@ RUN pip install -U --no-cache-dir \ | |||
bottle \ | |||
#spacy | |||
spacy \ | |||
#stanza integration for spacy | |||
stanza \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spacy_stanza
should be installed as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose to remove it, since there are at least two issues with the spacy_stanza
library I bumped into:
- Multi-word token expansion issue, misaligned tokens --> failed NER (German) explosion/spacy-stanza#70. As I understand this affects quality because of imprecise tokenization. Also, this generate verbose output I could not suppress.
stanza
version is pinned and it is not the latest.
Also, doing lemmatization using stanza directly is straightforward, see #255 (comment).
The code in this PR should be changed a bit to make it work. Currently, it is broken, since I wanted to check the image size and did not care about usable LinguaCafe at this stage).
Replace where possible to see the image size I ended up with, and also because the easy way to map the models to a language to test them was to ditch the Spacy counterparts anyways. And after doing so to see if space savings from shared dependencies could shrink this image I found that:
I'm now to test that languages other than Finish actually work, as in, to check if they actually load and can tokenize a paragraph. I will be grateful for any help so I have built a test on my fork, pull it with If the decrease in performance is not that big of a deal, if the rest of the languages can be worked around to be used, and if they all follow the pattern of using less RAM than Spacy counterparts, I can vouch for this to be our new default. But those ifs are doing quite the heavylifting. Edit: And yes I'm shunning away from compiling Pytorch until we exhaust alternatives, today I saw their |
A few things to keep in mind regarding replacing default tokenizers with stanza:
|
Thank you for the tests! This is VERY detailed list. I'll do a few other things this week, but on the weekend or next week I am ready to do my part adding more tokenizers to linguacafe starting with Finnish. |
You are welcome! It was fun and educational to work on this benchmark. Here is the release with results in CSV format: https://github.com/rominf/spacy-vs-stanza/releases/tag/v0.1.0. Feel free to ask any questions, but please note that from 14:00 UTC today until the morning of May 27th (so, ~one week), I will be offline. |
This PR is a request for comments about using stanza model for Finnish and is not meant to be merged in current state, hence it is draft.
Unfortunately, Finnish lemmatization is not very accurate. I ran slightly updated benchmark: https://github.com/aajanki/finnish-pos-accuracy and found that spacy lemmatization model used in LinguaCafe has
F1=0.842
, whereas default stanza model for Finnish givesF1=0.958
.I tried to use stanza with https://github.com/explosion/spacy-stanza adapter (see PR code). It works. Also, code changes are generalizable to other languages (stanza supports over 70 languages).
There is a huge downside though: the size of resulting docker image, which is mostly because NVIDIA drivers I guess, which are automatically downloaded with pytorch installation.
In conclusion, it is possible to significantly increase accuracy for Finnish (and probably some other languages) while not increasing code complexity at the cost of image size.
What do you think about Finnish lemmatization accuracy and introducing stanza?
Before (lemma is whole word – incorrect):
After (lemma is correct):