Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Use stanza model for Finnish #255

Draft
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

rominf
Copy link
Contributor

@rominf rominf commented May 10, 2024

This PR is a request for comments about using stanza model for Finnish and is not meant to be merged in current state, hence it is draft.

Unfortunately, Finnish lemmatization is not very accurate. I ran slightly updated benchmark: https://github.com/aajanki/finnish-pos-accuracy and found that spacy lemmatization model used in LinguaCafe has F1=0.842, whereas default stanza model for Finnish gives F1=0.958.

I tried to use stanza with https://github.com/explosion/spacy-stanza adapter (see PR code). It works. Also, code changes are generalizable to other languages (stanza supports over 70 languages).

There is a huge downside though: the size of resulting docker image, which is mostly because NVIDIA drivers I guess, which are automatically downloaded with pytorch installation.

$ podman system df -v  # before
REPOSITORY                                      TAG                       IMAGE ID      CREATED        SIZE        SHARED SIZE  UNIQUE SIZE  CONTAINERS
...
localhost/linguacafedev_python                  latest                    85a508ecb0cb  54 minutes  1.095GB     424.3MB      671.1MB      0
...
$ podman system df -v  # after
REPOSITORY                                      TAG                       IMAGE ID      CREATED        SIZE        SHARED SIZE  UNIQUE SIZE  CONTAINERS
...
localhost/linguacafedev_python                  latest                    2c95c59fdae3  26 minutes  6.906GB     5.53GB       1.376GB      0
...

In conclusion, it is possible to significantly increase accuracy for Finnish (and probably some other languages) while not increasing code complexity at the cost of image size.

What do you think about Finnish lemmatization accuracy and introducing stanza?

lemma_f1_speed

Before (lemma is whole word – incorrect):
Screenshot_20240510_163709

After (lemma is correct):
Screenshot_20240510_165630

@simjanos-dev
Copy link
Owner

simjanos-dev commented May 10, 2024

Oh wow, this looks great! I didn't know about this.

I would love to add this. We actually have a language install system, so the image size would not increase, it would only take up space for users who actually use this language.

Does this require a GPU? Can you please test what the size would be without the nvidia driver?

My only problem with it would be GPU dependence plus that my laptop is probably too weak to test this. After adding the 2 missing Spacy languages my plan was to use different tokenizers, it would be VERY useful if I could keep using Spacy for more languages.

Thank you so much for working on this!

@sergiolaverde0 You may be interested in this.

@rominf
Copy link
Contributor Author

rominf commented May 10, 2024

@simjanos-dev I am so glad you liked it!

Yes, it works without GPU: I just added installation of CPU version of torch on separate line. The size of the image dropped significantly:

$ podman system df -v  # before
REPOSITORY                                      TAG                       IMAGE ID      CREATED        SIZE        SHARED SIZE  UNIQUE SIZE  CONTAINERS
...
localhost/linguacafedev_python                  latest                    85a508ecb0cb  54 minutes  1.095GB     424.3MB      671.1MB      0
...
$ podman system df -v  # after - GPU
REPOSITORY                                      TAG                       IMAGE ID      CREATED        SIZE        SHARED SIZE  UNIQUE SIZE  CONTAINERS
...
localhost/linguacafedev_python                  latest                    2c95c59fdae3  26 minutes  6.906GB     5.53GB       1.376GB      0
...
$ podman system df -v  # after - CPU
REPOSITORY                                      TAG                       IMAGE ID      CREATED        SIZE        SHARED SIZE  UNIQUE SIZE  CONTAINERS
...
localhost/linguacafedev_python                  latest                    04233fafac2c  About a minute  2.804GB     1.428GB      1.376GB      0

What are my next actions? Fix the documentation (add references about stanza to all places where spacy is mentioned, write proper commit and PR messages) and undraft the PR, or is there something else that needs to be done?

@@ -8,6 +8,8 @@ RUN apt-get update -y \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*

RUN pip install torch --index-url https://download.pytorch.org/whl/cpu
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment about stanza and NVIDIA drivers is needed here.

@sergiolaverde0
Copy link
Contributor

Looking at the URL for the Pytorch install this doesn't need a GPU since it uses CPU as the computing platform.

I heard we can reduce the size of that install by compiling Pytorch from source without the unnecessary features but I haven't done it before and I don't know by how much we can cut it.

If the accuracy increase is noticeable for enough languages maybe we should consider the possibility of making it the default. I'm concerned with performance when using CPU so that's another thing to check.

I see a future here, but it will take effort.

@simjanos-dev
Copy link
Owner

My enthusiasm has dropped a lot, I thought it would be much smaller. The model size is still huge compared to the 20-50MB model we used before.

A few more questions:

  • How much RAM does it use compared to the old model?
  • How much more space does it take, if you install +1 or +2 languages? On hugging face the model is 350MB zipped. I'm asking this because I assume there are some shared parts, and not every language will add 1.8GB.

If the accuracy increase is noticeable for enough languages maybe we should consider the possibility of making it the default.

I don't think I want to do that. Some users already had issues with the ram. I've seen attempted installs on raspberry pi-s, small free tier hosted servers and old laptops. I myself have an old laptop. And I also want to host LinguaCafe on a VPS in the future and try to optimize it. I want to rather make LinguaCafe smaller if possible by default. However, I definitely want to add these models as well as an option.

What are my next actions?

I'm not sure, I will need some time to figure out what I would like to do. I will more than likely have a problem with testing this myself.

Since this is only needed for lemmas (except for languages that have no spaces or have readings), what if we would use a huge amount of text, and generate a list of lemmas that we would use for linguacafe? For most languages, that is the only value that is being added by using an other model or tokenizer than the multilingual Spacy one.

2 other options would be: adding them as extra installable languages like "Finnish (large)", or adding an API that let people use other tokenizers. It would be easy to copy the current python container, and modify it and add different models.

What do you think?

@sergiolaverde0
Copy link
Contributor

Well seeing how we have already more or less frozen the features for v12.0 and since I have an assignment for this weekend I suggest to give it some time.

Next week I will try to compile Pytorch from source and see what would be the absolute minimum size so we can make a better informed decision.

For the time being the option for larger models is my favourite.

@rominf sorry if we take our time, but Simjanos is right to be concerned about accessibility of the hardware requirements.

@simjanos-dev
Copy link
Owner

Well seeing how we have already more or less frozen the features for v12.0 and since I have an assignment for this weekend I suggest to give it some time.

Implementing this will definitely take a lot of time. I want to add everything to linguacafe that I can, but I cant do it at the rate requests are coming in. Its been an insane progress in the last 4 months since release.

@rominf
Copy link
Contributor Author

rominf commented May 11, 2024

Please take your time! I will post my results, so that you have a food for thought meanwhile.

You are right about importance of accessibility of the hardware requirements: my mistake, I was not thoughtful about this.

I will write about Finnish only, since I have not tried to do lemmatization in other languages.

Stanza language support is split to multiple models. For lemmatization only tokenize, mwt, lemma models are required and pos is optional, however it greatly improves the accuracy. Size of tokenize, mwt, lemma models is 6.8 MiB (six and eight MiB), size of tokenize, mwt, lemma, pos models is 182.7 Mib.

My PC info:

Processors: 28 × Intel® Core™ i7-14700K
Memory: 62.5 GiB of RAM
Operating System: Fedora Linux 40
Kernel Version: 6.8.8-300.fc40.x86_64 (64-bit)
Python: 3.9.19

Here are the results of lemmatization of Universal Dependencies tree bank:

model                  F1    token/s
spacy-fi_core_news_lg  0.871 25191
spacy-fi_core_news_md  0.870 24768
spacy-fi_core_news_sm  0.842 27826
stanza-fi (no pos)     0.879 4631
stanza-fi (with pos)   0.958 1794

I also measured RAM usage on lemmatization of Alice in Wonderland in Finnish using scalene. Here is the script:

import collections
import sys

text = open("pg46569.txt").read()
if sys.argv[1] == "spacy":
    import spacy
    spacy.require_cpu()
    nlp = spacy.load("fi_core_news_sm", disable=['ner', 'parser'])
    # Just to be sure nothing extra happens on first nlp object call
    nlp("")
    doc = nlp(text)
    # Consume generator to avoid extra memory allocations
    collections.deque(((token.text, token.lemma_) for token in doc), maxlen=0)
elif sys.argv[1] == "stanza":
    import stanza
    # This will download only needed models to ~/stanza_resources/ and store them for next runs 
    nlp = stanza.Pipeline("fi", processors="tokenize,mwt,lemma", verbose=False, use_gpu=False)
    #nlp = stanza.Pipeline("fi", processors="tokenize,mwt,pos,lemma", verbose=False, use_gpu=False)
    # Just to be sure nothing extra happens on first nlp object call
    nlp("")
    doc = nlp(text)
    # Consume generator to avoid extra memory allocations
    collections.deque(((token.text, token.lemma) for sentence in doc.sentences for token in sentence.words), maxlen=0)
elif sys.argv[1] == "simplemma":
    from simplemma import simple_tokenizer, lemmatize
    doc = simple_tokenizer(text, iterate=True)
    collections.deque(((token, lemmatize(token, lang="fi")) for token in text), maxlen=0)

Results:

model                  max RAM (GiB) total time (s)
spacy-fi_core_news_sm  0.9           2.7
stanza-fi (no pos)     0.4           7.5
stanza-fi (with pos)   2.0           23.4
-- Python 3.12.3
spacy-fi_core_news_sm  0.9           2.7
stanza-fi (no pos)     0.4           5.6
stanza-fi (with pos)   2.0           18.2
simplemma              0.5           1.7
-- Python 3.11.9
spacy-fi_core_news_sm  0.9           2.8
stanza-fi (no pos)     0.4           6.6
stanza-fi (with pos)   2.0           19.2
simplemma              0.6           1.7

As you can notice from the script, I don't use spacy_stanza library anymore, but call stanza directly: there are no benefits for this specific task.

This is the size of the image now (without pos):

$ podman system df -v  # before
REPOSITORY                                      TAG                       IMAGE ID      CREATED        SIZE        SHARED SIZE  UNIQUE SIZE  CONTAINERS
...
localhost/linguacafedev_python                  latest                    85a508ecb0cb  54 minutes  1.095GB     424.3MB      671.1MB      0
...
$ podman system df -v  # after
REPOSITORY                                      TAG                       IMAGE ID      CREATED        SIZE        SHARED SIZE  UNIQUE SIZE  CONTAINERS
...
localhost/linguacafedev_python                  latest                    9cabe369f271  13 seconds      2.074GB     1.429GB      644.8MB      0
...

To sum up, stanza without pos processor is a bit more accurate on Finnish than spacy, takes significantly less disk space and RAM, however, much slower. Stanza with pos processor is much more accurate on Finnish than spacy, but takes significantly more disk space and RAM and tremendously slower.

The proposal about having multiple variants of language is my favorite as well!

Do you want me to do a benchmark of spacy vs stanza for other languages?

UPD: added results for Python 3.12 for Alice in Wonderland test.
UPD: added simplemma for Alice in Wonderland test.
UPD: added results for Python 3.11 for Alice in Wonderland test.

@simjanos-dev
Copy link
Owner

This is a really detailed test report, thank you so much!

Operating System: Fedora Linux 40
Kernel Version: 6.8.8-300.fc40.x86_64 (64-bit)
Processors: 28 × Intel® Core™ i7-14700K
Memory: 62.5 GiB of RAM

Wow. I have i5-8250u and 8GB ram.

The proposal about having multiple variants of language is my favorite as well!

I think we should go with that as well to provide the best experience possible.

At first I was thinking about it the wrong way. My first idea was to have multiple languages for different tokenizers, but I realized it would be extremely difficult to implement, since Language names are used at a ton of places.

It is however reasonably simple to switch tokenizers. So we can just make the tokenizer selectable on the admin page without separating them into their own language.

Do you want me to do a benchmark of spacy vs stanza for other languages?

I mostly interested in that if we add multiple languages, would the additional disk space required decrease, due to shared dependencies.

I think the latest 2GB disk size you commented is very reasonable to be added as an option. But if the models themselfs are so small, is there any way to decrease the disk space further? Can we remove Spacy, and use Stanza by itself to save space? I know it returns a different format, but I can write a different tokenizer function for it.

The tokenizer is quite slower, but the PHP side of processing the text takes time as well, so it might won't be that much of an issue, plus users can decide which one they want to use.

I would like to help implement this, but I won't be able to provide testing, or support for users who will have issues with it, because my laptop would die trying to run this.

I will think about how to implement a tokenizer selector. We should probably rebrand installable languages to installable packages or something.

@rominf
Copy link
Contributor Author

rominf commented May 11, 2024

What if we extend your idea about installable lemmatizers even further? Since some people want to run LinguaCafe in constrained environments, what if:

  1. The size of the usable linguacafedev_python image decrease significantly?
  2. Not just models, but model runners (spacy, stanza) be installable on demand just in a few seconds?

This can be done!

My proposal is to preinstall simplemma instead of spacy, so that the image is minimal. It has low footprint and runs very fast (it can be seen from the table in my previous message – I added simplemma there) – a good fit for raspberry pi. If user selects enhanced models spacy or stanza is installed in a few seconds using uv, which installs stanza just in few seconds (5 seconds on my machine)! This is just one extra call to uv for installation of the package in venv.

I created four venvs using uv: empty, simplemma, spacy, and stanza. Here is what I got:
image

pytorch takes the most place, as @sergiolaverde0 expected.

Here is the showcase of how fast uv is:

(stanza-pip) rominf@rominf-fedora /t/venv> time pip install stanza --extra-index-url https://download.pytorch.org/whl/cpu
...
________________________________________________________
Executed in   20.89 secs    fish           external
   usr time    8.45 secs    0.00 micros    8.45 secs
   sys time    0.84 secs  506.00 micros    0.84 secs

(stanza-uv) rominf@rominf-fedora /t/venv> time uv pip install stanza --extra-index-url https://download.pytorch.org/whl/cpu
Resolved 20 packages in 2.30s
Downloaded 20 packages in 2.52s
Installed 20 packages in 285ms
...
________________________________________________________
Executed in    5.13 secs    fish           external
   usr time    1.52 secs    0.00 micros    1.52 secs
   sys time    0.95 secs  335.00 micros    0.95 secs

Docker image building with uv becomes much, much faster and here is the footprint:

$ podman system df -v  # before
REPOSITORY                                      TAG                       IMAGE ID      CREATED        SIZE        SHARED SIZE  UNIQUE SIZE  CONTAINERS
...
localhost/linguacafedev_python                  latest                    85a508ecb0cb  54 minutes  1.095GB     424.3MB      671.1MB      0
...
$ podman system df -v  # after
REPOSITORY                                      TAG                       IMAGE ID      CREATED        SIZE        SHARED SIZE  UNIQUE SIZE  CONTAINERS
...
localhost/linguacafedev_python                  latest                    5f19a370db3b  13 seconds      327.9MB     164.9MB      163MB        0
...

PS: please have a look at "UPD" in my previous message: stanza on Python 3.12 is quite a bit faster than on Python 3.9.

@sergiolaverde0
Copy link
Contributor

sergiolaverde0 commented May 11, 2024

Hi, I have a few questions, hoping not to derail this too much:

  • How is performance in Python 3.11? That is what Debian 12 currently packages so probably RPis and other SBC do the same, and thus it makes for a good baseline of "most users will have this or newer".
  • Is there any documentation about Arm compatibility of these models? I don't seem to be finding any. Currently our Python image is only built for Amd64 because some Spacy languages have dependencies that are not available for Arm. Apple Silicon runs the image via virtualization and other devices are unsupported.
  • Off topic but I noticed you are using podman, have you encountered any issues? Some months ago users had trouble trying to run the images with rootless podman and the only solution we had was asking them to use rootful podman or docker.

About suing uv: I'm really not a big fan of using pre-1.0 software in "production" for critical tasks. However it might seriously make install of extra components faster and if we are having so many of those the benefits might outweight the issues they cause. If we go this route we will have to pin the version and update it manually unlike the rest of the tools we use.

I see simplemma does not consume less RAM than Stanza without pos. Sure it is faster but I think we could skip on it at least for the time being to reduce mental overload while planning what to do. We might also try a survey to ask users how much they care about text import times.

I also want to remark: Stanza supports languages that Spacy doesn't, so this might solve our Vietnamese issue and maybe our Tagalog issue as a side effect.

@simjanos-dev
Copy link
Owner

simjanos-dev commented May 11, 2024

I will comment on it more later, just a few quick comments from phone.

I also want to remark: Stanza supports languages that Spacy doesn't, so this might solve our Vietnamese issue and maybe our Tagalog issue as a side effect.

I want to check other non spacy tokenizers as well, and compare the sizes. I think Spacy is a good default option based on its size, and if theres an other smaller tokenizer for Vietnamese, I would prefer that instead of Stanza.

Theres also an option for using Spacy multilingual model and simple lemmatizer together. It would be really good and easy solution for Czech and Latin lemma support.

We could replace spacy for most languages with simple lemmatizer, but there are 3 points to keep in mind:

  • Some languages have or will have gender tagging support.
  • We need to make sure that simple lemmatizer is accurate enough.
  • Part of speech may will be an important core feature. Im thinking about adding an option to treat the same word with different pos as 2 different unique words, so they can have more accurate lemmas and readings. This is just an idea, and wont be implemented soon.

I am thinking about it. I have no strong opinions about it, but I feel like using Spacy is a good default option when available.

Tokenization speeds importance will decrease in the future, because I want to make a queue for them, and users will be able to start reading after the first chapter is finished.

@rominf
Copy link
Contributor Author

rominf commented May 11, 2024

@sergiolaverde0

  • I updated the table in the message above. Python 3.11 is a bit slower than 3.12.
  • pytorch is available on ARM (https://download.pytorch.org/whl/torch_stable.html, look for cpu/torch-2.3.0, arm64.whl, and aarch64.whl). As for stanza itself, it is fully written in Python and there is no arch in PyPI classifiers. It should run on ARM fine. Of course, checking it on cloud server would not harm.
  • Yes, I use podman. There were issues with SELinux. Here are short instructions (disclaimer: I am not SELinux expert and I don't know if it is the most secure way, yet this is for sure more secure than disabling SELinux):
$ git clone -b deploy https://github.com/simjanos-dev/LinguaCafe.git linguacafe && cd linguacafe
$ sudo semanage fcontext -a -t svirt_sandbox_file_t "~/linguacafe(/.*)?"
$ sudo restorecon -vR ~/linguacafe  # repeat this command every time after downloading dictionaries into storage/app/dictionaries/
$ sudo setsebool container_manage_cgroup 1
$ sudo chmod 777 -R ~/linguacafe/  # as per original instruction
$ podman-compose up -d

@rominf
Copy link
Contributor Author

rominf commented May 11, 2024

@simjanos-dev

I would like to help implement this, but I won't be able to provide testing, or support for users who will have issues with it, because my laptop would die trying to run this.

Thank you! I can do testing and support users for this feature. Also, I do not think pos version of stanza will behave differently in any way comparing to non-pos version (except for accuracy): and you should be able to run non-pos version. :-)

@simjanos-dev
Copy link
Owner

simjanos-dev commented May 11, 2024

and you should be able to run non-pos version. :-)

I'll try it out sometime.

Thank you! I can do testing and support users for this feature.

In that case I am open for adding Stanza as an additional option for at least Finnish. If it goes well, I think we can add more languages and Stanza tokenizers. I will do everything on the Laravel and front-end side, and can also do Python if needed. (Honestly I am a bit worried about having parts of the code that I don't test/support completely.)

What are my next actions?

Currently I think the only thing needed on the Python/docker side is to make it installable like other language packages.

I will experiment with simplemma for Czech, Latin, Ukrainian and Welsh in the future. It also has Hindi which was a requested language.

I wanted to split up tokenizer.py for a while, because it keeps growing. Now it will be kind of necessary. Currently it should have 3 files: tokenizer, import and models(I'm not sure if this one can be separated). I will probably do it for v0.13 or v0.14.

It might take a while for me to do my part, I will be working a bit less on linguacafe, and will work on parts of it that I want to use, because I feel a bit burned out.

And thank you so much for working on this! Both Stanza and Simplemma are great tools for tokenizing, I didn't even know about them.

@sergiolaverde0
Copy link
Contributor

I did some really quick mockups last night and was able to reduce it to 1.81 GB by changing the base to python:slim and ensuring no cache is used when installing Pytorch.

I will add the first change for v0.13 regardless of what happens with the tokenizers because there's no reason not to. While doing this I realized we can use Python 3.12 regardless of anything so sorry for wasting your time with that pointless inquiry.

Later I will test how does size evolve as I replace more and more languages with the Stanza variants, and check if I can shrink Pytorch more.

@simjanos-dev
Copy link
Owner

simjanos-dev commented May 14, 2024

Did you mean replace to test the image size, or did you mean you will replace all spacy packages with stanza?

Edit: I think it was the former. Im a bit slow today and was confused.

@@ -22,6 +24,8 @@ RUN pip install -U --no-cache-dir \
bottle \
#spacy
spacy \
#stanza integration for spacy
stanza \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spacy_stanza should be installed as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to remove it, since there are at least two issues with the spacy_stanza library I bumped into:

  1. Multi-word token expansion issue, misaligned tokens --> failed NER (German) explosion/spacy-stanza#70. As I understand this affects quality because of imprecise tokenization. Also, this generate verbose output I could not suppress.
  2. stanza version is pinned and it is not the latest.

Also, doing lemmatization using stanza directly is straightforward, see #255 (comment).

The code in this PR should be changed a bit to make it work. Currently, it is broken, since I wanted to check the image size and did not care about usable LinguaCafe at this stage).

@sergiolaverde0
Copy link
Contributor

sergiolaverde0 commented May 15, 2024

Did you mean replace to test the image size, or did you mean you will replace all spacy packages with stanza?

Replace where possible to see the image size I ended up with, and also because the easy way to map the models to a language to test them was to ditch the Spacy counterparts anyways.

And after doing so to see if space savings from shared dependencies could shrink this image I found that:

  • Eight languages, including Norwegian, Sweedish, Croatian and Danish, either lack a mwt model or lack any model altogether, so if I try to install them with the generic python3 -c 'import stanza; stanza.download("x", processors="tokenize,mwt,lemma")' it fails. I kept those on their Spacy variants to get the image to build.
  • English also lacks a default mwt, since none of the options listed on the docs are marked as a default. It installs just fine but I don't know how will that impact accuracy and performance, but solving it should be easy enough if we dig into the docs deeper.
  • Darn, those Stanza models are tiny! Most if not all of them were less than 7MB, so by using them as a replacement of the usually larger Spacy models I reduce the image down to 1.59GB. I ended up with a total of 11 Stanza languages and 9 Spacy languages if we include the multilingual as its own.

I'm now to test that languages other than Finish actually work, as in, to check if they actually load and can tokenize a paragraph. I will be grateful for any help so I have built a test on my fork, pull it with docker pull ghcr.io/sergiolaverde0/linguacafe-python-service:stanza. Depending on how this goes I will see how things behave with the languages whose models and dependencies were too big to be shipped by default like Japanese and Russian.

If the decrease in performance is not that big of a deal, if the rest of the languages can be worked around to be used, and if they all follow the pattern of using less RAM than Spacy counterparts, I can vouch for this to be our new default. But those ifs are doing quite the heavylifting.

Edit: And yes I'm shunning away from compiling Pytorch until we exhaust alternatives, today I saw their setup.py is 1500 lines long.

@simjanos-dev
Copy link
Owner

simjanos-dev commented May 15, 2024

A few things to keep in mind regarding replacing default tokenizers with stanza:

  • I do not know yet, but part of speech may be needed for an important feature in the future. I haven't decided on it yet.
  • Gender tagging is very important to keep where it's available. Some languages in spacy support it, but haven't added support for them yet in linguacafe. I'm pretty sure it exist in Danish, Swedish, Italian and Spanish.
  • Japanese has a post processing where I combine multiple words into one after the tokenization, which relies on the word splitting being the way it is in spacy, and having correct POS tagged. Chinese and Thai users would also lose data if their words are being split differently. I can't speak for their accuracy, except for Japanese, which I find pretty good, except for 2 problems that my post processing introduced.

I will work on moving this post processing from Laravel to Python today. It is just an additional function, so I will merge this in on Friday after release if there are no PR-s touching the file. If there will be, I'll modify my code to avoid creating conflicts with other people's work. Did not work.

@rominf
Copy link
Contributor Author

rominf commented May 19, 2024

https://rominf.github.io/spacy-vs-stanza

@simjanos-dev
Copy link
Owner

Thank you for the tests! This is VERY detailed list.

I'll do a few other things this week, but on the weekend or next week I am ready to do my part adding more tokenizers to linguacafe starting with Finnish.

@rominf
Copy link
Contributor Author

rominf commented May 20, 2024

You are welcome! It was fun and educational to work on this benchmark.

Here is the release with results in CSV format: https://github.com/rominf/spacy-vs-stanza/releases/tag/v0.1.0.

Feel free to ask any questions, but please note that from 14:00 UTC today until the morning of May 27th (so, ~one week), I will be offline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants