Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion - stopwords #212

Open
leomaurodesenv opened this issue Jun 30, 2021 · 4 comments
Open

Discussion - stopwords #212

leomaurodesenv opened this issue Jun 30, 2021 · 4 comments

Comments

@leomaurodesenv
Copy link

I liked the texthero, and I want to contribute in somehow.
First, I want to discuss something that boring me - stopwords..

Problem - I want to deploy a solution without the spacy stopwords requirements, and, possible, add my own stopwords.
My solution is based on Docker containers, is a bad practice download files every time that a new containers is instanced, causing a cold start problem, also using unnecessary space (because I don't use them).

In this sense,

  • Is it possible to remove the spacy stopwords requirements?
  • How can we add general stopwords, according to our own language needs?
  • Do we have some stopwords dictionary for many languages outside spacy?
  • How turn off stopwords download?
@jbesomi
Copy link
Owner

jbesomi commented Jul 1, 2021

Hi Leonardo, thank you for opening this issue. I agree with you, it's quite annoying that stopwords are downloaded even when they are not needed. This should have been fixed in #194. I will soon release a new version that includes the patch.

Regarding your other questions:

  1. Removal of spacy stopwords requirements. I believe we can completely get rid of spacy requirements by saving in a txt file (or another file extension) all stopwords and load directly that one. Do you want to work on that?
  2. Multi-lingual support is something we would like to introduce for quite a long time ... if you are interested in helping out to develop a general solution that works for many languages I would be more than happy to talk!
  3. Currently, Texthero is fully supporting only English, adding stopwords on other languages (with Spacy for instance) should be trivial though; this is strictly related to point 1.

Hope it helps!
Best,

@jbesomi
Copy link
Owner

jbesomi commented Jul 1, 2021

Hi Leonardo,

I just released a new version (Texthero 1.1.0); now stopwords should be downloaded lazily. Would you mind try it and let me know? Later on, we can discuss your other great points further!

@leomaurodesenv
Copy link
Author

Hello @jbesomi , sorry for my late answer.
Sure, I'm going to try out next week.

Yes, I would like to help. But, I'm not sure how to support multi-lingual stopwords.. But add multi-lingual embeddings could improve, and slowly the code. This is tough.. heheh

Removal of spacy stopwords requirements. I'm going to take a look and send a message here.

@jbesomi
Copy link
Owner

jbesomi commented Jul 8, 2021

Thanks for the update Leo. As you suggested, we can start by improving the stopwords (for English) and see how it goes. Multilingual support requires some thinking and refactoring, we can discuss that later on once the simpler version is implemented.
Best,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants