Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multiprocessing #20

Open
jdvala opened this issue Nov 22, 2021 · 4 comments
Open

Add multiprocessing #20

jdvala opened this issue Nov 22, 2021 · 4 comments
Labels
enhancement New feature or request

Comments

@jdvala
Copy link

jdvala commented Nov 22, 2021

Given that cleaning text could be sometimes a very time consuming task if the number of data texts are huge, it would be really good if clean-text can provide inbuilt multiprocessing ability.

It could be really simple such that you could providing a flag and then adding an option to input list of text instead of a single text.

What do you think?

@jdvala
Copy link
Author

jdvala commented Nov 22, 2021

I can help you do that if you agree.

@jfilter
Copy link
Owner

jfilter commented Nov 22, 2021

Hey @jdvala, this is good idea. I would suggest to use Python's multiprocessing, e.g. with a pool. What's your opinion on this?

@jfilter jfilter added the enhancement New feature or request label Nov 22, 2021
@jdvala
Copy link
Author

jdvala commented Jan 4, 2022

Hi @jfilter I have a few question that I would like to discuss before starting to implement this.
If we enable multiprocessing we need to have a list of text and not just text, currently the clean function only excepts str.

  • Does it make sense to have another function completely which calls the clean functions?
  • Or do we make changes to the clean function?

I would recommend to go for the second option as people have gotten used to the current signature of the function and changes we change this, so in my opinion we have clean_parallel function which calls the clean function.

Secondly, if a single text is large enough, then breaking it and parallelizing it also makes sense.

At this point I am confused as which should we implement first.

@jfilter
Copy link
Owner

jfilter commented Jan 5, 2022

Hey @jdvala, in my opinion, the clean function should also accept a list of texts and then return a list of processed texts.

Then, we need a new parameter, e.g. n_jobs, to specify the number of maximum parallel jobs. This is how joblib is doing it. We may also use joblib to do the multiprocessing. Or take a look at https://github.com/Slimmer-AI/mpire since working with Python's multiprocessing feels clunky.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants