Add multiprocessing #20

jdvala · 2021-11-22T07:59:56Z

Given that cleaning text could be sometimes a very time consuming task if the number of data texts are huge, it would be really good if clean-text can provide inbuilt multiprocessing ability.

It could be really simple such that you could providing a flag and then adding an option to input list of text instead of a single text.

What do you think?

jdvala · 2021-11-22T08:00:17Z

I can help you do that if you agree.

jfilter · 2021-11-22T17:04:42Z

Hey @jdvala, this is good idea. I would suggest to use Python's multiprocessing, e.g. with a pool. What's your opinion on this?

jdvala · 2022-01-04T15:42:29Z

Hi @jfilter I have a few question that I would like to discuss before starting to implement this.
If we enable multiprocessing we need to have a list of text and not just text, currently the clean function only excepts str.

Does it make sense to have another function completely which calls the clean functions?
Or do we make changes to the clean function?

I would recommend to go for the second option as people have gotten used to the current signature of the function and changes we change this, so in my opinion we have clean_parallel function which calls the clean function.

Secondly, if a single text is large enough, then breaking it and parallelizing it also makes sense.

At this point I am confused as which should we implement first.

jfilter · 2022-01-05T11:36:14Z

Hey @jdvala, in my opinion, the clean function should also accept a list of texts and then return a list of processed texts.

Then, we need a new parameter, e.g. n_jobs, to specify the number of maximum parallel jobs. This is how joblib is doing it. We may also use joblib to do the multiprocessing. Or take a look at https://github.com/Slimmer-AI/mpire since working with Python's multiprocessing feels clunky.

jfilter added the enhancement New feature or request label Nov 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multiprocessing #20

Add multiprocessing #20

jdvala commented Nov 22, 2021

jdvala commented Nov 22, 2021

jfilter commented Nov 22, 2021

jdvala commented Jan 4, 2022

jfilter commented Jan 5, 2022

Add multiprocessing #20

Add multiprocessing #20

Comments

jdvala commented Nov 22, 2021

jdvala commented Nov 22, 2021

jfilter commented Nov 22, 2021

jdvala commented Jan 4, 2022

jfilter commented Jan 5, 2022