Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiprocessing with apply_async does not work #33

Open
sanjaysrikakulam opened this issue Jun 23, 2021 · 6 comments
Open

Multiprocessing with apply_async does not work #33

sanjaysrikakulam opened this issue Jun 23, 2021 · 6 comments

Comments

@sanjaysrikakulam
Copy link

sanjaysrikakulam commented Jun 23, 2021

Hi @lmdu ,

Sorry again, found one more issue while trying to parallelize and share the index with multiple processes.

Here is the code:

from multiprocessing import Manager, Pool
from pyfastx import Fasta

def print_seq_names(fasta_obj, lock):
    for i in range(5):
        with lock:
            print(fasta_obj[i].name)

def error_call(err):
    print(err)

fasta_obj = Fasta("uniprot_sprot.fasta.gz")
pool = Pool(5)
lock = Manager().Lock()

for i in range(4):
    pool.apply_async(print_seq_names, args=(fasta_obj, lock), error_callback=error_call)

pool.close()
pool.join()

Error:
<multiprocessing.pool.ApplyResult object at 0x7f7605085090>
can't pickle Fasta objects
<multiprocessing.pool.ApplyResult object at 0x7f76b0f20390>
can't pickle Fasta objects
<multiprocessing.pool.ApplyResult object at 0x7f76b0f201d0>
can't pickle Fasta objects
<multiprocessing.pool.ApplyResult object at 0x7f76b0f20390>
can't pickle Fasta objects

Is it not possible to share the Fasta object or the index or the identifier objects with multiple processes anymore?

Also, if I make the fasta object and the identifier object in my code (the above is a sample dummy code) as a global variable, I could see only 1 process/core at a time be running (out of 64 cores in real code) and the rest of them are in the sleep state. Do you know why this is the behaviour?

Any help here would be great as well,

Thank you!

P.S:
OS: CentOS 7
Python : 3.7.7
pyfastx: 0.8.3

@sanjaysrikakulam
Copy link
Author

sanjaysrikakulam commented Jun 30, 2021

Hi @lmdu,

Any idea on what is actually happening here in the multiprocessing stuff? I am writing a paper for my tool which uses pyfastx and depends on parallelization. Any fix or suggestion to get this working will be really great!

@lmdu
Copy link
Owner

lmdu commented Jun 30, 2021

I am so sorry. Pyfastx does not support pickle, you could not use Fasta object as a parameter pass to multiprocessing. It is very complicated to implement this function. Moreover, I have not found a solution to implement file handler sharing between different processes. I would add support for pickle to pyfastx v0.9.0.

@sanjaysrikakulam
Copy link
Author

OK, thank you for the information. But will there be a memory overhead, if I create a fasta object in every child process?

Say my fasta/fastq index is of size 40 or 50GiB and I use 64 cores, so if each of my processes creates a fasta object, it means there will be a memory overhead, right?

@lmdu
Copy link
Owner

lmdu commented Jun 30, 2021

There may be no memory overhead. Pyfastx will not load the entire index into memory.

@sanjaysrikakulam
Copy link
Author

OK, I will check this and see whether each process loads something in memory when a fasta/fastq object is created in every child process.

@sanjaysrikakulam
Copy link
Author

Hi @lmdu,

I tried the apply_async technique in the pyfastx's documentation. Like re-creating the fasta/fastq object inside the worker process, there is no memory overhead, but only one or two processes run out of 64 initiated processes and the rest of them goes to a sleep state. This won't really work for multiprocessing. I look forward to pyfastx v0.9.0.

Thank you for your support and quick response!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants