Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is process hanging or will it just take a while? #31

Open
nmb85 opened this issue Jan 12, 2023 · 5 comments
Open

Is process hanging or will it just take a while? #31

nmb85 opened this issue Jan 12, 2023 · 5 comments
Labels
help wanted Extra attention is needed

Comments

@nmb85
Copy link

nmb85 commented Jan 12, 2023

I love the concept for MetaCoAG; what a great idea! I'm trying to use your awesome tool to bin contigs from a 30 Gbp MEGAHIT metagenomic assembly with 35 million contigs and it has been paused (or working?) at the step after initially assigning contigs with marker genes to bins for a little more than 48 hours. There is no sign that memory or CPU usage has changed in that time and there haven't been any messages printed to the log file or stdout/stderr. The only files in the output directory are the tetranucleotide frequency pickle file and the log file. The log file is attached below.

Is my process hanging, does it require more memory (current usage is steady at 65% of max: 175 GB/250 GB), or is it working? If it is in fact working, what would you expect the time to complete this step to be and is there a flag that I missed for parallelizing this step?

Thanks for any help! Would love to see how MetaCoAG performs!
metacoag.log

@Vini2
Copy link
Collaborator

Vini2 commented Jan 16, 2023

Hello @nmb85,

Thank you very much for your interest in MetaCoAG!

I haven't tested MetaCoAG on datasets having more than a couple of hundred thousand contigs. I don't know how long it will take to complete (maybe a couple of days?).

If it is possible, can you share with me the data you are testing on? I would like to give it a try and see. 35 million contigs sound very interesting!

Thank you!

@Vini2 Vini2 added the help wanted Extra attention is needed label Jan 16, 2023
@nmb85
Copy link
Author

nmb85 commented Jan 16, 2023

Thank you, @Vini2! I will reach out to you via your contact form on your professional website in order to share the data. The data is proprietary and entails tens of GBs, so I cannot post it on a public link. In the meantime, this was my attempt at parallelizing the get_non_isolated function in metacoag_utils/graph_utils.py:

I imported multiprocessing as mp and broke up the get_non_isolated function into two functions, abstracting away the outermost for loop as a function to run in parallel with mp.

import multiprocessing as mp
 
def get_connected_components(i, assembly_graph, binned_contigs):
    non_isolated = []
    if i not in non_isolated and i in binned_contigs:
        component = []
        component.append(i)
        length = len(component)
        neighbours = assembly_graph.neighbors(i, mode="ALL")
        for neighbour in neighbours:
            if neighbour not in component:
                component.append(neighbour)
        component = list(set(component))
        while length != len(component):
            length = len(component)
            for j in component:
                neighbours = assembly_graph.neighbors(j, mode="ALL")
                for neighbour in neighbours:
                    if neighbour not in component:
                        component.append(neighbour)
        labelled = False
        for j in component:
            if j in binned_contigs:
                labelled = True
                break
        if labelled:
            for j in component:
                if j not in non_isolated:
                    non_isolated.append(j)
    return non_isolated


def get_non_isolated(node_count, assembly_graph, binned_contigs, nthreads):
    with mp.Pool(processes=nthreads) as pool:
        non_isolated = pool.starmap(get_connected_components, [(i, assembly_graph, binned_contigs) for i in range(node_count)])
    return non_isolated

Then in metacoag, I changed the get_non_isolated function call on lines 653-657 to pass the nthreads variable to the new parallelized get_non_isolated function:

non_isolated = graph_utils.get_non_isolated(
        node_count=node_count,
        assembly_graph=assembly_graph,
        binned_contigs=binned_contigs,
        nthreads=nthreads
    )

Changes ran fine to completion without error messages on a toy dataset, but couldn't observe multiprocessing via htop

@nmb85
Copy link
Author

nmb85 commented Jan 21, 2023

Brief update: I allowed metacoag to run for 6 days, but it was still stuck at running the get_non_isolated function. No errors, just very busy running that function. I believe parallelizing this step and perhaps downstream steps would help here.

@nmb85
Copy link
Author

nmb85 commented Jan 27, 2023

Hi @Vini2, you're probably swamped - should we move ahead with trying to parallelize this on our end? If so, any hints or suggestions based on our attempt above?

@Vini2
Copy link
Collaborator

Vini2 commented May 15, 2023

Hi @nmb85,

I'm so sorry I couldn't get back to you regarding this issue. Is everything sorted? Were you able to parallelise the step? I tested your suggested method and it works fine.

Vini2 added a commit that referenced this issue May 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants