Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NUCmer job generation for large jobs slows down rapidly. #306

Open
widdowquinn opened this issue Jul 7, 2021 · 6 comments
Open

NUCmer job generation for large jobs slows down rapidly. #306

widdowquinn opened this issue Jul 7, 2021 · 6 comments
Assignees
Labels
bug something isn't working how it should enhancement something we'd like pyani to do that it doesn't already
Milestone

Comments

@widdowquinn
Copy link
Owner

widdowquinn commented Jul 7, 2021

Summary:

For large comparison runs (e.g. 2500 input genomes) the process of generating NUCmer jobs is slow, and starts to slow rapidly after about 80k command-lines are created.

I think this may be due to the structure of generate_joblist() in subcmd_anim.py

Description:

In generate_joblist(), an empty list (joblist) is created. This is populated with ComparisonJob objects in a for loop.

If we're not in recovery mode, the only problematic-looking task is to construct a ComparisonJob and add it to joblist. This is meant to be O(1) in Python > 3.1 (there used to be a bug causing a slowdown issue, but this was apparently resolved: https://bugs.python.org/issue4074).

StackOverflow indicates the problem may be resolved by turning off garbage collection:

As we're adding objects to the list, Garbage Collection is checking the entire list on each append. Wrapping the append with gc.disable()/gc.enable() might fix it.

@widdowquinn widdowquinn added bug something isn't working how it should enhancement something we'd like pyani to do that it doesn't already labels Jul 7, 2021
@widdowquinn widdowquinn self-assigned this Jul 7, 2021
@baileythegreen
Copy link
Contributor

We have also discussed having the 'joblist' container be a set, rather than a list (#297).

I don't know if Python garbage collection would do something similar here, or if that is specifically related to the append method.

@widdowquinn
Copy link
Owner Author

widdowquinn commented Jul 7, 2021

I don't know if it would, either. It's hard to beat O(1) for efficiency, so we'd gain nothing from changing to a set at this point, I think (though it may still be useful elsewhere).

@widdowquinn
Copy link
Owner Author

I looked at pyani_orm.py for comparison - filter_existing_comparisons() seems to append the same number of tuples to a list without losing efficiency.

@widdowquinn
Copy link
Owner Author

widdowquinn commented Jul 7, 2021

gc.disable()/gc.enable() didn't appear to have an effect, nor did substituting a deque for a list.

Replacing the list with a set makes it substantially slower.

The rate of appending drops by 50% between 40k items and 80k items, regardless.

Batching outputs may be the most usefully efficient option.

@widdowquinn
Copy link
Owner Author

Batching gave an average speed-up of about 100x - very fast! -until it fell over at 400k jobs. I don't yet understand why that happened.

@widdowquinn
Copy link
Owner Author

widdowquinn commented Jul 8, 2021

With batching, it appears that the script enters a D (uninterruptible IO) state when joblist in generate_joblist() reaches ≈450000 elements.

I had the function batch jobs into lists of 10k, and .append() those to the joblist. Commenting out the .append() allowed the script to avoid the D state and slowdown, but means that we do not collect/run all jobs.

It may be that, for scalability, we need to restructure how jobs are passed around the code. This may be a solved issue after the snakemake overhaul of job management.

@baileythegreen baileythegreen added this to the 0.3.1 milestone May 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something isn't working how it should enhancement something we'd like pyani to do that it doesn't already
Projects
None yet
Development

No branches or pull requests

2 participants