NUCmer job generation for large jobs slows down rapidly. #306

widdowquinn · 2021-07-07T16:48:00Z

Summary:

For large comparison runs (e.g. 2500 input genomes) the process of generating NUCmer jobs is slow, and starts to slow rapidly after about 80k command-lines are created.

I think this may be due to the structure of generate_joblist() in subcmd_anim.py

Description:

In generate_joblist(), an empty list (joblist) is created. This is populated with ComparisonJob objects in a for loop.

If we're not in recovery mode, the only problematic-looking task is to construct a ComparisonJob and add it to joblist. This is meant to be O(1) in Python > 3.1 (there used to be a bug causing a slowdown issue, but this was apparently resolved: https://bugs.python.org/issue4074).

StackOverflow indicates the problem may be resolved by turning off garbage collection:

https://stackoverflow.com/questions/2473783/is-there-a-way-to-circumvent-python-list-append-becoming-progressively-slower

As we're adding objects to the list, Garbage Collection is checking the entire list on each append. Wrapping the append with gc.disable()/gc.enable() might fix it.

The text was updated successfully, but these errors were encountered:

baileythegreen · 2021-07-07T17:03:18Z

We have also discussed having the 'joblist' container be a set, rather than a list (#297).

I don't know if Python garbage collection would do something similar here, or if that is specifically related to the append method.

widdowquinn · 2021-07-07T17:05:01Z

I don't know if it would, either. It's hard to beat O(1) for efficiency, so we'd gain nothing from changing to a set at this point, I think (though it may still be useful elsewhere).

widdowquinn · 2021-07-07T17:19:25Z

I looked at pyani_orm.py for comparison - filter_existing_comparisons() seems to append the same number of tuples to a list without losing efficiency.

widdowquinn · 2021-07-07T18:09:29Z

gc.disable()/gc.enable() didn't appear to have an effect, nor did substituting a deque for a list.

Replacing the list with a set makes it substantially slower.

The rate of appending drops by 50% between 40k items and 80k items, regardless.

Batching outputs may be the most usefully efficient option.

widdowquinn · 2021-07-07T18:30:27Z

Batching gave an average speed-up of about 100x - very fast! -until it fell over at 400k jobs. I don't yet understand why that happened.

widdowquinn · 2021-07-08T07:54:50Z

With batching, it appears that the script enters a D (uninterruptible IO) state when joblist in generate_joblist() reaches ≈450000 elements.

I had the function batch jobs into lists of 10k, and .append() those to the joblist. Commenting out the .append() allowed the script to avoid the D state and slowdown, but means that we do not collect/run all jobs.

It may be that, for scalability, we need to restructure how jobs are passed around the code. This may be a solved issue after the snakemake overhaul of job management.

widdowquinn added bug something isn't working how it should enhancement something we'd like pyani to do that it doesn't already labels Jul 7, 2021

widdowquinn self-assigned this Jul 7, 2021

baileythegreen added this to the 0.3.1 milestone May 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NUCmer job generation for large jobs slows down rapidly. #306

NUCmer job generation for large jobs slows down rapidly. #306

widdowquinn commented Jul 7, 2021 •

edited

baileythegreen commented Jul 7, 2021

widdowquinn commented Jul 7, 2021 •

edited

widdowquinn commented Jul 7, 2021

widdowquinn commented Jul 7, 2021 •

edited

widdowquinn commented Jul 7, 2021

widdowquinn commented Jul 8, 2021 •

edited

NUCmer job generation for large jobs slows down rapidly. #306

NUCmer job generation for large jobs slows down rapidly. #306

Comments

widdowquinn commented Jul 7, 2021 • edited

Summary:

Description:

baileythegreen commented Jul 7, 2021

widdowquinn commented Jul 7, 2021 • edited

widdowquinn commented Jul 7, 2021

widdowquinn commented Jul 7, 2021 • edited

widdowquinn commented Jul 7, 2021

widdowquinn commented Jul 8, 2021 • edited

widdowquinn commented Jul 7, 2021 •

edited

widdowquinn commented Jul 7, 2021 •

edited

widdowquinn commented Jul 7, 2021 •

edited

widdowquinn commented Jul 8, 2021 •

edited