c_assign_taxonomy2 constant sized blocks of working memory #1366

wbazant · 2021-06-28T16:48:22Z

This change reduces the amount of memory used by assignTaxonomy by allocating a constant size of arrays before dispatching parallel tasks.

Rationale

I was reviewing dada2 wrappers for MicrobiomeDB, with the hope of moving on to a better way of running dada2 at scale.

I have ended up with code that runs assignTaxonomy and addSpecies in chunks:
(https://github.com/VEuPathDB/DJob/blob/4b0a3a512e/DistribJobTasks/bin/dada2/mergeAsvsAndAssignToOtus.R#32). which were necessary to process ASVs from a few thousand samples on a memory budget of 6GB.

I thought I might be able to do this in dada2 itself, investigated what the limiting step was, and I noticed most memory was spent in C_assign_taxonomy2. I've modified it to use constant sized blocks of working memory before parallelisation dispatch, which seems to work well enough to make the above workaround unnecessary.

Testing results

I've profiled the method with a repeated sequence of 100 E.coli bases and a small reference fasta, and I submitted it as a job with 6GB. Without the change I can do 300k sequences, with the change about 5M and then something else becomes a bottleneck.

currently (dada2 master branch)

Num seqs        Mem total	Time total	Mem (.Call)     Time (.Call)
10000   141.3   3.78    88.4    3.42
20000   255.7   6.90    177.0   6.32
100000  1163.0  32.36   880.1   29.30
200000  2290.5  63.82   1760.1  58.28
300000  3432.2  96.28   2640.2  87.80
400000 dies

with this change

Num seqs	Mem total	Time total	Mem (.Call)	Time (.Call)
10000	603.2	3.96	554.3	3.74
20000	629.4	6.30	559.3	5.98
100000	863.9	25.7	590.8	23.9
200000	1142.1	55.58	631.6	52.42
1000000	3459.1	246.20	958.2	232.20
2000000	6509.4	528.64	1366.3	500.96
5000000	15744.8	1530.00	2590.9	1414.58
8000000				dies

The above run times are from two different jobs in a cluster, so they're not directly comparable, but I think the run time didn't change much either way. I don't know if there are any benefits to having NSEQ smaller or if something about this change could be good for run time. As far as potential slowdowns go, I had to zero out one array in a for loop instead of using calloc, and there's an extra sync point every NSEQ / INTERRUPT_BLOCK_SIZE parallel tasks, so NSEQ was chosen to be large enough that it doesn't matter.

I've ran some basic checks at the outputs from a changed function in my one test case - I think it's still correct.

Would you like to add this change to the code? Let me know if you'd like more testing, something changed style-wise etc.

Can't init bootstrap scores with calloc: zero out the entries instead Block size big enough to not slow down the execution too much

benjjneb · 2021-06-30T13:23:56Z

This looks intriguing, and memory usage of assignTaxonomy is a real bottleneck so a very welcome potential contribution.

No problems with the style.

Because we don't have proper unit tests, this will need a bit more testing before considering pulling it in. The two things that aren't covered right now by your test case would be longer sequences, and situations where tryRC is necessary because there is a mixture of orientations.

Could you try running your perf/output comparisons on a set of full-length 16S sequences, and on a set of sequences that are split between normal and reverse-complement orientations?

Also, if you could post the output of dada2:::tax.check("path/to/[dada2 formatted training fasta]") using the new code. That does a basic sanity check of consistency between the assignments and the known taxa for a set of 16S sequences.

For ASVs of 500 bases, the memory usage gets too high when processing tens of thousands

wbazant · 2021-07-01T11:46:43Z

After looking at longer ASVs I realised the bottleneck is frequently lower than previous NSEQ so I've lowered it to 8192.

Longer sequences

This is with sequences of 1200 base pairs (the whole sequence) and 500bp (maybe a bit more realistic as an upper end of ASV length?)

Num seqs        Mem total       Time total      Mem (.Call)     Time (.Call)
# 1200bp current code
5000    629.9   5.12    585.8   4.88
10000   1224.9  9.64    1171.4  9.20
15000   1834.0  14.86   1757.1  13.84
20000   2436.6  19.50   2342.8  18.32
25000   dies

#500 bp current code
10000   526.1   5.64    469.5   5.34
20000   1025.7  10.66   939.0   10.18
40000   2025.3  20.38   1877.9  19.08
60000   dies

#1200bp NSEQ=8192
5000    1012.9  7.48    958.3   7.06
10000   1022.6  11.88   960.3   11.22
15000   1048.4  16.44   962.4   15.06
20000   1061.2  20.74   964.4   19.34
25000   1094.0  24.82   966.5   22.86

#500bp NSEQ=8192
10000   445.4   6.88    385.3   6.46
20000   469.0   11.60   389.4   10.84
40000   539.3   22.34   397.6   20.36
60000   618.6   32.00   405.7   29.46
80000   705.5   43.14   413.9   39.40

RC

The above perf numbers are already with tryRC=TRUE. But I've ran the output check on a reverse complement of my sequence, checked I get a match to the right genus with tryRC = TRUE, and that I don't get a match ot the right genus with tryRC = FALSE.

Tax.check

This is the output with SILVA 138 and a test file I downloaded from https://drive5.com/taxxi/benchmark/testfa/sp_ten_16s.100 :
tax-check.txt

c_assign_taxonomy2 constant sized blocks of working memory

b35a32a

Can't init bootstrap scores with calloc: zero out the entries instead Block size big enough to not slow down the execution too much

Lower NSEQ to help with longer ASVs

ef60f7a

For ASVs of 500 bases, the memory usage gets too high when processing tens of thousands

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

c_assign_taxonomy2 constant sized blocks of working memory #1366

c_assign_taxonomy2 constant sized blocks of working memory #1366

wbazant commented Jun 28, 2021

benjjneb commented Jun 30, 2021

wbazant commented Jul 1, 2021

c_assign_taxonomy2 constant sized blocks of working memory #1366

Are you sure you want to change the base?

c_assign_taxonomy2 constant sized blocks of working memory #1366

Conversation

wbazant commented Jun 28, 2021

Rationale

Testing results

currently (dada2 master branch)

with this change

benjjneb commented Jun 30, 2021

wbazant commented Jul 1, 2021

Longer sequences

RC

Tax.check