Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

c_assign_taxonomy2 constant sized blocks of working memory #1366

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

wbazant
Copy link
Contributor

@wbazant wbazant commented Jun 28, 2021

This change reduces the amount of memory used by assignTaxonomy by allocating a constant size of arrays before dispatching parallel tasks.

Rationale

I was reviewing dada2 wrappers for MicrobiomeDB, with the hope of moving on to a better way of running dada2 at scale.

I have ended up with code that runs assignTaxonomy and addSpecies in chunks:
(https://github.com/VEuPathDB/DJob/blob/4b0a3a512e/DistribJobTasks/bin/dada2/mergeAsvsAndAssignToOtus.R#32). which were necessary to process ASVs from a few thousand samples on a memory budget of 6GB.

I thought I might be able to do this in dada2 itself, investigated what the limiting step was, and I noticed most memory was spent in C_assign_taxonomy2. I've modified it to use constant sized blocks of working memory before parallelisation dispatch, which seems to work well enough to make the above workaround unnecessary.

Testing results

I've profiled the method with a repeated sequence of 100 E.coli bases and a small reference fasta, and I submitted it as a job with 6GB. Without the change I can do 300k sequences, with the change about 5M and then something else becomes a bottleneck.

currently (dada2 master branch)

Num seqs        Mem total	Time total	Mem (.Call)     Time (.Call)
10000   141.3   3.78    88.4    3.42
20000   255.7   6.90    177.0   6.32
100000  1163.0  32.36   880.1   29.30
200000  2290.5  63.82   1760.1  58.28
300000  3432.2  96.28   2640.2  87.80
400000 dies

with this change

Num seqs	Mem total	Time total	Mem (.Call)	Time (.Call)
10000	603.2	3.96	554.3	3.74
20000	629.4	6.30	559.3	5.98
100000	863.9	25.7	590.8	23.9
200000	1142.1	55.58	631.6	52.42
1000000	3459.1	246.20	958.2	232.20
2000000	6509.4	528.64	1366.3	500.96
5000000	15744.8	1530.00	2590.9	1414.58
8000000				dies

The above run times are from two different jobs in a cluster, so they're not directly comparable, but I think the run time didn't change much either way. I don't know if there are any benefits to having NSEQ smaller or if something about this change could be good for run time. As far as potential slowdowns go, I had to zero out one array in a for loop instead of using calloc, and there's an extra sync point every NSEQ / INTERRUPT_BLOCK_SIZE parallel tasks, so NSEQ was chosen to be large enough that it doesn't matter.

I've ran some basic checks at the outputs from a changed function in my one test case - I think it's still correct.

Would you like to add this change to the code? Let me know if you'd like more testing, something changed style-wise etc.

Can't init bootstrap scores with calloc: zero out the entries instead

Block size big enough to not slow down the execution too much
@benjjneb
Copy link
Owner

This looks intriguing, and memory usage of assignTaxonomy is a real bottleneck so a very welcome potential contribution.

No problems with the style.

Because we don't have proper unit tests, this will need a bit more testing before considering pulling it in. The two things that aren't covered right now by your test case would be longer sequences, and situations where tryRC is necessary because there is a mixture of orientations.

Could you try running your perf/output comparisons on a set of full-length 16S sequences, and on a set of sequences that are split between normal and reverse-complement orientations?

Also, if you could post the output of dada2:::tax.check("path/to/[dada2 formatted training fasta]") using the new code. That does a basic sanity check of consistency between the assignments and the known taxa for a set of 16S sequences.

For ASVs of 500 bases, the memory usage gets too high when processing tens of thousands
@wbazant
Copy link
Contributor Author

wbazant commented Jul 1, 2021

After looking at longer ASVs I realised the bottleneck is frequently lower than previous NSEQ so I've lowered it to 8192.

Longer sequences

This is with sequences of 1200 base pairs (the whole sequence) and 500bp (maybe a bit more realistic as an upper end of ASV length?)

Num seqs        Mem total       Time total      Mem (.Call)     Time (.Call)
# 1200bp current code
5000    629.9   5.12    585.8   4.88
10000   1224.9  9.64    1171.4  9.20
15000   1834.0  14.86   1757.1  13.84
20000   2436.6  19.50   2342.8  18.32
25000   dies

#500 bp current code
10000   526.1   5.64    469.5   5.34
20000   1025.7  10.66   939.0   10.18
40000   2025.3  20.38   1877.9  19.08
60000   dies

#1200bp NSEQ=8192
5000    1012.9  7.48    958.3   7.06
10000   1022.6  11.88   960.3   11.22
15000   1048.4  16.44   962.4   15.06
20000   1061.2  20.74   964.4   19.34
25000   1094.0  24.82   966.5   22.86

#500bp NSEQ=8192
10000   445.4   6.88    385.3   6.46
20000   469.0   11.60   389.4   10.84
40000   539.3   22.34   397.6   20.36
60000   618.6   32.00   405.7   29.46
80000   705.5   43.14   413.9   39.40

RC

The above perf numbers are already with tryRC=TRUE. But I've ran the output check on a reverse complement of my sequence, checked I get a match to the right genus with tryRC = TRUE, and that I don't get a match ot the right genus with tryRC = FALSE.

Tax.check

This is the output with SILVA 138 and a test file I downloaded from https://drive5.com/taxxi/benchmark/testfa/sp_ten_16s.100 :
tax-check.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants