Performance with large datasets #9

lordkev · 2020-05-06T22:24:17Z

Hi Alex,

As I mentioned in nf-core/eager#209 (comment) I'm running into some performance issues with a BAM file containing ~786M merged paired-end reads. I first had to bump the heap size as with default settings it ran out of memory fairly quickly. After bumping the max heap size to 48G it's been running now for about 1.5 hours and so far has only treated around 3M reads - and has been sitting at that 3M mark for almost 30 min. Is there anything I might be able to do to increase throughput?

Here are the flagstats for the BAM:

786466969 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
4730606 + 0 supplementary
0 + 0 duplicates
609772993 + 0 mapped (77.53% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)`

The text was updated successfully, but these errors were encountered:

apeltzer · 2020-05-07T07:33:15Z

Hi Kevin! Thanks for opening the issue to discuss this. I think without major modifications, thats the only thing we/you can do here. One possibility would be to split your mapped reads per chromosome (assuming you have multiple in your reference genome here), and then perform DeDup on each individually - then subsequently merging together the BAM files in a "divide and conquer / map reduce" approach.

I thought about this here: nf-core/eager#30 for nf-core/eager, so maybe this is a good use case now to really make it happen ;-)

lordkev · 2020-05-07T22:12:23Z

Yes, I think would work great, for my use case at least. I work pretty much exclusively on human whole genome data so the parallelization benefits would be huge if split per chromosome.

I do wonder if there wasn't something else strange going on with particular files. I'd have to go back and look but I had one file that was in the neighborhood or 4-500M reads which completed within a reasonable number of hours after bumping the heap size up. However this other file with ~700M reads was going so slow that it appeared that it might take days. It could have been that one had a much higher dupe rate than the other though.

In any case, I suspect parallelizing by chromosome would make it a non-issue.

apeltzer mentioned this issue May 7, 2020

Metadata Sheet Discussion & Multilane merging nf-core/eager#209

Closed

apeltzer added the enhancement label May 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance with large datasets #9

Performance with large datasets #9

lordkev commented May 6, 2020 •

edited

apeltzer commented May 7, 2020

lordkev commented May 7, 2020 •

edited

Performance with large datasets #9

Performance with large datasets #9

Comments

lordkev commented May 6, 2020 • edited

apeltzer commented May 7, 2020

lordkev commented May 7, 2020 • edited

lordkev commented May 6, 2020 •

edited

lordkev commented May 7, 2020 •

edited