Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance with large datasets #9

Open
lordkev opened this issue May 6, 2020 · 2 comments
Open

Performance with large datasets #9

lordkev opened this issue May 6, 2020 · 2 comments

Comments

@lordkev
Copy link

lordkev commented May 6, 2020

Hi Alex,

As I mentioned in nf-core/eager#209 (comment) I'm running into some performance issues with a BAM file containing ~786M merged paired-end reads. I first had to bump the heap size as with default settings it ran out of memory fairly quickly. After bumping the max heap size to 48G it's been running now for about 1.5 hours and so far has only treated around 3M reads - and has been sitting at that 3M mark for almost 30 min. Is there anything I might be able to do to increase throughput?

Here are the flagstats for the BAM:

786466969 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
4730606 + 0 supplementary
0 + 0 duplicates
609772993 + 0 mapped (77.53% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)`
@apeltzer
Copy link
Owner

apeltzer commented May 7, 2020

Hi Kevin! Thanks for opening the issue to discuss this. I think without major modifications, thats the only thing we/you can do here. One possibility would be to split your mapped reads per chromosome (assuming you have multiple in your reference genome here), and then perform DeDup on each individually - then subsequently merging together the BAM files in a "divide and conquer / map reduce" approach.

I thought about this here: nf-core/eager#30 for nf-core/eager, so maybe this is a good use case now to really make it happen ;-)

@lordkev
Copy link
Author

lordkev commented May 7, 2020

Yes, I think would work great, for my use case at least. I work pretty much exclusively on human whole genome data so the parallelization benefits would be huge if split per chromosome.

I do wonder if there wasn't something else strange going on with particular files. I'd have to go back and look but I had one file that was in the neighborhood or 4-500M reads which completed within a reasonable number of hours after bumping the heap size up. However this other file with ~700M reads was going so slow that it appeared that it might take days. It could have been that one had a much higher dupe rate than the other though.

In any case, I suspect parallelizing by chromosome would make it a non-issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants