Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cleaned up GPU consensus calling #661

Open
wants to merge 88 commits into
base: master
Choose a base branch
from

Conversation

hasindu2008
Copy link
Contributor

This pull request restructures and cleans up the pull request #468 (GPU accelerated nanopolish consensus) by @vellamike so that:

  1. it can be automatically merged; and,
  2. changes to original nanopolish code and behaviour are minimised.

In a summary:

  • newly added files are : cuda.mk, src/cuda_kernels/GpuAligner.cu,
    src/cuda_kernels/GpuAligner.h and, src/cuda_kernels/gpu_call_variants.inl.

Importantly these files are effective (or compiled) only if make is called with make cuda=1. Those files are completely ignored otherwise.

  • changes to existing files are :

    Makefile - includes cuda.mk if called with make cuda=1

    README.md - brief guide on compiling and running for the GPUs

    src/nanopolish_call_variants.cpp - option gpu added where as the default behaviour is to follow the CPU code path.

    GPU code path generate_candidate_single_base_edits_gpu is only compiled if make cuda=1 is specified, and thus, no impact on the existing code.


I did some benchmarks based on the small chr20 dataset (average coverage of around 30-40X) on three different systems, a server, laptop and a Jetson dev board. On all cases, a speed up of ~5X was observed for the GPU implementation compared to its CPU counterpart (run with maximum threads available on CPU). Importantly, the output from the GPU and CPU are very similar except for a handful of differences, probably due to floating point handling. Further, the implementation was robust, that it ran on these cases without an issue.

The average speed ups for three separate 50kb regions (chr20:5000k-5050k, chr20:5050k-5100k, chr20:5100k-5150k) are as below:

3

The averages in the graph are based on three executions and the raw time values are as below:
1

I also checked on a 1M region (chr20:5M-6M) and the speedup on the server with Tesla V100 was even better with ~7X. The raw values are as below:
2


Given that Nanopolish consensus is known to be very time consuming process for larger genomes, I think this will benefit Nanopolish users.


  • The compiled binary (with GPU support) is attached if you wish to test on a system with an NVIDIA GPU without the trouble of installing CUDA toolkit.
    nanopolish-gpu-bin.tar.gz

  • Also the test script and the raw outputs and logs are also attached. Just extracting and running simple_bench.sh may work.
    test.tar.gz

@lfaino
Copy link

lfaino commented Oct 5, 2019

Hi,
very nice work but i have a question.
I have a PC with 3 GPU, how can i use all of them? Can i set to which card send a region?

additionally, on a multi-thread PC you can use the software parallel to send several process (regions) at the time with little number of nanopolish threads (-t option).
how beneficial is GPU over CPU in this contest?

Cheers
Luigi

@hasindu2008
Copy link
Contributor Author

At the moment it is the first GPU that is used by default. It is possible to add a command line option to that the user can provide which GPU to execute on - planning to do this in near future.

I did not benchmark the CPU version in that multi process approach so far. However, the mutli threaded approaches I benchmarks had around 70-90% CPU utilisation, so I guess the multi process approach will be slightly faster overall. Infact, in modern GPUs it is possible to launch multiple contexts as well, but these approaches are yet to be evaluated.

@vellamike could give more insights in to these questions.

@lfaino
Copy link

lfaino commented Oct 8, 2019

Hi,

I made some test my self. Considering that on gtx1070 I can lunch 7 process at the time, I was able to make a variant call in 22 minutes compared to 44 on cpu.

I think that with the data set I used, the gpu approach is at least 50% faster.

@hasindu2008
Copy link
Contributor Author

What was the average coverage of the dataset? Is it a publicly available dataset, if so I can give a try on a v100 as well. And, how did you launch the multiple processes, is it through that nanopolish's make range? If possible share the commands.

@vellamike
Copy link

@lfaino You can also chose the GPU by setting the CUDA_VISIBLE_DEVICES environment variable. For example, export CUDA_VISIBLE_DEVICES=0 or CUDA_VISIBLE_DEVICES=1. There is no easy way to find which GPU is which device, but in general device 0 will be the fastest GPU on your system.

If you are able to launch multiple processes from the command line, you should set CUDA_VISIBLE_DEVICES differently for each, in your case you have 3 GPUs so you should do

CUDA_VISIBLE_DEVICES=0 nanopolish ...
CUDA_VISIBLE_DEVICES=1 nanopolish ...
CUDA_VISIBLE_DEVICES=2 nanopolish ...

gtx1070 is a relatively old consumer card so a card like a GV100 should be faster.

Multi-GPU support is something we can definitely add in the future.

@lfaino
Copy link

lfaino commented Oct 8, 2019

@hasindu2008
here the command for CPU:
time python3 /data/software/nanopolish/scripts/nanopolish_makerange.py reads.paf.b02.racon1.fasta | parallel --results nanopolish.results -P 20 nanopolish variants --consensus -o cpu/polished.{1}.vcf -w {1} -r barcode02_guppy.fastq -b test.sorted.bam -g reads.paf.b02.racon1.fasta -t 4 --min-candidate-frequency 0.1

here the gpu:
time python3 /data/software/nanopolish/scripts/nanopolish_makerange.py reads.paf.b02.racon1.fasta | parallel --results nanopolish.results -P 7 /data/software/nanopolish_gpu/nanopolish variants --consensus -o gpu/polished.{1}.vcf -w {1} -r barcode02_guppy.fastq -b test.sorted.bam -g reads.paf.b02.racon1.fasta --gpu=2 -t 4 --min-candidate-frequency 0.1

i made an error in the cpu command because i used 80 threads in total (-P 20 and -t 4) but i have a system with with 72 threads in total.

i work with a systen with GTX1070 8gb of RAM.

about the dataset, it is not available (but i can share as long as you keep for yourself) and it is about 50X data of a bacterial genome about 6 Mb large

Cheers
Luigi

@lfaino
Copy link

lfaino commented Oct 8, 2019

@vellamike,
my idea was a bit different but i can try to make a work around.
i would like to use parallel with makerange.py and send to one GPU or another based on the fact that one process is finished. in simpler word, control which gpu finished a job and use it again.

just to be clear, is it possible in the future to have something like

python -m torch.distributed.launch --nproc_per_node=4 train_flipflop.py ...
like in the taiyaki script
taiyaki

@vellamike
Copy link

vellamike commented Oct 8, 2019 via email

@hasindu2008
Copy link
Contributor Author

@jts
This version now supports methylation aware polishing for the GPU.

The answers from CPU (left) and GPU (right) for multi-model (-q dam,dcm) considerably match.
image

The answers from single-model also considerably match.
image

Experiment details

Reference:
ftp://ftp.ensemblgenomes.org/pub/bacteria/release-45/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655/dna/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.toplevel.fa.gz

Reads:
https://nanopore.s3.climb.ac.uk/Zymo-GridION-EVEN-BB-SN.fq.gz
https://nanopore.s3.climb.ac.uk/Zymo-GridION-EVEN-BB-SN_signal.tar

Commands:

minimap2 -ax map-ont -t32 ecoli_k12_ms1655.fa Zymo-GridION-EVEN-BB-SN.fq > Zymo-GridION-EVEN-BB-SN.sam 
samtools sort  Zymo-GridION-EVEN-BB-SN.sam >  Zymo-GridION-EVEN-BB-SN.bam 
samtools index  Zymo-GridION-EVEN-BB-SN.bam 
~/nanopolish/nanopolish index Zymo-GridION-EVEN-BB-SN.fq -d  Zymo-GridION-EVEN-BB-SN 

Then regions were extracted for region Chromosome:200000-202000

./nanopolish variants --consensus -o $DATADIR/polished_cpu_single_model.vcf -w "Chromosome:200000-202000" -r $DATADIR/reads.fasta -b $DATADIR/reads.bam -g $DATADIR/draft.fa 
 ./nanopolish variants --consensus -o $DATADIR/polished_gpu_single_model.vcf -w "Chromosome:200000-202000" -r $DATADIR/reads.fasta -b $DATADIR/reads.bam -g $DATADIR/draft.fa --gpu=1 
./nanopolish variants --consensus -o $DATADIR/polished_cpu_multi_model.vcf -w "Chromosome:200000-202000" -r $DATADIR/reads.fasta -b $DATADIR/reads.bam -g $DATADIR/draft.fa   -q dcm,dam  
./nanopolish variants --consensus -o $DATADIR/polished_gpu_multi_model.vcf -w "Chromosome:200000-202000" -r $DATADIR/reads.fasta -b $DATADIR/reads.bam -g $DATADIR/draft.fa --gpu=1  -q dcm,dam 

The vcf files are here:
vcfs.zip

@hasindu2008
Copy link
Contributor Author

@jts
Tested on the human genome as well for chr20:5000000-5050000 from the methylation calling tutorial dataset. VCF outputs significantly match for CPU and GPU for both single-model and multi-model.

On my laptop (12-core Intel i7, 16GB RAM and 1050 NVIDIA GPU)
Single-model
CPU : Elapsed (wall clock) time (h:mm:ss or m:ss): 3:15.61
GPU : Elapsed (wall clock) time (h:mm:ss or m:ss): 0:20.96

Multi-model (-q cpg)
CPU : Elapsed (wall clock) time (h:mm:ss or m:ss): 4:35.16
GPU : Elapsed (wall clock) time (h:mm:ss or m:ss): 0:29.25

scripts used, logs generated and VCF outputs:
scripts_and_results.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants