cleaned up GPU consensus calling #661

hasindu2008 · 2019-10-03T14:11:17Z

This pull request restructures and cleans up the pull request #468 (GPU accelerated nanopolish consensus) by @vellamike so that:

it can be automatically merged; and,
changes to original nanopolish code and behaviour are minimised.

In a summary:

newly added files are : cuda.mk, src/cuda_kernels/GpuAligner.cu,
src/cuda_kernels/GpuAligner.h and, src/cuda_kernels/gpu_call_variants.inl.

Importantly these files are effective (or compiled) only if make is called with make cuda=1. Those files are completely ignored otherwise.

changes to existing files are :

Makefile - includes cuda.mk if called with make cuda=1

README.md - brief guide on compiling and running for the GPUs

src/nanopolish_call_variants.cpp - option gpu added where as the default behaviour is to follow the CPU code path.

GPU code path generate_candidate_single_base_edits_gpu is only compiled if make cuda=1 is specified, and thus, no impact on the existing code.

I did some benchmarks based on the small chr20 dataset (average coverage of around 30-40X) on three different systems, a server, laptop and a Jetson dev board. On all cases, a speed up of ~5X was observed for the GPU implementation compared to its CPU counterpart (run with maximum threads available on CPU). Importantly, the output from the GPU and CPU are very similar except for a handful of differences, probably due to floating point handling. Further, the implementation was robust, that it ran on these cases without an issue.

The average speed ups for three separate 50kb regions (chr20:5000k-5050k, chr20:5050k-5100k, chr20:5100k-5150k) are as below:

The averages in the graph are based on three executions and the raw time values are as below:

I also checked on a 1M region (chr20:5M-6M) and the speedup on the server with Tesla V100 was even better with ~7X. The raw values are as below:

Given that Nanopolish consensus is known to be very time consuming process for larger genomes, I think this will benefit Nanopolish users.

The compiled binary (with GPU support) is attached if you wish to test on a system with an NVIDIA GPU without the trouble of installing CUDA toolkit.
nanopolish-gpu-bin.tar.gz
Also the test script and the raw outputs and logs are also attached. Just extracting and running simple_bench.sh may work.
test.tar.gz

…imal changes to the original source

lfaino · 2019-10-05T13:13:07Z

Hi,
very nice work but i have a question.
I have a PC with 3 GPU, how can i use all of them? Can i set to which card send a region?

additionally, on a multi-thread PC you can use the software parallel to send several process (regions) at the time with little number of nanopolish threads (-t option).
how beneficial is GPU over CPU in this contest?

Cheers
Luigi

hasindu2008 · 2019-10-08T03:53:21Z

At the moment it is the first GPU that is used by default. It is possible to add a command line option to that the user can provide which GPU to execute on - planning to do this in near future.

I did not benchmark the CPU version in that multi process approach so far. However, the mutli threaded approaches I benchmarks had around 70-90% CPU utilisation, so I guess the multi process approach will be slightly faster overall. Infact, in modern GPUs it is possible to launch multiple contexts as well, but these approaches are yet to be evaluated.

@vellamike could give more insights in to these questions.

lfaino · 2019-10-08T05:26:41Z

Hi,

I made some test my self. Considering that on gtx1070 I can lunch 7 process at the time, I was able to make a variant call in 22 minutes compared to 44 on cpu.

I think that with the data set I used, the gpu approach is at least 50% faster.

hasindu2008 · 2019-10-08T05:53:07Z

What was the average coverage of the dataset? Is it a publicly available dataset, if so I can give a try on a v100 as well. And, how did you launch the multiple processes, is it through that nanopolish's make range? If possible share the commands.

vellamike · 2019-10-08T07:18:23Z

@lfaino You can also chose the GPU by setting the CUDA_VISIBLE_DEVICES environment variable. For example, export CUDA_VISIBLE_DEVICES=0 or CUDA_VISIBLE_DEVICES=1. There is no easy way to find which GPU is which device, but in general device 0 will be the fastest GPU on your system.

If you are able to launch multiple processes from the command line, you should set CUDA_VISIBLE_DEVICES differently for each, in your case you have 3 GPUs so you should do

CUDA_VISIBLE_DEVICES=0 nanopolish ...
CUDA_VISIBLE_DEVICES=1 nanopolish ...
CUDA_VISIBLE_DEVICES=2 nanopolish ...

gtx1070 is a relatively old consumer card so a card like a GV100 should be faster.

Multi-GPU support is something we can definitely add in the future.

lfaino · 2019-10-08T07:25:00Z

@hasindu2008
here the command for CPU:
time python3 /data/software/nanopolish/scripts/nanopolish_makerange.py reads.paf.b02.racon1.fasta | parallel --results nanopolish.results -P 20 nanopolish variants --consensus -o cpu/polished.{1}.vcf -w {1} -r barcode02_guppy.fastq -b test.sorted.bam -g reads.paf.b02.racon1.fasta -t 4 --min-candidate-frequency 0.1

here the gpu:
time python3 /data/software/nanopolish/scripts/nanopolish_makerange.py reads.paf.b02.racon1.fasta | parallel --results nanopolish.results -P 7 /data/software/nanopolish_gpu/nanopolish variants --consensus -o gpu/polished.{1}.vcf -w {1} -r barcode02_guppy.fastq -b test.sorted.bam -g reads.paf.b02.racon1.fasta --gpu=2 -t 4 --min-candidate-frequency 0.1

i made an error in the cpu command because i used 80 threads in total (-P 20 and -t 4) but i have a system with with 72 threads in total.

i work with a systen with GTX1070 8gb of RAM.

about the dataset, it is not available (but i can share as long as you keep for yourself) and it is about 50X data of a bacterial genome about 6 Mb large

Cheers
Luigi

lfaino · 2019-10-08T07:34:49Z

@vellamike,
my idea was a bit different but i can try to make a work around.
i would like to use parallel with makerange.py and send to one GPU or another based on the fact that one process is finished. in simpler word, control which gpu finished a job and use it again.

just to be clear, is it possible in the future to have something like

python -m torch.distributed.launch --nproc_per_node=4 train_flipflop.py ...
like in the taiyaki script
taiyaki

vellamike · 2019-10-08T09:48:53Z

Hi Luigi, yes this should be possible, could you create a github issue specifically for this?

…

On Tue, Oct 8, 2019 at 8:34 AM Luigi Faino ***@***.***> wrote: @vellamike <https://github.com/vellamike>, my idea was a bit different but i can try to make a work around. i would like to use parallel with makerange.py and send to one GPU or another based on the fact that one process is finished. in simpler word, control which gpu finished a job and use it again. just to be clear, is it possible in the future to have something like python -m torch.distributed.launch --nproc_per_node=4 train_flipflop.py ... like in the taiyaki script taiyaki <https://github.com/nanoporetech/taiyaki> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#661?email_source=notifications&email_token=AALYB7KD3NBLIKTLEYAFDELQNQZZZA5CNFSM4I5DU252YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEATGDJQ#issuecomment-539386278>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AALYB7KFCQFIFHAG4OT7IT3QNQZZZANCNFSM4I5DU25Q> .

hasindu2008 · 2020-03-13T00:40:44Z

@jts
This version now supports methylation aware polishing for the GPU.

The answers from CPU (left) and GPU (right) for multi-model (-q dam,dcm) considerably match.

The answers from single-model also considerably match.

Experiment details

Reference:
ftp://ftp.ensemblgenomes.org/pub/bacteria/release-45/fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655/dna/Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.toplevel.fa.gz

Reads:
https://nanopore.s3.climb.ac.uk/Zymo-GridION-EVEN-BB-SN.fq.gz
https://nanopore.s3.climb.ac.uk/Zymo-GridION-EVEN-BB-SN_signal.tar

Commands:

minimap2 -ax map-ont -t32 ecoli_k12_ms1655.fa Zymo-GridION-EVEN-BB-SN.fq > Zymo-GridION-EVEN-BB-SN.sam 
samtools sort  Zymo-GridION-EVEN-BB-SN.sam >  Zymo-GridION-EVEN-BB-SN.bam 
samtools index  Zymo-GridION-EVEN-BB-SN.bam 
~/nanopolish/nanopolish index Zymo-GridION-EVEN-BB-SN.fq -d  Zymo-GridION-EVEN-BB-SN

Then regions were extracted for region Chromosome:200000-202000

./nanopolish variants --consensus -o $DATADIR/polished_cpu_single_model.vcf -w "Chromosome:200000-202000" -r $DATADIR/reads.fasta -b $DATADIR/reads.bam -g $DATADIR/draft.fa 
 ./nanopolish variants --consensus -o $DATADIR/polished_gpu_single_model.vcf -w "Chromosome:200000-202000" -r $DATADIR/reads.fasta -b $DATADIR/reads.bam -g $DATADIR/draft.fa --gpu=1 
./nanopolish variants --consensus -o $DATADIR/polished_cpu_multi_model.vcf -w "Chromosome:200000-202000" -r $DATADIR/reads.fasta -b $DATADIR/reads.bam -g $DATADIR/draft.fa   -q dcm,dam  
./nanopolish variants --consensus -o $DATADIR/polished_gpu_multi_model.vcf -w "Chromosome:200000-202000" -r $DATADIR/reads.fasta -b $DATADIR/reads.bam -g $DATADIR/draft.fa --gpu=1  -q dcm,dam

The vcf files are here:
vcfs.zip

hasindu2008 · 2020-03-13T01:29:32Z

@jts
Tested on the human genome as well for chr20:5000000-5050000 from the methylation calling tutorial dataset. VCF outputs significantly match for CPU and GPU for both single-model and multi-model.

On my laptop (12-core Intel i7, 16GB RAM and 1050 NVIDIA GPU)
Single-model
CPU : Elapsed (wall clock) time (h:mm:ss or m:ss): 3:15.61
GPU : Elapsed (wall clock) time (h:mm:ss or m:ss): 0:20.96

Multi-model (-q cpg)
CPU : Elapsed (wall clock) time (h:mm:ss or m:ss): 4:35.16
GPU : Elapsed (wall clock) time (h:mm:ss or m:ss): 0:29.25

scripts used, logs generated and VCF outputs:
scripts_and_results.zip

vellamike and others added 30 commits June 7, 2018 11:26

acuda stubs

bb40365

Sending event means to device

26f042d

estimating emission probabilities

381c3c5

estimating emission probabilities

8b74a57

kermel Executing to completion but incomplete -WIP

9add07a

Sending correct sequences to GPU

2fb2b0b

Correct grid size

b614313

Match state almost working

aa43bc9

All states now being updated, but no terminal kmer or scaling

a4dbf43

diagnosing issue

46e6ead

Dynamic Programming Table the same for GPU and CPU except end

e148c87

first two base scores correct, bug for other ones

6dddd48

GPU and CPU versions now giving same results

03ffaba

Removed print statements

ac82456

Fixed bug with overly-large host allocations

10db85a

removed some print statements

f5c0b4a

removed some print statements

5a203a4

Sharing a lot more memory

458a84c

Kernel now fast but some numerical errors remain

eae79cb

Fixed bug which was causing incorrect forward strand results

348fcf0

tidyup

0719a9b

some performance improvments

712e068

Fix error and tidy up

d6be1c6

Merge branch 'master' into benchmark

ad39f6a

tidy up

ca3af6e

small performance improvments

27fe627

tidyup

0e7fdcb

Update README.md

a0cce8f

Update README.md

677c94b

Update README.md

213b8eb

vellamike and others added 16 commits September 20, 2018 15:19

removed deprecated code

e823003

removed old debug code

551cd23

revert typo

5d67b61

Made indentation consistent

27f4d5c

fixed indentation

585302a

Merge branch 'master' into candidate-scoring-gpu

0cec8f9

changes to the makefile to get it compiled

f3bf3e1

cleaned up the make file and added cuda support as an option with min…

e484b29

…imal changes to the original source

Merge remote-tracking branch 'upstream/master' into gpu-varcall

c0fe717

cleaned up to be consistent with the original code

f19f9b8

restructured to minimise changes to the original source code

5658597

make the --gpu more clear

8338b92

Merge remote-tracking branch 'upstream/master' into gpu-varcall-update

3c8677b

set to cuda static runtime library

c05733b

removed .gitignore in test/

8964db0

add cida object file to make file clean option

896b806

hasindu2008 added 2 commits March 13, 2020 10:39

Merge remote-tracking branch 'upstream/master' into gpu-varcall-update

91deb0b

implementation of the methylation aware polishing option for the GPU

09df08d

hasindu2008 and others added 2 commits March 23, 2020 01:31

Merge remote-tracking branch 'upstream/master' into gpu-varcall-update

2e8b6b9

Fixed edge case causing segfault when no reads are present in a scoreSet

b13ab2d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cleaned up GPU consensus calling #661

cleaned up GPU consensus calling #661

hasindu2008 commented Oct 3, 2019

lfaino commented Oct 5, 2019

hasindu2008 commented Oct 8, 2019

lfaino commented Oct 8, 2019

hasindu2008 commented Oct 8, 2019

vellamike commented Oct 8, 2019

lfaino commented Oct 8, 2019 •

edited

lfaino commented Oct 8, 2019

vellamike commented Oct 8, 2019 via email

hasindu2008 commented Mar 13, 2020

hasindu2008 commented Mar 13, 2020

cleaned up GPU consensus calling #661

Are you sure you want to change the base?

cleaned up GPU consensus calling #661

Conversation

hasindu2008 commented Oct 3, 2019

lfaino commented Oct 5, 2019

hasindu2008 commented Oct 8, 2019

lfaino commented Oct 8, 2019

hasindu2008 commented Oct 8, 2019

vellamike commented Oct 8, 2019

lfaino commented Oct 8, 2019 • edited

lfaino commented Oct 8, 2019

vellamike commented Oct 8, 2019 via email

hasindu2008 commented Mar 13, 2020

hasindu2008 commented Mar 13, 2020

lfaino commented Oct 8, 2019 •

edited