Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

am I running this wrong? #183

Open
shinlin77 opened this issue Apr 1, 2023 · 7 comments
Open

am I running this wrong? #183

shinlin77 opened this issue Apr 1, 2023 · 7 comments

Comments

@shinlin77
Copy link

I have SR and LR on the same genomes. Previously, I used the following pipelines on the

SR: bowtie2 -> picard -> GATK
LR: minimap2 -> clair3 -> GATK (to round up the gvcfs)

and got reasonable concordance between the two, meaning most of the variants in one were in the other. Moreover, for variants called by both, the majority of the individual calls were the same, too.

Now, I was told PMD was a great caller for ONT data, so I used the following updated pipeline

LR updated: minimap2 -> PMD -> GLnexus (to round up the gvcfs)

however, the SR and LR updated results had almost no concordance. I feel maybe I'm not running PMD right. This is the commandline

singularity exec --bind /where_data_is/ --bind /local/scratch/ --bind /where_genome_is/ /server/apps/software/PEPPER_Margin_DeepVariant/0.8/docker/pepper_deepvariant run_pepper_margin_deepvariant call_variant -b LR_aligned.sorted.bam -f human_genome.fasta -o /output/PMD -p pepper_margin_deepvariant -t 16 --ont_r9_guppy5_sup --gvcf

Interestingly, the result has very few SNVs (they are almost all SVs). Do you have any suggestions (e.g. parameters to tweak)? Thanks.

@kishwarshafin
Copy link
Owner

kishwarshafin commented Apr 1, 2023

Hello @shinlin77 ,

Can you provide a little more information about this?

  1. How (not why, sorry) are you running GATK after clair3? Would it be the case that you are running GATK for both SR and LR and seeing high concordance?

  2. What is the basecaller you are using for the data, can you make sure it's SUP?

  3. What are the GLnexus parameters are you using? I suspect this is where you are filtering out a lot of variants.

Another way to compare is to take one sample, run PMDV and GATK and then compare before rounding up GVCFs, as this pipeline is modular, it should be easy to compare. You can also compare the VCFs using hap.py to derive a concordance set. We usually expect very high concordance if you have sufficient coverage.

@shinlin77
Copy link
Author

shinlin77 commented Apr 1, 2023

Here are my answers...

  1. For LR pipeline, I am using GATK not to call (clair3 is doing that). I use GATK GenotypeGVCFs just round up all the individual gvcfs generated from clair3.

  2. Sorry, I don't understand your comment on SUP. Which pipeline do you want me to look into (SR, LR, or LR updated)?

  3. singularity run -B /where_files_are/:/in -B /where_files_are/ $GLNEXUS_DOCKERFILE glnexus_cli --config DeepVariant -m 256 -t 12 --dir /tmp/GLnexus.DB
    /data/PMD_/individual1.g.vcf.gz
    /data/PMD_/individual2.g.vcf.gz
    ...
    > /output/cohort_PMD.bcf

I don't think the aggregation of the gvcfs step is making that big of a difference.

@kishwarshafin
Copy link
Owner

@shinlin77 , what happens if you use GATK GenotypeGVCFs with PMDV gvcfs? This behavior is inconsistent.

@shinlin77
Copy link
Author

GATK GenotypeGVCFs does not work on PMDV gvcfs. There are too many incompatibilities. For example PMDV gvcfs have

<*>

instead of

<NON_REF>

I changed those, which was easy enough, but then I ran into other incompatibilities that I couldn't figure out.

@kishwarshafin
Copy link
Owner

hi @shinlin77 ,

One of the easier way to reduce variability between pipeline and compare the variant calls is to take these files from one sample (not combined):

  1. VCF output (not gvcf) from short read variant call (short_read.vcf.gz)
  2. VCF output from long read PMDV pipeline (pmdv.vcf.gz)

And run:

docker run -it -v /input:/input \
jmcdani20/hap.py:v0.3.12 /opt/hap.py/bin/hap.py \
/input/short_read.vcf.gz \
/input/pmdv.vcf.gz \
-r /input/human_reference.fasta \
-o /input/happy_output \
--pass-only \
--no-roc \
--no-json \
--engine=vcfeval \
--threads=$(nproc)

The TP number will tell you how many variants are exactly the same (GT and allele-wise) between the two VCFs. You can do the same with clair3 vcf and compare the numbers.

@shinlin77
Copy link
Author

shinlin77 commented Apr 3, 2023

Here are the results. I'm not sure why the TRUTH.TOTAL columns are different when using the exact same gold standard, but apparently, somebody else has asked that same question on the hap.py github portal (without a response). As you can see, the recall for PMDV is very low.

image

@kishwarshafin
Copy link
Owner

@shinlin77 ,

The numbers look very discordant here. I am also unsure what is happening. Is it possible for you to share the bams so I can run them to see if there's something we are missing? Only chr20 would do in this case I think as the numbers are so off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants