Skip to content
This repository has been archived by the owner on Feb 7, 2023. It is now read-only.

Missing Indels #49

Open
nathanhaigh opened this issue Mar 16, 2020 · 7 comments
Open

Missing Indels #49

nathanhaigh opened this issue Mar 16, 2020 · 7 comments

Comments

@nathanhaigh
Copy link

nathanhaigh commented Mar 16, 2020

My own analysis of Illumina (SRR11140750, bottom track) and nanopore (SRR11140751, top track) data from the same swab sample shows your variant analysis doesn't include indels:

image

You probably should include --call-indels in your call to lofreq call

@nekrut
Copy link
Collaborator

nekrut commented Mar 16, 2020

what is your protocol?

@nathanhaigh
Copy link
Author

nathanhaigh commented Mar 16, 2020

For these mappings, it's simply bwa mem for the short reads and minimap2 for the nanopore reads.

Given the small size of the genome, I just eye-balled possible variants and saw this. Then realised you don't call indels in your variant analysis.

This is the content I've developed for a Master/Major in Bioinformatics Genomics course:
https://uofabioinformaticshub.github.io/genomics_applications/Practicals/resequencing/resequencing.html

@nathanhaigh
Copy link
Author

Perhaps you are already calling variants. However, they are missing from these outputs: https://github.com/galaxyproject/SARS-CoV-2/tree/master/4-Variation#outputs

@wm75
Copy link
Collaborator

wm75 commented Mar 21, 2020

The current workflow fails to call indels because it does not use lofreq indelqual to add indel qualities to the mapped reads first. Without those lofreq call skips indels even when the --call-indels option is in use (which is the case already).

@nekrut are you interested in having indels get called and do want to update the workflow accordingly yourself?

@saramonzon
Copy link

Hi! I was just stopping by to say this, and you already did! Indels should be called, as we have observed a bunch of them in covid19 data. Moreover they are indels that conserve the open reading frame, only deleting with aa, which makes it a most probably functional variant.

Also I think you don't filter host reads in the variant pipeline, but I'm not sure if you do it in the assembly. We have observed that bwa mem is two much sensitive and softclips a lot of reads against the human genome, bowtie2 is much better option for this type of data, specially if you are using amplicon data.

Thanks for the repo!

@tseemann
Copy link

@saramonzon do you have evidence for host reads (human) aligning to the SARS genome?
It could be possible with bwa mem default settings like minscore=30 but that should not be used here.

You can reduce soft clipping by changing the end-bonus settings, or as you say, using glocal alignment via bowtie2 --end-to-end

@saramonzon
Copy link

saramonzon commented May 19, 2020

Hi @tseemann, yes using bwa with default parameters we obtain a considerably percentage of reads mapping to both human and SARS-Cov-2 depending on the sample, we have fixed it as you say using bowtie2, but using --local not --end-to-end.
But It's more likely that there are more virus reads mapping to human genome, than human reads mapping against SARS-Cov-2, however it could be a possibility that is worth to be avoided just in case.
I share here my notes when I was analyzing this just in case someone finds them useful, there are just personal notes, sorry about the writing!
README.txt

mvdbeek added a commit to mvdbeek/SARS-CoV-2 that referenced this issue Jun 1, 2020
Updates lofreq version and adds indelquals.
xref galaxyproject#49
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants