Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mapping hits using a pangenome #183

Open
LeonardosMageiros opened this issue Nov 22, 2021 · 4 comments
Open

Mapping hits using a pangenome #183

LeonardosMageiros opened this issue Nov 22, 2021 · 4 comments

Comments

@LeonardosMageiros
Copy link

Hi,

I have executed pyseer and I would like to map my unitigs not to the list of my input files but back to just one fasta/gff file that represents my pangenome.

This file contains representative sequences of all the genes in my dataset (produced using roary). So if I understand things correctly using it as a ref in your annotate script should map all the hits in my results back to a gene. Nonetheless I have many unmapped unitigs in my output.

Am I missing something?
Any help would be much appreciated.

Best
Leonardos

@johnlees
Copy link
Collaborator

For a single reference, you probably want to use this method rather than the annotate script:
https://pyseer.readthedocs.io/en/master/usage.html#mapping-to-references-phandango

There will be many unitigs from assemblies that are not in a roary pangenome - those in intergenic regions, those not annotated correctly in the input etc.
This also sounds like a mapping issue if you're using representatives. Some genes are very diverse and you may need to use different settings for the mapping, or a more sensitive aligner such as bowtie2.

This isn't something we directly support in pyseer however

@LeonardosMageiros
Copy link
Author

LeonardosMageiros commented Nov 23, 2021

Thank you very much for your quick response.
Things are clearer now.

Just one last question to see if my understanding is correct.

I have a dataset of ~1500 strains and I have performed 2 GWAS runs on two different subsets of isolates.
My goal is to find the overlap of these two runs on the gene level.

If I use the annotate script with my pangenome on top as ref and then all the strains participating in the two GWAS runs below as draft sequences, would the mapping of the unitigs be consistent in the two cases so I can calculate the overlapping genes?

Do you see any reason for the above approach not to work and if yes do you suggest another approach to tackle my problem?

Thank you very much in advance
Leonardos

@johnlees
Copy link
Collaborator

Annotation is all done through mapping, so is with respect to whichever reference you use. If you use the same sequences to map to, they will have the same co-ordinate system so can be compared directly.

@LeonardosMageiros
Copy link
Author

Can I also ask if there is a way to map the results of a gene presence absence analysis to the same sequences as I map my unitigs GWAS?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants