-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with variant calls from minigrah-cactus graph #4062
Comments
This comes from the GBZ file format, where HAPLOTYPE paths can be stored extremely efficiently (you can have thousands) but this comes at the cost of not supporting all types of queries on them. Minigraph-cactus follows this convention in that only samples you specifiy with Now if you only have 7 genomes, you can just set them all as REFERENCE and it'll probably be okay. The easiest way to do this is probably to take the output GFA from minigraph-cactus, and change the "RS" tag in the first line to include all your samples (separated by spaces):
From there you can make a new .gbz with
And then use that with The new .gbz will be the same as the old one, except the path sense will be changed to REFERENCE for everything. |
Hi @glennhickey Thanks that helped a lot. I decided just to run again and provide I did a uniq -c to count the number of calls per chromosome/haplotype.
L1_v1 is Llute_Wodjil and L2_v1 is Llute_Paris You can see Llute_Wodjil is heavily biased. Is this expected? This is for only one sample. I'll be aligning hundreds more samples and then creating a multi sample VCF. Thanks. |
The first reference does get special treatment and there will be indeed a bit of bias towards it. But I'm not quite sure what you're doing and what you're expecting. To get full calls on all samples, I think you will need to run |
Hi @glennhickey sorry its taken a while to get back to this. I've been aligning each sample to the graph, then calling variants to produce a VCF for each sample. I then merge the VCF files to produce a single VCF with multiple samples. For each of the reference/reference-sense paths in the graph, I've split the VCF into different references. You can see that there is a huge difference between different references, with Llute_Wodjil (the first reference) heavily biased:
Does this mean that variants identified in one reference-sense path (wtih Llute_Wodjil set as the first refeence), are completely unique, and any variants identified in say Llute_Wodjil would be a kind of consensus variant (found in all paths, or only on Llute_Wodjil? I'm trying to understand how these VCF would differ if I were to align to each reference usng conventional approaches instread, such as BWA-MEM followed by GATK. In this way could I consider variants identified to Llute_Wodjil as core pangenome sequence regions, and anything aligning to the other reference-sense paths considered as non-core shell regions found in only a subset. Thanks. |
Hi @glennhickey Ideally I would like a multiple sample VCF file. I've realised to really looking at population structure, we'll need a gVCF to get reference alleles in the VCF after I've merged all the VCF together, otherwise there will be missing calls when they should be reference alleles. Is there a way to get |
We don't presently have a way to call or genotype multiple samples at once in For your previous comment about the different VCF sizes, are you running If you're doing that and still seeing a big difference, it could be your other samples did not align very well into the graph. You can use For the gVCF, if you run call with |
Hi @glennhickey Thanks. Adding -a and then merging with -0 is working to bring all the samples together properly. I was wondering, is it possible to iteratively align WGS reads to the graph until a final graph has all samples within the graph with their own paths. I was considering could this be used in conjunction with ODGI PAV to get presence absence across a huge number of haplotypes and WGS samples aligned? Usually I have used PGGB to generate a graph, ODGI PAV to create a PAV matrix across a relatively small subset, and wondered if aligning hundreds of samples to the graph, could I then get a PAV matrix based on aligned samples as well, which would be much more comprehensive across a population. Also if it possible to infer a phylogenetic tree from the final pangenome graph with hundreds of samples aligned? I usually run mashtree across all my assembled genomes of pseudoomolecule quality, to get a newick tree, but also wondered if I aligned WGS data to the graph, is there a tool which can interpret phylogeny based on the graph alone? Thanks. |
The only effective way I"m aware of now to make pangenome graphs from WGS is in one shot from VCF with For what you're doing, I think you'd either need to try to get |
Thanks. I remember trying to generate a graph from VCF once before but found that my chromosome lengths were too long. I have to use CSI indexing for my genomes. Has this been updated in the latest vg toolkit build? |
Unfortunately, it doesn't look like it, as the issues remain open. |
Ok thanks. Hopefully its implemented at some point. I'm currently looking at comparing results between PGGB and MC. Specifically how they compare with identified variants and aligning samples with giraffe. My pangenome in this case is relatively small (7 lupin accessions), and I believe vg autoindex will likely complete, unlike with previous attempts with much larger pangenomes (~20 barley accessions) The d2.gbz outputs from MC, how are they generated? I assume clipping is done so that nodes have a minimmum depth of 2, to avoid edge cases. Which tools and commands are used to clip out these edge cases? I'd like to generate similar d2.gbz output and other necessary files using my PGGB graph and vg autoindex. |
A few things:
|
Hi @glennhickey Thanks. I've tried generating a I use
I simply reran
Once I get a |
Sorry that's a typo on my part. You make the There is also an example of doing all this in this tutorial in this section. |
Hi @glennhickey Thanks. From this workflow it looks like for each sample I need to create a personalised pangenome. Could I create a personalised pangenome from all the reads using kmc or would I have to do kmer counting for each read set prior to mapping the reads? I have around 400 samples, R1 and R2. Thanks. |
Yeah, the idea is to make a personalized genome for each sample. As shown in that preprint, it doesn't affect the overall running time much: the cost of making the personal pangenome is recovered by faster mapping time. Not that there's anything preventing you from making a personal genome for a set of samples together -- it's just we haven't really benchmarked this nor looked at the best parameters, and I wouldn't expect it to work as well as the per-sample approach. |
Thanks. I'll stick to a personalied pangenome per sample. Is there a way to generate a hapl file from the GBZ file? I've been using a more up to date vg tookit than the one that comes with MC. I've been getting these errors before with giraffe (below), and read on another issue post that it would be best to generate the min and dist indexes again from the GBZ prior to running giraffe. I'd imagine it would be best to do the same for the hapl files as well.
|
You can generally ignore most giraffe warnings, including that one. Cactus includes the latest vg release. |
Hi @glennhickey Just looking over the workflow for the personalised pangenome. Is the aim to use the hapl file and the kmer counts with girraffe with the unfiltered pangenome graph (GBZ). Or do I use giraffe (-Z parameter) with the personalised pangenome graph which is output by a previous run of giraffe? for example, theres one example which is from manual sampling and another which integrates with giraffe, both seem a bit different. Which do I use? I assume for both I need to run
what does the -i parameter do? I cant see it in the usage list. |
Once I have the resulting GAM file using the HAPL file using the Girraffe integrartion approach above, I want to augment the genome graph with the GAM file to call variants on. I usually do this:
Should the GBZ file when augmenting with the GAM file, be a personalised GBZ file or the unfiltered GBZ file? If the personalised GBZ file, then I should take the manual sampling approach to create a temporary graph ${sample}.gbz, instead of the Giraffe integration example above? |
Hi @glennhickey I've been getting an error about r-index when running giraffe:
I run this code which gives me the error, when trying to generate the HAPL file:
I need the HAPL file so I can run:
|
It looks like you don't have the r-index file. You can build it like this:
|
Thanks @jltsiren |
Hi @glennhickey and @jltsiren I've managed to get further along, but now get this error:
This is my workflow up untll giraffe:
|
Haplotype sampling expects a graph with a linear high-level structure. The error message indicates that your graph lacks that structure. There are 26 weakly connected components, which should correspond to chromosomes. On the other hand, there are 49 top-level chains, which are the linear structures haplotype sampling uses for generating haplotypes. Because the numbers don't match, many of the internal assumptions don't hold and the algorithm doesn't work. If you want to use haplotype sampling, you need to rebuild the graph so that the chromosomes / other contigs are clearly separated and it's possible to find end-to-end paths over each contig. |
Hi @jltsiren Thanks. I have 26 chromosomes for each of the 7 genomes. There are no contigs, only 26 chromosomes for each of the 7. I ran all of these together with minigraph-cactus, and then ran the join method at the end which merges all chromosomes together. Perhaps I should run the mapping stage on each chromosome manually first and then join afterwards. I've done this for much larger pangenomes, but because this genome is much smaller, I ran it all together. If I rebuild the graph, what exactly should I be changing? The only thing I can see which could change is the join stage:
|
Also worth noting, |
The output of minigraph-cactus should always be compatible with |
Hi @glennhickey The version I have is from November last year. I'll try with the |
Hi @glennhickey Should I use |
You want to use |
Hi
I'm using vg tools v1.50.1 "Monopoli"
I've produced a graph from 7 different genomes of the same species using minigraph-cactus.
I've used the indexes produced from minigraph-cactus with
--giraffe`` parameter and obtained the
.d2.gbz``` index.I then map WGS reads from a different species entirely using giraffe, and follow the tutorial here: https://github.com/vgteam/vg/wiki/SV-Genotyping-and-variant-calling
I've found the resulting VCF, whether I use an augmented graph or not, still gives the calls against the reference I used when I ran minigraph-cactus, and none of the other 6 haplotypes my reads would have also mapped to.
What I want is the VCF to contain variant calls against all different paths in the graph, not just the reference I chose, but to provide variant calls against all 7 different haplotypes in the graph. I want to use the graph as a single point of call for variation across the pangenome, but yet I still get variant calls against only 1 of the 7 genomes (the reference used in minigraph-cactus). Am I missing something, a particular step, or is this expected when using minigraph-cactus? Should I use PGGB, if I want calls against all haplotypes, as PGGB maps all haplotypes against all haplotypes and has no single reference.
Thanks.
The text was updated successfully, but these errors were encountered: