Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Systematic decisions to ignore regions of the PRG #96

Open
iqbal-lab opened this issue Mar 30, 2018 · 8 comments
Open

Systematic decisions to ignore regions of the PRG #96

iqbal-lab opened this issue Mar 30, 2018 · 8 comments

Comments

@iqbal-lab
Copy link
Collaborator

Given a (minimum and) maximum read length, and a PRG,we should be able to

  1. decide that there are some places we will never be able to draw inference on, so we might as well ignore (which means do not put them in the kmer-index, nor store allele counts)

  2. decide there are some places that should always be analysed jointly (so that if we chunk the genome, those chunks should be analysed concurrently(as one chunk)

@iqbal-lab
Copy link
Collaborator Author

iqbal-lab commented Mar 30, 2018

Proposal 1

  1. Divide the PRG into chunks of length 10000bp (or whatever). Say this is N chunks
  2. For each chunk calculate the list of kmers contained within (not just on/overlapping alleles)
  3. Calculate an N x N matrix, where the (i,j) entry is the number of kmers in common between chunk i and chunk j

Any chunk that shares too many kmers with lots of other chunks is a repeat where we will never be able to place reads reliably ---> mask out.

Any small set of chunks which share "enough" kmers, should be co-analysed.

What's good about proposal 1

  • Simple to calculate, unambiguous.
  • Solves problems 1 and 2 as defined in the issue

**What's bad about the proposal **

  • It doesn't make sense. What exactly are we supposed to do about alternate alleles? If there are 5 long (1000bp) alleles at site-10, do we take all the kmers in all of them? So we're comparing chunks where some have many more kmers? Chunk 5 might have all the kmers in one path through chunk 6, but not the kmers in the alternate paths/alleles, for example
  • well....if there are regions with sites/alleles which we decide are too repetitive, do we leave the sites/alleles in the PRG but never quasimap to them? We could flag them in quasimap output as repetitive? Would like to preserve the 1-1 mapping between PRG-VCF records and sites.

@iqbal-lab
Copy link
Collaborator Author

** Proposal 2**

As proposal 1, except when comparing chunk i and chunk j, (assume diploid, but easy to modify what I say for other ploidies) sample 100 (or some number) of possible pairs of paths in chunk i, and also in chunk j, calculate the statistic for each pair of these, and then take the average,

@iqbal-lab
Copy link
Collaborator Author

Impact on Plasmodium falciparum (key use case for us): will immediately remove the crazy repeat regions where we should not waste time trying to quasimap, or variant call. Less wasted time, lower RAM use.

Impact on MHC (also key): this is trickier. There are places (of key importance) where there are say 2 sites, a long way apart, each with say 5000 alternate alleles. Now subsets of these alleles are very similar, within each site, and also a smaller subset are very similar between sites.

To be concrete:

Site 1 has:
set 1 of 50 alleles which are very similar to each other
set 2 of 60 alleles which are very similar to each other (but more differnet from all others)
set 3 of 80 alleles which are very similar to each other (different to others)

Site 2 has:
set 1 of 10 alleles which are almost identical to alleles from Site 1 set 2
set 2 of 50 alleles which are very similar to each other (different to all others)
etc

Now, what should we do in terms of deciding whether site 1 and site 2 should be co-analysed?
In theory they share a bunch of alleles. But for any given sample (human - pair of genomes), probably they dont have anything in site 1-set2 or site2-set 1. And if not, we don't need to analyse them together.

@iqbal-lab
Copy link
Collaborator Author

BTW, above I said something like
Impact on Plasmodium falciparum (key use case for us) will immediately remove the crazy repeat regions
I have since got a workaround for the moment, which was to build the Plasmodium PRG just from Cortex calls, and Cortex has very low power to detect mutations in repeat regions. So this is not blocking for the Pf analyses we are planning.

@iqbal-lab
Copy link
Collaborator Author

Also, since this issue was raised, Robyn and I had a conversation which resulted in

Proposal 3
As a one off calculation, sample read-length paths from the PRG, and then map them back to the PRG, and then mask out regions where the reads from there map to too many places

@iqbal-lab
Copy link
Collaborator Author

..where "mask out" could mean modify the original VCF/whatever and regenerate a better PRG

@ffranr
Copy link
Contributor

ffranr commented May 17, 2018

There's a lot going on in this issue. I think that it would be beneficial to spin out certain problems into separate issues. For instance, I think that chunking the PRG as a memory scaling enhancement can be dealt with separately.

From what I understand, the regions that are beneficial to ignore are repeat regions. Surely there are already existing solutions which can identify repeat regions. @iqbal-lab will any of those solutions work for us?

@iqbal-lab
Copy link
Collaborator Author

  1. I agree about chunking for memory reduction - belongs elsewhere
  2. there are methods for finding repeat regions, on linear references. The word "repeat" means many things in this field, annoyingly. Can mean exact copy or very approximate copy, or can mean sequence is repetitive (eg ATGATATAGCCGTCGTCGTCATTATATAC). The reason I like simply sampling read-lengths from the PRG and querying them is that it is a direct measure of what we want to do. I agree there is a scaling problem if you really try to pull out every read-length from the graph, but you don't need to do all of them.

I think we need more clarity on what this is meant to achieve.

I think the onus is on me to do that, so taking ownership

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants