Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Index requested greater than vector's size" when loading pufferfish index using salmon v1.0 #8

Open
mdshw5 opened this issue Dec 18, 2019 · 22 comments

Comments

@mdshw5
Copy link

mdshw5 commented Dec 18, 2019

I'm opening this issue here because I believe it's more of a pufferfish-related matter, but will use the issue template from your salmon repository.

Is the bug primarily related to salmon (bulk mode) or alevin (single-cell mode)?
salmon/pufferfish
Describe the bug
When running salmon v1.0 using a rather large index, I receive an error raised from the cereal library "Index requested greater than vector's size". The log reads:

-----------------------------------------
| Loading contig table | Time = 12.954 s
-----------------------------------------
size = 35010142
-----------------------------------------
| Loading contig offsets | Time = 269.18 ms
-----------------------------------------
-----------------------------------------
| Loading reference lengths | Time = 7.8427 ms
-----------------------------------------
-----------------------------------------
| Loading eq table | Time = 3.3896 s
-----------------------------------------
-----------------------------------------
| Loading mphf table | Time = 3.8301 s
-----------------------------------------
size = 3567796961
Number of ones: 35010141
Number of ones per inventory item: 512
Inventory entries filled: 68380
-----------------------------------------
| Loading contig boundaries | Time = 11.288 s
-----------------------------------------
size = 3567796961
-----------------------------------------
| Loading sequence | Time = 7.763 s
-----------------------------------------
size = 2517492731
-----------------------------------------
| Loading positions | Time = 171.81 s
-----------------------------------------
size = 3221360466
-----------------------------------------
| Loading reference sequence | Time = 7.9564 s
-----------------------------------------
-----------------------------------------
| Loading reference accumulative lengths | Time = 35.741 ms
-----------------------------------------
Index requested greater than vector's size: 6442720932>6442720932
Index requested greater than vector's size: 6442720996>6442720932
Index requested greater than vector's size: 6442721060>6442720932
Index requested greater than vector's size: 6442721124>6442720932
Index requested greater than vector's size: 6442721188>6442720932
Index requested greater than vector's size: 6442721252>6442720932
Index requested greater than vector's size: 6442721316>6442720932
Index requested greater than vector's size: 6442721380>6442720932
Index requested greater than vector's size: 6442721444>6442720932
...

The index does not finish loading, and so salmon does not enter read quantification routines.

To Reproduce

  • Which version of salmon was used? 1.0
  • How was salmon installed (compiled, downloaded executable, through bioconda)?
    Github release tarball
  • Which reference (e.g. transcriptome) was used?
    Gencode v32, with additional elements representing genic introns and intergenic spaces.
  • Which read files were used?
    NCBI SRA run accession GSM2392582
  • Which which program options were used?
    --no-version-check --libType ISR --threads 4 --seqBias --gcBias --useVBOpt

Expected behavior
I expected index loading to complete successfully.

Desktop (please complete the following information):

  • OS: CentOS6

Additional context
I've uploaded the index file archive here and the transcripts fasta file here.

@rob-p
Copy link
Contributor

rob-p commented Dec 18, 2019

Hi @mdshw5,

Thanks for the detailed report! I grabbed the fasta file and indexed it, and (on the current develop branch of salmon, at least, which should not deviate substantially from 1.0) it was able to load the index and align the reads successfully. I'm currently downloading your pre-computed index to try an do a basic cross-check, but iCloud is taking some time ;P. In the mean time, I get the following sizes for the components of the index — does your index have anything that looks different?

total 18G
drwxrwxr-x 2 rob rob 4.0K Dec 18 17:26 .
drwxrwxr-x 4 rob rob 4.0K Dec 18 17:27 ..
-rw-rw-r-- 1 rob rob 697K Dec 18 16:57 complete_ref_lens.bin
-rw-rw-r-- 1 rob rob 2.7G Dec 18 17:22 ctable.bin
-rw-rw-r-- 1 rob rob 122M Dec 18 17:22 ctg_offsets.bin
-rw-rw-r-- 1 rob rob  32K Dec 18 16:57 duplicate_clusters.tsv
-rw-rw-r-- 1 rob rob 1.3G Dec 18 17:22 eqtable.bin
-rw-rw-r-- 1 rob rob  968 Dec 18 17:26 info.json
-rw-rw-r-- 1 rob rob 1.6G Dec 18 17:26 mphf.bin
-rw-rw-r-- 1 rob rob 9.4G Dec 18 17:26 pos.bin
-rw-rw-r-- 1 rob rob  115 Dec 18 17:26 pre_indexing.log
-rw-rw-r-- 1 rob rob 426M Dec 18 17:14 rank.bin
-rw-rw-r-- 1 rob rob 1.4M Dec 18 17:22 refAccumLengths.bin
-rw-rw-r-- 1 rob rob 4.5M Dec 18 17:26 ref_indexing.log
-rw-rw-r-- 1 rob rob 697K Dec 18 17:18 reflengths.bin
-rw-rw-r-- 1 rob rob 769M Dec 18 17:22 refseq.bin
-rw-rw-r-- 1 rob rob 851M Dec 18 17:14 seq.bin
-rw-rw-r-- 1 rob rob   96 Dec 18 17:26 versionInfo.json

Thanks!

@mdshw5
Copy link
Author

mdshw5 commented Dec 18, 2019 via email

@mdshw5
Copy link
Author

mdshw5 commented Dec 18, 2019

My file sizes are mostly the same:

-rw-r--r-- 1 shirlma2 shirlma2 697K Dec 17 14:25 complete_ref_lens.bin
-rw-r--r-- 1 shirlma2 shirlma2 2.7G Dec 17 15:36 ctable.bin
-rw-r--r-- 1 shirlma2 shirlma2 122M Dec 17 15:35 ctg_offsets.bin
-rw-r--r-- 1 shirlma2 shirlma2  32K Dec 17 14:24 duplicate_clusters.tsv
-rw-r--r-- 1 shirlma2 shirlma2 1.3G Dec 17 15:35 eqtable.bin
-rw-r--r-- 1 shirlma2 shirlma2  115 Dec 17 15:52 indexing.log
-rw-r--r-- 1 shirlma2 shirlma2 1013 Dec 17 15:51 info.json
-rw-r--r-- 1 shirlma2 shirlma2 1.6G Dec 17 15:52 mphf.bin
-rw-r--r-- 1 shirlma2 shirlma2 9.4G Dec 17 15:52 pos.bin
-rw-r--r-- 1 shirlma2 shirlma2 426M Dec 17 15:23 rank.bin
-rw-r--r-- 1 shirlma2 shirlma2 1.4M Dec 17 15:36 refAccumLengths.bin
-rw-r--r-- 1 shirlma2 shirlma2 697K Dec 17 15:31 reflengths.bin
-rw-r--r-- 1 shirlma2 shirlma2 769M Dec 17 15:36 refseq.bin
-rw-r--r-- 1 shirlma2 shirlma2 851M Dec 17 15:23 seq.bin
-rw-r--r-- 1 shirlma2 shirlma2   96 Dec 17 15:52 versionInfo.json

I'm assuming the info.json difference is down to file paths, but there is some useful info there:

{
    "index_version": 4,
    "reference_gfa": [
        "/home/shirlma2/pisces/human/gencode_basic/salmon"
    ],
    "sampling_type": "dense",
    "k": 31,
    "num_kmers": 2517492731,
    "num_contigs": 35010141,
    "seq_length": 3567796961,
    "have_ref_seq": true,
    "have_edge_vec": false,
    "SeqHash": "0f5545409ac11ab3c53aec46b9ba650a54c350259d7b019545a5d17ac8c0d379",
    "NameHash": "de498fa7bca71090cc5b82de17738e0bcc4146f7533f1e786bbaac1a17d2c632",
    "SeqHash512": "6023f73a5b01e262fbbf55d1600302c56d28aea6fa060fb0b5bb6d7172593ab9bb8154f9ff63a8d0b06734d500a65a8a91e7cf362f0805534c05fecc14fe2f28",
    "NameHash512": "1e01a9565163ab169c34672eeb6384413598235e56b1441c312646e2fe9e0d82a526add56afe5c5b55d1f1c6c2ae8b3a3645a670cd0bfcfc2f513dd70eab23d9",
    "DecoySeqHash": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
    "DecoyNameHash": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
    "num_decoys": 0,
    "first_decoy_index": 18446744073709551297
}

I also believe I might have created this index using the beta 0.99v2 pre-release, and I just poked around comparing to the current develop branch, and it looks like the cereal lib version was different: COMBINE-lab/salmon@v0.99.0-beta2...develop#diff-af3b638bc2a3e6c650974192a53c7291L556-R553.

That probably explains the issue I'm having, and will rebuild the indices and confirm whether I still have an issue. Thanks again for your time!

@rob-p
Copy link
Contributor

rob-p commented Dec 18, 2019

I just updated this as i noticed cereal released a new version after years. I don't think the issue you had was a listed bug, but i can try to deserialize your index with the new version (after kids are in bed :)).

@mdshw5
Copy link
Author

mdshw5 commented Dec 19, 2019

I just reindexed my transcripts fasta using salmon v1.0 and still have the exact same error messages. Just to check, I also tested on a CentOS7 machine with a different CPU architecture (though I think host specific issues are unlikely) with no luck.

Out of desperation I tried dropping flags from my command and, like most things done out of desperation I made it work but am at a loss to explain why. My problem was the --gcBias flag. I'm still not sure why, since the --help-reads still lists the flag as a valid option, but I'm willing to bet that the selective alignment algorithm borked the GC bias implementation.

@rob-p
Copy link
Contributor

rob-p commented Dec 19, 2019

Thanks for the detective work. Actually, --gcBias should be a valid flag, and we explicitly tested it with the pufferfish-based selective alignment (as it required some interface changes). But, it's worth noting that it wasn't tested specifically in this scenario. That is, we tested it in selective-alignment with the default transcript index, with the decoy transcript index, and with the transcript + decoy genome index, and it worked well in all of those cases. However, we never tested it where we added all of those other sequences to the index and didn't mark them as decoys. I still have no idea why it might mess with cereal, but I could believe that having many of those other long sequences as non-decoys in the index might trigger a corner case. We'll have to dig deeper. Thanks for providing the info and files. Let us know if you stumble on anything else and we will do the same!

@rob-p
Copy link
Contributor

rob-p commented Dec 19, 2019

Actually, @mdshw5 — one thing that I did notice strange during indexing (watching the log scroll through), is that there were a bunch of sequences that, after polyA clipping had a length of 0. I think this could potentially be a corner case. I can dig deeper, but do you see any reason why your transcript fasta would contain sequences that are all A?

edit: a quick look suggests they are actually empty input sequences! I'll see if removing them helps.

@mdshw5
Copy link
Author

mdshw5 commented Dec 19, 2019

Sounds good. Thanks again for the help tonight. I should point out that I posted my log with --quiet so the zero length transcript messages are missing. Looking here:

https://github.com/COMBINE-lab/salmon/blob/ea80f3a12a1a6ccb871a92175eda0bf94d2b114f/include/ReadExperiment.hpp#L311-L320

I can reason about why I encounter this error building my weird index.

edit: just saw your reply and edit and I think you're right. I'm constructing the Gencode transcript sequences using the GTF models, which produces some small transcripts that are usually filtered out during indexing so I've become accustomed to ignoring these messages.

@mdshw5
Copy link
Author

mdshw5 commented Dec 19, 2019

To explain what I'm working on (and maybe it's a terrible idea):

  1. Decoy sequences are a good idea but it's kind of silly to have limited baits with mappings that are "throw-away"
  2. Your pufferfish index and selective alignment strategy can index and map lots more sequence
  3. I can try to compartmentalize the whole genome into: mature transcripts, unprocessed transcripts (or maybe just concatenate constitutive introns together at the gene level), and intergenes. Possibly more compartments related to endogenous viruses and such.
  4. If I can ask salmon to quantify all of these sequences, do some light post processing and renormalize transcripts to TPM, and apply appropriate normalization to the intronic and intergenic sequence counts it would let me do a few things. QC (fraction intronic, intergenic, viral, ribosomal, mitochondrial) my libraries, try to recover intronic reads from libraries that have a larger than normal pool of nascent transcript, and maybe calculate some RNA velocity-type metric based on the (processed/unprocessed) distribution of all genes in a sample.

@rob-p
Copy link
Contributor

rob-p commented Dec 19, 2019

Regarding your first response:

I'm filtering out empty transcripts now and will try re-indexing. "Small" targets should not be a problem. Actually, we have an explicit strategy to account for them in the index. However, I think empty targets are a corner case we didn't consider. The indexing process is lossless, so it will keep around that these inputs existed even though they had no sequence, but that causes a calculation bug (I think) in the code you link to above. I think the best behavior here is to say that short targets are OK, but empty targets are disallowed (do they technically violate the FASTA "format"?), and throw an error during indexing if we see them.

Re your second idea. I see where you're going with this, and we're definitely on board. The main purpose of the decoys right now is to avoid spurious mappings (which they do help to achieve). However, given the ability to map to much more sequence, there are a number of other things we would want to keep around and handle in an appropriate way and some of these are absolutely on our radar. Interestingly, I think that if we want to start having bias modeling for very long sequences though, we might have to think about re-engineering some things (for a reason unrelated to the behavior we are seeing here). If we have very very long sequences (e.g. transcripts will all retained introns) that have only a few reads mapping to them, it will be a huge waste and not particularly useful to model the bias of all potential fragments originating from all potential positions of those target sequences. This is a separate (performance) issue, so I won't wax philosophical about it here, but I'll think about it some more and maybe we can discuss it in another issue / via another forum.

@rob-p
Copy link
Contributor

rob-p commented Dec 19, 2019

Ok, so I'm not sure if I'll be able to nail this down tonight, but it looks like simply getting rid of length 0 transcripts does not resolve this issue. So, it must be some other corner case showing up in this large collections of transcripts (+ other sequences) that didn't show up when we were testing with the regular transcriptome & decoy transcriptome ... will continue to dig.

@rob-p
Copy link
Contributor

rob-p commented Dec 19, 2019

@mdshw5,

Ok! Figured it out. It turns out it is due to short targets (targets <= k). The = in <= is because of a quirk of TwoPaCo. I thought we'd handled this corner-case appropriately, but apparently not. If you remove all targets <= k in length, everything works, but we'll have to fix this for targets <= k but > 0 on our end. I'll keep you posted!

@rob-p
Copy link
Contributor

rob-p commented Dec 19, 2019

Ohhh man! What a bug! It was caused by missing parentheses around this line:

      if (!isDecoy and !isShort and sopt.biasCorrect or sopt.gcBiasCorrect) {

should have been

      if (!isDecoy and !isShort and (sopt.biasCorrect or sopt.gcBiasCorrect)) {

damn. So, I think this issue would potentially manifest any time we have transcripts of length <= k in the input and we enable the bias correction. I just pushed the fix to develop, and we'll cut a new release soon. However, even before that, I'll drop and exe from our CI server for you to test on your end. It actually has nothing to do with indexing and everything to do with parentheses in that one line, so you shouldn't need to rebuild any index.

Edit: new executable can be grabbed from here.

Also, I am curious if it's the case for you that quantification never started, or if it's the case that the log printed out a bunch of stuff, then quantification started, but took forever because we are trying to bias-correct a bunch of giant "super transcripts". Basically, after thinking about the effect of the bug, it occurs to me that result should be log spam (and *insane amount of log spam in this case, which itself takes a ton of time to write) and memory wastage (neither good things nor intended things), but should not be the index failing to load or wrong results (since these short transcripts will never actually be allowed to have any reads map to them anyway).

Edit edit: I tested this hypothesis, and the index loading (prior to the bug fix) did eventually finish for me. It simply took forever because of all of the spam being printed to stderr. I think that the effect of this bug is greatly exacerbated because of the specific transcriptome being indexed here. It will be triggered whenever there is a short sequence (<=k) that appears very close to the end of the target transcriptome, and the number of messages printed will be proportional to the nearest sequence of length > k that occurs before this short sequence. Your transcriptome file, has a number of short sequences near the end, some of which have a closest prior long transcript that is very long. So this was basically the perfect scenario for triggering this bug. Thanks for helping us find it!

@rob-p
Copy link
Contributor

rob-p commented Dec 19, 2019

And to top it off, this led to the discovery of another bug where populating the reference sequence used for bias correction in this case relied on the result of uninitialized memory ... so this needs to be fixed ASAP.

@mdshw5
Copy link
Author

mdshw5 commented Dec 19, 2019

I have lots of comments when I get back to my computer but right now wanted to say I’m absolutely floored by your effort here and will test the new exe today. I was looking suspiciously at that conditional flow control line and was pretty sure “& & & |” wasn’t right but I’m glad to see it was just order of evaluation.

@rob-p
Copy link
Contributor

rob-p commented Dec 19, 2019

Sure thing. Thanks for finding this out! The executable above fixes the loading issue, I'll drop a new one soon that also fixes the memory correctness issue (since we know that bug and have resolved it upstream now).

@rob-p
Copy link
Contributor

rob-p commented Dec 19, 2019

Oki dokes, @mdshw5 : Here is an executable that should resolve all of those issues and is currently the RC for salmon v1.1.0, which we will try to get out today.

@mdshw5
Copy link
Author

mdshw5 commented Dec 19, 2019

A couple of points in response to your comments above:

  • I wouldn't worry about supporting zero length FASTA entries during index construction, but it would be good to raise an error or discard them. Since there isn't a good specification for FASTA, I tend to think about FASTA validation in terms of line length indexing. In this context, both zero-length names (allow 0-length fasta label samtools/htslib#258) and zero-length lines (https://github.com/mdshw5/pyfaidx/blob/master/pyfaidx/__init__.py#L557) are okay, as long as there is a newline character.
  • You're right that the excessive logging was just making the index loading take longer and my jobs were running into hard run-time limits (or the limit of my patience).
  • Regarding bias modeling for extremely long (or short, or weird) sequences, it could be useful to have some way to incorporate targets during indexing that will never update the bias models. Something like passing "auxiliary" targets in a separate FASTA file to signify that these are not typical transcript targets.

@rob-p
Copy link
Contributor

rob-p commented Dec 19, 2019

Thanks for the detailed thoughts. Here are my thoughts^2.

  • Yes, I agree with your interpretation here. In fact, once things are parenthesized properly, zero length sequences are not a problem (since they get properly skipped by the "too short") flag. So the current behavior of issuing a warning already agrees with what would be best practice here.

  • Good to know. I got impatient as well and tested this by simply redirecting stderr to /dev/null. It still took a few minutes to load (it's still writing the message, it just doesn't have to send them to the console), but it was way faster than if you let them print.

  • I really like this specific suggestion. I want to get 1.1.0 out ASAP, since it fixes what I consider a non-trivial bug, but I'd be happy to try and work a feature like this into 1.2.0 (maybe you could help test it?) ;P.

@mdshw5
Copy link
Author

mdshw5 commented Dec 19, 2019

Thanks for the v1.1.0-beta2 test build, @rob-p! I just tested in my environment and this release fixes the issues I was encountering.

Edit: regarding the aux targets concept, I'd be glad to help test the feature, and the timing (Q1 2020) could work very well for me.

@rob-p
Copy link
Contributor

rob-p commented Dec 31, 2019

@mdshw5 — Regarding the aux targets concept, what are your thoughts on allowing those to be specified along with the quantification command rather than during indexing? The benefits of this are :

  • It is easier to implement.
  • It allows this feature without changing the index format (and breaking index backward compat).
  • It allows one to use different auxiliary targets in different runs (if you decide that certain targets should undergo correction).

on the flip side, the downsides are:

  • Not being baked into the index means that the aux target list would have to be passed to every quant command (maybe not so big of a problem).
  • We'd want to write down the info somewhere about what targets were marked as aux for metadata recording purposes (rather than being baked into the index)
  • We eat the cost of parsing the aux target list for every quant call (but this should be very small).

@mdshw5
Copy link
Author

mdshw5 commented Dec 31, 2019

I actually think passing the aux list during quant is the way to go. A few thoughts:

  • Keeping index compatibility is important.
  • The list could be rather long, so something like a file containing one target per line might be good. Something simple enough for most users, and easy to make.
  • The input file can just be copied to the quant.sf directory for provenance.
  • I’d love it if the TPM calculation could be driven by primary targets, and the counts would just be the counts.
  • Maybe introduce NA for aux targets in the quant.sf matrix? This could also provide another point for checking that data matrices come from runs using/not using the same aux target definitions.

Edit:

On second thought it’s unclear how people will use aux targets and many use cases (lists of targets where the bias models are just problematic) might want TPM estimates using aux target models. I’d still consider if TPM should be calculated with or without, or at the expense of adding complexity maybe keep the current behavior and add one more flag that excludes aux targets from TPM calculations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants