Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

control of 2 separate randseed events in sintax #535

Open
todd-desantis opened this issue Oct 17, 2023 · 10 comments
Open

control of 2 separate randseed events in sintax #535

todd-desantis opened this issue Oct 17, 2023 · 10 comments

Comments

@todd-desantis
Copy link

I see two places where sintax would need a random seed:

  1. when a subset of kmers are randomly sampled from the query sequence.
  2. when these kmers match equally well to more than one sequence in the .udb file then the "top" hit would be governed by a random ordering.

Can setting the randseed option be applied to both of these random events to create a reproducible output?

@torognes
Copy link
Owner

The randseed option can currently be used in sintax to initialise the sequence of random numbers used when sampling kmers from the query sequence, place 1 above.

The choice in place 2 is currently not random. When the query matches two or more sequences in the database equally well regarding the number of shared kmers, the top match is the shortest of the sequences. If two or more are equally long, the sequence that comes first in the database is chosen.

If the randseed option is used, the output should be reproducible. Please note that this does not currently work when running sintax with multiple threads (a warning is printed). I am considering ways to make it work also with multiple threads.

Do you think adding optional randomness to the choice in place 2 would be valuable?

@torognes torognes self-assigned this Nov 22, 2023
@todd-desantis
Copy link
Author

yes, I think that adding randomness to the choice in place 2 would be valuable. In some cases a short 300 base 16S query is very similar and or even exactly identical to many long 1500 base subjects in a 16S database and these subjects have different taxonomic positions. I'd rather not have the vsearch user get the perception that the query matches one of these taxa more than the others which might happen due to the length or order rule.

@torognes
Copy link
Owner

torognes commented Dec 4, 2023

Thanks for your suggestion. I will try to add an option to randomize the choice in place 2.

@todd-desantis
Copy link
Author

I also expect that the random selection among the equally top hits might require thousands of sequences in the udb file to be considered in the random draw. Thus, I'd expect that more computation time would be required to collect all the hits (as opposed to only the first hit), which is a fair trade-off. Thanks again for trying to add this randomness option.

@todd-desantis
Copy link
Author

Any more ideas on implementation on this? I was considering a work-around to create 10 shuffles of the sequence ordering in the fasta database and then format a udb from each then run my queries against each and implement a summarization in a post processing step. But that would require over 15Gb of disk space for all those reference files. It might be too inefficient and would require lots of explanation in the methods section of the journal article.

@torognes
Copy link
Owner

Yes, I've had a look at this right now and have already implemented it. A new option called --sintax_randomize or similar will enable this functionality. I hope to have it ready for a release tomorrow. Initial tests indicate that it may use approx 30% more time.

I think this could be a significant improvement to the SINTAX algorithm as it will select a random sequence among a wider set of sequences instead of usually picking one of the shortest ones.

@todd-desantis
Copy link
Author

Wonderful. Thank you. I think a 30% increase is runtime is less than I expected. That is good news. Looking forward to the update.

@torognes
Copy link
Owner

Hi, I've now released vsearch version 2.28.1. It implements the feature you suggested. I made other improvements too, and I think the speed should be even better now. Again, thanks a lot for the suggested improvements!

The sintax command has been improved in several ways in this version of vsearch. Please note that several details of this algorithm is not clearly described in the preprint, and the implementation in vsearch differs from that in usearch.

The former vsearch version did not always choose the most common taxonomic entity over the 100 bootstraps among the database sequences with the highest amount of word similarity to the query. Instead, if several sequences had an equal similarity with the query, the sequence encountered in the earliest bootstrap was chosen. The confidence level was calculated based on this sequence compared to the selected sequences from the other 99 bootstraps. This could lead to a suboptimal choice with a low confidence. In the new version, the most common of the sequences with the highest amount of word similarity across the 100 bootstraps will be selected, and ties will be broken randomly.

Another problem with the old implementation was that if several sequences had the same amount of word similarity, the shortest one in the reference database would be chosen, and if they were equally long, the earliest in the database file would be chosen. A new option called sintax_random has now been introduced. This option will randomly select one of the sequences with the highest number of shared words with the query, without considering their length or position. This avoids a bias towards shorter reference sequences. This option is strongly recommended and will probably soon be the default.

Furthermore, a ninth taxonomic rank, strain (letter t), is now recognized. The speed of the sintax command has also been significantly improved at least in some cases. Run vsearch with the randseed option and 1 thread to ensure reproducibility of the random choices in the algorithm.

These changes are relevant for issues #210, #325, #498, and #535.

@todd-desantis
Copy link
Author

todd-desantis commented Apr 26, 2024 via email

@torognes
Copy link
Owner

Thanks - I've sent you an email.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants