control of 2 separate randseed events in sintax #535

todd-desantis · 2023-10-17T20:13:10Z

I see two places where sintax would need a random seed:

when a subset of kmers are randomly sampled from the query sequence.
when these kmers match equally well to more than one sequence in the .udb file then the "top" hit would be governed by a random ordering.

Can setting the randseed option be applied to both of these random events to create a reproducible output?

torognes · 2023-11-22T17:34:53Z

The randseed option can currently be used in sintax to initialise the sequence of random numbers used when sampling kmers from the query sequence, place 1 above.

The choice in place 2 is currently not random. When the query matches two or more sequences in the database equally well regarding the number of shared kmers, the top match is the shortest of the sequences. If two or more are equally long, the sequence that comes first in the database is chosen.

If the randseed option is used, the output should be reproducible. Please note that this does not currently work when running sintax with multiple threads (a warning is printed). I am considering ways to make it work also with multiple threads.

Do you think adding optional randomness to the choice in place 2 would be valuable?

todd-desantis · 2023-11-29T22:07:15Z

yes, I think that adding randomness to the choice in place 2 would be valuable. In some cases a short 300 base 16S query is very similar and or even exactly identical to many long 1500 base subjects in a 16S database and these subjects have different taxonomic positions. I'd rather not have the vsearch user get the perception that the query matches one of these taxa more than the others which might happen due to the length or order rule.

torognes · 2023-12-04T09:34:26Z

Thanks for your suggestion. I will try to add an option to randomize the choice in place 2.

todd-desantis · 2023-12-04T16:03:54Z

I also expect that the random selection among the equally top hits might require thousands of sequences in the udb file to be considered in the random draw. Thus, I'd expect that more computation time would be required to collect all the hits (as opposed to only the first hit), which is a fair trade-off. Thanks again for trying to add this randomness option.

todd-desantis · 2024-04-23T19:42:11Z

Any more ideas on implementation on this? I was considering a work-around to create 10 shuffles of the sequence ordering in the fasta database and then format a udb from each then run my queries against each and implement a summarization in a post processing step. But that would require over 15Gb of disk space for all those reference files. It might be too inefficient and would require lots of explanation in the methods section of the journal article.

torognes · 2024-04-24T16:13:09Z

Yes, I've had a look at this right now and have already implemented it. A new option called --sintax_randomize or similar will enable this functionality. I hope to have it ready for a release tomorrow. Initial tests indicate that it may use approx 30% more time.

I think this could be a significant improvement to the SINTAX algorithm as it will select a random sequence among a wider set of sequences instead of usually picking one of the shortest ones.

todd-desantis · 2024-04-24T17:37:16Z

Wonderful. Thank you. I think a 30% increase is runtime is less than I expected. That is good news. Looking forward to the update.

torognes · 2024-04-26T13:32:20Z

Hi, I've now released vsearch version 2.28.1. It implements the feature you suggested. I made other improvements too, and I think the speed should be even better now. Again, thanks a lot for the suggested improvements!

The sintax command has been improved in several ways in this version of vsearch. Please note that several details of this algorithm is not clearly described in the preprint, and the implementation in vsearch differs from that in usearch.

The former vsearch version did not always choose the most common taxonomic entity over the 100 bootstraps among the database sequences with the highest amount of word similarity to the query. Instead, if several sequences had an equal similarity with the query, the sequence encountered in the earliest bootstrap was chosen. The confidence level was calculated based on this sequence compared to the selected sequences from the other 99 bootstraps. This could lead to a suboptimal choice with a low confidence. In the new version, the most common of the sequences with the highest amount of word similarity across the 100 bootstraps will be selected, and ties will be broken randomly.

Another problem with the old implementation was that if several sequences had the same amount of word similarity, the shortest one in the reference database would be chosen, and if they were equally long, the earliest in the database file would be chosen. A new option called sintax_random has now been introduced. This option will randomly select one of the sequences with the highest number of shared words with the query, without considering their length or position. This avoids a bias towards shorter reference sequences. This option is strongly recommended and will probably soon be the default.

Furthermore, a ninth taxonomic rank, strain (letter t), is now recognized. The speed of the sintax command has also been significantly improved at least in some cases. Run vsearch with the randseed option and 1 thread to ensure reproducibility of the random choices in the algorithm.

These changes are relevant for issues #210, #325, #498, and #535.

todd-desantis · 2024-04-26T15:53:11Z

Thanks for these improvements. I’m working on a manuscript that will benefit from your software engineering allowing me to leverage the StrainSelect database all the way to the strain (t) level and produce confidence values less-affected by db artifacts. Very exciting! Would you consider becoming a co-author on this work? Todd

…

On Fri, Apr 26, 2024 at 6:32 AM Torbjørn Rognes ***@***.***> wrote: Hi, I've now released vsearch version 2.28.1. It implements the feature you suggested. I made other improvements too, and I think the speed should be even better now. Again, thanks a lot for the suggested improvements! The sintax command has been improved in several ways in this version of vsearch. Please note that several details of this algorithm is not clearly described in the preprint, and the implementation in vsearch differs from that in usearch. The former vsearch version did not always choose the most common taxonomic entity over the 100 bootstraps among the database sequences with the highest amount of word similarity to the query. Instead, if several sequences had an equal similarity with the query, the sequence encountered in the earliest bootstrap was chosen. The confidence level was calculated based on this sequence compared to the selected sequences from the other 99 bootstraps. This could lead to a suboptimal choice with a low confidence. In the new version, the most common of the sequences with the highest amount of word similarity across the 100 bootstraps will be selected, and ties will be broken randomly. Another problem with the old implementation was that if several sequences had the same amount of word similarity, the shortest one in the reference database would be chosen, and if they were equally long, the earliest in the database file would be chosen. A new option called sintax_random has now been introduced. This option will randomly select one of the sequences with the highest number of shared words with the query, without considering their length or position. This avoids a bias towards shorter reference sequences. This option is strongly recommended and will probably soon be the default. Furthermore, a ninth taxonomic rank, strain (letter t), is now recognized. The speed of the sintax command has also been significantly improved at least in some cases. Run vsearch with the randseed option and 1 thread to ensure reproducibility of the random choices in the algorithm. These changes are relevant for issues #210 <#210>, #325 <#325>, #498 <#498>, and #535 <#535>. — Reply to this email directly, view it on GitHub <#535 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BBHQGOH4S4F5RYIALZL4LYDY7JJPTAVCNFSM6AAAAAA6EM5PZSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZZGQYDKNZWGA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

torognes · 2024-04-29T09:02:55Z

Thanks - I've sent you an email.

frederic-mahe added the enhancement label Oct 18, 2023

torognes self-assigned this Nov 22, 2023

torognes added the question label Nov 22, 2023

torognes added a commit that referenced this issue Apr 25, 2024

Add sintax_random option, fix bug, and improve performance #535

e6d9f49

This was referenced Apr 26, 2024

sintax classifier and multiple identical best hits #325

Open

Sintax taxonomy classifier #210

Open

frederic-mahe pushed a commit that referenced this issue Apr 27, 2024

Add sintax_random option, fix bug, and improve performance #535

41f440d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

control of 2 separate randseed events in sintax #535

control of 2 separate randseed events in sintax #535

todd-desantis commented Oct 17, 2023

torognes commented Nov 22, 2023

todd-desantis commented Nov 29, 2023

torognes commented Dec 4, 2023

todd-desantis commented Dec 4, 2023

todd-desantis commented Apr 23, 2024

torognes commented Apr 24, 2024

todd-desantis commented Apr 24, 2024

torognes commented Apr 26, 2024

todd-desantis commented Apr 26, 2024 via email

torognes commented Apr 29, 2024

control of 2 separate randseed events in sintax #535

control of 2 separate randseed events in sintax #535

Comments

todd-desantis commented Oct 17, 2023

torognes commented Nov 22, 2023

todd-desantis commented Nov 29, 2023

torognes commented Dec 4, 2023

todd-desantis commented Dec 4, 2023

todd-desantis commented Apr 23, 2024

torognes commented Apr 24, 2024

todd-desantis commented Apr 24, 2024

torognes commented Apr 26, 2024

todd-desantis commented Apr 26, 2024 via email

torognes commented Apr 29, 2024