Chimera detection --uchime_ref unexpected behaviour #504

nvucic · 2023-01-24T17:58:41Z

Hello,

I've been using --uchime_ref to identify chimeras providing a list of reference sequences in FASTA format. I noticed that vsearch does not report match with highest score when there are large number of sequences (~5k) between second best and best match in cases where second match comes first in the FASTA file. I could provide test data if necessary.

Can you suggest any kind of workarund or fix for this?

Best,
Nemanja

frederic-mahe · 2023-01-25T10:19:22Z

Thank you for reporting this. Could you please provide some test data and the exact command line used? I'll try to reproduce the issue, and then make a minimal reproducible example if possible.

nvucic · 2023-01-25T11:28:53Z

Sure,

First command reports chimera with lower score
vsearch --uchime_ref test_ampl.fasta --db test_design.fasta --uchimeout test.chimera_table.not_expected.tsv

When I move desired parent A to the top of the design FASTA the results are as expected
vsearch --uchime_ref /opt/test_data/test_ampl.fasta --db /opt/test_data/test_design.rep.fasta --uchimeout test.chimera_table.expected.tsv

frederic-mahe · 2023-01-25T16:23:26Z

Thanks, I confirm there is something strange here, maybe a bug. The wrong parent A is selected when there are more than 17,195 entries between the
two candidates for parent A.

filler.zip

QUERY="TATCTACCCAACGAACGGCTATACCCGCTATGCGGACTCGGTGAAAGGTCGTTTCACGATCTCGGCGGATACGTCGAAAAACACGGCCTACCTGCAGATGAACTCGCTGCGTGCCGAGGATACGGCCGTGTATTATTGTTCGCGTTGGGGCGGCCTGGGTTTCATGGCGATGGAC"
PARENT_A_HIGH="TATCTACCCAACGAACGGCTATACCCGCTATGCGGACTCGGTGAAAGGTCGTTTCACGATCTCGGCGGATACGTCGAAAAACACCGCGTACCTGCAGATGAACAGCCTGCGTGCGGAAGATACGGCGGTTTACTATTGCTCGCGCCACGGCGGTGACGGCCACTACGCCATGGAC" # SAMP23224_sim_16939
PARENT_A_LOW="TATCTACCCAACGAACGGCTATACCCGCTATGCGGACTCGGTGAAGGGTCGTTTCACGATCTCGGCCGATACGTCGAAGAACACCGCGTACTTACAGATGAACAGCCTGCGCGCGGAAGACACGGCGGTGTATTACTGTAGCCGCTGGGGCGGCGCCCTGTTCTACGCGATGGAC" # SAMP23224_sim_13735
PARENT_B="TATCTACCCGACCAATGGCTATACGCGCTACGCCGACTCGGTTAAAGGTCGCTTCACGATCTCGGCGGATACGAGCAAGAACACGGCCTACCTGCAGATGAACTCGCTGCGTGCCGAGGATACGGCCGTGTATTATTGTTCGCGTTGGGGCGGCCTGGGTTTCATGGCGATGGAC" # SAMP23224_sim_16714
FILLER_FILE="filler.fasta"  # 37,198 lines

## PARENT_B PARENT_A_LOW ... no simulated entries ... PARENT_A_HIGH
vsearch \
    --uchime_ref <(printf ">query\n%s\n" ${QUERY}) \
    --db <(printf ">sB\n%s\n>sA_low\n%s\n" ${PARENT_B} ${PARENT_A_LOW}
           # cat ${FILLER_FILE}
           printf ">sA_high\n%s\n" ${PARENT_A_HIGH}
          ) \
              --quiet \
              --dbmask none \
              --uchimeout -  # score is 2.4872


## PARENT_B PARENT_A_LOW ... some simulated entries ... PARENT_A_HIGH
N_FILLER=34390
vsearch \
    --uchime_ref <(printf ">query\n%s\n" ${QUERY}) \
    --db <(printf ">sB\n%s\n>sA_low\n%s\n" ${PARENT_B} ${PARENT_A_LOW}
           head -n ${N_FILLER} ${FILLER_FILE}
           printf ">sA_high\n%s\n" ${PARENT_A_HIGH}
          ) \
              --quiet \
              --dbmask none \
              --uchimeout -  # score is 2.4872


## PARENT_B PARENT_A_LOW ... some simulated entries ... PARENT_A_HIGH
N_FILLER=34392
vsearch \
    --uchime_ref <(printf ">query\n%s\n" ${QUERY}) \
    --db <(printf ">sB\n%s\n>sA_low\n%s\n" ${PARENT_B} ${PARENT_A_LOW}
           head -n ${N_FILLER} ${FILLER_FILE}
           printf ">sA_high\n%s\n" ${PARENT_A_HIGH}
          ) \
              --quiet \
              --dbmask none \
              --uchimeout -  # score is 0.9055

I don't observe that pattern when using monotonous sequences (only 'A') as filler sequences.

torognes · 2023-01-26T13:54:37Z

I too can confirm that there seems to be a bug here. I'm looking into it.

torognes · 2023-01-26T16:25:37Z

I've looked at the examples provided by @frederic-mahe, and for those examples I do not think a bug is exposed. It is rather a consequence of the heuristics of the algorithm. Actually, sequence number 17196 in the filler.fasta file with the label SAMP23224_sim_17199 has very strong similarity to the query sequence in one region, causing it to be selected before the best parent sA_high, even though the former is worse overall.

The uchime algorithm divides the query into 4 equally long parts and initially finds a few (actually 4 in vsearch) candidate sequences that are very similar to each part. It then goes on to find the best pair of parents among those (up to 16 sequences).

What happens in @frederic-mahe's example is that a the SAMP23224_sim_17199 sequence is selected among those four for the second part (nucleotides 45-88) because it is 100% identical there. The overall best sequence (sA_high) is slightly less similar in that part (actually 95.45%), and is not among the 4 best hits in that region. In the first region (1-44 bp), they are both 100% identical, and the algorithm then selects the entry that comes first in the file.

I am not sure whether the same happens in the original case presented by @nvucic, but it may be related. Perhaps there are several sequences that are all 100% identical in one of the parts. Then the sequences selected may depend on the order in the file.

So it may be due to the heuristics.

To improve the situation, we could increase the number of sequences considered in each region (which would take more time), or we could sort them differently.

I am currently working on an updated chimera algorithm for long reads, and will have these issues in mind.

nvucic · 2023-01-27T13:19:59Z

To improve the situation, we could increase the number of sequences considered in each region (which would take more time), or we could sort them differently.

@torognes just enabling this as a parameter would be very useful.

Thank you for looking into this!

frederic-mahe added the bug label Jan 25, 2023

frederic-mahe mentioned this issue Jan 26, 2023

uchime_ref --db can't read from stdin #506

Closed

frederic-mahe added enhancement and removed bug labels Jan 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chimera detection --uchime_ref unexpected behaviour #504

Chimera detection --uchime_ref unexpected behaviour #504

nvucic commented Jan 24, 2023 •

edited

frederic-mahe commented Jan 25, 2023

nvucic commented Jan 25, 2023 •

edited

frederic-mahe commented Jan 25, 2023

torognes commented Jan 26, 2023

torognes commented Jan 26, 2023

nvucic commented Jan 27, 2023

Chimera detection --uchime_ref unexpected behaviour #504

Chimera detection --uchime_ref unexpected behaviour #504

Comments

nvucic commented Jan 24, 2023 • edited

frederic-mahe commented Jan 25, 2023

nvucic commented Jan 25, 2023 • edited

frederic-mahe commented Jan 25, 2023

torognes commented Jan 26, 2023

torognes commented Jan 26, 2023

nvucic commented Jan 27, 2023

nvucic commented Jan 24, 2023 •

edited

nvucic commented Jan 25, 2023 •

edited