Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chimera detection --uchime_ref unexpected behaviour #504

Open
nvucic opened this issue Jan 24, 2023 · 6 comments
Open

Chimera detection --uchime_ref unexpected behaviour #504

nvucic opened this issue Jan 24, 2023 · 6 comments

Comments

@nvucic
Copy link

nvucic commented Jan 24, 2023

Hello,

I've been using --uchime_ref to identify chimeras providing a list of reference sequences in FASTA format. I noticed that vsearch does not report match with highest score when there are large number of sequences (~5k) between second best and best match in cases where second match comes first in the FASTA file. I could provide test data if necessary.

Can you suggest any kind of workarund or fix for this?

Best,
Nemanja

@frederic-mahe
Copy link
Collaborator

Thank you for reporting this. Could you please provide some test data and the exact command line used? I'll try to reproduce the issue, and then make a minimal reproducible example if possible.

@nvucic
Copy link
Author

nvucic commented Jan 25, 2023

Sure,

First command reports chimera with lower score
vsearch --uchime_ref test_ampl.fasta --db test_design.fasta --uchimeout test.chimera_table.not_expected.tsv

When I move desired parent A to the top of the design FASTA the results are as expected
vsearch --uchime_ref /opt/test_data/test_ampl.fasta --db /opt/test_data/test_design.rep.fasta --uchimeout test.chimera_table.expected.tsv

@frederic-mahe
Copy link
Collaborator

Thanks, I confirm there is something strange here, maybe a bug. The wrong parent A is selected when there are more than 17,195 entries between the
two candidates for parent A.

filler.zip

QUERY="TATCTACCCAACGAACGGCTATACCCGCTATGCGGACTCGGTGAAAGGTCGTTTCACGATCTCGGCGGATACGTCGAAAAACACGGCCTACCTGCAGATGAACTCGCTGCGTGCCGAGGATACGGCCGTGTATTATTGTTCGCGTTGGGGCGGCCTGGGTTTCATGGCGATGGAC"
PARENT_A_HIGH="TATCTACCCAACGAACGGCTATACCCGCTATGCGGACTCGGTGAAAGGTCGTTTCACGATCTCGGCGGATACGTCGAAAAACACCGCGTACCTGCAGATGAACAGCCTGCGTGCGGAAGATACGGCGGTTTACTATTGCTCGCGCCACGGCGGTGACGGCCACTACGCCATGGAC" # SAMP23224_sim_16939
PARENT_A_LOW="TATCTACCCAACGAACGGCTATACCCGCTATGCGGACTCGGTGAAGGGTCGTTTCACGATCTCGGCCGATACGTCGAAGAACACCGCGTACTTACAGATGAACAGCCTGCGCGCGGAAGACACGGCGGTGTATTACTGTAGCCGCTGGGGCGGCGCCCTGTTCTACGCGATGGAC" # SAMP23224_sim_13735
PARENT_B="TATCTACCCGACCAATGGCTATACGCGCTACGCCGACTCGGTTAAAGGTCGCTTCACGATCTCGGCGGATACGAGCAAGAACACGGCCTACCTGCAGATGAACTCGCTGCGTGCCGAGGATACGGCCGTGTATTATTGTTCGCGTTGGGGCGGCCTGGGTTTCATGGCGATGGAC" # SAMP23224_sim_16714
FILLER_FILE="filler.fasta"  # 37,198 lines

## PARENT_B PARENT_A_LOW ... no simulated entries ... PARENT_A_HIGH
vsearch \
    --uchime_ref <(printf ">query\n%s\n" ${QUERY}) \
    --db <(printf ">sB\n%s\n>sA_low\n%s\n" ${PARENT_B} ${PARENT_A_LOW}
           # cat ${FILLER_FILE}
           printf ">sA_high\n%s\n" ${PARENT_A_HIGH}
          ) \
              --quiet \
              --dbmask none \
              --uchimeout -  # score is 2.4872


## PARENT_B PARENT_A_LOW ... some simulated entries ... PARENT_A_HIGH
N_FILLER=34390
vsearch \
    --uchime_ref <(printf ">query\n%s\n" ${QUERY}) \
    --db <(printf ">sB\n%s\n>sA_low\n%s\n" ${PARENT_B} ${PARENT_A_LOW}
           head -n ${N_FILLER} ${FILLER_FILE}
           printf ">sA_high\n%s\n" ${PARENT_A_HIGH}
          ) \
              --quiet \
              --dbmask none \
              --uchimeout -  # score is 2.4872


## PARENT_B PARENT_A_LOW ... some simulated entries ... PARENT_A_HIGH
N_FILLER=34392
vsearch \
    --uchime_ref <(printf ">query\n%s\n" ${QUERY}) \
    --db <(printf ">sB\n%s\n>sA_low\n%s\n" ${PARENT_B} ${PARENT_A_LOW}
           head -n ${N_FILLER} ${FILLER_FILE}
           printf ">sA_high\n%s\n" ${PARENT_A_HIGH}
          ) \
              --quiet \
              --dbmask none \
              --uchimeout -  # score is 0.9055

I don't observe that pattern when using monotonous sequences (only 'A') as filler sequences.

@torognes
Copy link
Owner

I too can confirm that there seems to be a bug here. I'm looking into it.

@torognes
Copy link
Owner

I've looked at the examples provided by @frederic-mahe, and for those examples I do not think a bug is exposed. It is rather a consequence of the heuristics of the algorithm. Actually, sequence number 17196 in the filler.fasta file with the label SAMP23224_sim_17199 has very strong similarity to the query sequence in one region, causing it to be selected before the best parent sA_high, even though the former is worse overall.

The uchime algorithm divides the query into 4 equally long parts and initially finds a few (actually 4 in vsearch) candidate sequences that are very similar to each part. It then goes on to find the best pair of parents among those (up to 16 sequences).

What happens in @frederic-mahe's example is that a the SAMP23224_sim_17199 sequence is selected among those four for the second part (nucleotides 45-88) because it is 100% identical there. The overall best sequence (sA_high) is slightly less similar in that part (actually 95.45%), and is not among the 4 best hits in that region. In the first region (1-44 bp), they are both 100% identical, and the algorithm then selects the entry that comes first in the file.

I am not sure whether the same happens in the original case presented by @nvucic, but it may be related. Perhaps there are several sequences that are all 100% identical in one of the parts. Then the sequences selected may depend on the order in the file.

So it may be due to the heuristics.

To improve the situation, we could increase the number of sequences considered in each region (which would take more time), or we could sort them differently.

I am currently working on an updated chimera algorithm for long reads, and will have these issues in mind.

@nvucic
Copy link
Author

nvucic commented Jan 27, 2023

To improve the situation, we could increase the number of sequences considered in each region (which would take more time), or we could sort them differently.

@torognes just enabling this as a parameter would be very useful.

Thank you for looking into this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants