Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Name and Product don't match Dbxref #287

Open
ktmeaton opened this issue May 9, 2024 · 0 comments
Open

Name and Product don't match Dbxref #287

ktmeaton opened this issue May 9, 2024 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@ktmeaton
Copy link

ktmeaton commented May 9, 2024

I annotated the S. pyogenes reference with bakta, and noticed some oddly named CDS. When I compare bakta's results to the RefSeq annotation, the accessions for RefSeq and UniParc are all correct and as expected. But the reported "Name" and "Product" in the bakta gff output doesnt seem to match what's in those databases.

For example, locus BEAOJI_08040 should be called M-related protein Enn, based on it's RefSeq and UniRef accessions. But the reported name is YSIRK-type signal peptide-containing protein instead. Here are some examples of mismatches, that are in an important sub-typing region:

NCTC12064_contig_1      Prodigal        CDS     1589565 1590701 .       -       0       ID=BEAOJI_08040;Name=YSIRK-type signal peptide-containing protein;locus_tag=BEAOJI_08040;product=YSIRK-type signal peptide-containing protein;Dbxref=RefSeq:WP_111679867.1,SO:0001217,UniParc:UPI000DA29B8C,UniRef:UniRef100_UPI000DA29B8C,UniRef:UniRef50_P50468,UniRef:UniRef90_UPI001CF4D4D8
NCTC12064_contig_1      Prodigal        CDS     1592413 1593579 .       -       0       ID=BEAOJI_08050;Name=Fibrinogen- and Ig-binding protein;locus_tag=BEAOJI_08050;product=Fibrinogen- and Ig-binding protein;Dbxref=GO:0005576,GO:0019864,RefSeq:WP_038431637.1,SO:0001217,UniParc:UPI0004D1BE22,UniRef:UniRef100_UPI0004D1BE22,UniRef:UniRef50_P30141,UniRef:UniRef90_P30141;gene=mrp4
Locus Bakta Refseq UniRef RefSeq Accession UniRef Accession
BEAOJI_08040 YSIRK-type signal peptide-containing protein M-related protein Enn M-related protein Enn WP_111679867.1 UniRef100_UPI000DA29B8C
BEAOJI_08050 Fibrinogen- and Ig-binding YSIRK-type signal peptide-containing protein YSIRK-type signal peptide-containing protein WP_038431637.1 UniRef100_UPI0004D1BE22

I found this line in the debug log, that says it's looking up UniRef90_UPI001CF4D4D8:

13:59:41.390 - DEBUG - PSC - lookup: contig=NCTC12064_contig_1, start=1589565, stop=1590701, strand=-, UniRef90=UniRef90_UPI001CF4D4D8, EC=, gene=, product=YSIRK-type signal peptide-containing protein

But it don't seem like UniRef90_UPI001CF4D4D8 exists? UniRef100_UPI001CF4D4D8 does exist and that one is named "YSIRK-type signal peptide-containing protein". But UniRef100_UPI001CF4D4D8 isn't mentioned anywhere in the log or output.

Versions

I'm using bakta v1.9.2 from the image bakta:1.9.2--pyhdfd78af_0 and the v5.1-full database.

bakta \
    --debug --genus Streptococcus --species pyogenes \
    --threads 9 \
    --prefix NCTC12064 \
    --db 5.1 \
    --locus NCTC12064_contig \
    Streptoccocus_pyogenes_strain_NCTC12064.fasta \
    > NCTC12064.out 2>&1

NCTC12064.log

@ktmeaton ktmeaton added the bug Something isn't working label May 9, 2024
@oschwengers oschwengers self-assigned this May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants