You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I annotated the S. pyogenes reference with bakta, and noticed some oddly named CDS. When I compare bakta's results to the RefSeq annotation, the accessions for RefSeq and UniParc are all correct and as expected. But the reported "Name" and "Product" in the bakta gff output doesnt seem to match what's in those databases.
For example, locus BEAOJI_08040 should be called M-related protein Enn, based on it's RefSeq and UniRef accessions. But the reported name is YSIRK-type signal peptide-containing protein instead. Here are some examples of mismatches, that are in an important sub-typing region:
NCTC12064_contig_1 Prodigal CDS 1589565 1590701 . - 0 ID=BEAOJI_08040;Name=YSIRK-type signal peptide-containing protein;locus_tag=BEAOJI_08040;product=YSIRK-type signal peptide-containing protein;Dbxref=RefSeq:WP_111679867.1,SO:0001217,UniParc:UPI000DA29B8C,UniRef:UniRef100_UPI000DA29B8C,UniRef:UniRef50_P50468,UniRef:UniRef90_UPI001CF4D4D8
NCTC12064_contig_1 Prodigal CDS 1592413 1593579 . - 0 ID=BEAOJI_08050;Name=Fibrinogen- and Ig-binding protein;locus_tag=BEAOJI_08050;product=Fibrinogen- and Ig-binding protein;Dbxref=GO:0005576,GO:0019864,RefSeq:WP_038431637.1,SO:0001217,UniParc:UPI0004D1BE22,UniRef:UniRef100_UPI0004D1BE22,UniRef:UniRef50_P30141,UniRef:UniRef90_P30141;gene=mrp4
Locus
Bakta
Refseq
UniRef
RefSeq Accession
UniRef Accession
BEAOJI_08040
YSIRK-type signal peptide-containing protein
M-related protein Enn
M-related protein Enn
WP_111679867.1
UniRef100_UPI000DA29B8C
BEAOJI_08050
Fibrinogen- and Ig-binding
YSIRK-type signal peptide-containing protein
YSIRK-type signal peptide-containing protein
WP_038431637.1
UniRef100_UPI0004D1BE22
I found this line in the debug log, that says it's looking up UniRef90_UPI001CF4D4D8:
13:59:41.390 - DEBUG - PSC - lookup: contig=NCTC12064_contig_1, start=1589565, stop=1590701, strand=-, UniRef90=UniRef90_UPI001CF4D4D8, EC=, gene=, product=YSIRK-type signal peptide-containing protein
But it don't seem like UniRef90_UPI001CF4D4D8 exists? UniRef100_UPI001CF4D4D8 does exist and that one is named "YSIRK-type signal peptide-containing protein". But UniRef100_UPI001CF4D4D8 isn't mentioned anywhere in the log or output.
I annotated the S. pyogenes reference with bakta, and noticed some oddly named CDS. When I compare bakta's results to the RefSeq annotation, the accessions for RefSeq and UniParc are all correct and as expected. But the reported "Name" and "Product" in the bakta gff output doesnt seem to match what's in those databases.
For example, locus
BEAOJI_08040
should be called M-related protein Enn, based on it's RefSeq and UniRef accessions. But the reported name is YSIRK-type signal peptide-containing protein instead. Here are some examples of mismatches, that are in an important sub-typing region:I found this line in the debug log, that says it's looking up
UniRef90_UPI001CF4D4D8
:But it don't seem like UniRef90_UPI001CF4D4D8 exists? UniRef100_UPI001CF4D4D8 does exist and that one is named "YSIRK-type signal peptide-containing protein". But
UniRef100_UPI001CF4D4D8
isn't mentioned anywhere in the log or output.Versions
I'm using bakta
v1.9.2
from the image bakta:1.9.2--pyhdfd78af_0 and thev5.1-full
database.NCTC12064.log
The text was updated successfully, but these errors were encountered: