Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backlocate w/ E. coli genome #225

Open
ibrahimkurt opened this issue Dec 20, 2022 · 3 comments
Open

Backlocate w/ E. coli genome #225

ibrahimkurt opened this issue Dec 20, 2022 · 3 comments

Comments

@ibrahimkurt
Copy link

Hi,

Is it possible to use backlocate w/ E. coli files? I have the following file from NCBI as the GTF file. And I have the appropriate reference file with the indices. I am trying to figure out how to call the genes/transcripts.

genomic.gtf.zip

There are RefSeq gene, CDS, start_codon and stop_codon entries per gene. When I try to call using the transcript_id or gene attributes I get the error:

echo -e "thrL\tP1090M" | java -jar backlocate.jar -R ../../Documents/CRISPR\ CBE\ Bacteria\ STOP\ Library/GCA_000005845.2.fasta --gtf ../../Documents/CRISPR\ CBE\ Bacteria\ STOP\ Library/from\ NCBI/GCF_000005845.2/ncbi_dataset/data/GCF_000005845.2/genomic.gtf

[WARN][BackLocate]no transcript found for thrL

Thanks a lot for your help.

@lindenb
Copy link
Owner

lindenb commented Dec 20, 2022

The tool expects gene_name as the attribute for the gene, not gene. see https://github.com/lindenb/jvarkit/blob/master/src/test/resources/gencode.v19.annotation.gtf .

Use sed to "fix" your gtf

@ibrahimkurt
Copy link
Author

Thanks a lot for your prompt reply. I replaced all gene with gene_name but I am still getting the same error. Could it be because there are only gene, CDS, start_codon and stop_codon entries but not transcript per gene? I manually copied and pasted a gene entry block and replaced the feature with transcript as shown below, but I still got the same error saying [WARN][BackLocate]no transcript found for thrL

Do you have any ideas? Thanks!

NC_000913.3	RefSeq	gene	190	255	.	+	.	gene_id "b0001"; transcript_id ""; db_xref "ASAP:ABE-0000006"; db_xref "ECOCYC:EG11277"; db_xref "GeneID:944742"; gbkey "Gene"; gene_name "thrL"; gene_biotype "protein_coding"; gene_synonym "ECK0001"; locus_tag "b0001"; 
NC_000913.3	RefSeq	transcript	190	255	.	+	.	gene_id "b0001"; transcript_id ""; db_xref "ASAP:ABE-0000006"; db_xref "ECOCYC:EG11277"; db_xref "GeneID:944742"; gbkey "Gene"; gene_name "thrL"; gene_biotype "protein_coding"; gene_synonym "ECK0001"; locus_tag "b0001"; 
NC_000913.3	RefSeq	CDS	190	252	.	+	0	gene_id "b0001"; transcript_id "gnl|b0001|mrna.NP_414542"; db_xref "UniProtKB/Swiss-Prot:P0AD86"; gbkey "CDS"; gene_name "thrL"; locus_tag "b0001"; orig_transcript_id "gnl|b0001|mrna.NP_414542"; product "thr operon leader peptide"; protein_id "NP_414542.1"; transl_table "11"; exon_number "1"; 
NC_000913.3	RefSeq	start_codon	190	192	.	+	0	gene_id "b0001"; transcript_id "gnl|b0001|mrna.NP_414542"; db_xref "UniProtKB/Swiss-Prot:P0AD86"; gbkey "CDS"; gene_name "thrL"; locus_tag "b0001"; orig_transcript_id "gnl|b0001|mrna.NP_414542"; product "thr operon leader peptide"; protein_id "NP_414542.1"; transl_table "11"; exon_number "1"; 
NC_000913.3	RefSeq	stop_codon	253	255	.	+	0	gene_id "b0001"; transcript_id "gnl|b0001|mrna.NP_414542"; db_xref "UniProtKB/Swiss-Prot:P0AD86"; gbkey "CDS"; gene_name "thrL"; locus_tag "b0001"; orig_transcript_id "gnl|b0001|mrna.NP_414542"; product "thr operon leader peptide"; protein_id "NP_414542.1"; transl_table "11"; exon_number "1"; 

@lindenb
Copy link
Owner

lindenb commented Dec 20, 2022

set transcript_id to "gnl|b0001|mrna.NP_414542" for the transcript.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants