Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with talon_initialize_database #140

Open
Kiliankleemann opened this issue Oct 19, 2023 · 13 comments
Open

Problem with talon_initialize_database #140

Kiliankleemann opened this issue Oct 19, 2023 · 13 comments

Comments

@Kiliankleemann
Copy link

Tried to run talon_initialize_database but got an error:

talon_initialize_database --f  reference/GRCh38_GENCODE_rmsk_TE_reformatted.gtf \
  --g hg38_rmsk_ucsd \
  --a hg38 \
  --o hg38 
chrY
bulk update genes...
bulk update gene_annotations...
Traceback (most recent call last):
  File "/home/kilian/anaconda3/envs/talon/bin/talon_initialize_database", line 8, in <module>
    sys.exit(main())
  File "/home/kilian/anaconda3/envs/talon/lib/python3.7/site-packages/talon/initialize_talon_database.py", line 1073, in main
    populate_db(db_name, annot_name, chrom_genes, chrom_transcripts, exons, genome_build)
  File "/home/kilian/anaconda3/envs/talon/lib/python3.7/site-packages/talon/initialize_talon_database.py", line 634, in populate_db
    add_transcripts(c, transcripts, annot_name, gene_id_map, genome_build)
  File "/home/kilian/anaconda3/envs/talon/lib/python3.7/site-packages/talon/initialize_talon_database.py", line 743, in add_transcripts
    db_gene_id = gene_id_map[native_gene_id]
KeyError: 'AluY'
@Kiliankleemann
Copy link
Author

I made sure the reformatting of GTF is correct:

wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/rmsk.txt.gz
gzip -d *.gz
talon_reformat_gtf -g reference/GRCh38_GENCODE_rmsk_TE.gtf

talon_initialize_database --f reference/GRCh38_GENCODE_rmsk_TE_reformatted.gtf \
  --g hg38_rmsk_ucsd \
  --a hg38 \
  --o hg38 

@fairliereese
Copy link
Member

Would you be able to share the GTF that you're using with me? I will try running it on my end and see if I can pinpoint the issue.

@Kiliankleemann
Copy link
Author

Should be able to download the gtf and unzp with the first command - thats the one I tried

@fairliereese
Copy link
Member

This one? https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/rmsk.txt.gz This does not look like a GTF to me. For example, the strand should be in the 6th column (0-indexed), but looks like it's in the 9th column of your file.

@Kiliankleemann
Copy link
Author

Which gtf did you use for hg38 repeatmasker?

@fairliereese
Copy link
Member

fairliereese commented Oct 24, 2023

For me to best help you, you should send all the commands that you used to download / format your GTF. I think I'm missing some information from your side.

@sojichld
Copy link

sojichld commented Feb 1, 2024

I'm having a similar issue and I'm not really sure why. I've also tried using the gtf formatter with no luck.

It took 0:00:00.01 to process chromosome
NW_023397527.1
Traceback (most recent call last):
File "/users/aademilu/.local/bin/talon_initialize_database", line 8, in
sys.exit(main())
File "/users/aademilu/.local/lib/python3.8/site-packages/talon/initialize_talon_database.py", line 1015, in main
populate_db(db_name, annot_name, chrom_genes, chrom_transcripts, exons, genome_build)
File "/users/aademilu/.local/lib/python3.8/site-packages/talon/initialize_talon_database.py", line 596, in populate_db
transcripts = chrom_transcripts[chromosome]
KeyError: 'NW_023397527.1'

I've attached an example of the file. The full file can be found here.
gtf_example.txt

@fairliereese
Copy link
Member

Can you please send me the exact command you tried for talon_initialize_database, as well as the version number of TALON that you're using?

@sojichld
Copy link

sojichld commented Feb 2, 2024

Can you please send me the exact command you tried for talon_initialize_database, as well as the version number of TALON that you're using?

talon_initialize_database --f ../../reference/GCF_004126475.2_mPhyDis1.pri.v3_genomic.gtf --a discolor_annot --g discolor --o discolor

Where can I find version information?

@fairliereese
Copy link
Member

I don't think there's a nice way to access the version info now, but if you haven't updated TALON in a long time it might be worth pulling and installing the latest commits. On my machine, I am able to run your init command with gtf_example.txt no problem. Did you also verify that you're having an issue with the small file too?

@sojichld
Copy link

sojichld commented Feb 3, 2024

Yes, while that one does run for me as well (it doesn't inlcude NW_023397527.1 ), I cannot get other cuts of the file to work, it creates an error as follows:

    genes, transcripts, exons = read_gtf_file(gtf_file)
  File "/users/aademilu/.local/lib/python3.8/site-packages/talon/initialize_talon_database.py", line 495, in read_gtf_file
    entry_type = tab_fields[2]

I noticed that the gene is the only one of that scaffold, maybe that could be the issue? I have provided the full file, which will run until the scaffold in question. The program will run if I remove the gene from the gtf.

@fairliereese
Copy link
Member

The problem is that gene does not have any transcripts annotated to it. If you look, it goes from one gene entry (the one on your NW_023397527.1 chromosome) to the next, without any additional entry. I would advise removing this entry from your GTF and moving on with your analysis.
Screenshot 2024-02-03 at 11 17 58 AM

@sojichld
Copy link

sojichld commented Feb 3, 2024

You're right. Strange. I've removed it an it works just fine. Was able to fully process everything. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants