Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue in NFCORE_RNASEQ:RNASEQ:QUANTIFY_PSEUDO_ALIGNMENT:TX2GENE (genome.filtered.gtf) #1204

Open
alexmascension opened this issue Jan 30, 2024 · 2 comments
Labels
bug Something isn't working
Milestone

Comments

@alexmascension
Copy link

Description of the bug

I'm running nf-core/rnaseq with the following command:

nextflow run nf-core/rnaseq                                             
-r 3.14.0                                            
-profile docker                                                                                        
--max_cpus 9                                             
--max_memory 24.GB                                            
--max_time 500.h 
--pseudo_aligner salmon 
--outdir results/test_mouse/RNASEQ_PSEUDOALIGNER 
--genome GRCm38 
--input work/test_mouse/samplesheets/RNASEQ_PSEUDOALIGNER.csv 
--skip_alignment 
--fasta database/genomes/GRCm38/genome.fasta 
--gtf database/genomes/GRCm38/genes.gtf 
--gene_bed database/genomes/GRCm38/genes.bed 
--salmon_index database/indexes/GRCm38/SALMON 
--star_index database/indexes/GRCm38/STAR

When I run it I get the following error:

ERROR ~ Error executing process > 'NFCORE_RNASEQ:RNASEQ:QUANTIFY_PSEUDO_ALIGNMENT:TX2GENE (genome.filtered.gtf)'

Caused by:
  Missing output file(s) `*.tsv` expected by process `NFCORE_RNASEQ:RNASEQ:QUANTIFY_PSEUDO_ALIGNMENT:TX2GENE (genome.filtered.gtf)`

Command executed:

  tx2gene.py \
      --quant_type salmon \
      --gtf genome.filtered.gtf \
      --quants quants \
      --id gene_id \
      --extra gene_name \
      -o tx2gene.tsv
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_RNASEQ:RNASEQ:QUANTIFY_PSEUDO_ALIGNMENT:TX2GENE":
      python: $(python --version | sed 's/Python //g')
  END_VERSIONS

Command exit status:
  0

Command output:
  (empty)

Command error:
  __main__ - 2024-01-30 16:37:17,060 WARNING: No attribute in GTF matching transcripts
  __main__ - 2024-01-30 16:37:17,060 ERROR: Failed to map transcripts to genes.

Work dir:
  /data/Proyectos/NGS_pipeline/work/e2/6ad3cf2cec77f4aeb2434722052450

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

 -- Check '.nextflow.log' file for details

At first I thought that something might be wrong with my gtf of fasta files. However, when I run the "same" command using STAR + salmon I don't get any error:

nextflow run nf-core/rnaseq                                             
-r 3.14.0                                             
-profile docker                                                                                         
--max_cpus 9                                             
--max_memory 96.GB                                            
--max_time 500.h --aligner star_salmon 
--save_unaligned 
--save_untrimmed 
--min_mapped_reads 2 
 --input work/test_mouse/samplesheets/RNASEQ.csv 
--outdir results/test_mouse/RNASEQ 
--genome GRCm38 
--star_index database/indexes/GRCm38/STAR 
--salmon_index database/indexes/GRCm38/SALMON 
--gtf database/genomes/GRCm38/genes.gtf 
--gene_bed database/genomes/GRCm38/genes.bed 
--fasta database/genomes/GRCm38/genome.fasta

So I don't really know where it fails.

Command used and terminal output

No response

Relevant files

nexflow.log

System information

Nextflow version: 23.10.0
Hardware: Desktop
Executor: local
Container engine: docker
OS: Linux
Version of nf-core/rnaseq: 3.14.0

@alexmascension alexmascension added the bug Something isn't working label Jan 30, 2024
@alexmascension
Copy link
Author

BTW, just in case, genome gtf and fasta files are downloaded according to nf-core/rnaseq guidelines, and STAR/salmon/kalisto indexes are built as in the code from the respective .nf files.

@alexmascension
Copy link
Author

Hi! I realised that the problem might be related to the nomenclature of the transcript fasta and the gtf file.

To build kallisto and salmon indexes I used the fasta file from https://ftp.ensembl.org/pub/release-{ENSEMBL_VERSION}/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

In this fasta the transcript id and version are shown together:

>ENST00000631435.1 cdna scaffold:GRCh38:HSCHR7_2_CTG6:809186:809197:1 gene:ENSG00000282253.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]

However, the transcript_id and transcript_version are separated in the gtf:

1	havana	exon	182696	182746	.	+	.	gene_id "ENSG00000279928"; gene_version "2"; transcript_id "ENST00000624431"; transcript_version "2"; exon_number "1"; gene_name "DDX11L17"; gene_source "havana"; gene_biotype "unprocessed_pseudogene"; transcript_name "DDX11L17-201"; transcript_source "havana"; transcript_biotype "unprocessed_pseudogene"; exon_id "ENSE00003759020"; exon_version "2"; tag "basic"; tag "Ensembl_canonical"; transcript_support_level "NA";

Therefore, when running kallisto or salmon, the abundace file in quants folder has the quantification of the transcripts as per the fasta file; so when loading the transcripts by tx2gene.py, the following section of the discover_transcript_ºattribute() fails:

    with open(gtf_file) as inh:
        # Read GTF file, skipping header lines
        for line in filter(lambda x: not x.startswith("#"), inh):
            cols = line.split("\t")

            # Use regular expression to correctly split the attributes string
            attributes_str = cols[8]
            attributes = dict(re.findall(r'(\S+) "(.*?)(?<!\\)";', attributes_str))

            votes.update(key for key, value in attributes.items() if value in transcripts)

Because no value of the gtf follows the structure "ENSTXXXXXXX.Y".

So far, I've patched this problem by creating a new attribute in the gtf file that combines the transcript id and version.

If this problem can be replicated, I think it would be a good idea to make a more lenient discover_transcript_ºattribute() function that allows for transcript ids with or without version.

@drpatelh drpatelh added this to the 3.15.0 milestone May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants