Issue in NFCORE_RNASEQ:RNASEQ:QUANTIFY_PSEUDO_ALIGNMENT:TX2GENE (genome.filtered.gtf) #1204

alexmascension · 2024-01-30T17:25:53Z

Description of the bug

I'm running nf-core/rnaseq with the following command:

nextflow run nf-core/rnaseq                                             
-r 3.14.0                                            
-profile docker                                                                                        
--max_cpus 9                                             
--max_memory 24.GB                                            
--max_time 500.h 
--pseudo_aligner salmon 
--outdir results/test_mouse/RNASEQ_PSEUDOALIGNER 
--genome GRCm38 
--input work/test_mouse/samplesheets/RNASEQ_PSEUDOALIGNER.csv 
--skip_alignment 
--fasta database/genomes/GRCm38/genome.fasta 
--gtf database/genomes/GRCm38/genes.gtf 
--gene_bed database/genomes/GRCm38/genes.bed 
--salmon_index database/indexes/GRCm38/SALMON 
--star_index database/indexes/GRCm38/STAR

When I run it I get the following error:

ERROR ~ Error executing process > 'NFCORE_RNASEQ:RNASEQ:QUANTIFY_PSEUDO_ALIGNMENT:TX2GENE (genome.filtered.gtf)'

Caused by:
  Missing output file(s) `*.tsv` expected by process `NFCORE_RNASEQ:RNASEQ:QUANTIFY_PSEUDO_ALIGNMENT:TX2GENE (genome.filtered.gtf)`

Command executed:

  tx2gene.py \
      --quant_type salmon \
      --gtf genome.filtered.gtf \
      --quants quants \
      --id gene_id \
      --extra gene_name \
      -o tx2gene.tsv
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_RNASEQ:RNASEQ:QUANTIFY_PSEUDO_ALIGNMENT:TX2GENE":
      python: $(python --version | sed 's/Python //g')
  END_VERSIONS

Command exit status:
  0

Command output:
  (empty)

Command error:
  __main__ - 2024-01-30 16:37:17,060 WARNING: No attribute in GTF matching transcripts
  __main__ - 2024-01-30 16:37:17,060 ERROR: Failed to map transcripts to genes.

Work dir:
  /data/Proyectos/NGS_pipeline/work/e2/6ad3cf2cec77f4aeb2434722052450

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

 -- Check '.nextflow.log' file for details

At first I thought that something might be wrong with my gtf of fasta files. However, when I run the "same" command using STAR + salmon I don't get any error:

nextflow run nf-core/rnaseq                                             
-r 3.14.0                                             
-profile docker                                                                                         
--max_cpus 9                                             
--max_memory 96.GB                                            
--max_time 500.h --aligner star_salmon 
--save_unaligned 
--save_untrimmed 
--min_mapped_reads 2 
 --input work/test_mouse/samplesheets/RNASEQ.csv 
--outdir results/test_mouse/RNASEQ 
--genome GRCm38 
--star_index database/indexes/GRCm38/STAR 
--salmon_index database/indexes/GRCm38/SALMON 
--gtf database/genomes/GRCm38/genes.gtf 
--gene_bed database/genomes/GRCm38/genes.bed 
--fasta database/genomes/GRCm38/genome.fasta

So I don't really know where it fails.

Command used and terminal output

No response

Relevant files

nexflow.log

System information

Nextflow version: 23.10.0
Hardware: Desktop
Executor: local
Container engine: docker
OS: Linux
Version of nf-core/rnaseq: 3.14.0

The text was updated successfully, but these errors were encountered:

alexmascension · 2024-02-01T11:34:21Z

BTW, just in case, genome gtf and fasta files are downloaded according to nf-core/rnaseq guidelines, and STAR/salmon/kalisto indexes are built as in the code from the respective .nf files.

alexmascension · 2024-02-01T13:45:25Z

Hi! I realised that the problem might be related to the nomenclature of the transcript fasta and the gtf file.

To build kallisto and salmon indexes I used the fasta file from https://ftp.ensembl.org/pub/release-{ENSEMBL_VERSION}/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz

In this fasta the transcript id and version are shown together:

>ENST00000631435.1 cdna scaffold:GRCh38:HSCHR7_2_CTG6:809186:809197:1 gene:ENSG00000282253.1 gene_biotype:TR_D_gene transcript_biotype:TR_D_gene gene_symbol:TRBD1 description:T cell receptor beta diversity 1 [Source:HGNC Symbol;Acc:HGNC:12158]

However, the transcript_id and transcript_version are separated in the gtf:

1	havana	exon	182696	182746	.	+	.	gene_id "ENSG00000279928"; gene_version "2"; transcript_id "ENST00000624431"; transcript_version "2"; exon_number "1"; gene_name "DDX11L17"; gene_source "havana"; gene_biotype "unprocessed_pseudogene"; transcript_name "DDX11L17-201"; transcript_source "havana"; transcript_biotype "unprocessed_pseudogene"; exon_id "ENSE00003759020"; exon_version "2"; tag "basic"; tag "Ensembl_canonical"; transcript_support_level "NA";

Therefore, when running kallisto or salmon, the abundace file in quants folder has the quantification of the transcripts as per the fasta file; so when loading the transcripts by tx2gene.py, the following section of the discover_transcript_ºattribute() fails:

    with open(gtf_file) as inh:
        # Read GTF file, skipping header lines
        for line in filter(lambda x: not x.startswith("#"), inh):
            cols = line.split("\t")

            # Use regular expression to correctly split the attributes string
            attributes_str = cols[8]
            attributes = dict(re.findall(r'(\S+) "(.*?)(?<!\\)";', attributes_str))

            votes.update(key for key, value in attributes.items() if value in transcripts)

Because no value of the gtf follows the structure "ENSTXXXXXXX.Y".

So far, I've patched this problem by creating a new attribute in the gtf file that combines the transcript id and version.

If this problem can be replicated, I think it would be a good idea to make a more lenient discover_transcript_ºattribute() function that allows for transcript ids with or without version.

alexmascension added the bug Something isn't working label Jan 30, 2024

drpatelh added this to the 3.15.0 milestone May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue in NFCORE_RNASEQ:RNASEQ:QUANTIFY_PSEUDO_ALIGNMENT:TX2GENE (genome.filtered.gtf) #1204

Issue in NFCORE_RNASEQ:RNASEQ:QUANTIFY_PSEUDO_ALIGNMENT:TX2GENE (genome.filtered.gtf) #1204

alexmascension commented Jan 30, 2024

alexmascension commented Feb 1, 2024

alexmascension commented Feb 1, 2024

Issue in NFCORE_RNASEQ:RNASEQ:QUANTIFY_PSEUDO_ALIGNMENT:TX2GENE (genome.filtered.gtf) #1204

Issue in NFCORE_RNASEQ:RNASEQ:QUANTIFY_PSEUDO_ALIGNMENT:TX2GENE (genome.filtered.gtf) #1204

Comments

alexmascension commented Jan 30, 2024

Description of the bug

Command used and terminal output

Relevant files

System information

alexmascension commented Feb 1, 2024

alexmascension commented Feb 1, 2024