Name of blocking variable not correctly recognized #245

aghr · 2024-02-29T11:23:28Z

Description of the bug

When having in samplesheet.csv a variable (column header) RNA_extraction_date and when I use this as a blocking variable in the contrasts.csv then RNA_extraction_date from contrasts.csv is recognized as R_extraction_date and I get the error that R_extraction_date is not in samplesheet.csv

To me this looks like an issue with the 'NA' in RNA_extraction_data. NA is also the default in R for specifying 'Not Available' and this might cause the 'NA' in RNA_extraction_date to be replaced by the empty string leading to R_extraction_date. When I change RNA_extraction_date to Rna_extraction_date in samplesheet.csv and contrasts.csv then it seems to work.

Interestingly: the process NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:VALIDATOR (samplesheet.csv) finishes with success but it looks as if this validator checks only samplesheet.csv but not contrats.csv. The error happens in process NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:DESEQ2_DIFFERENTIAL which runs after the validator finished with success.

Command used and terminal output

Header of samplesheet.csv:

sample,fastq_1,fastq_2,strandedness,condition,genotype,sex_Xist,litter_date,dev_time,replicate,flowcell,position_in_wellplate,RNA_extraction_date,genotype_id,NREADS_VST_XIST,SAMPLE_RNA_CONC,LIBPREP_RNA_CONC,row_in_wellplate,col_in_wellplate,SAMPLE_RNA_CONC_2,LIBPREP_RNA_CONC_2,sex_Eif2s3y,sex


contrasts.csv:

id,variable,reference,target,blocking
A_vs_B,condition,A,B,RNA_extraction_date

#########################
Validator finishes with success:


NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:VALIDATOR (samplesheet.csv)

########################
But then:

ERROR ~ Error executing process > 'NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:DESEQ2_DIFFERENTIAL ([id:A_vs_B, variable:
condition, reference:A, target:B, blocking:R_extraction_date])'

Caused by:
  Process `NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:DESEQ2_DIFFERENTIAL ([id:A_vs_B, variable:condition, reference:A, target:B, blocking:R_extraction_date])` terminated with an error exit status (1)

Command executed [/XXX/.nextflow/assets/nf-core/differentialabundance/./workflows/../modules/nf-core/deseq2/differential/templates/
deseq_de.R]:


#######################

  Error: Blocking variables R_extraction_date do not correspond to sample sheet columns.
  Execution halted

Relevant files

No response

System information

version 23.10.0 build 5889 (created 15-10-2023 15:07 UTC (17:07 CEST))
Linux desktop
CentOS Linux release 7.7.1908
local execution
with singularity container
version of nfcore diffabundance: 1.4.0

BEFH · 2024-03-07T17:32:52Z

This bug is due to this line of code:

differentialabundance/workflows/differentialabundance.nf

Line 319 in a3d664c

it.blocking = it.blocking.replace('NA', '')

It probably needs to be changed to:

it.blocking = it.blocking.replaceAll('^NA$', '')

aghr · 2024-03-12T22:19:34Z

The comment directly before this code block reads:

// Split the contrasts up so we can run differential analyses and
// downstream plots separately.
// Replace NA strings that might have snuck into the blocking column

Why would this code block try to replace NA sub-strings in column headers of the samplesheet.csv file? Maybe you apply it to samplesheet.csv, too?

BEFH · 2024-03-12T22:23:04Z

The issue is that the current code replaces any "NA" in the string. Mine only replaces it if it's the whole string. However, come to think of it, IDK why it's replacing it in the variable name anyway, and whether that's desired behavior

WackerO · 2024-03-18T08:52:57Z

I assume (carefully) that .splitCsv might add NAs when it finds an empty column; is that correct, @pinin4fjords?

Edit: No it doesn't, at least not in a little test I ran. Why might NAs sneak in? 🤔

pinin4fjords · 2024-03-18T09:34:56Z

Thanks for the bug report!

This was definitely done in response to a bug, possibly people using NA in the input contrasts files to indicate missing values. So I don't want to remove this entirely.

@BEFH - could you PR your regex fix please, since it's so concise? Please add a changelog entry in the same style as the others there when you do so.

We should also document the special meaning of 'NA'.

aghr · 2024-03-19T15:58:19Z

Hey, checking for and dealing with NA entries in contrasts.csv seems reasonable. But, replacing any pattern match of 'NA' with the empty string will destroy meaningful entries like 'RNA_CONCENTRATION' or 'ANALYSIS_OUTCOME' and so on. Would it be possible to replace 'NA' with the empty string only if the complete entry is 'NA' (in perl ^NA$) but otherwise don't replace NAs?

pinin4fjords · 2024-03-19T16:06:52Z

Yes, that's the fix proposed by @BEFH

aghr added the bug Something isn't working label Feb 29, 2024

LaurenceKuhl self-assigned this Mar 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Name of blocking variable not correctly recognized #245

Name of blocking variable not correctly recognized #245

aghr commented Feb 29, 2024

BEFH commented Mar 7, 2024

aghr commented Mar 12, 2024 •

edited

BEFH commented Mar 12, 2024

WackerO commented Mar 18, 2024 •

edited

pinin4fjords commented Mar 18, 2024

aghr commented Mar 19, 2024

pinin4fjords commented Mar 19, 2024

Name of blocking variable not correctly recognized #245

Name of blocking variable not correctly recognized #245

Comments

aghr commented Feb 29, 2024

Description of the bug

Command used and terminal output

Relevant files

System information

BEFH commented Mar 7, 2024

aghr commented Mar 12, 2024 • edited

BEFH commented Mar 12, 2024

WackerO commented Mar 18, 2024 • edited

pinin4fjords commented Mar 18, 2024

aghr commented Mar 19, 2024

pinin4fjords commented Mar 19, 2024

aghr commented Mar 12, 2024 •

edited

WackerO commented Mar 18, 2024 •

edited