Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Name of blocking variable not correctly recognized #245

Open
aghr opened this issue Feb 29, 2024 · 7 comments
Open

Name of blocking variable not correctly recognized #245

aghr opened this issue Feb 29, 2024 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@aghr
Copy link

aghr commented Feb 29, 2024

Description of the bug

When having in samplesheet.csv a variable (column header) RNA_extraction_date and when I use this as a blocking variable in the contrasts.csv then RNA_extraction_date from contrasts.csv is recognized as R_extraction_date and I get the error that R_extraction_date is not in samplesheet.csv

To me this looks like an issue with the 'NA' in RNA_extraction_data. NA is also the default in R for specifying 'Not Available' and this might cause the 'NA' in RNA_extraction_date to be replaced by the empty string leading to R_extraction_date. When I change RNA_extraction_date to Rna_extraction_date in samplesheet.csv and contrasts.csv then it seems to work.

Interestingly: the process NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:VALIDATOR (samplesheet.csv) finishes with success but it looks as if this validator checks only samplesheet.csv but not contrats.csv. The error happens in process NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:DESEQ2_DIFFERENTIAL which runs after the validator finished with success.

Command used and terminal output

Header of samplesheet.csv:

sample,fastq_1,fastq_2,strandedness,condition,genotype,sex_Xist,litter_date,dev_time,replicate,flowcell,position_in_wellplate,RNA_extraction_date,genotype_id,NREADS_VST_XIST,SAMPLE_RNA_CONC,LIBPREP_RNA_CONC,row_in_wellplate,col_in_wellplate,SAMPLE_RNA_CONC_2,LIBPREP_RNA_CONC_2,sex_Eif2s3y,sex


contrasts.csv:

id,variable,reference,target,blocking
A_vs_B,condition,A,B,RNA_extraction_date

#########################
Validator finishes with success:


NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:VALIDATOR (samplesheet.csv)

########################
But then:

ERROR ~ Error executing process > 'NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:DESEQ2_DIFFERENTIAL ([id:A_vs_B, variable:
condition, reference:A, target:B, blocking:R_extraction_date])'

Caused by:
  Process `NFCORE_DIFFERENTIALABUNDANCE:DIFFERENTIALABUNDANCE:DESEQ2_DIFFERENTIAL ([id:A_vs_B, variable:condition, reference:A, target:B, blocking:R_extraction_date])` terminated with an error exit status (1)

Command executed [/XXX/.nextflow/assets/nf-core/differentialabundance/./workflows/../modules/nf-core/deseq2/differential/templates/
deseq_de.R]:


#######################

  Error: Blocking variables R_extraction_date do not correspond to sample sheet columns.
  Execution halted

Relevant files

No response

System information

  • version 23.10.0 build 5889 (created 15-10-2023 15:07 UTC (17:07 CEST))
  • Linux desktop
  • CentOS Linux release 7.7.1908
  • local execution
  • with singularity container
  • version of nfcore diffabundance: 1.4.0
@aghr aghr added the bug Something isn't working label Feb 29, 2024
@BEFH
Copy link

BEFH commented Mar 7, 2024

This bug is due to this line of code:

it.blocking = it.blocking.replace('NA', '')

It probably needs to be changed to:

it.blocking = it.blocking.replaceAll('^NA$', '')

@aghr
Copy link
Author

aghr commented Mar 12, 2024

The comment directly before this code block reads:

// Split the contrasts up so we can run differential analyses and
// downstream plots separately.
// Replace NA strings that might have snuck into the blocking column

Why would this code block try to replace NA sub-strings in column headers of the samplesheet.csv file? Maybe you apply it to samplesheet.csv, too?

@BEFH
Copy link

BEFH commented Mar 12, 2024

The issue is that the current code replaces any "NA" in the string. Mine only replaces it if it's the whole string. However, come to think of it, IDK why it's replacing it in the variable name anyway, and whether that's desired behavior

@LaurenceKuhl LaurenceKuhl self-assigned this Mar 15, 2024
@WackerO
Copy link
Collaborator

WackerO commented Mar 18, 2024

I assume (carefully) that .splitCsv might add NAs when it finds an empty column; is that correct, @pinin4fjords?

Edit: No it doesn't, at least not in a little test I ran. Why might NAs sneak in? 🤔

@pinin4fjords
Copy link
Member

Thanks for the bug report!

This was definitely done in response to a bug, possibly people using NA in the input contrasts files to indicate missing values. So I don't want to remove this entirely.

@BEFH - could you PR your regex fix please, since it's so concise? Please add a changelog entry in the same style as the others there when you do so.

We should also document the special meaning of 'NA'.

@aghr
Copy link
Author

aghr commented Mar 19, 2024

Hey, checking for and dealing with NA entries in contrasts.csv seems reasonable. But, replacing any pattern match of 'NA' with the empty string will destroy meaningful entries like 'RNA_CONCENTRATION' or 'ANALYSIS_OUTCOME' and so on. Would it be possible to replace 'NA' with the empty string only if the complete entry is 'NA' (in perl ^NA$) but otherwise don't replace NAs?

@pinin4fjords
Copy link
Member

Yes, that's the fix proposed by @BEFH

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: No status
Development

No branches or pull requests

5 participants