Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process terminated for an unknown reason (SLURM) #339

Open
hdbeukel opened this issue Oct 16, 2023 · 0 comments
Open

Process terminated for an unknown reason (SLURM) #339

hdbeukel opened this issue Oct 16, 2023 · 0 comments
Labels
bug Something isn't working

Comments

@hdbeukel
Copy link

Description of the bug

When running the atac-seq pipeline on our SLURM cluster, it keeps failing at seemingly arbitrary points, with an error message saying that the process was "terminated for an unknown reason -- Likely it has been terminated by the external system" (see full error below).

When resuming the pipeline, without any changes in parameters or anything, it usually does get past the previously terminated process and then fails again at a later step, with the same error message. If I keep resuming the pipeline, eventually it does reach the end.

When a process fails, the working directory contains only two files:

  • .command.sh
  • .command.run

No .out, .trace, .exitcode, ... and also no symlinks to the input data have been created. If a manually submit the .command.run script to the cluster, without making any changes, it succeeds without any problem and all the files are there.

I have been in touch with our IT support in charge of managing the cluster but they also have no clue what is happening. We used to have a Sun Grid Engine cluster, on which the pipeline ran without problems. The issue started to appear when the cluster was migrated to SLURM.

Command used and terminal output

#!/bin/bash
#
#SBATCH -p all # partition (queue)
#SBATCH -c 1 # number of cores
#SBATCH --mem 16G # memory pool for all cores
#SBATCH -o slurm.%N.%j.out # STDOUT
#SBATCH -e slurm.%N.%j.err # STDERR

module load java/x86_64/16.0.1+9
module load nextflow/x86_64/23.04.1

nextflow -c atac-seq-slurm.config run nf-core/atacseq \
         -profile singularity \
         -params-file atac-seq.yaml \
         --save_align_intermeds \
         -resume

ERROR ~ Error executing process > 'NFCORE_ATACSEQ:ATACSEQ:MERGED_LIBRARY_MARKDUPLICATES_PICARD:SAMTOOLS_INDEX (CONTROL_REP1)'

Caused by:
  Process `NFCORE_ATACSEQ:ATACSEQ:MERGED_LIBRARY_MARKDUPLICATES_PICARD:SAMTOOLS_INDEX (CONTROL_REP1)` terminated for an unknown reason -- Likely it has been terminated by the external system

Command executed:

  samtools \
      index \
      -@ 1 \
       \
      CONTROL_REP1.mLb.mkD.sorted.bam
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_ATACSEQ:ATACSEQ:MERGED_LIBRARY_MARKDUPLICATES_PICARD:SAMTOOLS_INDEX":
      samtools: $(echo $(samtools --version 2>&1) | sed 's/^.*samtools //; s/Using.*$//')
  END_VERSIONS

Command exit status:
  -

Command output:
  (empty)

Relevant files

The config file only sets the working directory and the SLURM executor:

workDir = '/scratch/...'

executor {
    name = 'slurm'
}

The parameter file contains these settings:

input: './samplesheet.csv'
fasta: 'data/ath.fasta'
gff: 'data/ath.gff'
outdir: './results'

aligner: bowtie2
macs_gsize: 119481543
narrow_peak: true

max_cpus: 24
max_memory: '100.GB'

System information

  • Nextflow version: 23.04.1
  • Hardware: HPC
  • Executor: SLURM
  • Container engine: Singularity
  • OS: Linux
  • version of nf-core/ataseq: 2.1.2
@hdbeukel hdbeukel added the bug Something isn't working label Oct 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant