Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

“An error has occurred when calling silent_system2:” #29

Open
thistleknot opened this issue Dec 22, 2020 · 19 comments
Open

“An error has occurred when calling silent_system2:” #29

thistleknot opened this issue Dec 22, 2020 · 19 comments
Assignees
Labels
bug Something isn't working

Comments

@thistleknot
Copy link

https://stackoverflow.com/questions/65402764/slurmr-trying-to-run-an-example-job-an-error-has-occurred-when-calling-silent

I setup a slurm cluster and I can issue a srun -N4 hostname just fine.

I keep seeing "silent_system2" errors. I've installed slurmR using devtools::install_github("USCbiostats/slurmR")

I'm following the second example 3: https://github.com/USCbiostats/slurmR

here are my files

cat slurmR.R

library(doParallel)
library(slurmR)

cl <- makeSlurmCluster(4)

registerDoParallel(cl)
m <- matrix(rnorm(9), 3, 3)
foreach(i=1:nrow(m), .combine=rbind)

StopCluster(cl)
print(m)

cat rscript.slurm

#!/bin/bash
#SBATCH --output=slurmR.out

cd /mnt/nfsshare/tankpve0/
Rscript --vanilla slurmR.R

cat slurmR.out

Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
slurmR default option for `tmp_path` (used to store auxiliar files) set to:
  /mnt/nfsshare/tankpve0
You can change this and checkout other slurmR options using: ?opts_slurmR, or you could just type "opts_slurmR" on the terminal.
Submitting job... jobid:18.
Slurm accounting storage is disabled
Error: An error has occurred when calling `silent_system2`:
Warning: An error was detected before returning the cluster object. If submitted, we will try to cancel the job and stop the cluster object.
Execution halted
@gvegayon
Copy link
Member

gvegayon commented Jan 4, 2021

This seems to be an issue with Slurm's configuration. See if you could try the following:

library(slurmR)
Slurm_lapply(1:10, function(x) runif(10), njobs = 4)

That is the bare minimum. Creating a cluster object may be more complicated.

@gvegayon gvegayon self-assigned this Jan 4, 2021
@ekernf01
Copy link

ekernf01 commented Apr 18, 2021

I'm also seeing this issue. It appears with the bare minimum example you just posted. When I set plan = "none" and submit the job by hand, I see the jobs on squeue, the log shows no errors, and the answers are all there in the R data files. But when I go to collect, I get

No job found. This may be a false negative as the job may still be on it's way to be submitted.. Waiting 10 seconds before retry.
Error: No job found. This may be a false negative as the job may still be on it's way to be submitted.

I just set up slurm on my laptop for testing, so it certainly could be a problem with my configuration. But given that it all ran and the answers are right there as expected, it seems like Slurm_collect ought to be able to find them.

Edit: I'm using R 4.0.0, slurm-wlm 17.11.2, ubuntu 18.04, slurmR 0.4.2.

@gvegayon
Copy link
Member

Thanks, @ekernf01, I'll try to reproduce your error using Docker. I'm not sure what could be causing it. In the case of @thistleknot, I believe this is an issue with the setup of his cluster. I currently don't have access to a cluster that allows using ssh between nodes (which is what makeSlurmCluster relies on). I am very aware of these issues and will try to solve them ASAP.

@ekernf01
Copy link

If it's helpful in setting up the container, I used this guide to set up my slurm.

https://blog.llandsmeer.com/tech/2020/03/02/slurm-single-instance.html

@ekernf01
Copy link

Can't stop thinking of futurama.
https://futurama.fandom.com/wiki/Slurm

@edisto69
Copy link

edisto69 commented May 17, 2021

I am experiencing the same issue. I built an Odroid XU4-based cluster (an XU4 front-end and 12 MC1s as the nodes). When I submit:

job<-SlurmEvalQ(slurmR::WhoAmI(),njobs=20,plan="submit")

It says the job was submitted. Looking at slurmctld.log, I can see the jobs were submitted to the 12 nodes, and the remaining 8 jobs assigned as the first jobs were completed, and subsequently completed. But, when I enter "job" or "res<-Slurm.collect(job), I get:

Slurm accounting storage is disabled
Error: An error has occurred when calling 'silent_system2':

The same issue occurs with the minimal Slurm_lapply example above. Any suggestions will be greatly appreciated!

The system is connected to an NFS server, but I am running R on the front-end, not on the server.

@gvegayon
Copy link
Member

@edisto69 @ekernf01 @thistleknot I believe you may have found a bug. It could be still that your systems may have an issue or two with the Slurm config (which I will check ASAP to see how to give it the right treatment), yet slurmR was supposed to be more explicit regarding the type of error. It turned out that I was not capturing stderr when needed, which now I am.

I would appreciate it if you could install this version instead, re-run your code, and report back whatever you see.

To install this version, you can either do use git:

git clone https://github.com/USCbiostats/slurmR/tree/issue029
R CMD INSTALL slurmR

Or download the zip, unzip it, and then install, e.g.,

wget https://github.com/USCbiostats/slurmR/archive/refs/heads/issue029.zip
unzip issue029.zip
R CMD INSTALL slurmR-issue029

I appreciate your help!
cc @USCbiostats/core-c

@edisto69
Copy link

Thanks for following up!

Now, when I run:

library(slurmR)
Slurm_lapply(1:10, function(x) runif(10), njobs = 4)

I get the more specific error message:

Error: An error has occurred when calling
system2("sacct", flags, stdout = TRUE, stderr = TRUE)
Slurm accounting storage is disabled

@gvegayon gvegayon added the bug Something isn't working label May 21, 2021
@edisto69
Copy link

edisto69 commented May 23, 2021

I found one reason I was having an issue (by looking at slurmd.log). I have a single user on all nodes and on the front end, but they don't have a shared home folder...I'm trying to figure out how/if I can have the users share a home folder on the NFS share.

@gvegayon
Copy link
Member

Thank you @edisto69, I just pushed an update. Could you try to install it again? Thanks

@edisto69
Copy link

I am probably messing you up by changing things...I'm still working on getting R installed on the NFS server so all the nodes have access, but I have tried it a few times after a new R installation using:

install.packages("devtools")
devtools::install_github("USCbiostats/slurmR")

And I get the generic error:

Error: An error has occurred when calling 'silent_system2':

I hope to have things configured by the end of the week, and I'll try it again.

@edisto69
Copy link

Sorry for spamming the thread...I am pretty sure that my configuration is good now. I just ran the rslurm::slurm_apply example, and got back the expected results.

Running the minimal example that you gave above, I still get:

Error: An error has occurred when calling 'silent_system2':

But the slurmr-job directory now has no errors in the '02-output-' files, and has '03-answer-' files and 'X_0001.rds' to 'X_0004.rds' (now we have Futurama and the X-files...).

@gvegayon
Copy link
Member

gvegayon commented Jun 7, 2021

Hey @edisto69, thanks for trying that. The issue is that you got the bugged version, not the patched one. You can either install the updated version like this:

wget https://github.com/USCbiostats/slurmR/archive/refs/heads/issue029.zip
unzip issue029.zip
R CMD INSTALL slurmR-issue029

Or, if you want to use devtools, like this

devtools::install_github("USCBiostats/slurmR", ref = "issue029")

I'll now try to replicate the issue using docker.

gvegayon added a commit that referenced this issue Jun 8, 2021
@edisto69
Copy link

edisto69 commented Jun 8, 2021

Well...it is different.

I ran:

library(slurmR)
slurmR::Slurm_lapply(1:10, function(x) runif(10), njobs = 4)

It now says that it cannot create the slurmr job file in the users home directory (which is an NFS mount) because permission is denied, but I can access the directory from the terminal, and rslurm::slurm_map() has no issues setting up the job directory.

For my slurm_map() scripts I have been using /home/user/work as my wd, where 'user' is a link to the NFS mount home directory, and work is a link in that directory to a different NFS mount folder.

Setting the same wd for the above script resulted in the same error.

@gvegayon
Copy link
Member

gvegayon commented Jun 8, 2021

Thank you very much, @edisto69, I really appreciate all the time you are giving me! I think it would be great if we could talk more at length to see what's going. Would you be willing to have a conference call to talk about this? If so, feel free to email me at g.vegayon@gmail.com.

Regarding the docker image, @ekernf01, I was able to build one using an existing image with Slurm. It is available at https://hub.docker.com/repository/docker/uscbiostats/slurmr-dev, and the instructions (partial, though) are here.

@gvegayon gvegayon mentioned this issue Aug 10, 2021
@kgoldfeld
Copy link

kgoldfeld commented Nov 12, 2021

Has this problem been resolved? I just started getting this message occasionally (that is not consistently) when submitting the same job:

Warning: The call to -sacct- failed. This is probably due to not having slurm accounting up and running. For more information, checkout this discussion: https://github.com/USCbiostats/slurmR/issues/29
Error in UseMethod("get_tmp_path") : 
  no applicable method for 'get_tmp_path' applied to an object of class "c('integer', 'numeric')"
Calls: Slurm_lapply ... wait_slurm.integer -> status -> status.default -> sacct_ -> get_tmp_path

Does the latest development version slurmR fix this?

Follow up: I got the development version of slurmR installed on the HPC, but still getting the same error ... any ideas?

@jobstdavid
Copy link

jobstdavid commented Mar 18, 2022

Unfortunately, I have the same problem as @kgoldfeld

`Submitted batch job 892035
Submitting job...Warning: The call to -sacct- failed. This is probably due to not having slurm accounting up and running. For more information, checkout this discussion: https://github.com/USCbiostats/slurmR/issues/29
Error in UseMethod("get_tmp_path") : 
  no applicable method for 'get_tmp_path' applied to an object of class "c('integer', 'numeric')"
Calls: Slurm_sapply ... wait_slurm.integer -> status -> status.default -> sacct_ -> get_tmp_path
In addition: Warning messages:
1: In normalizePath(file.path(tmp_path, job_name)) :
  path[1]="/home/jobst/test/slurmr-job-9c9aa2b50d464": No such file or directory
2: `X` is not a list. The function will coerce it into one using `as.list` 
Execution halted`

Does there already exist a solution? This would be great!!!

gvegayon added a commit that referenced this issue Mar 21, 2022
@gvegayon
Copy link
Member

Hey @jobstdavid and @kgoldfeld (and others!), I just pushed what I think is a fix to the master branch. I'd appreciate you installing the package and giving it a try.

@kgoldfeld
Copy link

@gvegayon - I installed the package on our HPC, and did some quick tests. It seems like things are working again - though I will keep you posted in case the errors reappear. Thanks so much for the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants