Skip to content
Bernd Bischl edited this page Mar 18, 2015 · 13 revisions

Is BatchJobs and it's Registry mechanism written such that all "compute" nodes need to be on the same shared file system?

Yes. But we dislike that fact and would like to allow more at some point, which would require non-trivial work. Also note that Bernd has worked with BatchJobs on Microsoft Azure, where this was not given. He used sshfs as a simple workaround and was surprised how well it worked.

Multicore mode does not work?

We have worked very hard that it does, which was far from trivial. Run debugMulticore. If you are a Linux Guru (tm) look at linux-helper. Mail us the debug output (and a possible explanation). Also note that the constructor makeClusterFunctionsMulticore allows the definition of the command-line options that Rscript / R CMD BATCH are called with. Depending on what goes on in your R profile files, you might want to change those.

SSH mode does not work?

We have worked very hard that it does, which was far from trivial. Do all your nodes share a common file system? Can you log into all of them without typing in passwords? Are all necessary environment variables defined so one can run R in non-interactive shells? Run debugSSH. If you are a Linux Guru (tm) look at linux-helper. Mail us the debug output (and a possible explanation). Also note that the constructor makeSSHWorker used for makeClusterFunctionsSSH allows the definition of the command-line options that Rscript / R CMD BATCH are called with. Depending on what goes on in your R profile files, you might want to change those.

There seems to be a bug in the cluster functions implementation for my system?

We try to test on all systems available to us, but sometimes there are subtle differences in the outputs of OS commands, etc. Now have a look at the output in R and inspect the job file which was generated from your brew template (in the "jobs" subdirectory of your registry directory if debug mode was turned on). Please mails us this information, so we fix the bug, possibly with your interpretation of the error. If you want to, you could also have a look the source code of the respective cluster functions, they are all named "clusterFunctions{System}.R" and some helper functions are placed in clusterFunctionsHelpers.R. The code is not very lengthy and should be simple to understand. Define a very simple test like batchMap(reg, identity, 1) and set the debug option to TRUE as described in the Configuration section.

My batch system is apparently supported, what should I do now? What are these template files?

First of all, get a job definition file that works on your system. It is also a good idea to briefly discuss this with your admin. Now transform this job file into a brew template, which is actually pretty straightforward. If you look into our examples subdirectory you will find many working examples of such brew templates. Note that the job resources (walltime, etc.) are not further defined in any way, you are free to use any names or encodings you want. It is probably a good idea to perform some sanity checks for those resources in your brew template to output informative error messages when typos happen. Now write a very short test, possibly turn debug mode on (this will also store the job files in the "jobs" subdirectory of the registry-files-directory so you can inspect them). If this works, specify your desired default job resources in your config file and you are done. If more people want to use the R package on the same site, share the template. If you encounter problems, send a mail to the list and we will help.

My batch system is not supported. Could you integrate it?

Please get in contact via the mailing list so we know about your request. In principle this will certainly be possible, how long it takes will depend upon how different your system is from the already implemented ones and whether we find a way to run some tests on such a system.

My batch system is not supported. Can I integrate it myself?

Yes, the package has been written with this option in mind. Start by reading the interface definition at makeClusterFunctionSSH. Then read the source code of the already implemented back-ends to get you started, all filenames begin with "clusterFunctions". Here is an example for Torque. Useful helper functions are documented and exported, they are provided here. Now use the constructor to implement your code. You could put it into a separate R file, source this in your config file and then call your own constructor like this:

cluster.functions = makeClusterFunctionsMyCustomSystem('my_template.tmpl')

It is very likely that you should write a brew template for your implementation as well, see the question directly above. If you have any questions we will certainly support you as good as we can and also integrate your code in future versions of the packages if you want that.

Can I query the job resources from submit on the slave?

Yes. In your job code simply call getResources. This will return the default resources, overwritten by the resources you specified during submit. This is helpful if you are using multiple cores in one job (e.g. MPI) or you would like to know the walltime to exit gracefully when it runs out.

Can I query the job resources from submits later on the master?

Yes. All resources are stored on disk and connected to their jobs in the DB. Simply call getJobResources with the ids of the desired jobs.

What about package updates? Will they break my experiments?

No. New releases will be backward compatible.

Are the results reproducible?

BatchJobs and BatchExperiments set and store seeds for each possible stochastic computational part. So yes, your results will be reproducible.

The progress bar is annoying. It also clutters reports that I generate with, e.g., knitr or sweave.

The progress bar did output to stdout in BBmisc versions < 1.5. From 1.6 on, it outputs to stderr by default, which should resolve the problem. Nevertheless, it was always possible to turn it off by using an option. Now the stream can also be selected. Please see makeProgressBar. Additionally, setting the option BatchJobs.verbose to FALSE will suppress many messages. Summed up, you want to put a chunk with following content at the start of your report:

options(BatchJobs.verbose = FALSE, BBmisc.ProgressBar.style = "off")