[BUG]: ClusterManager not working on PBS #419

nathaliesoy · 2023-08-30T08:41:34Z

What happened?

When using the cluster manager on pbs the code breaks. It seems to fail to activate the workers due to wrong qsub flags.

Version

0.14.1

Operating System

Linux

Package Manager

pip

Interface

Script (i.e., python my_script.py)

Relevant log output

output: 
Compiling Julia backend...
Error launching workers
ErrorException("")
Activating environment on workers.
Importing installed module on workers...Finished!
Testing module on workers...Finished!
Testing entire pipeline on workers...Finished!
error: 
qsub: invalid option -- 'w'
qsub: invalid option -- 'd'
qsub: invalid option -- 't'
usage: qsub [-a date_time] [-A account_string] [-c interval]
	[-C directive_prefix] [-e path] [-f ] [-h ] [-I [-X]] [-j oe|eo] [-J X-Y[:Z]]
	[-k keep] [-l resource_list] [-m mail_options] [-M user_list]
	[-N jobname] [-o path] [-p priority] [-P project] [-q queue] [-r y|n]
	[-R o|e|oe] [-S path] [-u user_list] [-W otherattributes=value...]
	[-S path] [-u user_list] [-W otherattributes=value...]
	[-v variable_list] [-V ] [-z] [script | -- command [arg1 ...]]
       qsub --version
/srv01/agrp/soybna/.local/lib/python3.8/site-packages/pysr/sr.py:1230: UserWarning: Note: Using a large maxsize for the equation search will be exponentially slower and use significant memory. You should consider turning `use_frequency` to False, and perhaps use `warmup_maxsize_by`.
  warnings.warn(
/srv01/agrp/soybna/.local/lib/python3.8/site-packages/pysr/julia_helpers.py:195: UserWarning: Your system's Python library is static (e.g., conda), so precompilation will be turned off. For a dynamic library, try `pyenv`.
  warnings.warn(
Traceback (most recent call last):
  File "run_pysr.py", line 28, in <module>
    model.fit(traindata['features'], traindata['init_hidden_rep'])
  File "/srv01/agrp/soybna/.local/lib/python3.8/site-packages/pysr/sr.py", line 1845, in fit
    self._run(X, y, mutated_params, weights=weights, seed=seed)
  File "/srv01/agrp/soybna/.local/lib/python3.8/site-packages/pysr/sr.py", line 1705, in _run
    self.raw_julia_state_ = SymbolicRegression.EquationSearch(
RuntimeError: <PyCall.jlwrap (in a Julia function called from Python)
JULIA: MethodError: reducing over an empty collection is not allowed; consider supplying `init` to the reducer
Stacktrace:
  [1] mapreduce_empty(#unused#::typeof(identity), op::Function, T::Type)
    @ Base ./reduce.jl:367
  [2] reduce_empty(op::Base.MappingRF{typeof(identity), SymbolicRegression.SearchUtilsModule.var"#2#4"{Dict{Int64, Int64}}}, #unused#::Type{Int64})
    @ Base ./reduce.jl:356
  [3] reduce_empty_iter
    @ ./reduce.jl:379 [inlined]
  [4] mapreduce_empty_iter(f::Function, op::Function, itr::Vector{Int64}, ItrEltype::Base.HasEltype)
    @ Base ./reduce.jl:375
  [5] _mapreduce(f::typeof(identity), op::SymbolicRegression.SearchUtilsModule.var"#2#4"{Dict{Int64, Int64}}, #unused#::IndexLinear, A::Vector{Int64})
    @ Base ./reduce.jl:427
  [6] _mapreduce_dim
    @ ./reducedim.jl:365 [inlined]
  [7] #mapreduce#800
    @ ./reducedim.jl:357 [inlined]
  [8] mapreduce
    @ ./reducedim.jl:357 [inlined]
  [9] #reduce#802
    @ ./reducedim.jl:406 [inlined]
 [10] reduce
    @ ./reducedim.jl:406 [inlined]
 [11] next_worker(worker_assignment::Dict{Tuple{Int64, Int64}, Int64}, procs::Vector{Int64})
    @ SymbolicRegression.SearchUtilsModule ~/.julia/packages/SymbolicRegression/Y57Eu/src/SearchUtils.jl:23
 [12] _EquationSearch(parallelism::Symbol, datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, NamedTuple{(), Tuple{}}}}; niterations::Int64, options::Options{Int64, Optim.Options{Float64, Nothing}, L2DistLoss, Nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}}, numprocs::Int64, procs::Nothing, addprocs_function::typeof(addprocs_pbs), runtests::Bool, saved_state::Nothing)
    @ SymbolicRegression ~/.julia/packages/SymbolicRegression/Y57Eu/src/SymbolicRegression.jl:572
 [13] _EquationSearch
    @ ~/.julia/packages/SymbolicRegression/Y57Eu/src/SymbolicRegression.jl:412 [inlined]
 [14] EquationSearch(datasets::Vector{Dataset{Float32, Float32, Matrix{Float32}, Vector{Float32}, Nothing, NamedTuple{(), Tuple{}}}}; niterations::Int64, options::Options{Int64, Optim.Options{Float64, Nothing}, L2DistLoss, Nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}}, parallelism::String, numprocs::Int64, procs::Nothing, addprocs_function::typeof(addprocs_pbs), runtests::Bool, saved_state::Nothing)
    @ SymbolicRegression ~/.julia/packages/SymbolicRegression/Y57Eu/src/SymbolicRegression.jl:399
 [15] EquationSearch(X::Matrix{Float32}, y::Matrix{Float32}; niterations::Int64, weights::Nothing, varMap::Vector{String}, options::Options{Int64, Optim.Options{Float64, Nothing}, L2DistLoss, Nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}}, parallelism::String, numprocs::Int64, procs::Nothing, addprocs_function::typeof(addprocs_pbs), runtests::Bool, saved_state::Nothing, multithreaded::Nothing, loss_type::Type)
    @ SymbolicRegression ~/.julia/packages/SymbolicRegression/Y57Eu/src/SymbolicRegression.jl:332
 [16] invokelatest(::Any, ::Any, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Any, NTuple{8, Symbol}, NamedTuple{(:weights, :niterations, :varMap, :options, :numprocs, :parallelism, :saved_state, :addprocs_function), Tuple{Nothing, Int64, Vector{String}, Options{Int64, Optim.Options{Float64, Nothing}, L2DistLoss, Nothing, StatsBase.Weights{Float64, Float64, Vector{Float64}}}, Int64, String, Nothing, typeof(addprocs_pbs)}}})
    @ Base ./essentials.jl:818
 [17] _pyjlwrap_call(f::Function, args_::Ptr{PyCall.PyObject_struct}, kw_::Ptr{PyCall.PyObject_struct})
    @ PyCall ~/.julia/packages/PyCall/twYvK/src/callback.jl:32
 [18] pyjlwrap_call(self_::Ptr{PyCall.PyObject_struct}, args_::Ptr{PyCall.PyObject_struct}, kw_::Ptr{PyCall.PyObject_struct})
    @ PyCall ~/.julia/packages/PyCall/twYvK/src/callback.jl:44>

Extra Info

Setting multithreading to False doesn't change anything.

The text was updated successfully, but these errors were encountered:

MilesCranmer · 2023-08-30T11:03:53Z

Thanks! This looks like it might be an issue in ClusterManagers.jl JuliaParallel/ClusterManagers.jl#179

What is your qsub --version?

nathaliesoy · 2023-08-30T11:30:00Z

pbs_version = 20.0.1

MilesCranmer · 2023-08-30T12:44:03Z

Okay this might take a bit longer to solve. It turns out to be really hard to set up a local version of PBS for testing things. But I'm working on it!

JuliaParallel/ClusterManagers.jl#193

MilesCranmer · 2023-08-30T12:48:04Z

Basically what we need to do is modify these lines to fix ClusterManagers.jl:

https://github.com/JuliaParallel/ClusterManagers.jl/blob/0b0ee3dc772beee0c8cccc77079d941b979ffeac/src/qsub.jl#L52-L54

            qsub_cmd = pipeline(`echo $(Base.shell_escape(cmd))` , (isPBS ?
                    `qsub -N $jobname -wd $wd -j oe -k o -t 1-$np $queue` :
                    `qsub -N $jobname -wd $wd -terse -j y -R y -t 1-$np -V $queue`))

It sounds like they haven't yet updated this qsub call to PBS version 20.

If you are proficient with qsub and know what flags need to be used here, you might be able to make a local modification of ClusterManagers.jl, and then switch to that copy of ClusterManagers.jl with PySR with:

cd ClusterManagers.jl
julia --project=@pysr-0.16.3 -e 'using Pkg; pkg"dev ."'

This will get the PySR environment for 0.16.3 to use the local copy of ClusterManagers.jl. Then if you are able to update the qsub call in the src/qsub.jl file to the qsub version 20 syntax, it should work.

nathaliesoy · 2023-08-30T13:58:31Z

Thank you Miles for investigating this! I think I figured out the new PBS 20 flags and changed it accordingly.

So I added these two lines to my submission shell script

cd ClusterManagers.jl
julia --project=@pysr-0.16.3 -e 'using Pkg; pkg"dev ."'

but it doesn't look like it is picking up the local package. The julia version I am using is globally installed on the cluster. I can't recall, does the ClusterManagers.jl need to be in a specific folder? Do I need to set some path somewhere?

MilesCranmer · 2023-08-30T14:10:27Z

Even if the Julia version is globally installed, you should have the environments appear in your local folder ~/.julia/environments. There should be a pysr-0.16.3 one in that folder (or whatever version of PySR you have installed).

If you open the file ~/.julia/environments/pysr-0.16.3/Manifest.toml, and go to the "ClusterManagers.jl" section, it should tell you if it is a local version or not, and what folder it is using. Maybe the path name is a relative path rather than absolute? You could also try

julia --project=@pysr-0.16.3 -e 'using Pkg; Pkg.develop(path="/path/to/clustermanagers.jl")'

and give the full absolute path (to the location of your modified ClusterManagers.jl) there?

MilesCranmer · 2023-08-30T14:19:36Z

Oh wait, sorry. I just realized you said in the original post that you are using PySR 0.14.1. So either (1) update to PySR 0.16.3 and go through the normal installation with python -m pysr install before implementing these changes, or (2) use --project=@pysr-0.14.1 instead of -0.16.3.

nathaliesoy · 2023-08-30T14:57:49Z

okay so that part seems okay now, thanks!
now the issue is that when submitting it can't connect to the server, errno=15010, seems like a permission thing... Probably I should pick it up with our system administrator?

MilesCranmer · 2023-08-30T15:14:21Z

Hm, yeah the sysadmin might know best for that type of issue. How are you running things?

You could also try running a parallel Julia command manually, just to see if it gives a more helpful error message.

First, create an interactive job on the cluster that you can ssh into. Ssh into it and start Julia with: julia --project=@pysr-0.16.3. Then, execute the following (copy-paste)

import Distributed: pmap
import ClusterManagers: addprocs_pbs

num_workers = 10

# Create the workers:
procs = addprocs_pbs(num_workers)

# Run a computation on each worker:
pmap(worker_id -> worker_id^2, procs)

It should return a vector like [4, 9, 16, ...] if successful. And each of those computations will have run on a different worker across the PBS allocation.

nathaliesoy added the bug Something isn't working label Aug 30, 2023

nathaliesoy assigned MilesCranmer Aug 30, 2023

MilesCranmer mentioned this issue Aug 30, 2023

pbs error JuliaParallel/ClusterManagers.jl#179

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: ClusterManager not working on PBS #419

[BUG]: ClusterManager not working on PBS #419

nathaliesoy commented Aug 30, 2023

MilesCranmer commented Aug 30, 2023

nathaliesoy commented Aug 30, 2023

MilesCranmer commented Aug 30, 2023

MilesCranmer commented Aug 30, 2023

nathaliesoy commented Aug 30, 2023

MilesCranmer commented Aug 30, 2023 •

edited

MilesCranmer commented Aug 30, 2023

nathaliesoy commented Aug 30, 2023

MilesCranmer commented Aug 30, 2023 •

edited

[BUG]: ClusterManager not working on PBS #419

[BUG]: ClusterManager not working on PBS #419

Comments

nathaliesoy commented Aug 30, 2023

What happened?

Version

Operating System

Package Manager

Interface

Relevant log output

Extra Info

MilesCranmer commented Aug 30, 2023

nathaliesoy commented Aug 30, 2023

MilesCranmer commented Aug 30, 2023

MilesCranmer commented Aug 30, 2023

nathaliesoy commented Aug 30, 2023

MilesCranmer commented Aug 30, 2023 • edited

MilesCranmer commented Aug 30, 2023

nathaliesoy commented Aug 30, 2023

MilesCranmer commented Aug 30, 2023 • edited

MilesCranmer commented Aug 30, 2023 •

edited

MilesCranmer commented Aug 30, 2023 •

edited