Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JOSS Review] Cannot run Scheduler with SLURM example #273

Open
gomezzz opened this issue Mar 26, 2024 · 7 comments
Open

[JOSS Review] Cannot run Scheduler with SLURM example #273

gomezzz opened this issue Mar 26, 2024 · 7 comments

Comments

@gomezzz
Copy link

gomezzz commented Mar 26, 2024

I am trying to follow the example here https://automl.github.io/amltk/latest/examples/dask-jobqueue/

But I am unable to run pip install openml "amltk[smac, sklearn, dask-jobqueue]" successfully in a new conda environment

with

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                 conda_forge    conda-forge
_openmp_mutex             4.5                       2_gnu    conda-forge
bzip2                     1.0.8                hd590300_5    conda-forge
ca-certificates           2024.2.2             hbcca054_0    conda-forge
ld_impl_linux-64          2.40                 h41732ed_0    conda-forge
libexpat                  2.6.2                h59595ed_0    conda-forge
libffi                    3.4.2                h7f98852_5    conda-forge
libgcc-ng                 13.2.0               h807b86a_5    conda-forge
libgomp                   13.2.0               h807b86a_5    conda-forge
libnsl                    2.0.1                hd590300_0    conda-forge
libsqlite                 3.45.2               h2797004_0    conda-forge
libuuid                   2.38.1               h0b41bf4_0    conda-forge
libxcrypt                 4.4.36               hd590300_1    conda-forge
libzlib                   1.2.13               hd590300_5    conda-forge
ncurses                   6.4.20240210         h59595ed_0    conda-forge
openssl                   3.2.1                hd590300_1    conda-forge
pip                       24.0               pyhd8ed1ab_0    conda-forge
python                    3.11.8          hab00c5b_0_cpython    conda-forge
readline                  8.2                  h8228510_1    conda-forge
setuptools                69.2.0             pyhd8ed1ab_0    conda-forge
tk                        8.6.13          noxft_h4845f30_101    conda-forge
tzdata                    2024a                h0c530f3_0    conda-forge
wheel                     0.43.0             pyhd8ed1ab_0    conda-forge
xz                        5.2.6                h166bdaf_0    conda-forge

running the pip install I receive

Building wheels for collected packages: pyrfr
  Building wheel for pyrfr (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py bdist_wheel did not run successfully.
  │ exit code: 1
  ╰─> [2 lines of output]
      [<setuptools.extension.Extension('pyrfr._regression') at 0x7b20d16df910>, <setuptools.extension.Extension('pyrfr._util') at 0x7b20d10c7850>]
      error: command 'swig' failed: No such file or directory
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for pyrfr
  Running setup.py clean for pyrfr
Failed to build pyrfr
ERROR: Could not build wheels for pyrfr, which is required to install pyproject.toml-based projects

which seems to be a problem in one of the dependencies, however I am unsure how to proceed? If you have maybe a conda environment.yml or a docker or such, providing it may help users setting up their env?

(opened as part of JOSS Review openjournals/joss-reviews#6367 )

@gomezzz
Copy link
Author

gomezzz commented Mar 26, 2024

Same seems to apply to the HPO example https://automl.github.io/amltk/latest/examples/hpo/

@gomezzz
Copy link
Author

gomezzz commented Mar 26, 2024

Alright, I managed to bypass the problem by manually installing swig via conda.

Seems this is a quite old issue automl/auto-sklearn#459 . Not sure if you maybe want to mention it in the docs or if you encountered this problem before?

However, I now receive a different error running the example

Task exception was never retrieved
future: <Task finished name='Task-19' coro=<_wrap_awaitable() done, defined at .../mambaforge-pypy3/envs/amltk/lib/python3.11/site-packages/distributed/deploy/spec.py:124> exception=FileNotFoundError(2, 'No such file or directory')>
Traceback (most recent call last):
  File ".../mambaforge-pypy3/envs/amltk/lib/python3.11/site-packages/distributed/deploy/spec.py", line 125, in _wrap_awaitable
    return await aw
           ^^^^^^^^
  File ".../mambaforge-pypy3/envs/amltk/lib/python3.11/site-packages/distributed/deploy/spec.py", line 74, in _
    await self.start()
  File ".../mambaforge-pypy3/envs/amltk/lib/python3.11/site-packages/dask_jobqueue/core.py", line 426, in start
    out = await self._submit_job(fn)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../mambaforge-pypy3/envs/amltk/lib/python3.11/site-packages/dask_jobqueue/core.py", line 409, in _submit_job
    return await self._call(shlex.split(self.submit_command) + [script_filename])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../mambaforge-pypy3/envs/amltk/lib/python3.11/site-packages/dask_jobqueue/core.py", line 494, in _call
    proc = await asyncio.create_subprocess_exec(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../mambaforge-pypy3/envs/amltk/lib/python3.11/asyncio/subprocess.py", line 223, in create_subprocess_exec
    transport, protocol = await loop.subprocess_exec(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../mambaforge-pypy3/envs/amltk/lib/python3.11/asyncio/base_events.py", line 1708, in subprocess_exec
    transport = await self._make_subprocess_transport(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../mambaforge-pypy3/envs/amltk/lib/python3.11/asyncio/unix_events.py", line 207, in _make_subprocess_transport
    transp = _UnixSubprocessTransport(self, protocol, args, shell,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../mambaforge-pypy3/envs/amltk/lib/python3.11/asyncio/base_subprocess.py", line 36, in __init__
    self._start(args=args, shell=shell, stdin=stdin, stdout=stdout,
  File ".../mambaforge-pypy3/envs/amltk/lib/python3.11/asyncio/unix_events.py", line 818, in _start
    self._proc = subprocess.Popen(
                 ^^^^^^^^^^^^^^^^^
  File ".../mambaforge-pypy3/envs/amltk/lib/python3.11/subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File ".../mambaforge-pypy3/envs/amltk/lib/python3.11/subprocess.py", line 1953, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'sbatch'

(note I changed N_WORKERS to 4 as I my workstation is not that beefy :) )

@gomezzz gomezzz changed the title [JOSS Review] Cannot install package to run Scheduler with SLURM [JOSS Review] Cannot run Scheduler with SLURM example Mar 26, 2024
@eddiebergman
Copy link
Contributor

Hi @gomezzz,

Just wanted to let you know I really appreciate all of these and I will respond on Thursday!

@gomezzz
Copy link
Author

gomezzz commented Mar 27, 2024

Hi @eddiebergman , glad to hear, no rush!

Yeah, just so you know, I am also not particularly worried about the quality of the module, just want to help make it accessible to an audience, who might be new to it like myself. :)

@eddiebergman
Copy link
Contributor

Hi @gomezzz,

Regarding the issues

Assuming this is on Windows, I think the bulk of these issues stem from the fact I do not test/work on Windows. I had a student use Windows and he showed me that many of the dask tests failed. I will see what I can do about it. You should still be able to use a Scheduler.with_processes().

The pyrfr problem is due to technical debt of SMAC, one of the optimizers, which relies on a custom C++ random forest which is only compiled for Linux up to Python 3.10. This is why swig was required, so pip can build it locally. You should be able to use OptunaOptimizer instead of SMAC.

Documenting these issues

I'm not entirely sure how to do so, this library allows you to plug and play different tools into it, all of them optional so there's no required dependancies. Of course for the examples I need to choose one of them, but as you pointed out, there are issues with it. I'm not entirely sure what to do about this, any recommendations would be greatly appreciated.

@gomezzz
Copy link
Author

gomezzz commented Apr 5, 2024

Hi @eddiebergman ,

Assuming this is on Windows, I think the bulk of these issues stem from the fact I do not test/work on Windows. I had a student use Windows and he showed me that many of the dask tests failed. I will see what I can do about it. You should still be able to use a Scheduler.with_processes().

I am on Linux :)

Documenting these issues

I'm not entirely sure how to do so, this library allows you to plug and play different tools into it, all of them optional so there's no required dependancies. Of course for the examples I need to choose one of them, but as you pointed out, there are issues with it. I'm not entirely sure what to do about this, any recommendations would be greatly appreciated.

I think a small note in the examples along the lines of "If you encounter problems installing with pip install openml "amltk[smac]" we recommend instead ...." or something like this would help already. Alternatively, you could provide a conda environment file or something with fixed versions? I think the main problem here is that normally you expect you install a framework and then you are ready to play with it but it seems you always need more dependencies here. So, alternatively, you could also consider providing a default config with one optimizer that works and then you only need to install more dependencies if you want to use another?

@eddiebergman
Copy link
Contributor

Hi @gomezzz, looking into this again, I think the main issue with this is that you need to be running on a SLURM cluster, i.e. not a single workstation. SLURM is a scheduling service for distributing workloads across a cluster and not something you'd typically have installed on a single workstation.

However this raises a good point that there should be more examples on simple use cases that you are exposed to first!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants