Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpus on SlurmCluster #1075

Open
mens-artis opened this issue Sep 26, 2023 · 1 comment
Open

gpus on SlurmCluster #1075

mens-artis opened this issue Sep 26, 2023 · 1 comment
Labels

Comments

@mens-artis
Copy link

Description

Running a SlurmCluster with example7 as a basis works, but when I add job_extra_directives=["--gres=gpu:2"]
and send a torch tensor to('cuda:0'), it crashes. It may be related to this:

Warning
On some clusters you cannot spawn new jobs when running a SLURMCluster inside a job instead of on the login node. No obvious errors might be raised but it can hang silently.

But, I used .to('cuda:0') to make it less silent.

Steps/Code to Reproduce

"""
Parallelization-on-Cluster
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

An example of applying SMAC to optimize Branin using parallelization via Dask client on a
SLURM cluster. If you do not want to use a cluster but your local machine, set dask_client
to `None` and pass `n_workers` to the `Scenario`.

:warning: On some clusters you cannot spawn new jobs when running a SLURMCluster inside a
job instead of on the login node. No obvious errors might be raised but it can hang silently.

Sometimes you need to modify your launch command which can be done with
`SLURMCluster.job_class.submit_command`.

```python
cluster.job_cls.submit_command = submit_command
cluster.job_cls.cancel_command = cancel_command

Here we optimize the synthetic 2d function Branin.
We use the black-box facade because it is designed for black-box function optimization.
The black-box facade uses a :term:Gaussian Process<GP> as its surrogate model.
The facade works best on a numerical hyperparameter configuration space and should not
be applied to problems with large evaluation budgets (up to 1000 evaluations).
"""

import numpy as np
from ConfigSpace import Configuration, ConfigurationSpace, Float
from dask.distributed import Client
from dask_jobqueue import SLURMCluster

from smac import BlackBoxFacade, Scenario

import torch

copyright = "Copyright 2023, AutoML.org Freiburg-Hannover"
license = "3-clause BSD"

class Branin(object):
@Property
def configspace(self) -> ConfigurationSpace:
cs = ConfigurationSpace(seed=0)
x0 = Float("x0", (-5, 10), default=-5, log=False)
x1 = Float("x1", (0, 15), default=2, log=False)
cs.add_hyperparameters([x0, x1])

    return cs

def train(self, config: Configuration, seed: int = 0) -> float:

#def gpu_checks(self, seed: int = 0, budget: int = 25):
    # setting device on GPU if available, else CPU
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    print('Using device:', device) # prints Using device: cpu

    b = torch.randn((4, 5))
    b.to('cuda:0')

    x0 = config["x0"]
    x1 = config["x1"]
    a = 1.0
    b = 5.1 / (4.0 * np.pi**2)
    c = 5.0 / np.pi
    r = 6.0
    s = 10.0
    t = 1.0 / (8.0 * np.pi)
    ret = a * (x1 - b * x0**2 + c * x0 - r) ** 2 + s * (1 - t) * np.cos(x0) + s

    return ret

if name == "main":
model = Branin()

# Scenario object specifying the optimization "environment"
scenario = Scenario(model.configspace, deterministic=True, n_trials=100)

n_workers = 2  # Use 4 workers on the cluster

cluster = SLURMCluster(
    # This is the partition of our slurm cluster.
    queue="..."
    cores=1,
    memory="1 GB",
    walltime="00:10:00",
    processes=1,
    log_directory="tmp/smac_dask_slurm",
    #worker_extra_args=["--gpus-per-task=2"],
    job_extra_directives=["--gres=gpu:2"]
)
cluster.scale(jobs=n_workers)
print(cluster.job_script())

# Dask will create n_workers jobs on the cluster which stay open.
# Then, SMAC/Dask will schedule individual runs
#   on the workers like on your local machine.
#client = Client(
#    address=cluster,
#)
# Instead, you can also do
client = cluster.get_client()

# Now we use SMAC to find the best hyperparameters
smac = BlackBoxFacade(
    scenario,
    model.train,  # We pass the target function here
    overwrite=True,  # Overrides any previous results that are found that are inconsistent with the meta-data
    dask_client=client,
)

incumbent = smac.optimize()

# Get cost of default configuration
default_cost = smac.validate(model.configspace.get_default_configuration())
print(f"Default cost: {default_cost}")

# Let's calculate the cost of the incumbent
incumbent_cost = smac.validate(incumbent)
print(f"Incumbent cost: {incumbent_cost}")
#### Expected Results
Using device: cuda

#### Actual Results
Using device: cpu /crash

Traceback (most recent call last):
  File "home/example70.py", line 135, in <module>
    incumbent = smac.optimize()
  File "/home/venv/lib/python3.10/site-packages/smac/facade/abstract_facade.py", line 319, in optimize
    incumbents = self._optimizer.optimize(data_to_scatter=data_to_scatter)
  File "/home/venv/lib/python3.10/site-packages/smac/main/smbo.py", line 304, in optimize
    self._runner.submit_trial(trial_info=trial_info, **dask_data_to_scatter)
  File "/home/venv/lib/python3.10/site-packages/smac/runner/dask_runner.py", line 141, in submit_trial
    self._process_pending_trials()
  File "/home/venv/lib/python3.10/site-packages/smac/runner/dask_runner.py", line 208, in _process_pending_trials
    self._results_queue.append(trial.result())
  File "/home/venv/lib/python3.10/site-packages/distributed/client.py", line 320, in result
    return self.client.sync(self._result, callback_timeout=timeout)
  File "/home/venv/lib/python3.10/site-packages/distributed/client.py", line 328, in _result
    raise exc.with_traceback(tb)
distributed.scheduler.KilledWorker: Attempted to run task run_wrapper-59c7552af3ee317d9a0d09c069418d5b on 3 different workers, but all those workers died while running it. The last worker that attempt to run the task was tcp://10.5.166.193:37123. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.html.
#### Versions
`AttributeError: module 'smac' has no attribute '__version__'`
2.02 (from pip)
@alexandertornede
Copy link
Contributor

Thanks for reporting this, we will look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
Development

No branches or pull requests

2 participants