n_jobs doesn't seem to be taken into account by TuneSearchCV #257

RNarayan73 · 2022-11-08T14:22:50Z

Hello,

I have passed a value equal to the physical cores on my computer = 14 to n_jobs when calling TuneSearchCV as against the total cpu_count of 20 which includes virtual cpus as per the parameters below (truncated for brevity):

 {'cv': StratifiedKFold(n_splits=5, random_state=0, shuffle=True),
 'early_stopping': False,
 'error_score': nan,
 'estimator__memory': None,

 'estimator__steps': [truncated for brevity]

 'local_dir': 'C:\\Users\\naray/_python/_raytune/',
 'loggers': None,
 'max_iters': 1,
 'mode': 'max',
 'n_jobs': 14,
 'n_trials': 20,
 'name': 'ray_mp_con_main',
 'param_distributions': {'enc__target': <ray.tune.search.sample.Categorical at 0x23d31227550>,
  'enc__time__cyclicity': <ray.tune.search.sample.Categorical at 0x23d31226a40>,
  'clf__base_estimator__max_depth': <ray.tune.search.sample.Integer at 0x23d31227310>,
  'clf__base_estimator__min_child_weight': <ray.tune.search.sample.Float at 0x23cabb9c4f0>,
  'clf__base_estimator__colsample_bytree': <ray.tune.search.sample.Float at 0x23cab49bc40>,
  'clf__base_estimator__subsample': <ray.tune.search.sample.Float at 0x23cab49ace0>,
  'clf__base_estimator__learning_rate': <ray.tune.search.sample.Float at 0x23cab49a800>,
  'clf__base_estimator__gamma': <ray.tune.search.sample.Float at 0x23cab49a6e0>},
 'pipeline_auto_early_stop': True,
 'random_state': 0,
 'refit': 'avg_prec',
 'return_train_score': False,
 'scoring': {'profit_ratio': make_scorer(profit_ratio_score),
  'f_0.5': make_scorer(fbeta_score, beta=0.5, zero_division=0, pos_label=1),
  'precision': make_scorer(precision_score, zero_division=0, pos_label=1),
  'recall': make_scorer(recall_score, zero_division=0, pos_label=1),
  'avg_prec': make_scorer(average_precision_score, needs_proba=True, pos_label=1),
  'brier_rel': make_scorer(brier_rel_score, needs_proba=True, pos_label=1),
  'brier_score': make_scorer(brier_score_loss, greater_is_better=False, needs_proba=True, pos_label=1),
  'logloss_rel': make_scorer(logloss_rel_score, needs_proba=True),
  'log_loss': make_scorer(log_loss, greater_is_better=False, needs_proba=True)},
 'search_kwargs': None,
 'search_optimization': 'bayesian',
 'sk_n_jobs': 1,
 'stopper': None,
 'time_budget_s': None,
 'use_gpu': False,
 'verbose': 1}

However, it seems to ignore this and requests > 14 cpus when executing as shown below:

== Status ==
Current time: 2022-11-08 14:18:43 (running for 00:00:15.96)
Memory usage on this node: 12.7/15.6 GiB
Using FIFO scheduling algorithm.
Resources requested: 18.0/20 CPUs, 0/1 GPUs, 0.0/3.97 GiB heap, 0.0/1.98 GiB objects
Current best trial: 39cbd78e with average_test_avg_prec=0.5405577423096732 and parameters={'early_stopping': False, 'early_stop_type': , 'groups': None, 'cv': StratifiedKFold(n_splits=5, random_state=0, shuffle=True), 'fit_params': {}, 'scoring': {'profit_ratio': make_scorer(profit_ratio_score), 'f_0.5': make_scorer(fbeta_score, beta=0.5, zero_division=0, pos_label=1), 'precision': make_scorer(precision_score, zero_division=0, pos_label=1), 'recall': make_scorer(recall_score, zero_division=0, pos_label=1), 'avg_prec': make_scorer(average_precision_score, needs_proba=True, pos_label=1), 'brier_rel': make_scorer(brier_rel_score, needs_proba=True, pos_label=1), 'brier_score': make_scorer(brier_score_loss, greater_is_better=False, needs_proba=True, pos_label=1), 'logloss_rel': make_scorer(logloss_rel_score, needs_proba=True), 'log_loss': make_scorer(log_loss, greater_is_better=False, needs_proba=True)}, 'max_iters': 1, 'return_train_score': False, 'n_jobs': 1, 'metric_name': 'average_test_avg_prec', 'enc__target': MeanEncoder(ignore_format=True), 'enc__time__cyclicity': CyclicalFeatures(drop_original=True), 'clf__base_estimator__max_depth': 4, 'clf__base_estimator__min_child_weight': 0.00015754293696729367, 'clf__base_estimator__colsample_bytree': 0.30008924942642895, 'clf__base_estimator__subsample': 0.4478400830132758, 'clf__base_estimator__learning_rate': 0.3258309822543486, 'clf__base_estimator__gamma': 0.011208562277376221}
Result logdir: C:\Users\naray_python_raytune\ray_mp_con_main
Number of trials: 17/20 (1 PENDING, 9 RUNNING, 7 TERMINATED)

Has anyone come across this? Or am I doing something wrong?

Please let me know if you need more info.

Regards
Narayan

The text was updated successfully, but these errors were encountered:

Yard1 · 2022-11-08T21:43:16Z

Looking at the code, we are trying to maintain full cluster resource utilization by dividing the number of all CPUs in a cluster by n_jobs and using that as the number of CPUs to use. Honestly, we are probably trying to be too clever here, and this should be handled by a different system altogether. Will add that to the backlog.

As a workaround, try running ray.init(num_cpus=14) before initializing TuneSearchCV(n_jobs=None).

RNarayan73 · 2022-11-15T00:07:04Z

Hi @Yard1

On further digging, I tried out the following scenarios and came across these observations

I realised that the status quo is actually worse than what I had described above. Not only does it request up to the max virtual cpus (20), it actually seems to utilise only a max of 1/2 the requested cpus at one time (I'm guessing it arbitrarily assumes that the physical cores are 1/2 the total cpus reported by cpu_count. That may have been fine until the Intel 12th gen cpus came along, which have 14 cores and 20 virtual cpus). So, with the status quo, 4 of the cores will never be used and it might be worth revisiting the logic for physical core count to use psutil.cpu_count(logical=False) instead os.cpu_count()!
Your proposed workaround does seem to show the max number of cpus requested correctly (14) and actually does seem to run a max of 14 processes at any one time - when TuneSearchCV is run once. However, it causes issues when you try to parallelize the run with joblib.Parallel with the default backend loky, as in cross_validate().

2a) For example, when I initialise ray separately with ray.init, and then call TuneSearchCV, cv.n_splits times in a loop through cross_validate(), which uses loky backend for parallel processing, the first instance of ray (gcs_server + raylet + n_jobs processes) is completely ignored by the subsequent calls to TuneSearchCV which just goes on to generate completely new instances (gcs_server + raylet + n_jobs processes) cv.n_splits times. It eventually reaches 100's of additional processes for a 5-fold cross validation, leading to massive oversubscription, often causing OOM or crashes.

2b) If I don't initialise ray.init separately and follow the status quo, leaving TuneSearchCV to do the initialisation within cross_validate, there is no 'stranded' instance. Besides the issue with using only 10 cores described above, the looping through cv.n_splits still generates multiple instances of ray (gcs_server + raylet + n_jobs processes) for each split within cross_validate with the same effects as above. However, there is 1 less instance hogging memory here thereby causing fewer crashes, albeit with massive oversubscription, so I have fallen back on this approach!

2c) The only way to avoid this is to not parallelise the run at all when using parallel_backend by passing n_jobs = 1 and execute all the cv.n_splits within cross_validate() in sequence with n_jobs=1 which is a shame as it underutilises available cores.

Despite the issues cited above, in all cases above, I get consistent, repeatable results for my metrics.
Finally, I registered ray as a backend in parallel_backend and tried to use it to parallelise the runs instead of the default loky. This works well without oversubscription, and if I declare the num_cpus in ray.init() it uses all the 14 cores but issues a warning message as below:

2022-11-14 22:15:02,545 WARNING pool.py:591 -- The 'context' argument is not supported using ray. Please refer to the documentation for how to control ray initialization.
[Parallel(n_jobs=14)]: Using backend RayBackend with 14 concurrent workers.

However, the metrics results are not repeatable despite using the exact same setup and random_state seed as earlier.

Conclusions and questions:
A) As it stands it seems like loky is incompatible with ray for parallelisation. Is it possible to get TuneSearchCV to play nice with loky so that we can get repeatable results?

B) What could be the reason for the non-repeatable results when using ray in parallel_backend? Can it be addressed?

Thanks
Narayan

RNarayan73 · 2023-08-28T22:27:48Z

@Yard1
Is this issue on the backlog for an imminent release?
Regards
Narayan

Yard1 self-assigned this Nov 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

n_jobs doesn't seem to be taken into account by TuneSearchCV #257

n_jobs doesn't seem to be taken into account by TuneSearchCV #257

RNarayan73 commented Nov 8, 2022

Yard1 commented Nov 8, 2022

RNarayan73 commented Nov 15, 2022 •

edited

RNarayan73 commented Aug 28, 2023

n_jobs doesn't seem to be taken into account by TuneSearchCV #257

n_jobs doesn't seem to be taken into account by TuneSearchCV #257

Comments

RNarayan73 commented Nov 8, 2022

Yard1 commented Nov 8, 2022

RNarayan73 commented Nov 15, 2022 • edited

RNarayan73 commented Aug 28, 2023

RNarayan73 commented Nov 15, 2022 •

edited