ValueError from default configuration of joblib Parallel in DataTransformer #260

PJPRoche · 2022-12-20T15:16:31Z

Problem Description

I encountered a ValueError as I increased the number or rows in my training set and have isolated the issue down to the triggering of the automated memory mapping in the joblib Parallel class which is a dependency of CTGAN. I think the issue is caused by the write permissions on the folder where Parallel is looking by default. The problem is there does not seem to be a way, via CTGAN, to provide additional parameters to the Parallel instance to control how to manage the memory mapping.

ctgan.data_transformer uses Parallel but with only a single fixed parameter, set when the class is instantiated.

CTGAN/ctgan/data_transformer.py

Line 162 in 8bf22ef

return Parallel(n_jobs=-1)(processes)

For larger datasets, there is a threshold (Parallel default is 1M) on the size of arrays passed to the workers that triggers automated memory mapping in a temp folder. Because there is no way to set initial parameters for Parallel, CTGAN is at the mercy of how this all gets handled by default. This has all worked fine for me on my laptop, but I am now running into errors on Databricks when running my CTGAN model.

ValueError: assignment destination is read-only
---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/databricks/python/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 428, in _process_worker
    r = call_item()
  File "/databricks/python/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 275, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "/databricks/python/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 620, in __call__
    return self.func(*args, **kwargs)
  File "/databricks/python/lib/python3.8/site-packages/joblib/parallel.py", line 288, in __call__
    return [func(*args, **kwargs)
  File "/databricks/python/lib/python3.8/site-packages/joblib/parallel.py", line 288, in <listcomp>
    return [func(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-2d8ce7d7-4206-43c2-9fd3-0aad944e643a/lib/python3.8/site-packages/ctgan/data_transformer.py", line 112, in _transform_continuous
    data[column_name] = data[column_name].to_numpy().flatten()
  File "/databricks/python/lib/python3.8/site-packages/pandas/core/frame.py", line 3163, in __setitem__
    self._set_item(key, value)
  File "/databricks/python/lib/python3.8/site-packages/pandas/core/frame.py", line 3243, in _set_item
    NDFrame._set_item(self, key, value)
  File "/databricks/python/lib/python3.8/site-packages/pandas/core/generic.py", line 3832, in _set_item
    NDFrame._iset_item(self, loc, value)
  File "/databricks/python/lib/python3.8/site-packages/pandas/core/generic.py", line 3821, in _iset_item
    self._mgr.iset(loc, value)
  File "/databricks/python/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 1110, in iset
    blk.set_inplace(blk_locs, value_getitem(val_locs))
  File "/databricks/python/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 363, in set_inplace
    self.values[locs] = values
ValueError: assignment destination is read-only
"""

Expected behavior

The parallelization of the data transformations in CTGAN is a crucial step. Would it not be better to provide some configuration scope over this important step? I can confirm if I were able to set the Parallel instance, like the example below, then it resolves my original ValueError.

Parallel(n_jobs=-1, temp_folder="/tmp", mmap_mode="r+")(processes)

Additional context

I have also opened an issue (joblib/joblib#1373) in the joblib repo because I think there is some inconsistency, or at least further clarification is required, in the way multiple input parameters work with each other. But that does not change the need for the ability to better configure Parallel in CTGAN.

Environment Details

Please indicate the following details about the environment in which you found the bug:

CTGAN version: 0.5.2
Python version: 3.8
Operating System: Databricks 10.4

The text was updated successfully, but these errors were encountered:

PJPRoche added bug Something isn't working pending review This issue needs to be further reviewed, so work cannot be started labels Dec 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError from default configuration of joblib Parallel in DataTransformer #260

ValueError from default configuration of joblib Parallel in DataTransformer #260

PJPRoche commented Dec 20, 2022

ValueError from default configuration of joblib Parallel in DataTransformer #260

ValueError from default configuration of joblib Parallel in DataTransformer #260

Comments

PJPRoche commented Dec 20, 2022

Problem Description

Expected behavior

Additional context

Environment Details