Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError from default configuration of joblib Parallel in DataTransformer #260

Open
PJPRoche opened this issue Dec 20, 2022 · 0 comments
Labels
bug Something isn't working pending review This issue needs to be further reviewed, so work cannot be started

Comments

@PJPRoche
Copy link

Problem Description

I encountered a ValueError as I increased the number or rows in my training set and have isolated the issue down to the triggering of the automated memory mapping in the joblib Parallel class which is a dependency of CTGAN. I think the issue is caused by the write permissions on the folder where Parallel is looking by default. The problem is there does not seem to be a way, via CTGAN, to provide additional parameters to the Parallel instance to control how to manage the memory mapping.

ctgan.data_transformer uses Parallel but with only a single fixed parameter, set when the class is instantiated.

return Parallel(n_jobs=-1)(processes)

For larger datasets, there is a threshold (Parallel default is 1M) on the size of arrays passed to the workers that triggers automated memory mapping in a temp folder. Because there is no way to set initial parameters for Parallel, CTGAN is at the mercy of how this all gets handled by default. This has all worked fine for me on my laptop, but I am now running into errors on Databricks when running my CTGAN model.

ValueError: assignment destination is read-only
---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/databricks/python/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 428, in _process_worker
    r = call_item()
  File "/databricks/python/lib/python3.8/site-packages/joblib/externals/loky/process_executor.py", line 275, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "/databricks/python/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 620, in __call__
    return self.func(*args, **kwargs)
  File "/databricks/python/lib/python3.8/site-packages/joblib/parallel.py", line 288, in __call__
    return [func(*args, **kwargs)
  File "/databricks/python/lib/python3.8/site-packages/joblib/parallel.py", line 288, in <listcomp>
    return [func(*args, **kwargs)
  File "/local_disk0/.ephemeral_nfs/envs/pythonEnv-2d8ce7d7-4206-43c2-9fd3-0aad944e643a/lib/python3.8/site-packages/ctgan/data_transformer.py", line 112, in _transform_continuous
    data[column_name] = data[column_name].to_numpy().flatten()
  File "/databricks/python/lib/python3.8/site-packages/pandas/core/frame.py", line 3163, in __setitem__
    self._set_item(key, value)
  File "/databricks/python/lib/python3.8/site-packages/pandas/core/frame.py", line 3243, in _set_item
    NDFrame._set_item(self, key, value)
  File "/databricks/python/lib/python3.8/site-packages/pandas/core/generic.py", line 3832, in _set_item
    NDFrame._iset_item(self, loc, value)
  File "/databricks/python/lib/python3.8/site-packages/pandas/core/generic.py", line 3821, in _iset_item
    self._mgr.iset(loc, value)
  File "/databricks/python/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 1110, in iset
    blk.set_inplace(blk_locs, value_getitem(val_locs))
  File "/databricks/python/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 363, in set_inplace
    self.values[locs] = values
ValueError: assignment destination is read-only
"""

Expected behavior

The parallelization of the data transformations in CTGAN is a crucial step. Would it not be better to provide some configuration scope over this important step? I can confirm if I were able to set the Parallel instance, like the example below, then it resolves my original ValueError.

Parallel(n_jobs=-1, temp_folder="/tmp", mmap_mode="r+")(processes)

Additional context

I have also opened an issue (joblib/joblib#1373) in the joblib repo because I think there is some inconsistency, or at least further clarification is required, in the way multiple input parameters work with each other. But that does not change the need for the ability to better configure Parallel in CTGAN.

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • CTGAN version: 0.5.2
  • Python version: 3.8
  • Operating System: Databricks 10.4
@PJPRoche PJPRoche added bug Something isn't working pending review This issue needs to be further reviewed, so work cannot be started labels Dec 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working pending review This issue needs to be further reviewed, so work cannot be started
Projects
None yet
Development

No branches or pull requests

1 participant