`ParallelRunner` raises `AttributeError: The following data sets cannot be used by multiprocessing...` on datasets not involved in `--pipeline` being run #3804

yury-fedotov · 2024-04-11T14:44:36Z

Description

Using ParallelRunner puts some restrictions on datasets involved in the run, as the logs mention:

In order to utilize multiprocessing you need to make sure all data sets are serialisable, i.e. data sets should not make use of lambda functions, nested functions, closures etc.
If you are using custom decorators ensure they are correctly decorated using functools.wraps().

Having this constraint on datasets that are involved in the pipeline that's executed with ParallelRunner makes total sense.

However, I found out that if any dataset in the catalog doesn't adhere to this, usage of ParallelRunner becomes impossible even for pipelines that have nothing to do with those datasets.

In other words, the following raises AttributeError: The following data sets cannot be used by multiprocessing...:

kedro run --pipeline pipeline_that_doesnt_involve_problematic_datasets runner=ParallelRunner

Context

This error prevents leveraging amazing advantages that ParallelRunner can bring to large projects in cases where any dataset doesn't adhere to the runner's requirements.

Steps to Reproduce

Create a pipeline that uses datasets not adhering to ParallelRunner requirements, but runs fine with SequentialRunner. Let this pipeline have 2 outputs: e.g. pandas dataframes.
Create a second pipeline that does some profiling of those tables: like df.describe(). There should be a modular pipe and 2 namespaces pipelines created for 2 tables respectively.
Run the first pipeline with SequentialRunner and produce those 2 outputs.
Try running the second pipeline with ParallelRunner, since it should be able to process those 2 namespaces in parallel, and see error raised.

Expected Result

The second pipeline involves no datasets that do not adhere to ParallelRunner requirements, and should be executed without errors. It should not check requirements for datasets not involved in it.

Actual Result

ParallelRunner raises AttributeError: The following data sets cannot be used by multiprocessing... on datasets not involved in --pipeline being run

Your Environment

Kedro version used (pip show kedro or kedro -V): 0.19.3
Python version used (python -V): 3.10
Operating system and version: Windows 10, Spark 3.5

The text was updated successfully, but these errors were encountered:

noklam · 2024-04-11T15:43:29Z

Haven't read the full thing. Was this working prior 0.19? In general we recommend ThreadRunner because multiprocess doesn't work with Spark. The computation doesn't happened locally anyway so it does not make sense to use multiprocess.

Would you be able to provide an demo repository that we can run on other side? Something modify from the existing starter would be good enough.

yury-fedotov · 2024-04-19T16:07:33Z

@noklam Hey! Sorry for late reply.

I haven't tested in < 0.19 tbh.
ThreadRunner has limitations too - e.g. matplotlib does not work with it, since this package is thread-unsafe. That's a limitation in my use case since the whole point of moving away from SequentialRunner is to parallelize nodes that generate big partitioned datasets of plt.Figures.
ParallelRunner does not with spark - that I get. So the fact that it's not able to run pipelines involving SparkDataset or SparkHiveDataset is clear. But the problem I described is a bit different: if your catalog has any spark datasets, ParallelRunner cannot be used in even pipelines that have nothing to do with those catalog datasets.

On providing the repo - I'm not sure unfortunately I'll have time for that in the near future, but will post here if I manage to.

noklam · 2024-04-19T17:08:15Z

Got it, would this resolved if dataset is somehow lazy initialised?

yury-fedotov · 2024-04-27T02:40:15Z

Got it, would this resolved if dataset is somehow lazy initialised?

Yeah lazy initialization would resolve this. That's my understanding since if I comment out those datasets, it works fine.

Does Kedro support lazy initialization somehow?

noklam · 2024-04-27T15:25:25Z

Kedro-datasets is lazily import but I think during the initialisatio Data Catalog would create instance for the entire catalog.

merelcht · 2024-05-21T11:34:53Z

This seems to be related to #2829

astrojuanlu · 2024-05-27T07:49:59Z

Indeed, closing this as duplicate of #2829, they are the same problem.

github-actions bot mentioned this issue May 1, 2024

Monthly issue metrics report #3846

Open

astrojuanlu added the Component: Runners label May 27, 2024

astrojuanlu closed this as not planned Won't fix, can't repro, duplicate, stale May 27, 2024

astrojuanlu removed the Component: Runners label May 27, 2024

astrojuanlu mentioned this issue May 27, 2024

Lazy Loading of Catalog Items #2829

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ParallelRunner` raises `AttributeError: The following data sets cannot be used by multiprocessing...` on datasets not involved in `--pipeline` being run #3804

`ParallelRunner` raises `AttributeError: The following data sets cannot be used by multiprocessing...` on datasets not involved in `--pipeline` being run #3804

yury-fedotov commented Apr 11, 2024

noklam commented Apr 11, 2024 •

edited

yury-fedotov commented Apr 19, 2024 •

edited

noklam commented Apr 19, 2024

yury-fedotov commented Apr 27, 2024

noklam commented Apr 27, 2024

merelcht commented May 21, 2024

astrojuanlu commented May 27, 2024

ParallelRunner raises AttributeError: The following data sets cannot be used by multiprocessing... on datasets not involved in --pipeline being run #3804

ParallelRunner raises AttributeError: The following data sets cannot be used by multiprocessing... on datasets not involved in --pipeline being run #3804

Comments

yury-fedotov commented Apr 11, 2024

Description

Context

Steps to Reproduce

Expected Result

Actual Result

Your Environment

noklam commented Apr 11, 2024 • edited

yury-fedotov commented Apr 19, 2024 • edited

noklam commented Apr 19, 2024

yury-fedotov commented Apr 27, 2024

noklam commented Apr 27, 2024

merelcht commented May 21, 2024

astrojuanlu commented May 27, 2024

`ParallelRunner` raises `AttributeError: The following data sets cannot be used by multiprocessing...` on datasets not involved in `--pipeline` being run #3804

`ParallelRunner` raises `AttributeError: The following data sets cannot be used by multiprocessing...` on datasets not involved in `--pipeline` being run #3804

noklam commented Apr 11, 2024 •

edited

yury-fedotov commented Apr 19, 2024 •

edited