You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Anything else we need to know?:
The error was caused using a dask.distributed async client.
I'm also tried using .flatten() method and dd.from_delayed(data_bag.to_delayed())
Thanks for the report. This is unfortunately a known issue of the async Client and one we will likely not be able to fix. However, there are ways to avoid this problem
First of all, it's important to understand where this is coming from. Your data_bag.to_dataframe() method call is not providing any meta, i.e. we have to guess / determine what the schema of the dataframe is going to look like. To do this, we actually run the very first task of the bag ourselves to check the schema. This is typically something users want to avoid and providing meta specifically avoids this issue, i.e.
fixes the issue. The reason is that the internal call to compute the first partition is not handling async code. This is a shortcoming of the async Client when working with dask collections.
Ideally, you wouldn't load the data locally, though, but do this on the cluster as well. Most users are not using the async dask Client unless they work with the lower level Future interface.
For example, using dask.delayed and dask.dataframe this would look as
Describe the issue:
Using dask bag to normalize the data into a dask dataframe throws:
TypeError: 'coroutine' object is not iterable
Minimal Complete Verifiable Example:
Anything else we need to know?:
The error was caused using a
dask.distributed
async client.I'm also tried using
.flatten()
method anddd.from_delayed(data_bag.to_delayed())
Environment:
dask, version 2024.2.0
Python 3.11.6
Docker - Debian GNU/Linux 12 (bookworm)
pip
The text was updated successfully, but these errors were encountered: