Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dask.bag.Bag.to_dataframe behavior change in 2024.3.0 - setting dtype to string rather than object by default #10998

Open
kbuma opened this issue Mar 12, 2024 · 4 comments

Comments

@kbuma
Copy link

kbuma commented Mar 12, 2024

Describe the issue:

Since the last update dask bag's to_dataframe generates data frames with a string dtype by default rather than object

Minimal Complete Verifiable Example:

>>> import dask.bag as db
>>> import pandas as pd
>>> pd.DataFrame.from_records([{"obj": range(2)}]).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   obj     1 non-null      object
dtypes: object(1)
memory usage: 140.0+ bytes
>>> db.from_sequence([{"obj": range(2)}]).to_dataframe().compute().info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   obj     1 non-null      string
dtypes: string(1)
memory usage: 151.0 bytes

Anything else we need to know?:

Environment:

  • Dask version: 2024.3.0
  • Python version: 3.11.8
  • Operating System: MacOS 14.3.1
  • Install method (conda, pip, source): pip
@github-actions github-actions bot added the needs triage Needs a response from a contributor label Mar 12, 2024
@phofl
Copy link
Collaborator

phofl commented Mar 12, 2024

Hi, thanks for your report. This is expected because we are now utilising an option that converts all string-like columns to PyArrow backed strings. You can disable that behaviour with

dask.config.set({'dataframe.convert-string': False})

@kbuma
Copy link
Author

kbuma commented Mar 12, 2024

@phofl thanks for the info on how to disable that. Any pointers to summary of expected behavior changes with the new dataframe engine?

@phofl
Copy link
Collaborator

phofl commented Mar 12, 2024

Hopefully not too many that are actually user-facing. We now optimise queries before we submit them to the scheduler, but we aimed for as much compatibility from an end-user perspective as possible. Please let us know if you encounter something that breaks

@mrocklin
Copy link
Member

Just checking here, I hope that we're considering this to be a bug. We may not have a good way to fix the bug short term, but certainly converting a range object to a string is unexpected and suboptimal behavior.

I think when we convert to dataframe we're looking at a few values, right? We can probably do some simple Python logic there to see if strings or objects are appropriate, right? (This may not be right, but I'm curious why not if not)

@fjetter fjetter added dataframe convert-string and removed needs triage Needs a response from a contributor labels Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants