Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas series with object type get converted to strings #457

Open
cliffckerr opened this issue Aug 12, 2023 · 0 comments
Open

Pandas series with object type get converted to strings #457

cliffckerr opened this issue Aug 12, 2023 · 0 comments
Labels
bug extensions issues affecting numpy/pandas/etc has-MRE Has a minimal reproducible example for debugging

Comments

@cliffckerr
Copy link

cliffckerr commented Aug 12, 2023

This error is a slightly different take on #407 and #358.

My issue is that if a dataframe has a column of mixed type (e.g. [4, 'foo']), then it will be converted to strings on unpickling (e.g. ['4', 'foo']):

import pandas as pd
import jsonpickle as jp
import jsonpickle.ext.numpy as jsonpickle_numpy
import jsonpickle.ext.pandas as jsonpickle_pandas
jsonpickle_numpy.register_handlers()
jsonpickle_pandas.register_handlers()

# Create simple data frame of mixed data types
df1 = pd.DataFrame((
  a = [1, 2], 
  b = [4, 'foo']
))

# Convert to jsonpickle
string = jp.dumps(df1)
df2 = jp.loads(string)

# Show that it didn't work
print(df1.b.values)
print(df2.b.values)

assert df1.b[0] == df2.b[0] # False: the first is 4 and the second is '4'

I know there isn't an easy fix for this, but I'm hopeful that there's some fix, since (at least in this case) the actual dataframe data looks a lot like JSON format already! In particular, one suggestion I have is that currently just a single dtype is stored per column:

"meta": {"dtypes": {"a": "int64", "b": "object"}

My suggestion that I believe would solve this particular bug would be that if the dtype is object, to store a list (one for each element) instead. For example, in this case:

"meta": {"dtypes": {"a": "int64", "b": {"object": ["int", "object"]}}

Personally, I also think storing values as a dict of lists (rather than as a string) would be more robust and easier to read/interpret. I feel pd.DataFrame.to_dict() is already pretty close to what would be required!

@Theelx Theelx added bug extensions issues affecting numpy/pandas/etc has-MRE Has a minimal reproducible example for debugging labels Mar 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug extensions issues affecting numpy/pandas/etc has-MRE Has a minimal reproducible example for debugging
Projects
None yet
Development

No branches or pull requests

2 participants