Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas dataframe with list as values #407

Open
feizerl opened this issue Sep 30, 2022 · 10 comments
Open

pandas dataframe with list as values #407

feizerl opened this issue Sep 30, 2022 · 10 comments

Comments

@feizerl
Copy link

feizerl commented Sep 30, 2022

Consider this code:

import jsonpickle
import jsonpickle.ext.numpy as jsonpickle_numpy
import jsonpickle.ext.pandas as jsonpickle_pandas
import pandas as pd

jsonpickle_pandas.register_handlers()
jsonpickle_numpy.register_handlers()


a = pd.DataFrame({"date": [["20220921"]]})
b = jsonpickle.decode(jsonpickle.encode(df, keys=True))

if you run this, you get:

$ py -i foo.py 
>>> a['date'][0]
['20220921']
>>> b['date'][0]
"['20220921']"

i.e. a list[str] has been changed to a str

Is there any way to avoid this?

@Theelx
Copy link
Contributor

Theelx commented Oct 3, 2022

Apologies for the slow response, I've been busy with classes lately. This is most likely a bug, I'll look into this.

@Theelx
Copy link
Contributor

Theelx commented Oct 8, 2022

Update: This is definitely a bug, and it'll be hard to fix. Pandas doesn't have a string dtype, only an object dtype, so it's hard for jsonpickle's pandas extension to differentiate between non-integer types, such as a list and a string. I'm working on fixing the encoding for that, but it'll be difficult as I need to integrate type() checking too, in addition to dtype checking.

@Theelx
Copy link
Contributor

Theelx commented Nov 15, 2022

Update 2: I just realized this'll be harder to fix than I previously thought, since the lists can contain more than one dtype. For example, one could have a dataframe like so: pd.DataFrame({"date": [["20220921", 20220921]]}), which has string and int dtypes inside the list. This means we'll have to store the dtypes for every single item inside the list, which is going to massively bloat the encoded blob. I'm reconsidering whether it's worth fixing this bug in the general case of a multi-dtype list.

@Theelx
Copy link
Contributor

Theelx commented Dec 1, 2022

Apparently this is basically the same issue as #358.

@Paradoxdruid
Copy link

Paradoxdruid commented Dec 31, 2022

I wanted to use jsonpickle for a bioinformatics project of mine, but this behavior is a killer; I regularly have dataframes with lists of floats.

Could an implementation like this, which stores lists as lists of tuples of (value, type), be useful? It roughly doubles the size of the representation of lists as values in pandas Series, but wouldn't touch elsewhere, I think.

# a naive implementation
def list_encode(list_input: List[Any]) -> str:
    nested_list = [f"({each}, {type(each).__name__})" for each in list_input]
    return json.dumps(nested_list)

def decode_list(str_input: str) -> List[Any]:
    obj = json.loads(str_input)
    final_list: List[Any] = []
    for each in obj:
        value, str_type = each.split(",")
        str_type = str_type.strip()
        if str_type == "int":
            final_list.append(int(value))
        elif str_type == "float":
            final_list.append(float(value))
        # Add datetime, etc; or recursively call for nested lists
        else:
            final_list.append(value)
    return final_list

Then, an example:

# Example Encoding
sample = [1, 1.0, "Sue"]
list_encode(sample)
# Output
'["1, int", "1.0, float", "Sue, str"]'

# Example Decoding
str_sample = '["1, int", "1.0, float", "Sue, str"]'
decode_list(str_sample)
# Output
[1, 1.0, 'Sue']

So, you'd add a new check in the PandasSeriesHandler or something:

if isinstance(value, list):
    value = list_encode(value)

@Theelx
Copy link
Contributor

Theelx commented Jan 1, 2023

Oh, if that works I'd be happy to merge it! I'll try and test it over the next few days, thanks so much for giving some example code!

@Paradoxdruid
Copy link

Thanks, @Theelx ! Looking back at it this naïve implementation I wrote, we might need to use a special character in the f-string-- as I wrote it, if any of the string values in the list contain a comma, it will error out. Given that "Los Angeles, California" would be a reasonable string to have encoded in a dataset, failing on a comma is probably a bad idea. 😉

@Theelx
Copy link
Contributor

Theelx commented Jan 9, 2023

Hm, breaking on a special character isn't a good idea for library code. I'll try to change the behavior so it works for everything.

@Fshahnaj
Copy link

Can you please assign me for the issue #407?

@stefan6419846
Copy link

No need to explicitly assign you to this issue - just start working on it and open a PR (maybe as a draft at first).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants