pandas dataframe with list as values #407

feizerl · 2022-09-30T20:49:35Z

Consider this code:

import jsonpickle
import jsonpickle.ext.numpy as jsonpickle_numpy
import jsonpickle.ext.pandas as jsonpickle_pandas
import pandas as pd

jsonpickle_pandas.register_handlers()
jsonpickle_numpy.register_handlers()


a = pd.DataFrame({"date": [["20220921"]]})
b = jsonpickle.decode(jsonpickle.encode(df, keys=True))

if you run this, you get:

$ py -i foo.py 
>>> a['date'][0]
['20220921']
>>> b['date'][0]
"['20220921']"

i.e. a list[str] has been changed to a str

Is there any way to avoid this?

The text was updated successfully, but these errors were encountered:

Theelx · 2022-10-03T15:01:25Z

Apologies for the slow response, I've been busy with classes lately. This is most likely a bug, I'll look into this.

Theelx · 2022-10-08T20:17:12Z

Update: This is definitely a bug, and it'll be hard to fix. Pandas doesn't have a string dtype, only an object dtype, so it's hard for jsonpickle's pandas extension to differentiate between non-integer types, such as a list and a string. I'm working on fixing the encoding for that, but it'll be difficult as I need to integrate type() checking too, in addition to dtype checking.

Theelx · 2022-11-15T00:14:17Z

Update 2: I just realized this'll be harder to fix than I previously thought, since the lists can contain more than one dtype. For example, one could have a dataframe like so: pd.DataFrame({"date": [["20220921", 20220921]]}), which has string and int dtypes inside the list. This means we'll have to store the dtypes for every single item inside the list, which is going to massively bloat the encoded blob. I'm reconsidering whether it's worth fixing this bug in the general case of a multi-dtype list.

Theelx · 2022-12-01T16:43:47Z

Apparently this is basically the same issue as #358.

Paradoxdruid · 2022-12-31T17:24:24Z

I wanted to use jsonpickle for a bioinformatics project of mine, but this behavior is a killer; I regularly have dataframes with lists of floats.

Could an implementation like this, which stores lists as lists of tuples of (value, type), be useful? It roughly doubles the size of the representation of lists as values in pandas Series, but wouldn't touch elsewhere, I think.

# a naive implementation
def list_encode(list_input: List[Any]) -> str:
    nested_list = [f"({each}, {type(each).__name__})" for each in list_input]
    return json.dumps(nested_list)

def decode_list(str_input: str) -> List[Any]:
    obj = json.loads(str_input)
    final_list: List[Any] = []
    for each in obj:
        value, str_type = each.split(",")
        str_type = str_type.strip()
        if str_type == "int":
            final_list.append(int(value))
        elif str_type == "float":
            final_list.append(float(value))
        # Add datetime, etc; or recursively call for nested lists
        else:
            final_list.append(value)
    return final_list

Then, an example:

# Example Encoding
sample = [1, 1.0, "Sue"]
list_encode(sample)
# Output
'["1, int", "1.0, float", "Sue, str"]'

# Example Decoding
str_sample = '["1, int", "1.0, float", "Sue, str"]'
decode_list(str_sample)
# Output
[1, 1.0, 'Sue']

So, you'd add a new check in the PandasSeriesHandler or something:

if isinstance(value, list):
    value = list_encode(value)

Theelx · 2023-01-01T18:25:48Z

Oh, if that works I'd be happy to merge it! I'll try and test it over the next few days, thanks so much for giving some example code!

Paradoxdruid · 2023-01-01T18:45:52Z

Thanks, @Theelx ! Looking back at it this naïve implementation I wrote, we might need to use a special character in the f-string-- as I wrote it, if any of the string values in the list contain a comma, it will error out. Given that "Los Angeles, California" would be a reasonable string to have encoded in a dataset, failing on a comma is probably a bad idea. 😉

Theelx · 2023-01-09T16:00:47Z

Hm, breaking on a special character isn't a good idea for library code. I'll try to change the behavior so it works for everything.

Fshahnaj · 2023-11-28T17:23:23Z

Can you please assign me for the issue #407?

stefan6419846 · 2023-11-28T17:25:31Z

No need to explicitly assign you to this issue - just start working on it and open a PR (maybe as a draft at first).

Theelx added bug good-first-issue labels Oct 3, 2022

cliffckerr mentioned this issue Aug 12, 2023

Pandas series with object type get converted to strings #457

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas dataframe with list as values #407

pandas dataframe with list as values #407

feizerl commented Sep 30, 2022

Theelx commented Oct 3, 2022

Theelx commented Oct 8, 2022

Theelx commented Nov 15, 2022

Theelx commented Dec 1, 2022

Paradoxdruid commented Dec 31, 2022 •

edited

Theelx commented Jan 1, 2023

Paradoxdruid commented Jan 1, 2023

Theelx commented Jan 9, 2023

Fshahnaj commented Nov 28, 2023

stefan6419846 commented Nov 28, 2023

pandas dataframe with list as values #407

pandas dataframe with list as values #407

Comments

feizerl commented Sep 30, 2022

Theelx commented Oct 3, 2022

Theelx commented Oct 8, 2022

Theelx commented Nov 15, 2022

Theelx commented Dec 1, 2022

Paradoxdruid commented Dec 31, 2022 • edited

Theelx commented Jan 1, 2023

Paradoxdruid commented Jan 1, 2023

Theelx commented Jan 9, 2023

Fshahnaj commented Nov 28, 2023

stefan6419846 commented Nov 28, 2023

Paradoxdruid commented Dec 31, 2022 •

edited