Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

de/serialize fails for mixed type series #103

Open
crystalgreen opened this issue Jan 13, 2021 · 6 comments
Open

de/serialize fails for mixed type series #103

crystalgreen opened this issue Jan 13, 2021 · 6 comments
Labels
version-2 Saving this for Data-Forge version 2.

Comments

@crystalgreen
Copy link

Mixed types in a series are not correctly deserialized. See example below.

const df = new dataForge.DataFrame({
    columns: {
        Col1: ['a', new Date(2020, 11, 1), 1.1, 'd']
    },
});
console.log(df.getSeries("Col1").detectTypes().toString());
let json = JSON.stringify(df.serialize());
console.log(json);
const deserializedDf = dataForge.DataFrame.deserialize(JSON.parse(json));
console.log(deserializedDf.toString());
// This fails
expect(df.at(1).Col1).to.equal(deserializedDf.at(1).Col1);

The reason seems that the detected column type is serialized as well (even if there doesn't seem to be a public API to get that type) and during deserialization that type is applied to all values.

Idea for a fix: Column type is the most frequently found type. For values of other types, store [type name, value] instead of the value. The deserializer would detect these arrays and correctly deserialize these other types.

PS: The current values format is quite verbose. More compact would be an array of values per row.

Of course, this would be a breaking change for those that manually parse the serialized format.

@ashleydavis
Copy link
Member

Thanks for reporting the issue.

Data-Forge isn't very smart with that kind of thing. It only inspects the item in a series to determine the type because it would be too expensive to inspect every item in the series.

I can think of a way to fix this actually without even having to record the type of the column.

It could be implemented like how MongoDB deserializes a JSON file, if a value is `{ "$date": "..." }" then the sub-value is parsed as a date, and then the date replaces it.

That could work nicely here. What do you think?

I think this could be made to be backward compatible.

Would you be interested in working on it?

@crystalgreen
Copy link
Author

Thanks for your answer.
I guess, you would not write type info for JSON types like string, number, boolean. If one does not serialize the (primary) column type (of a mixed type column), this would blow up the JSON size for non-JSON types like date by adding a lot of redundant type info, even worse of homogeneous columns that are of non-JSON types like date columns.
Therefore I favor the idea of a primary column type whose values will not be written with their types. This primary type can be either given by the user or by a heuristic that analyzes the top N rows.

In my use case I expect more diverse types than just string, number, boolean, date. More like what is covered by frictionless, e.g. duration or geopoint.
A possible future requirement for me could be a DataFrame export function to a frictionless compatible format and this requires precise column type information. (I'm not asking you to implement this, just to make another point about known column types.)

I now also need to serialize custom classes containing DataFrames. This exceeds the scope of the DataFrame serialization scheme.

Thanks for the offer to work on it, but it seems I will not use the DataFrame serialization part after all as my scope has changed.
I started writing my own JSON serialization logic that preserves types of any object whose type is known/registered, doing above mentioned optimizations for arrays. I plan on extending your DataFrame/Series to add column type info that can either be provided by the user or by a heuristic. This helps my generic serializer which will use a custom handler for DataFrame objects. It could also help for other use cases as the above mentioned frictionless export.

I leave it up to you if you want to close the issue or not.

PS: Great library of yours, it was the best I found (for TypeScript).

@ashleydavis
Copy link
Member

Thanks for the info. Frictionless is interesting.

If you have written your own serialization code would you mind sharing it with me?

I might be able to integrate it into DF or maybe it will just give me inspiration as to what to do next.

I'll keep this issue open until DF supports mixed types in a series.

@ashleydavis ashleydavis added the version-2 Saving this for Data-Forge version 2. label Feb 1, 2021
@crystalgreen
Copy link
Author

I created a reduced version of my serializer that focuses on arrays, without class serialization support.
Have a look here, maybe you can use it or at least get inspiration.

@ashleydavis
Copy link
Member

@crystalgreen that looks awesome. Will you make this available in an npm library or a GitHub repo by itself? I'd love to clone the project and try it out - try and see how I could fit it into Data-Forge.

@crystalgreen
Copy link
Author

I currently do not plan to create a public library but this may change in the future. My full blown library with serialization support for classes and other JS type is still in early stages of development and not yet stable. I also haven't made any performance tests yet but think about adding a JIT compiler like in the very fast and sophisticated serialization lib @deepkit/type which makes it presumably one of the fastest TS serializers out there but that unfortunately does not fit my needs. Those libs have an emphasis on schema validation while I prefer the data to describe itself, especially for mixed type arrays.

If you like, you can use my code in the above link and adapt it to your needs. I'm open for ideas of improvement.
No promise, but I might come back to you should my lib become public one day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
version-2 Saving this for Data-Forge version 2.
Projects
None yet
Development

No branches or pull requests

2 participants