Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional plain HDF5 data mapping without reconstruction of data source type #505

Open
koehlerson opened this issue Nov 15, 2023 · 10 comments

Comments

@koehlerson
Copy link
Contributor

koehlerson commented Nov 15, 2023

As discussed in Ferrite-FEM/Ferrite.jl#678 it would be nice to "dump" data sometimes, especially for the beginning of a computational project where structs change. The custom serialization interface offers a way to bijectively map from A->Aserialized (and vice verca). This, however, is sometimes not needed, instead, sometimes only certain fields need to be stored that can be expressed as Julia primitives/isbitstypes or arrays thereof.

As an example: Let's say I have a simulation with some struct that holds a state of a material that I'm simulating. This struct is not needed to reproduce the simulation but serves more as "intermediate" results that maybe relevant for postprocessing purposes. So, there is not really the need to store the full type or a custom serialization of it. Instead, I just want to store a single scalar of this type. See https://ferrite-fem.github.io/Ferrite.jl/stable/examples/plasticity/ especially the MaterialState. You only need the fields (all are isbitstype or primitives).

What I envision is some function, let's call it for the sake of this issue stuff_to_store which is by default dispatched on ::Type{Any} and is in this case the identity mapping. However, a user can now overload this function for their own struct and specify a NamedTuple of things to save. What should be expected from the user is that all fields of the NamedTuple are directly supported by HDF5 due to primitives,isbitstypes or arrays of these things. The tricky bit is now to call this for whatever should be stored and check if stuff_to_save is dispatched for the current object. If so, the NamedTuple is saved. In the case it isn't dispatched, the usual JLD2 machinery should kick in with all the nice features that are already implemented (including custom serialization).

By this, one could achieve an optional "plain HDF5" storage behavior in JLD2. I'm not quite sure where to start to tackle this problem, but as @JonasIsensee pointed out a seperate package on top of JLD2 would be a good start. In order to be helpful I'd need some guidance where we could sneak in such dispatches. The aforementioned things are already packed with implementation detail even tho it isn't meant to be. More or less everything should be understood as an example how to get to an optional "plain HDF5" storage scheme within JLD2.

@koehlerson
Copy link
Contributor Author

Along that lines, I do think that the same people who would be interested in something like this, would be interested in syntactic sugar for #504, i.e. something like file["field"][] returns the NamedTuple representation or any other barebone representation

@JonasIsensee
Copy link
Collaborator

Along that lines, I do think that the same people who would be interested in something like this, would be interested in syntactic sugar for #504, i.e. something like file["field"][] returns the NamedTuple representation or any other barebone representation

That sounds nice. This syntax file["field"][] sadly can't work, since file["field"] already tries to load the regular struct before the second getindex is called. Something like file["field", :plain] could work. HDF5.jl works with that kind of syntax.

@koehlerson
Copy link
Contributor Author

koehlerson commented Nov 18, 2023

Something like file["field", :plain] could work.

that sounds nice!

@JonasIsensee
Copy link
Collaborator

JonasIsensee commented Nov 18, 2023

I like this idea of an optional bijection.

The place to start, I'd say, is to define the expected behaviour and desired API style.
E.g.

  • Do you want to implement a conversion function? Or give a list of fields to keep?
  • How should it compose:
    • only convert at toplevel (no nesting)
    • Partial nesting: Require that the result named tuple only contains basic types and other (same) converted structures. Do not mix with regular structs.
    • full nesting: recursively convert all instances
      Potential problem: the conversion is not a bijection. Loading of (normal) structs with converted fields will not work.
    • What should it do to singleton structures ? These are encoded as nothing. (aka there's the type definition in the file but there's no actual data attached to it. So the info would be lost on reconstruction.
    • What are basic types ? ;) HDF5 has endless different number and string definitions. Also, e.g. Dicts are also implemented using CustomSerialization.

Another Quirk:
JLD2 typically inlines isbits fields (eg. other structs) into parent structs.
This makes it difficult to make NT reconstruction do the right thing.

One idea:
Do normal JLD2 storage, dumping everything as normal.
Give JLD2 a list of Types to reconstruct and make everything else return NamedTuples.

Another thing to consider is #487 . Julia can construct arbitrarily large structs i.e. through structs with large ntuples as fields or through code generation. JLD2 has a hard limit on the size of the type description.

Last thing is type stability / compilation / run time.
Depending on the desired behaviour, one could also implement a generic JLD2.Compound that is returned for all non-basic types and supports type-unstable getindex or getproperty for the fields.

@JonasIsensee
Copy link
Collaborator

@koehlerson you probably didn't get notifications after I first commented with an incomplete message.
What are your thoughts?

To start playing around with ideas, one doesn't really need to work with JLD2 at all.
I think it would be sufficient to use something like

struct MockFile
    d::Dict{String,Any}
end

and build an interface that does the desired things.
This could then be tested against.

@koehlerson
Copy link
Contributor Author

koehlerson commented Dec 1, 2023

Hey sorry for the late reply!

Conversion function vs list of fields

Do you want to implement a conversion function? Or give a list of fields to keep?

So far, I only used some of the fields directly, but I can imagine that some people may want to compute something else before saving it. So, I think some kind of function would be nice to have a possible flexibility. Or maybe some folks want to compute a different representation for saving (thinking of #487 and the variety of array layouts, ML model layouts etc) compared to the layout that is present in the struct's field


Composability

How should it compose:

I'm not sure if I can follow 100% but what I would imagine is that either you have the bijection or you have the full nesting approach, i.e.

full nesting: recursively convert all instances

Why I'm unsure if I can follow is the following

Loading of (normal) structs with converted fields will not work.

Do you mean that you cannot recreate the struct that was saved? From my perspective this would be the desired behavior that you opt out of bijection and only save e.g. a NamedTuple with "crucial information" in a more simplistic way (primitives, isbitstype, ...) and if a user wants to rebuild something nothing is guaranteed and its up to the user to save as much as needed to rebuild on their own by some custom function rebuild(file,typename) in their codebase? But maybe I don't understand the statement correctly or perhaps my view on this issue is too biased with my personal use case.

What should it do to singleton structures ? These are encoded as nothing. (aka there's the type definition in the file but there's no actual data attached to it. So the info would be lost on reconstruction.

This is a very good point, since we have in Ferrite.jl also some singletons that are exposed to the user. For me personally I'd be fine with saving a string, but I'm not sure if there is any downside to it.

What are basic types ? ;) HDF5 has endless different number and string definitions. Also, e.g. Dicts are also implemented using CustomSerialization.

Maybe the set of stuff that is supported by HDF5.jl ? https://juliaio.github.io/HDF5.jl/stable/#Supported-data-types If I understand correctly, then, what you are trying to say is that a Dict is also deconstructed in some "across verison stable" way? If so, then it's of course possible to include that too. I guess somewhere in JLD2.jl are the custom serialization dispatches for these objects. Nonetheless supporting exactly what HDF5 supports feels somewhat robust, even though this is totally by gut and there is no rational argument behind from my side :D


Other remarks

Do normal JLD2 storage, dumping everything as normal.
Give JLD2 a list of Types to reconstruct and make everything else return NamedTuples.

This sounds nice, especially since this solves in my head somewhat a user experience problem. Usually I have one big simulation struct with the parameters and I want to reconstruct it, but everything else is okay to be a NamedTuple especially intermediate "latent" results

Another thing to consider is #487 . Julia can construct arbitrarily large structs i.e. through structs with large ntuples as fields or through code generation. JLD2 has a hard limit on the size of the type description.

Does this mean that the ML model from the issue is serialized by JLD2 and the type parameters are too large which could be dodged by utilizing the "plain HDF5" approach with NamedTuple?


To start playing around with ideas, one doesn't really need to work with JLD2 at all.
I think it would be sufficient to use something like

That sounds nice, will do that as soon as you gave some feedback, because I'm quite unsure to what extend the ideas make sense. My thinking is probably a bit too narrow towards my specific problem, so, happy to hear other perspectives :)

@JonasIsensee
Copy link
Collaborator

Why I'm unsure if I can follow is the following

Loading of (normal) structs with converted fields will not work.

Here's an example of a fundamental problem.
This already errors and would also error when you don't try to reconstruct types.
The only way to change this, I guess, would be to disallow non-concrete element types (except Any).
Here, the parent "structure" was just an array, but the same issue would appear for abstract type-restricted struct fields.

julia> struct N <: Real; x::Int; end

julia> arr = Real[1, 2, N(3)]
3-element Vector{Real}:
    1
    2
 N(3)

julia> jldsave("test.jld2"; arr)

## new session 
julia> load("test.jld2")
┌ Warning: type Main.N does not exist in workspace; reconstructing
└ @ JLD2 ~/.julia/dev/JLD2.jl/src/data/reconstructing_datatypes.jl:605
Error encountered while load FileIO.File{FileIO.DataFormat{:JLD2}, String}("test.jld2").

Fatal error:
ERROR: MethodError: Cannot `convert` an object of type JLD2.ReconstructedStatic{:N, (:x,), Tuple{Int64}} to an object of type Real

@JonasIsensee
Copy link
Collaborator

I'm sorry, this is a complex topic and so my answers will be a bit disorganized.
In my view, there are few problems that could be addressed here.

  1. Some objects have type signatures that JLD2 cannot store. CustomSerialization can not help here, because converts the data but still it tries to encode the originial type signature for loading.

  2. Some objects are (heavily nested) and immutable. JLD2 tries to inline immutable fields which yieds a struct that is too large for the HDF5 standard. (64kb is max for type description. This is what happens in Inexact error when saving large data without compresssion #487.
    Both (1) and (2) need different encoding in the file and if reconstruction is desired, some new way to encode the type signature.

  3. Some very basic julia objects such as Vector{Real} can be almost impossible to reconstruct in JLD2 currently. Anyone can define new subtypes that may be missing on load. It is not possible to detect this on the type-level and namedtuples won't fit be <: Real. One could possibly try to reconstruct all abstract types e.g. Real as Any but someone would have to try. [ Quick background info: When loading, JLD2 retrieves all the type infos and generates fairly efficient and type-stable code that then loads the data from top to bottom. This is necessary to make things fast but inevitable fails when unexpected types pop up somewhere in between ]

@JonasIsensee
Copy link
Collaborator

This sounds nice, especially since this solves in my head somewhat a user experience problem. Usually I have one big simulation struct with the parameters and I want to reconstruct it, but everything else is okay to be a NamedTuple especially intermediate "latent" results

A place to start experimenting might be in usability of typemap.
referencing #504 , typemap section in the docs

Here is an experimental package I built at some point. It has some tooling for retrieving type info from JLD2 files.
This could be used to help generate the typemap Dicts.
https://github.com/JonasIsensee/JLD3.jl/

The second (orthogonal) approach would be to implement a function that does all the conversions you can think of prior to handing it to JLD2.

@JonasIsensee
Copy link
Collaborator

Please test out #522.

I can't say it's elegant but it worked for my test cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants