Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to save model using optimized data formats #504

Open
dmey opened this issue Jul 15, 2020 · 11 comments
Open

Add support to save model using optimized data formats #504

dmey opened this issue Jul 15, 2020 · 11 comments

Comments

@dmey
Copy link

dmey commented Jul 15, 2020

I am currently working with a large model > 2 GB of JSON. This leads to files being large and loading slow due to parsing. Are there any plans to add drivers to save to other formats like HDF5?

@tnagler
Copy link
Collaborator

tnagler commented Jul 15, 2020

That sounds like a good idea. Don't know much about the to-binary-serialization landscape in C++ though, so we'll need to do some research first. (P.S: parametric models will be much smaller)

@tvatter
Copy link
Collaborator

tvatter commented Jul 15, 2020

Interesting! The current implementation uses boost property trees, which can be exported in a number of data formats that can be represented by such a tree, including XML, INI, and JSON.

Are you using truncated models? If not, you should consider this. Also, are you using nonparametric families? I just tried a 2000 parametric model truncated after 2 trees, and I got less than 2 MB. For a nonparametric model, it was below 200 MB.

The issue with nonparametric models is that there is a 30x30 grid of numbers that need to be stored for each pair, and JSON/XML and others are plain text formats, so it will take a lot of space in any non-binary format.

@dmey
Copy link
Author

dmey commented Jul 16, 2020

Are you using truncated models? If not, you should consider this. Also, are you using nonparametric families? I just tried a 2000 parametric model truncated after 2 trees, and I got less than 2 MB. For a nonparametric model, it was below 200 MB.

We set this to tll with 50 levels at the moment. Happy to share the model (~ 200 MB compressed).

The issue with nonparametric models is that there is a 30x30 grid of numbers that need to be stored for each pair, and JSON/XML and others are plain text formats, so it will take a lot of space in any non-binary format.

This may be an edge case for copula users. Otherwise, would it be possible to maybe consider a move to something like HDF5 in future releases instead of text based formats which are inherently inefficient for large data volume (some examples in C++ https://support.hdfgroup.org/HDF5/doc/cpplus_RM/examples.html).

@tvatter
Copy link
Collaborator

tvatter commented Jul 16, 2020

2000 dimensions nonparametric with 50 trees is definitely something that we aim at being able to handle. To be honest, we've never encountered this issue because, when experimenting with such large models, we were using the R interface, where we are saving the objects in a binary format rather than plain text.

But the Python bindings are really only wrapping the C++ code, and I've been looking at solutions to this issue. Mainly, the problem is that plain text without compression isn't a good idea for that model size. As you've noticed, the compressed files are much smaller, meaning that we could surely do a lot better.

One way that I think is sensible without adding other dependencies (i.e., moving towards HDF5 is currently not feasible because of this requirement): use boost serialization and hook up some compression on top (see e.g. here and here.

@tnagler What do you think?

@tvatter
Copy link
Collaborator

tvatter commented Jul 16, 2020

Also, HDF5 was considered as an addition to boost serialization a long time ago, but it didn't pan out. Not sure why.

@tnagler
Copy link
Collaborator

tnagler commented Jul 17, 2020

One way that I think is sensible without adding other dependencies (i.e., moving towards HDF5 is currently not feasible because of this requirement): use boost serialization and hook up some compression on top (see e.g. here and here.

@tnagler What do you think?

I like it! Seems both easy and solid.

@tvatter
Copy link
Collaborator

tvatter commented Jul 17, 2020

Quick update: after noticing that boost serialization isn't header only, we need to find another way.

@tvatter
Copy link
Collaborator

tvatter commented Jul 12, 2021

Small update. I spent some time today on this issue. Since #539, we are now using https://github.com/nlohmann/json instead of boost::property_tree.

In C++, this lets us do things like

std::ofstream o("pretty.json");
o << vinecop_json << std::endl; // without whitespace

and

std::ofstream o("pretty.json");
o << std::setw(4) << bicop_json << std::endl; //with 4 whitespaces indents

I'm integrating this into pyvinecopulib right now.

@tvatter tvatter closed this as completed Jul 12, 2021
@tvatter tvatter reopened this Jul 12, 2021
@tvatter
Copy link
Collaborator

tvatter commented Jul 12, 2021

Sorry, shouldn't have closed right away :)

@tvatter
Copy link
Collaborator

tvatter commented Jul 12, 2021

Alright, it's in pyvinecopulib : https://github.com/vinecopulib/pyvinecopulib/tree/v0.6.0

I'm closing for now, don't hesitate to reopen!

@tvatter tvatter closed this as completed Jul 12, 2021
@tvatter
Copy link
Collaborator

tvatter commented Jul 13, 2021

OK, I just did the following:

# Import the required libraries
import numpy as np
import pyvinecopulib as pv

# Simulate some data
np.random.seed(1234)  # seed for the random generator
n = 100  # number of observations
d = 2000  # the dimension
mean = np.random.normal(size=d)  # mean vector
cov = np.random.normal(size=(d, d))  # covariance matrix
cov = np.dot(cov.transpose(), cov)  # make it non-negative definite
x = np.random.multivariate_normal(mean, cov, n)

# Transform copula data using the empirical distribution
u = pv.to_pseudo_obs(x)

# Fit a Gaussian vine
# (i.e., properly specified since the data is multivariate normal)
controls = pv.FitControlsVinecop(family_set=[pv.BicopFamily.tll],
                                 trunc_lvl=50,
                                 num_threads=60)
cop = pv.Vinecop(u, controls=controls)

And then

%timeit cop.to_json("test.json")
%timeit test2 = pv.Vinecop("test.json")

image

I was also looking at top and the memory spiked around 16.6/17.3GB when writing/reading. So not ideal, but acceptable IMO. Filesize is still 1.7GB however. One thing that could be done is "truncating" the floating numbers, since we have currently 900 parameters for each PC and 2000 * 1999 * ... * 1950 PCs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants