Add support to save model using optimized data formats #504

dmey · 2020-07-15T20:19:05Z

I am currently working with a large model > 2 GB of JSON. This leads to files being large and loading slow due to parsing. Are there any plans to add drivers to save to other formats like HDF5?

tnagler · 2020-07-15T21:17:08Z

That sounds like a good idea. Don't know much about the to-binary-serialization landscape in C++ though, so we'll need to do some research first. (P.S: parametric models will be much smaller)

tvatter · 2020-07-15T21:18:24Z

Interesting! The current implementation uses boost property trees, which can be exported in a number of data formats that can be represented by such a tree, including XML, INI, and JSON.

Are you using truncated models? If not, you should consider this. Also, are you using nonparametric families? I just tried a 2000 parametric model truncated after 2 trees, and I got less than 2 MB. For a nonparametric model, it was below 200 MB.

The issue with nonparametric models is that there is a 30x30 grid of numbers that need to be stored for each pair, and JSON/XML and others are plain text formats, so it will take a lot of space in any non-binary format.

dmey · 2020-07-16T06:03:38Z

Are you using truncated models? If not, you should consider this. Also, are you using nonparametric families? I just tried a 2000 parametric model truncated after 2 trees, and I got less than 2 MB. For a nonparametric model, it was below 200 MB.

We set this to tll with 50 levels at the moment. Happy to share the model (~ 200 MB compressed).

The issue with nonparametric models is that there is a 30x30 grid of numbers that need to be stored for each pair, and JSON/XML and others are plain text formats, so it will take a lot of space in any non-binary format.

This may be an edge case for copula users. Otherwise, would it be possible to maybe consider a move to something like HDF5 in future releases instead of text based formats which are inherently inefficient for large data volume (some examples in C++ https://support.hdfgroup.org/HDF5/doc/cpplus_RM/examples.html).

tvatter · 2020-07-16T18:11:37Z

2000 dimensions nonparametric with 50 trees is definitely something that we aim at being able to handle. To be honest, we've never encountered this issue because, when experimenting with such large models, we were using the R interface, where we are saving the objects in a binary format rather than plain text.

But the Python bindings are really only wrapping the C++ code, and I've been looking at solutions to this issue. Mainly, the problem is that plain text without compression isn't a good idea for that model size. As you've noticed, the compressed files are much smaller, meaning that we could surely do a lot better.

One way that I think is sensible without adding other dependencies (i.e., moving towards HDF5 is currently not feasible because of this requirement): use boost serialization and hook up some compression on top (see e.g. here and here.

@tnagler What do you think?

tvatter · 2020-07-16T18:13:16Z

Also, HDF5 was considered as an addition to boost serialization a long time ago, but it didn't pan out. Not sure why.

tnagler · 2020-07-17T10:06:48Z

One way that I think is sensible without adding other dependencies (i.e., moving towards HDF5 is currently not feasible because of this requirement): use boost serialization and hook up some compression on top (see e.g. here and here.

@tnagler What do you think?

I like it! Seems both easy and solid.

tvatter · 2020-07-17T12:43:36Z

Quick update: after noticing that boost serialization isn't header only, we need to find another way.

tvatter · 2021-07-12T19:39:45Z

Small update. I spent some time today on this issue. Since #539, we are now using https://github.com/nlohmann/json instead of boost::property_tree.

In C++, this lets us do things like

std::ofstream o("pretty.json");
o << vinecop_json << std::endl; // without whitespace

and

std::ofstream o("pretty.json");
o << std::setw(4) << bicop_json << std::endl; //with 4 whitespaces indents

I'm integrating this into pyvinecopulib right now.

tvatter · 2021-07-12T19:40:04Z

Sorry, shouldn't have closed right away :)

tvatter · 2021-07-12T22:07:02Z

Alright, it's in pyvinecopulib : https://github.com/vinecopulib/pyvinecopulib/tree/v0.6.0

I'm closing for now, don't hesitate to reopen!

tvatter · 2021-07-13T00:38:57Z

OK, I just did the following:

# Import the required libraries
import numpy as np
import pyvinecopulib as pv

# Simulate some data
np.random.seed(1234)  # seed for the random generator
n = 100  # number of observations
d = 2000  # the dimension
mean = np.random.normal(size=d)  # mean vector
cov = np.random.normal(size=(d, d))  # covariance matrix
cov = np.dot(cov.transpose(), cov)  # make it non-negative definite
x = np.random.multivariate_normal(mean, cov, n)

# Transform copula data using the empirical distribution
u = pv.to_pseudo_obs(x)

# Fit a Gaussian vine
# (i.e., properly specified since the data is multivariate normal)
controls = pv.FitControlsVinecop(family_set=[pv.BicopFamily.tll],
                                 trunc_lvl=50,
                                 num_threads=60)
cop = pv.Vinecop(u, controls=controls)

And then

%timeit cop.to_json("test.json")
%timeit test2 = pv.Vinecop("test.json")

I was also looking at top and the memory spiked around 16.6/17.3GB when writing/reading. So not ideal, but acceptable IMO. Filesize is still 1.7GB however. One thing that could be done is "truncating" the floating numbers, since we have currently 900 parameters for each PC and 2000 * 1999 * ... * 1950 PCs.

tvatter mentioned this issue Jul 16, 2020

Add support to return/load json string vinecopulib/pyvinecopulib#66

Closed

tvatter added the enhancement label Jul 9, 2021

tvatter closed this as completed Jul 12, 2021

tvatter reopened this Jul 12, 2021

tvatter closed this as completed Jul 12, 2021

tvatter reopened this Jul 13, 2021

tvatter mentioned this issue Jul 23, 2021

Saved JSON model files cannot be loaded back in. vinecopulib/pyvinecopulib#84

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support to save model using optimized data formats #504

Add support to save model using optimized data formats #504

dmey commented Jul 15, 2020

tnagler commented Jul 15, 2020

tvatter commented Jul 15, 2020

dmey commented Jul 16, 2020

tvatter commented Jul 16, 2020

tvatter commented Jul 16, 2020

tnagler commented Jul 17, 2020

tvatter commented Jul 17, 2020

tvatter commented Jul 12, 2021

tvatter commented Jul 12, 2021

tvatter commented Jul 12, 2021

tvatter commented Jul 13, 2021

Add support to save model using optimized data formats #504

Add support to save model using optimized data formats #504

Comments

dmey commented Jul 15, 2020

tnagler commented Jul 15, 2020

tvatter commented Jul 15, 2020

dmey commented Jul 16, 2020

tvatter commented Jul 16, 2020

tvatter commented Jul 16, 2020

tnagler commented Jul 17, 2020

tvatter commented Jul 17, 2020

tvatter commented Jul 12, 2021

tvatter commented Jul 12, 2021

tvatter commented Jul 12, 2021

tvatter commented Jul 13, 2021