Question: Sparse heterogenous data #118

chrisrossi · 2023-05-18T14:05:19Z

Hi, Rudiger, et al,

David Phelan has asked me to look at using this library for dClimate, and I was hoping you could help answer a question for me.

We have weather station telemetry data that has the general form:

timestamp	var1	var2	var3
2023-05-18	2.1		6.5
2023-05-19	2.1		6.5
2023-05-20
2023-05-21	2.1	4.2
2023-05-22	2.1	4.2
2023-05-23	2.1	4.2	6.5

In plain English, for a specific index, we have several variables, any of which may or may not have a value.

David did point me to this older work where it's clear you were thinking about heterogenous data and missing values, but looking at the Banyan repository, it's not obvious to me how these are handled. Is this use case accounted for with Banyan, or should we be looking elsewhere?

If this is an accounted for use case, what's the best way to go about it? Use a separate tree for each variable, using Option<T> for the value, or use a single tree with a tuple value?

I'm not sure if this is related or not, but I don't have a good handle on what the "Forest" abstraction is or does. Is there any information you could provide about that?

Thank you!
Chris

rkuhn · 2023-05-19T07:36:56Z

Hi Chris,

this repo just tackles the issue of storing a persistent tree data structure on IPFS, it does not implement the columnar storage that @rklaehn blogged about. Rüdiger should be able to give you better pointers.

Regards,

Roland

rklaehn · 2023-05-21T11:07:14Z

Hi @chrisrossi. Just saw this. I try to limit my screen time on Sundays, but will give you a detailed answer some time next week.

rklaehn · 2023-05-22T07:12:38Z

OK, so what banyan does for you is the following:

it combines, compresses and encrypts multiple log entries in an efficient way
it allows you to define indexing data that gets bubbled up to the root of the tree
it allows you to do efficient queries for the part of the data for which you have indexes
it allows you to do efficient offset or offset range queries for any data, even if you don't have an index

What it does not do is to prescribe a format for your data. It has to be something that can be serialized to cbor, that's all. It is not a database where you can easily define custom indexes, but more like a database construction kit where you have to put some thoughts into what you want to index.

So the question is, how will this data be queried? Are you going go e.g. query just data for var1 over multiple logs? I assume you also want to query on time ranges.

The easiest thing you could do is to just store the data in a row based format and tag each value with a set of values it contains and the time. The summary would then be also a set of values, and a time range. This will work just fine and give you some decent compression, but it is not optimal.

For optimal compression you would have to transpose the data to a columnar format, as I described in https://rklaehn.github.io/2018/06/10/efficient-telemetry-storage-on-ipfs/ . If you do that you can expect some extremely good compression, but it is a bit more work. It is worth it though, if you expect to store a lot of data. You can expect a large improvement of the compression rate.

Here are some measurements from my blog post:

	size	size/sample
JSON	16265164	162.65164
CBOR	12268523	122.68523
JSON/deflate	870921	8.70921
CBOR/deflate	923066	9.23066
Columns/CBOR/deflate	162721	1.62721
” + dedup	161769	1.61769
Theoretical optimum	112500	1.125

The columnar + compressed is > a factor of 5 better than the naive "compressed rows" approach. Banyan makes it easy to plug in this row->column transformation.

However, I would suggest starting with the row based approach to get your feet wet and switch to the columnar approach later.

HTH,

Rüdiger

chrisrossi · 2023-05-22T15:34:31Z

Got it. Thank you for clarifying. Will be getting into it this week.

Thanks!
Chris

chrisrossi · 2023-05-22T17:06:09Z

@rklaehn Is there an open source example of something using the columnar approach with Banyan?

Thanks!
Chris

rklaehn · 2023-05-22T18:19:23Z

@rklaehn Is there an open source example of something using the columnar approach with Banyan?

Thanks! Chris

This https://github.com/Actyx/banyan/blob/master/banyan-utils/src/tag_index.rs is a little bit like it, and a long time ago I started on an implementation of this approach in rust here https://github.com/rklaehn/iptm-rs/tree/master/src

But the production quality tag index is in the actyx code base, which is proprietary.

Depending on how your data looks (nested or not) this does not have to be hard though.

chrisrossi · 2023-05-22T18:59:18Z

Thanks, I'll take a look at those! The data, AFAIK, is all expressible in CSV, so simple tables, no nesting. I'll look at the columnar approach, but I'm also thinking about just adding a field to each row that is a bitmap indicating presence/absence of a variable.

E.G. this example data, would be transformed:

timestamp	var1	var2	var3
2023-05-18	2.1		6.5
2023-05-19	2.1		6.5
2023-05-20
2023-05-21	2.1	4.2
2023-05-22	2.1	4.2
2023-05-23	2.1	4.2	6.5

to this (2023-05-20 can be omitted):

2023-05-18, 0b101, 2.1, 6.5
2023-05-19, 0b101, 2.1, 6.5
2023-05-21, 0b110, 2.1, 4.2   
2023-05-22, 0b110, 2.1, 4.2  
2023-05-23, 0b111, 2.1, 4.2, 6.5

Obviously, anticipating data with many more than 3 columns.

rklaehn · 2023-05-23T10:20:56Z

Just f64 values, no need for a variant? How many columns approximately?

So a value would be something like

struct Row {
  time: u64,
  values: BTreeMap<String, f64>,
}

?

A key would then be the set of strings for each column name present in the row, plus a timestamp.

chrisrossi · 2023-05-23T13:44:53Z

More or less, yes. To get around storing the column names per row, I might do something more like:

struct Row {
    time: u64,
    values_present: Bitmap,
    values: Vec<f64>,
}

With column order/names defined elsewhere.

rklaehn · 2023-05-23T13:56:21Z

Yes, I was asking what is in a Row in principle. Storing this exactly as is would be inefficient obviously.

So what you would do if you wanted a columnar approach would be to define a RowSeq type that efficiently stores a sequence of rows, and then use that in banyan. E.g. something like this:

struct RowSeq {
  names: Vec<String>, // names, unique (or unique and sorted)
  times: Vec<u64>, // time for each row, should be delta encoded when serializing
  colums: Vec<Column>, // columns, same number of items as names
}

struct Column {
  values: Vec<f64>, // values
  times: Vec<usize>, // indexes into the times table, guaranteed to be strictly sorted, should be delta encoded when serializing
 // this is the equivalent of the bitmap in your approach, except that it is transposed
}

You can then write things like getting a single Row out of a RowSeq, making a RowSeq out of n Rows, serializing and deserializing these, and then write some property based tests using proptest or quickcheck to make sure these transformations are correct.

proptest is the better crate. It has composable minimization of failed tests.
https://crates.io/crates/proptest

chrisrossi · 2023-05-23T16:34:09Z

Ok, so if I'm following along, RowSeq is the value type stored in Banyan. So instead of storing a single row per index in the Banyan tree, we'll store a certain number of rows (transformed into columnar format) at each index in the tree. Is there a heuristic for deciding how many rows to store in a RowSeq? Or have I misunderstood?

rklaehn · 2023-05-24T07:33:42Z

Ok, so if I'm following along, RowSeq is the value type stored in Banyan. So instead of storing a single row per index in the Banyan tree, we'll store a certain number of rows (transformed into columnar format) at each index in the tree. Is there a heuristic for deciding how many rows to store in a RowSeq? Or have I misunderstood?

Almost correct.

One of the things banyan does for you is to give you the option to store the rows in a columnar format, which is almost always more efficient because you have more similar data closer to each other. This helps the compression algorithm find the redundancy.

You are storing RowSeq, but indexing is still by individual rows. How data is distributed over the leafs should not matter for the indexing. If you store a row at index 1234 it should still be at index 1234 even if you pack the data differently.

What you will typically do with an active data source is just to store data as quickly as possible as it comes in, creating a long chain, and then "pack" the data into bigger leafs. The heuristics about how many rows to pack into one leaf is a configuration parameter of banyan and is based on the compressed size, not the uncompressed size.

This is very important. Different data has vastly different compression rate, and for efficient distribution you want the final compressed and encrypted blocks to be of roughly the same size.

https://docs.rs/banyan/0.17.1/banyan/struct.Config.html

pub struct Config {
pub max_summary_branches: usize,
pub max_key_branches: usize,
pub max_leaf_count: usize,
pub target_leaf_size: usize,
pub max_uncompressed_leaf_size: usize,
pub zstd_level: i32,
}

target_leaf_size is the approximate compressed leaf size you are targeting.

All the other parameters are to prevent bad things from happening, and to give you some control over the shape of the tree.

E.g. max_uncompressed_leaf_size is the max uncompressed leaf size, to prevent from having leafs that are unwieldy to handle in uncompressed form. `

rklaehn · 2023-05-24T13:12:22Z

@rkuhn any chance you can open source some of the relevant code? It would be really helpful here I think.

rklaehn · 2023-05-24T16:55:58Z

Ah, turns out that the tag index part is open source already. It is not exactly what you need, but it might help as inspiration...

https://github.com/Actyx/cbor-tag-index

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Sparse heterogenous data #118

Question: Sparse heterogenous data #118

chrisrossi commented May 18, 2023 •

edited

rkuhn commented May 19, 2023

rklaehn commented May 21, 2023

rklaehn commented May 22, 2023

chrisrossi commented May 22, 2023

chrisrossi commented May 22, 2023

rklaehn commented May 22, 2023

chrisrossi commented May 22, 2023 •

edited

rklaehn commented May 23, 2023 •

edited

chrisrossi commented May 23, 2023 •

edited

rklaehn commented May 23, 2023 •

edited

chrisrossi commented May 23, 2023

rklaehn commented May 24, 2023

rklaehn commented May 24, 2023

rklaehn commented May 24, 2023

Question: Sparse heterogenous data #118

Question: Sparse heterogenous data #118

Comments

chrisrossi commented May 18, 2023 • edited

rkuhn commented May 19, 2023

rklaehn commented May 21, 2023

rklaehn commented May 22, 2023

chrisrossi commented May 22, 2023

chrisrossi commented May 22, 2023

rklaehn commented May 22, 2023

chrisrossi commented May 22, 2023 • edited

rklaehn commented May 23, 2023 • edited

chrisrossi commented May 23, 2023 • edited

rklaehn commented May 23, 2023 • edited

chrisrossi commented May 23, 2023

rklaehn commented May 24, 2023

rklaehn commented May 24, 2023

rklaehn commented May 24, 2023

chrisrossi commented May 18, 2023 •

edited

chrisrossi commented May 22, 2023 •

edited

rklaehn commented May 23, 2023 •

edited

chrisrossi commented May 23, 2023 •

edited

rklaehn commented May 23, 2023 •

edited