Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Sparse heterogenous data #118

Open
chrisrossi opened this issue May 18, 2023 · 14 comments
Open

Question: Sparse heterogenous data #118

chrisrossi opened this issue May 18, 2023 · 14 comments

Comments

@chrisrossi
Copy link

chrisrossi commented May 18, 2023

Hi, Rudiger, et al,

David Phelan has asked me to look at using this library for dClimate, and I was hoping you could help answer a question for me.

We have weather station telemetry data that has the general form:

timestamp var1 var2 var3
2023-05-18 2.1 6.5
2023-05-19 2.1 6.5
2023-05-20
2023-05-21 2.1 4.2
2023-05-22 2.1 4.2
2023-05-23 2.1 4.2 6.5

In plain English, for a specific index, we have several variables, any of which may or may not have a value.

David did point me to this older work where it's clear you were thinking about heterogenous data and missing values, but looking at the Banyan repository, it's not obvious to me how these are handled. Is this use case accounted for with Banyan, or should we be looking elsewhere?

If this is an accounted for use case, what's the best way to go about it? Use a separate tree for each variable, using Option<T> for the value, or use a single tree with a tuple value?

I'm not sure if this is related or not, but I don't have a good handle on what the "Forest" abstraction is or does. Is there any information you could provide about that?

Thank you!
Chris

@rkuhn
Copy link
Member

rkuhn commented May 19, 2023

Hi Chris,

this repo just tackles the issue of storing a persistent tree data structure on IPFS, it does not implement the columnar storage that @rklaehn blogged about. Rüdiger should be able to give you better pointers.

Regards,

Roland

@rklaehn
Copy link
Contributor

rklaehn commented May 21, 2023

Hi @chrisrossi. Just saw this. I try to limit my screen time on Sundays, but will give you a detailed answer some time next week.

@rklaehn
Copy link
Contributor

rklaehn commented May 22, 2023

OK, so what banyan does for you is the following:

  • it combines, compresses and encrypts multiple log entries in an efficient way
  • it allows you to define indexing data that gets bubbled up to the root of the tree
  • it allows you to do efficient queries for the part of the data for which you have indexes
  • it allows you to do efficient offset or offset range queries for any data, even if you don't have an index

What it does not do is to prescribe a format for your data. It has to be something that can be serialized to cbor, that's all. It is not a database where you can easily define custom indexes, but more like a database construction kit where you have to put some thoughts into what you want to index.

So the question is, how will this data be queried? Are you going go e.g. query just data for var1 over multiple logs? I assume you also want to query on time ranges.

The easiest thing you could do is to just store the data in a row based format and tag each value with a set of values it contains and the time. The summary would then be also a set of values, and a time range. This will work just fine and give you some decent compression, but it is not optimal.

For optimal compression you would have to transpose the data to a columnar format, as I described in https://rklaehn.github.io/2018/06/10/efficient-telemetry-storage-on-ipfs/ . If you do that you can expect some extremely good compression, but it is a bit more work. It is worth it though, if you expect to store a lot of data. You can expect a large improvement of the compression rate.

Here are some measurements from my blog post:

  size size/sample
JSON 16265164 162.65164
CBOR 12268523 122.68523
JSON/deflate 870921 8.70921
CBOR/deflate 923066 9.23066
Columns/CBOR/deflate 162721 1.62721
” + dedup 161769 1.61769
Theoretical optimum 112500 1.125

The columnar + compressed is > a factor of 5 better than the naive "compressed rows" approach. Banyan makes it easy to plug in this row->column transformation.

However, I would suggest starting with the row based approach to get your feet wet and switch to the columnar approach later.

HTH,

Rüdiger

@chrisrossi
Copy link
Author

Got it. Thank you for clarifying. Will be getting into it this week.

Thanks!
Chris

@chrisrossi
Copy link
Author

@rklaehn Is there an open source example of something using the columnar approach with Banyan?

Thanks!
Chris

@rklaehn
Copy link
Contributor

rklaehn commented May 22, 2023

@rklaehn Is there an open source example of something using the columnar approach with Banyan?

Thanks! Chris

This https://github.com/Actyx/banyan/blob/master/banyan-utils/src/tag_index.rs is a little bit like it, and a long time ago I started on an implementation of this approach in rust here https://github.com/rklaehn/iptm-rs/tree/master/src

But the production quality tag index is in the actyx code base, which is proprietary.

Depending on how your data looks (nested or not) this does not have to be hard though.

@chrisrossi
Copy link
Author

chrisrossi commented May 22, 2023

Thanks, I'll take a look at those! The data, AFAIK, is all expressible in CSV, so simple tables, no nesting. I'll look at the columnar approach, but I'm also thinking about just adding a field to each row that is a bitmap indicating presence/absence of a variable.

E.G. this example data, would be transformed:

timestamp var1 var2 var3
2023-05-18 2.1   6.5
2023-05-19 2.1   6.5
2023-05-20      
2023-05-21 2.1 4.2  
2023-05-22 2.1 4.2  
2023-05-23 2.1 4.2 6.5

to this (2023-05-20 can be omitted):

2023-05-18, 0b101, 2.1, 6.5
2023-05-19, 0b101, 2.1, 6.5
2023-05-21, 0b110, 2.1, 4.2   
2023-05-22, 0b110, 2.1, 4.2  
2023-05-23, 0b111, 2.1, 4.2, 6.5

Obviously, anticipating data with many more than 3 columns.

@rklaehn
Copy link
Contributor

rklaehn commented May 23, 2023

Just f64 values, no need for a variant? How many columns approximately?

So a value would be something like

struct Row {
  time: u64,
  values: BTreeMap<String, f64>,
}

?

A key would then be the set of strings for each column name present in the row, plus a timestamp.

@chrisrossi
Copy link
Author

chrisrossi commented May 23, 2023

More or less, yes. To get around storing the column names per row, I might do something more like:

struct Row {
    time: u64,
    values_present: Bitmap,
    values: Vec<f64>,
}

With column order/names defined elsewhere.

@rklaehn
Copy link
Contributor

rklaehn commented May 23, 2023

Yes, I was asking what is in a Row in principle. Storing this exactly as is would be inefficient obviously.

So what you would do if you wanted a columnar approach would be to define a RowSeq type that efficiently stores a sequence of rows, and then use that in banyan. E.g. something like this:

struct RowSeq {
  names: Vec<String>, // names, unique (or unique and sorted)
  times: Vec<u64>, // time for each row, should be delta encoded when serializing
  colums: Vec<Column>, // columns, same number of items as names
}

struct Column {
  values: Vec<f64>, // values
  times: Vec<usize>, // indexes into the times table, guaranteed to be strictly sorted, should be delta encoded when serializing
 // this is the equivalent of the bitmap in your approach, except that it is transposed
}

You can then write things like getting a single Row out of a RowSeq, making a RowSeq out of n Rows, serializing and deserializing these, and then write some property based tests using proptest or quickcheck to make sure these transformations are correct.

proptest is the better crate. It has composable minimization of failed tests.
https://crates.io/crates/proptest

@chrisrossi
Copy link
Author

Ok, so if I'm following along, RowSeq is the value type stored in Banyan. So instead of storing a single row per index in the Banyan tree, we'll store a certain number of rows (transformed into columnar format) at each index in the tree. Is there a heuristic for deciding how many rows to store in a RowSeq? Or have I misunderstood?

@rklaehn
Copy link
Contributor

rklaehn commented May 24, 2023

Ok, so if I'm following along, RowSeq is the value type stored in Banyan. So instead of storing a single row per index in the Banyan tree, we'll store a certain number of rows (transformed into columnar format) at each index in the tree. Is there a heuristic for deciding how many rows to store in a RowSeq? Or have I misunderstood?

Almost correct.

One of the things banyan does for you is to give you the option to store the rows in a columnar format, which is almost always more efficient because you have more similar data closer to each other. This helps the compression algorithm find the redundancy.

You are storing RowSeq, but indexing is still by individual rows. How data is distributed over the leafs should not matter for the indexing. If you store a row at index 1234 it should still be at index 1234 even if you pack the data differently.

What you will typically do with an active data source is just to store data as quickly as possible as it comes in, creating a long chain, and then "pack" the data into bigger leafs. The heuristics about how many rows to pack into one leaf is a configuration parameter of banyan and is based on the compressed size, not the uncompressed size.

This is very important. Different data has vastly different compression rate, and for efficient distribution you want the final compressed and encrypted blocks to be of roughly the same size.

https://docs.rs/banyan/0.17.1/banyan/struct.Config.html

pub struct Config {
pub max_summary_branches: usize,
pub max_key_branches: usize,
pub max_leaf_count: usize,
pub target_leaf_size: usize,
pub max_uncompressed_leaf_size: usize,
pub zstd_level: i32,
}

target_leaf_size is the approximate compressed leaf size you are targeting.

All the other parameters are to prevent bad things from happening, and to give you some control over the shape of the tree.

E.g. max_uncompressed_leaf_size is the max uncompressed leaf size, to prevent from having leafs that are unwieldy to handle in uncompressed form. `

@rklaehn
Copy link
Contributor

rklaehn commented May 24, 2023

@rkuhn any chance you can open source some of the relevant code? It would be really helpful here I think.

@rklaehn
Copy link
Contributor

rklaehn commented May 24, 2023

Ah, turns out that the tag index part is open source already. It is not exactly what you need, but it might help as inspiration...

https://github.com/Actyx/cbor-tag-index

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants