Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support more advanced conflict resolution strategies #22

Open
njaard opened this issue Feb 7, 2023 · 1 comment
Open

Support more advanced conflict resolution strategies #22

njaard opened this issue Feb 7, 2023 · 1 comment
Labels
enhancement New feature or request

Comments

@njaard
Copy link
Owner

njaard commented Feb 7, 2023

Right now, when you have two records (identified by the same key+timestamp), the one from the most recent commit takes precedence. This issue is going to decide how to supporting aggregating those conflicts as opposed to just discarding the old one.

User stories

  • One common way is to count events. For example, if we record the number of events once per day. If you have multiple sources of this data, each source accumulates into the counter.
  • Maybe a user measures temperature. In this case, we want to store the minimum and maximum value, which means that min and max are the functions.
  • Maybe the user stores actual error messages. If you can receive more than one message per timestamp, you might want to just concatenate it to a string. Therefor, it'd be best to have a "join with delimiter"

By combining a record that has two sum fields, one with a count and one with a value, you also have enough information to produce the mean.

File format

I think it makes sense to store the aggregation method in the format string. This makes sense to me because the aggregation method should not ever change, and the format string is only stored once so it's efficient.

The format strings right now are single character codes like uff representing an unsigned 32-bit integer and 2 unsigned 32-bit floats. I propose a prefix or suffix after each one indicating the aggregation:

For example +u9f0f could represent "addition for the u", "maximum for the first f" and "minimum for the second f. I'm not too attached to the particular representation or even that it be constrained to single characters (in fact, it can't be if you need to specify the delimiter). A more complete list:

  • + sum
  • 9 maximum
  • 1 minimum
  • | join with delimiter. The following character must then be " followed by the actual delimiter, backslash-escaped, and then another ". For example, |"," for delimiting with a comma.
  • No character all which means "replace".

API

Right now, you can make records with record. We would need a new function like record_agg which generates the format string with the appropriate marker. For example:

  record_agg(sonnerie::Aggregate::Max, 25u32)
    .record_agg(sonnerie::Aggregate::Sum, 25.0f64)
    .record_agg(sonnerie::Aggregate::Join(","), "one message")

Applying the aggregate

Right now, Merge::discard_repetitions will just keep on reading values from all the transactions until it gets the last one for a given key+timestamp. Instead, Merge should apply the correct aggregate for each column.

A compaction directly uses Merge so therefor compaction doesn't need special behavior.

When the aggregate is impossible to apply in some manner

What if the data types don't match, like you're using the "summation" operator but one field is integer and the other is float, or one is numeric and the other is a string? I think the solution is to "try to do the correct thing" and then fallback on replacing the value.

What this means is that if we can guarantee a lossless conversion, then the operator can still occur. For example, if you're doing addition on a f32 and an f64, we can convert that f32 into an f64 and still do a summation.

In the case of that lossless conversion, the datatype should then become the "wider" of the two. Even if the wider of the two is in the latter transaction. That is because if a program is running that takes a while to complete, it would be surprising if all of a sudden your data became corrupt because it committed its transaction later than newer processes.

When the aggregate itself conflicts

That is to say, the order of transactions isn't defined until commit-time. That means that if multiple transactions have different aggregate records, it's probably just user error, because there's no way to make mathematical sense of it. Practically speaking, when the merging occurs, there is a defined order to the records and so the aggregate can just be applied in that order. No special work needs to occur.

CLI

The CLI expects the user to enter valid format strings. We can just leave that as it is until we provide a more user-friendly UI.

sonnerie-serve

sonnerie-serve, like the CLI, accepts format strings in the stream. Therefor nothing special needs to be done there either.

Examples

Support for widening

If you create three separate transactions, the final value is the same as the values with the aggregate function:

key 2023-01-01T00:00:00 +f 1.0
key 2023-01-01T00:00:00 +F 2.0
key 2023-01-01T00:00:00 +f 3.0

You should read back one record:
key 2023-01-01T00:00:00 +F 6.0

Strings

Strings have their aggregate value joined with the delimiter:

key 2023-01-01T00:00:00 |","s One
key 2023-01-01T00:00:00 |","s Two
key 2023-01-01T00:00:00 |","s Three

Read back:
key 2023-01-01T00:00:00 |","s One,Two,Three

Multiple columns

Each column has its own aggregation:

key 2023-01-01T00:00:00 +u9f0f 3 32.0 19.0
key 2023-01-01T00:00:00 +u9f0f 5 48.0 21.0
key 2023-01-01T00:00:00 +u9f0f 7 23.0 6.0

Read back:
key 2023-01-01T00:00:00 +u9f0f 15 48.0 6.0

Conflicting data types

If there's a conflict in the data type and widening can't occur, then just retain the value from the newest transaction:

key 2023-01-01T00:00:00 +u 12
key 2023-01-01T00:00:00 +f 19.0

read back:
key 2023-01-01T00:00:00 +f 19.0

Retain old behavior

Without an aggregate column, just select the value from the most recent transaction:

key 2023-01-01T00:00:00 f+u 4.0 4
key 2023-01-01T00:00:00 f+u 2.0 6

read back:
key 2023-01-01T00:00:00 f+u 2.0 10

@njaard njaard added the enhancement New feature or request label Feb 7, 2023
@db48x
Copy link
Contributor

db48x commented Feb 17, 2023

Hmm. My first thought is that this makes the format strings nigh unreadable, and my second is to bikeshed.

But I think it might be more helpful to focus on one specific use case. Suppose we are collecting latencies, perhaps for database queries. How would we collect the mean, max, 99th percentile, 90th percentile, etc?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants