Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add timestamps for json representation for GC #3

Open
reiddraper opened this issue Jan 26, 2012 · 12 comments
Open

add timestamps for json representation for GC #3

reiddraper opened this issue Jan 26, 2012 · 12 comments

Comments

@reiddraper
Copy link
Collaborator

I don't have a suggestion (yet), but wanted to put this issue down while I was thinking about it. Currently the JSON representation for these types, like the or-set, don't have the timestamp information that is needed to perform garbage collection.

@aphyr
Copy link
Owner

aphyr commented Jan 26, 2012

I was thinking one could choose logically or time-ordered tags for
observed-remove which are amenable to GC.

On 01/25/2012 07:09 PM, Reid Draper wrote:

I don't have a suggestion (yet), but wanted to put this issue down while I was thinking about it. Currently the JSON representation for these types, like the or-set, don't have the timestamp information that is needed to perform garbage collection.


Reply to this email directly or view it on GitHub:
#3

@reiddraper
Copy link
Collaborator Author

Not sure I follow, could you give an example?

@aphyr
Copy link
Owner

aphyr commented Jan 26, 2012

Let's say you use snowflake for unique ID generation. Snowflake is
k-ordered and has a time component, so each ID can be roughly located in
time. Use snowflake IDs for the tags in your observed-removed set. For
GC, remove any tag which has a tag with a time component older than your
threshold. Or just order all the tags and keep the highest n.

On 01/25/2012 07:18 PM, Reid Draper wrote:

Not sure I follow, could you give an example?


Reply to this email directly or view it on GitHub:
#3 (comment)

@reiddraper
Copy link
Collaborator Author

Ah, I see, cool idea. What about the two-phase set? I've almost thought about removing it in knockbox, to be honest.

@aphyr
Copy link
Owner

aphyr commented Jan 26, 2012

It makes sense for certain ephemeral structures (say, purchase orders),
and I'm inclined to keep it for completeness' sake. Not really married
to the concept, though.

On 01/25/2012 07:43 PM, Reid Draper wrote:

Ah, I see, cool idea. What about the two-phase set? I've almost thought about removing it in knockbox, to be honest.


Reply to this email directly or view it on GitHub:
#3 (comment)

@reiddraper
Copy link
Collaborator Author

Regarding using k-ordered timestamps for tag, I'm afraid it's a bit of a large dependency. Suppose you want to use CRDTs in javascript on the browserside, or you just don't want your CRDT library to depend on an external ID generation service. It seems there should also be an additional "normal" timestamp field, that can be used across all CRDT libs without assumptions about infrastructure such as snowflake. Unfortunately logically ordered timestamps don't help for what I think is going to be the most common pruning predicate, keeping garbage around longer than your longest expected partition.

@aphyr
Copy link
Owner

aphyr commented Jan 26, 2012

Or you could use unix epoch seconds, or use iso8601 timestamps, etc,
plus hostnames... there are many possible strategies here.

What bothers me is that you might use timestamps for your tags. Then
the timestamp information would be redundant but necessary. LWW-sets
imply a clock of some kind too, though it could be logical and not
wallclock time... hmmmmmm.

I don't want to define the tags to be a specific kind of timestamp,
because you might want to use a logical timestamp or a tag handed to you
by some other system. Maybe we should establish special variants of the
LWW and OR set types that include a predefined timestamp and GC scheme?

On 01/25/2012 08:29 PM, Reid Draper wrote:

Regarding using k-ordered timestamps for tag, I'm afraid it's a bit of a large dependency. Suppose you want to use CRDTs in javascript on the browserside, or you just don't want your CRDT library to depend on an external ID generation service. It seems there should also be an additional "normal" timestamp field, that can be used across all CRDT libs without assumptions about infrastructure such as snowflake. Unfortunately logically ordered timestamps don't help for what I think is going to be the most common pruning predicate, keeping garbage around longer than your longest expected partition.


Reply to this email directly or view it on GitHub:
#3 (comment)

@reiddraper
Copy link
Collaborator Author

Timestamp + hostname might be good enough for the cases I'm thinking, at least if it's at millisecond resolution. I suppose it also wouldn't be hard to have an incrementing counter for the library, so tags are time + hostname + counter. I definitely like the idea of predefined timestamp/GC schemes. I'm (hopefully) going to have some time this weekend to dive back into knockbox, and GC is high on my TODO list.

@aphyr
Copy link
Owner

aphyr commented Jan 26, 2012

I like that idea. I'm cool with implementing a flake-like at the library
level. Say default unique tags are

[unix_time, host, pid, unique]

and for lww-sets, just unix-time. Floats OK for times?

Note, unix time can jump backwards, stop, hiccup, etc... we expose
people to all of that but at the timescales we're talking (days, right?)
it shouldn't be an issue.

Obviously there will be a (gc obj) function to force garbage collection,
probably with age or size constraints. If you just go by "preserve n
tags" it's easy to implement regardless of tag type: just sort the tags
and take the biggest n. If you want to go by time then we filter (<
threshold (first tag)).

For users who want to use their own tag strategy, they can easily extend
this by just using [unix-time, custom-tag]. The GC culling won't know
the difference. Sound good?

--Kyle

On 01/25/2012 08:44 PM, Reid Draper wrote:

Timestamp + hostname might be good enough for the cases I'm thinking, at least if it's at millisecond resolution. I suppose it also wouldn't be hard to have an incrementing counter for the library, so tags are time + hostname + counter. I definitely like the idea of predefined timestamp/GC schemes. I'm (hopefully) going to have some time this weekend to dive back into knockbox, and GC is high on my TODO list.


Reply to this email directly or view it on GitHub:
#3 (comment)

@reiddraper
Copy link
Collaborator Author

What are your thoughts on where to store pruning parameters, ie. # of items to keep and/or # of seconds to keep? My gut says it's fine to have the library just decide this, but I could imagine some use-cases where you'd want to be able to have the object carry around this information, as it might be specific for the object. Thoughts?

@aphyr
Copy link
Owner

aphyr commented Feb 2, 2012

I would do both; if provided, you use, say "gc_max_items" and
"gc_max_seconds" for history pruning. If not provided, it's library
dependent.

--Kyle

On 02/01/2012 08:13 PM, Reid Draper wrote:

What are your thoughts on where to store pruning parameters, ie. # of items to keep and/or # of seconds to keep? My gut says it's fine to have the library just decide this, but I could imagine some use-cases where you'd want to be able to have the object carry around this information, as it might be specific for the object. Thoughts?


Reply to this email directly or view it on GitHub:
#3 (comment)

@reiddraper
Copy link
Collaborator Author

+1 to that idea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants