New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Escaping centralization #25
Comments
Exploring Vector clocksEverytime we apply or create an operation, we increment our clock. When we apply an operation by somebody else, we update the other clocks accordingly. The ID of an operation is the current vector clock.
Side note: It might work even better if we only increment our clock when changing the document ourselves. Then we can even detect messages overtaking others. I have posted a question on cs stackexchange to make sure there's no gotcha to this tweak. Update: I've corrected some transformations. |
My feeling was correct. This adaption is valid for changing data in a distributed manner. It's called version vecors. |
Centralization is bad.
Gulf currently fosters centralization via master-slave linking.
A solution would be to support peer-to-peer linking.
(Also see the following paper on the topic: https://smartech.gatech.edu/bitstream/handle/1853/6634/GIT-CC-98-15.pdf)
Options for p2p support:
1. Transform Property 2
An OT algorithm satisfies transform property 2 when it handles pseudo-ties (also called the false tie puzzle). A tie is when two people insert something at the same position. Ties are handled in standard (non-TP2 OT algorithms), by using a tie-breaking indicator, e.g. site IDs. Pseudo-ties are ties that result only from transforming operations. Pseudo-ties are only a problem if the order in which operations are transformed and applied is not coherent/deterministic, which is usually the case in peer-to-peer scenarios. See the OT FAQ for more on this.
Algorithms that feature TP2 solve the False tie puzzle. The most widely accepted solution seems to be tombstones: If you delete something you don't remove it from the model, you just mark it as deleted, resulting in a tombstone. The problem is that you have to keep every tombstone in memory. There's no easy way to gradually prune older tombstones from the model and re-insert them in case they're needed. Also, the transformation logic (what to transform against what) is much more complex than it is for simple OT.
2. Commutative Replicated Data Types
These claim to be the logical successors of OT. They are called NoOT, treedoc or logoot. The rationale is that you want append-only, commutative operations, that you don't need to transform based on their history because they can be applied in any order.
This is usually achieved by also adopting tombstones, as well as a similar concept for inserts: Instead of keeping the result of the insert (e.g. the flat text document), the document becomes a tree of inserts. This means that history is intrinsic to the operations as they need to remember the path to the leaf in the tree that they want to access. The whole document remembers all changes. This means that you can trivially decide whether an operation is based on one that is not yet known.
On the other hand, here the same problem arises: The whole document can become immensely memory-consuming and there's no easy way to compact it without losing history, although some authors have tried to remedy this by devising sophisticated compaction protocols.
(Todo: Read this: https://cs.kaist.ac.kr/upload_files/report/1254967150.pdf)
3. Elect a master
Electing a master that transforms incoming operations against missed ones, enforces order and keeps OT control algorithms simple. In fact the necessary transformations are equivalent of a three-way merge, as git does it every day.
The downside of having a master is that it represents a single point of failure (from which it is simple to recover, though), possibly a bottleneck and an easy point for applying censorship (thus the electing parties need to trust the master, which is not always possible).
4. Use an ordering strategy
The crucial problem of OT in p2p scenarios (as you have seen above) is the arbitrary, non-deterministic order of operations, which is determined by the network: you apply edits when they arrive. So, a solution would be to find a way to impose a global order on operations in p2p scenarios.
Post-commit ordering
This strategy applies an order after the edits have been sent out (the ordering key or whatever may have been assigned earlier but they aren't ordered until they are received).
The expense here is that transformation (i.e. the OT control layer; gulf in this case) is more complex (the greater the latency the greater the complexity).
Using deterministic ordering
A simple solution is to order edits by a conveniently-chosen identifier, a timestamp if you will. The timestamping algorithm must ensure causal and global (deterministic) order. A naive attempt would be
datetime-siteID
, but the problem is of course exactness of the datetime value. This can be solved with version vectors as timestamps. The drawback of this approach is that this is only an after-the-fact ordering: When an older edit is received, all edits that have happened in the meantime have to be undone, and transformed against the old edit (if you assume non-TP2 OT). This can be remedied by allowing peers to reject old edits, forcing the reconnecting peer to rebase. However, all peers need to consistently reject the old edit, which is hard to achieve.Pre-commit ordering
Electing an ordering peer
A naive solution would be to elect a peer to order all operations (note that it is not necessary to transform them in any way, as 'Electing a master' suggests). The problem here is again the ordering peer becoming a bottleneck and the trust that needs to be placed in them.
Using a distributed queue algorithm
A simple and common approach is the distributed queue dual bus, but I'm not sure if it can resist manipulation by a malicious peer.
The text was updated successfully, but these errors were encountered: