Skip to content

Latest commit

 

History

History
235 lines (152 loc) · 12.4 KB

Lecture 08.md

File metadata and controls

235 lines (152 loc) · 12.4 KB

Distributed Systems Lecture 8

Lecture Given by Lindsey Kuper on April 15th, 2020 via YouTube

Previous Next
Lecture 7 Lecture 9

Rules of the Chandy-Lamport Algorithm

This is an example of a decentralised1 algorithm that allows you to take a global snapshot of a running distributed system, and its design gives us two key advantages:

  1. Any participating process can initiate a snapshot.
    The process initiating the snapshot is not required to occupy some elevated role (such as "supervisor") because this task not considered "special" or "privileged".
  2. The process initiating the snapshot does not need to warn the other processes that this action is about to take place.

The act of initiating a snapshot creates a cascade of marker messages throughout the entire system. This message cascade then causes all the other processes to take a snapshot of themselves.

The Initiator Process

  • Records its own state
  • Sends a marker message out on all its outgoing channels
  • Starts recording messages arriving on all incoming channels

If process P1 decides to initiate a snapshot, then the following sequence of events takes place:

  • P1 records its own state as S1
  • Immediately after recording its own state, P1 sends out marker messages on all its outgoing channels (only one in this case: channel C12)
  • P1 starts recording any messages that might arrive on its incoming channels (again, only one in this case: channel C21)

Chandy-Lamport Snapshot 1

Notice that at the time P1's snapshot happens, message m is currently in the channel from P2 to P1 (channel C21).

Processes Receiving a Marker Message

IMPORTANT

A Marker FIFO Anomaly Cannot Happen

Due the fact that all channels behave as FIFO queues, we do not need to be concerned about the possibility of FIFO anomalies. This system is designed such that marker messages cannot arrive before earlier message-send events in the originating process.

None of what follows would work if we had not first eliminated the possibility of FIFO anomalies!

When a process receives a marker message, it can react in one of two different ways. How it reacts depends on whether or not that process has already seen a marker message during this run of the global snapshot.

Scenario 1: Nope, I Haven't Seen a Marker Message Before...

If this is the first time this process has seen a marker message, the receiver:

  • Records its own state
  • Flags the channel on which the marker message was received as empty
  • Sends out a marker message on each of its outgoing channels
  • Starts recording incoming messages on all channels except the one on which it received the original marker message (now flagged as empty)

Q:   During a snapshot, once a channel is marked as empty, what happens if you then receive a message on that channel?
A:   Whilst the snapshot is running, messages received on channels marked as empty are ignored!

In the diagram below, since this is the first marker message P2 has seen, it does the following:

  • It records its own state as S2
  • Flags channel C12 as empty
  • Sends out a marker message on all its outgoing channels (in this case, only channel C21)
  • Normally, it would now start recording any messages that arrive on its other, incoming channels; however, in this case, since its only incoming channel (C12) has already been marked as empty, there is nothing to record

Chandy-Lamport Snapshot 2

Scenario 2: Yup, I've Already Seen a Marker Message...

If a process sends out a marker message, then we consider that process already to have "seen" a marker message (its own). So when a process that has already sent out its own marker message receives someone else's marker message, it:

  • Stops recording incoming messages on that channel
  • Sets that channel's final state to be the sequence of all messages received whilst recording was active

Message m from P2 (sent at event C) arrives on channel C21 as event D in process P1. This message arrived before the marker message because channels always behave as FIFO queues.

Upon receiving this marker message, P1 then:

  • Stops recording on the marker message's channel (C21 in this case)
  • The final state of channel C21 is set to the sequence of messages that arrived whilst recording was active

Chandy-Lamport Snapshot 3

So, we now have a consistent snapshot of our entire system, which in this simple case, consists of four things:

  1. The state of our two processes:
    • P1's state recorded as S1
    • P2's state recorded as S2
  2. The state of all channels between those processes:
    • Channel C12 recorded by P2 (Empty)
    • Channel C21 recorded by P1 (Message m)

The Chandy-Lamport Algorithm in a More Detailed Scenario

When a snapshot takes place, every process ends up sending out a marker message to every other process. So, for a system containing N participating processes, N * (N - 1) marker messages will be sent. This might seem inefficient as the number of messages rises quadratically with the number of participating processes, but unfortunately, there is no better approach.

As stated in the previous lecture, the success of the Chandy-Lamport algorithm relies entirely on the truth of the following assumptions:

  1. Eventual message delivery is guaranteed, thus making delivery failure impossible
  2. All channels act as FIFO queues, thus eliminating the possibility of messages being delivered out of order (FIFO anomalies)
  3. Processes don't crash! (See lecture 10)

A Worked Example

In this example, we have three communicating processes P1, P2 and P3 in our system, and we want to take a snapshot.

Process P1 acts as the initiator; so it follows the above steps:

  • It records its own state as S1
  • It sends out two marker messages; one to P2 and one to P3 - but notice that the arrival of the marker message at P2 is delayed. This turns out not to be a problem.
  • P1 starts recording on both its incoming channels C21 and C31

Chandy-Lamport Example Step 1

Next, P3 receives the marker message from P1. Since this is the first marker message it has received:

  • It records its own state as S3
  • Marks the channel on which it received the marker message (C13) as empty
  • Sends out marker messages on all its outgoing channels
  • Starts recording on its other incoming channel (C23)

Chandy-Lamport Example Step 2

Looking at P3's marker message that now arrives at P1, since P1 initiated the snapshot process, this is not the first marker it has seen, so P1:

  • Stops recording incoming messages on that channel (C31)
  • Sets that channel's final state to be the sequence of all messages received whilst recording was active - which is none - so the channel state of C31 is {}.

Chandy-Lamport Example Step 3

Now look at the other marker message from P3 to P2. This is the first marker P2 has seen, so it:

  • It records its own state as S2
  • Marks the channel on which it received the marker message (C32) as empty
  • Sends out marker messages on all its outgoing channels
  • Starts recording on its other incoming channel (C12)

Chandy-Lamport Example Step 4

Eventually, the initial marker message from P1 arrives at P2. This is the second marker P2 has seen, so it:

  • Stops recording incoming messages on that channel (C12)
  • Sets that channel's final state to be the sequence of all messages received whilst recording was active - which is none - so the channel state of C12 is {}.

Chandy-Lamport Example Step 5

P2's marker message now arrives at P1. This is not the first marker P1 has seen, so it:

  • Stops recording incoming messages on that channel (C21)
  • Sets that channel's final state to be the sequence of all messages received whilst recording was active - which in this case is the message m3 sent at event H in P2 to event D in P1 - so the channel state of C12 is {m3}.

Chandy-Lamport Example Step 6

Lastly, the marker message from P2 arrives at P3. Similarly, this is not the first marker P3 has seen, so it:

  • Stops recording incoming messages on that channel (C23)
  • Sets that channel's final state to be the sequence of all messages received whilst recording was active - which is none - so the channel state of C23 is {}.

Chandy-Lamport Example Step 7

We now have a consistent snapshot of the entire system composed of three process states:

  • P1 = S1
  • P2 = S2
  • P3 = S3

And six channel states:

  • C12 = {}
  • C21 = {m3}
  • C13 = {}
  • C31 = {}
  • C23 = {}
  • C32 = {}

What About the Internal Events Not Recorded in the Process Snapshots?

In the above diagram, events C, D and E do not form part of P1's snapshot recorded in state S1 because these events had not yet occurred at the time P1 decided to take its snapshot.

Similarly, events J and K do not form part of P3's snapshot recorded in state S3 because these events had not yet occurred at the time the marker message arrived from P1.

These events will all be recorded the next time a snapshot is taken.

How Does the Entire System Know When the Snapshot Is Complete?

An individual process knows its local snapshot is complete when it has recorded:

  • Its own internal state, and
  • The state of all its incoming channels

If it can be shown that the snapshot process terminates for an individual process, and all individual processes use the same snapshot algorithm, then it follows that the snapshot will terminate for all participating processes in the system.

Now we can appreciate the importance of the assumptions listed at the start. The success of this entire algorithm rests on the fact that:

  • Eventual message delivery is guaranteed, and
  • Messages never arrive out of order (all channels are FIFO queues), and
  • Processes do not crash (yeah, right! Again, see lecture 10)

In Chandy & Lamport's original paper they provide a proof that the snapshot process does in fact terminate.

However, determining when the snapshot for the entire system is complete lies outside the rules of the Chandy-Lamport algorithm itself. Management of an entire system snapshot needs to be handled by some external coordinating process that:

  1. Receives all the snapshot data from the individual processes, then
  2. Collates that data to form an overall system snapshot.

Previous Next
Lecture 7 Lecture 9

Endnotes

1   In this context, a "decentralised algorithm" is one that does not need to be invoked from a special coordinating process; any process in the system can act as the initiator. A beneficial side-effect of this is that if two processes simultaneously decide to initiate a snapshot, then nothing bad happens.