Ghostferry binlog streamer lag with large source write volume due to misconfigured BinlogEventBatchSize? #332

shuhaowu · 2022-05-04T19:52:38Z

Whne the source experiences high load, it can sometimes fail to keep up and lag. This is unusable, as binlog streamer lag must be zero during cutover.

Thinking carefully about how Ghostferry works shows us why lag may occur. Ghostferry streams the binlog via the BinlogStreamer component. Once a binlog is read, it is put into a channel with a size of BinlogEventBatchSize (set in the config). This channel is read from the BinlogWriter, a separate go routine, with the following pseudo code:

The event is read from the channel and appended into a buffer.
If there are no more events, or if the buffer exceeds BinlogEventBatchSize, then write to the target.

We can note that during step 2 (during writing to MySQL), the BinlogWriter is not emptying from the channel, but the BinlogStreamer continues to fill the channel. If it takes a long time to write to MySQL, then the channel may be full. This means if the channel is full, the BinlogStreamer will be blocked, as the channel effectively serves as a buffer. In a recent move where we had lag, I took a profile via pprof for 5 seconds and graphed the Synchronization blocking profile, which showed that for 4.84s, the BinlogStreamer is blocked.

Math

There is a possible solution to this: if the buffer is larger, then the write (step 1) may finish before the buffer is full. This means the BinlogStreamer will never block. We can analyze this condition mathematically:

Let's say the write time taken is denoted as T.
The BinlogEventBatchSize is denoted as B.
The rate at which binlog event is generated on the source is R.
We can calculate the "optimal" batch size to be B = T * R (or B = TR)
We know the time it takes to write T is a function of the B, as larger batch sizes will take longer to write.
We can assume a simple linear model, where T = mB + o. We can interpret the parameter m to be the (marginal) time needed for writing a single binlog event on the target, and o to be the overhead of a single write operation (think network round trip and other overhead).

Combining the equations from 6 and 4:

B = TR
T = mB + o

B = (mB + o)R
B = mBR + oR
B - mRB = oR
(1 - mR)B = oR
B = oR / (1 - mR)

So the optimal batch size is B = oR / (1 - mR). Unfortunately, we don't have the values to any of these terms. Perhaps in the future Ghostferry can measure these.

However, if MySQL is too slow, Ghostferry cannot keep up at all. This is also shown in the equations. Specifically, if mR >= 1, then B will be a negative number. This is non-physical and thus doesn't work. This "magical expression" is more clear if I give an example. Recall m is the time required to write one event. Suppose m = 0.01 second / event. Recall R is the number of events / second on the source and suppose it is R = 200 events / second. This obviously doesn't work, because the target can only write a maximum of 100 events / second. The generation rate in this example is 200 events / second. mR = 2, which means the optimal batch size doesn't exist.

Recommendations

For cases where you see binlog streamer lag, try to increase the batch size. If that doesn't work, make MySQL write faster.
In the future, ghostferry should compute m, R, and o, to automatically determine "optimal" batch size, and possibly detect cases where it is not possible to keep up.

The text was updated successfully, but these errors were encountered:

owensmallwood · 2023-12-08T22:59:54Z

How many writes/second is considered "high load" on the source database for Ghostferry to handle?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ghostferry binlog streamer lag with large source write volume due to misconfigured BinlogEventBatchSize? #332

Ghostferry binlog streamer lag with large source write volume due to misconfigured BinlogEventBatchSize? #332

shuhaowu commented May 4, 2022

owensmallwood commented Dec 8, 2023

Ghostferry binlog streamer lag with large source write volume due to misconfigured BinlogEventBatchSize? #332

Ghostferry binlog streamer lag with large source write volume due to misconfigured BinlogEventBatchSize? #332

Comments

shuhaowu commented May 4, 2022

Math

Recommendations

owensmallwood commented Dec 8, 2023