Added a generic binlog catchup process and reconcile the binlogs #88

shuhaowu · 2019-01-28T15:44:09Z

This PR adds a generic binlog catchup process prior to resuming Ghostferry (in Ferry.CatchUpOnMissedBinlogsOnResume()). Right now, all we reconcile is the binlog events: we delete any rows that are modified during the downtime of Ghostferry and recopy them from the source. In the future, we'll add mechanisms to reconcile the iterative verifier, and so on.

Note that there are no added integration tests for this PR because the current master code can already interrupt/resume by simply replaying all the binlog entries as is. The tests that covers that also covers this PR. This does mean this PR is not strictly necessary if we just want to interrupt/resume a run without handling schema changes. The reconciliation process becomes more important with a schema change as we agreed to delete the table whose schema changed and recopy it completely upon resume.

Also performed a minor refactor such that the DataIterator takes the tables to iterate over as an argument to Run, as opposed to a struct member.

Instead of passing the Table as a struct member, this makes it pass by argument in the Run function. It makes more sense as it's not really a configuration on how to Run the DataIterator, but what to run the DataIterator on. This also makes the DataIterator slightly easier to use independently of the Ferry.

hkdsun

This does mean this PR is not strictly necessary if we just want to interrupt/resume a run without handling schema changes

I thought we came to agreement to ship full interrupt/resume before venturing onto the schema change plans. What changed your mind?

If we decide to go for full interruption, I would like us to not even have a Reconciler struct. Just feed the Ferry the last state and everything else happens automatically. Am I wrong to assume that's possible?

In the original ghostferry algorithm, what's the difference between starting the binlog streamer from the current position vs an earlier position? Assume there's no schema change.

hkdsun · 2019-01-28T22:11:27Z

binlog_reconciler.go

+// This is used during the resume when we are catching up to the binlog events
+// missed by Ghostferry while it is down.
+type ReconcilationDMLEvent struct {
+	*DMLEventBase


Why do you need to embed this? It doesn't make sense to call this struct a *DMLEvent either. It's not really an event itself. You may store the event that produced this as a field on the struct but calling this a DML event itself is very confusing

Is the only goal of this struct to provide a different AsSQLString so that it works as-is with the BinlogWriter? I feel like we need to refactor the current BinlogWriter to become a MultiStatementWriter.

type MultiStatementWriter struct{} // current BinlogWriter implementation type BinlogWriter struct{ writer *MultiStatementWriter } type BinlogReconciler struct{ writer *MultiStatementWriter }

I tried to refactor the BinlogWriter and it turned out to be a bit more difficult and tedious than I would have liked. Since the interface of the DMLEvent is mainly to call AsSQLString, I decided that it's easier to work with this version and refactor later.

hkdsun · 2019-01-28T22:25:10Z

binlog_reconciler.go

+				return err
+			}
+
+			batch = append(batch, NewReconciliationDMLEvent(table, pk, row))


wait... why do you need ReconcilationDMLEvent at all? can't you re-use RowBatch here?

hkdsun · 2019-01-28T22:30:29Z

ferry.go

+	// TODO: Step 2: setup the iterative verifier reconciler
+
+	// Step 3: initialize the binlog streamer and the throttler
+	binlogStreamer := f.NewBinlogStreamer()


wouldn't it be cleaner to have this inside the reconciler?

The BinlogReconciler is supposed to catch up binlogs events that occured while Ghostferry is down. However, if we want to interrupt/resume the verifier, the process of resuming that will also need an instance of the binlog streamer. Since we don't want two binlog streamers, this is managed directly in the Ferry.

hkdsun · 2019-01-28T22:39:54Z

binlog_reconciler.go

+				batch = make([]DMLEvent, 0, r.BatchSize)
+			}
+
+			row, err := r.selectRowFromSource(table, pk)


selecting one row at a time? wouldn't that be very slow for high traffic dbs?

hkdsun · 2019-01-28T22:41:13Z

binlog_reconciler.go

+}
+
+func (r *BinlogReconciler) replaceBatch(batch []DMLEvent) error {
+	err := r.BinlogWriter.WriteEvents(batch)


these are not binlog events. see my comment earlier about MultiStatementWriter

hkdsun · 2019-01-28T22:44:15Z

ferry.go

+	defer func() {
+		shutdown()
+		throttlerWg.Wait()
+	}()


why is this here? seems very out of place. smells like it should be started before we even get anywhere close to here

hkdsun · 2019-01-28T22:47:17Z

binlog_reconciler.go

+			return err
+		}
+
+		r.modifiedRows.AddRow(ev.TableSchema(), pk)


do we really wanna REPLACE for every table? seems like the usual binlog playing logic should handle most tables?

shuhaowu · 2019-01-30T20:31:20Z

In the original ghostferry algorithm, what's the difference between starting the binlog streamer from the current position vs an earlier position? Assume there's no schema change.

The original Ghostferry algorithm is insensitive to any starting locations. As long as we don't miss any events, the data copied should be correct.

If we don't start at the position of interruption and start at later point, we'll need something like the Reconciler. We would only really want to do this in the event of a schema change, because the events between the position of interruption and the position of resume could contain schema changes and thus change the binlog event structure.

If we decide to go for full interruption, I would like us to not even have a Reconciler struct. Just feed the Ferry the last state and everything else happens automatically. Am I wrong to assume that's possible?

So I did this PR primarily to futureproof the algorithm. It is not strictly necessary if we don't want to handle schema changes for the time being. It might be possible to resume without the reconciler but I'll have to try it out to see how it looks, especially with respect to the iterative verifier.

shuhaowu force-pushed the binlog-reconciler branch from c4caee5 to d796be7 Compare January 28, 2019 16:35

shuhaowu requested review from fjordan, hkdsun and pushrax January 28, 2019 16:36

Added catchup mechanism and BinlogReconciler

c8e4d54

shuhaowu force-pushed the binlog-reconciler branch from d796be7 to c8e4d54 Compare January 28, 2019 20:20

hkdsun reviewed Jan 28, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added a generic binlog catchup process and reconcile the binlogs #88

Added a generic binlog catchup process and reconcile the binlogs #88

shuhaowu commented Jan 28, 2019 •

edited

hkdsun left a comment

hkdsun Jan 28, 2019

shuhaowu Jan 30, 2019

hkdsun Jan 28, 2019

hkdsun Jan 28, 2019

shuhaowu Jan 30, 2019

hkdsun Jan 28, 2019

hkdsun Jan 28, 2019

hkdsun Jan 28, 2019

hkdsun Jan 28, 2019

shuhaowu commented Jan 30, 2019

Added a generic binlog catchup process and reconcile the binlogs #88

Are you sure you want to change the base?

Added a generic binlog catchup process and reconcile the binlogs #88

Conversation

shuhaowu commented Jan 28, 2019 • edited

hkdsun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shuhaowu commented Jan 30, 2019

shuhaowu commented Jan 28, 2019 •

edited