Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine the initialization of schema snapshots. #11109

Open
asddongmen opened this issue May 15, 2024 · 0 comments
Open

Refine the initialization of schema snapshots. #11109

asddongmen opened this issue May 15, 2024 · 0 comments
Assignees
Labels
area/ticdc Issues or PRs related to TiCDC. type/enhancement This is a enhancement PR

Comments

@asddongmen
Copy link
Contributor

asddongmen commented May 15, 2024

Is your feature request related to a problem?

Currently, during initialization, a changefeed retrieves all table schemas from upstream and then applies filter rules to these schemas to determine which tables will be replicated. This process can be time-consuming if there are a large number of tables upstream. However, only a few tables may need to be replicated, making the rest of the schemas unnecessary.

Consider a changefeed with the following filter rules as an example.

[filter]
rules=["db.t1"]

This changefeed only wants to copy the db.t1 table. But, it needs to get the schema of all tables in the db database at the start. If db has thousands of tables, this can take a lot of time.

We recently saw a problem where this caused about a 10 minute delay in the CDC. We saw this when there were hundreds of changefeeds and thousands of tables in a CDC failover situation.

Efforts have been made to resolve the lag increase problem, but further improvement is still needed.

So, we need to find a way to make getting the schemas quicker.

We have two ideas to fix this problem.

Solution 1

  1. Use [ListSimpleTables](https://github.com/pingcap/tidb/blob/master/pkg/meta/meta.go#L1026) to retrieve TableNameInfo. This only includes the name and ID of a table, making it smaller than TableInfo.
  2. Apply a filter to the retrieved TableNameInfos to locate the tables of interest.
  3. Use [GetTable](https://github.com/pingcap/tidb/blob/master/pkg/meta/meta.go#L1219) to acquire the schema of the selected tables.

This approach can reduce time costs by minimizing the amount of data that needs to be loaded. However, in the worst-case scenario (where the changefeed is interested in all upstream tables), it could result in additional network costs.

Solution 2

Add a new method to https://github.com/pingcap/tidb/blob/63cf3e54aeaaa2cfdee8d6587064d62ba3ad2a52/pkg/meta/meta.go#L993 as below:

// ListTables shows all tables in database.
func (m *Meta) ListTablesByFn(dbID int64, fn func(info *model.TableNameInfo) bool) ([]*model.TableInfo, error) {
	dbKey := m.dbKey(dbID)
	if err := m.checkDBExists(dbKey); err != nil {
		return nil, errors.Trace(err)
	}

	res, err := m.txn.HGetAll(dbKey)
	if err != nil {
		return nil, errors.Trace(err)
	}

	tables := make([]*model.TableInfo, 0, len(res)/2)
	for _, r := range res {
		// only handle table meta
		tableKey := string(r.Field)
		if !strings.HasPrefix(tableKey, mTablePrefix) {
			continue
		}
		tbName := &model.TableNameInfo{}
		err = json.Unmarshal(r.Value, tbName)
		if err != nil {
			return nil, errors.Trace(err)
		}
		if fn(tbName) {
			continue
		}

		tbInfo := &model.TableInfo{}
		err = json.Unmarshal(r.Value, tbInfo)
		if err != nil {
			return nil, errors.Trace(err)
		}
		tbInfo.DBID = dbID

		tables = append(tables, tbInfo)
	}

	return tables, nil
}

This method lets the caller use a function to pick the right tableInfos, which cuts down on extra network costs. But it means changing the code in the TiDB repository and updating the TiDB dependency in the TiFlow repository. This might bring about unforeseen issues and need extra work.

So, I suggest we put solution 1 in place for version ≤ v8.1.0, and bring in solution 2 in the next release of TiCDC.

This way, we'll save time in most situations and keep changes small.

Describe the feature you'd like

as above.

Describe alternatives you've considered

No response

Teachability, Documentation, Adoption, Migration Strategy

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ticdc Issues or PRs related to TiCDC. type/enhancement This is a enhancement PR
Projects
None yet
Development

No branches or pull requests

1 participant