Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: PostgreSQL destination delete stale requires sequential scan #17913

Open
1 task done
kd7lxl opened this issue May 8, 2024 · 3 comments
Open
1 task done

feat: PostgreSQL destination delete stale requires sequential scan #17913

kd7lxl opened this issue May 8, 2024 · 3 comments

Comments

@kd7lxl
Copy link
Contributor

kd7lxl commented May 8, 2024

Which problem is this feature request solving?

The postgresql destination plugin deletes stale records like this:

sb.WriteString("delete from ")
sb.WriteString(pgx.Identifier{msg.TableName}.Sanitize())
sb.WriteString(" where ")
sb.WriteString(schema.CqSourceNameColumn.Name)
sb.WriteString(" = $1 and ")
sb.WriteString(schema.CqSyncTimeColumn.Name)
sb.WriteString(" < $2")
batch.Queue(sb.String(), msg.SourceName, msg.SyncTime)

The filter on _cq_source_name and _cq_sync_time requires a sequential scan of the table because these fields aren't indexed:

explain delete from "gcp_compute_instances" where _cq_source_name = 'foo' and _cq_sync_time < now();
                                  QUERY PLAN
-------------------------------------------------------------------------------
 Delete on gcp_compute_instances  (cost=0.00..20059.78 rows=0 width=0)
   ->  Seq Scan on gcp_compute_instances  (cost=0.00..20059.78 rows=2 width=6)
         Filter: ((_cq_source_name = 'foo'::text) AND (_cq_sync_time < now()))
(3 rows)

On tables with several sources, this could be slow. It seems like it could benefit from an index.

Describe the solution you'd like

Create indexes on _cq_source_name and _cq_sync_time.

Pull request (optional)

  • I can submit a pull request
@kd7lxl
Copy link
Contributor Author

kd7lxl commented May 21, 2024

Just waiting for permission to contribute per

The CloudQuery framework, SDK and CLI are open source while plugins available under `plugins` are **open core**, hence not all contributions to plugins directory will be accepted if they are part of the commercial plugin offering - please [file an issue](https://github.com/cloudquery/cloudquery/issues/new/choose) before opening a PR.

@erezrokah
Copy link
Contributor

erezrokah commented May 21, 2024

Hi @kd7lxl, thanks for suggesting a contribution. Maybe you can start with a draft PR to show how the implementation would look like? We'll need to also discuss this internally if we want the SDK schema to tell destinations if they should create indexes for columns, or it should be coded in the destination per column

@kd7lxl
Copy link
Contributor Author

kd7lxl commented May 21, 2024

The scope of this request is to create an index to directly support the query used in

sb.WriteString("delete from ")
sb.WriteString(pgx.Identifier{msg.TableName}.Sanitize())
sb.WriteString(" where ")
sb.WriteString(schema.CqSourceNameColumn.Name)
sb.WriteString(" = $1 and ")
sb.WriteString(schema.CqSyncTimeColumn.Name)
sb.WriteString(" < $2")
batch.Queue(sb.String(), msg.SourceName, msg.SyncTime)

This is not a generic request from the SDK to create indexes, since the SDK or source plugin is not responsible for the design of this delete query. The query is defined in plugins/destination/postgresql/client/delete.go, so the matching index will be created in plugins/destination/postgresql/client/migrate.go.

Supporting indexes for other columns via the SDK would be an interesting discussion, but beyond the scope here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Ready
Development

No branches or pull requests

2 participants