Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Electric pg connection closing on large syncs #519

Open
js2702 opened this issue Oct 3, 2023 · 3 comments
Open

Electric pg connection closing on large syncs #519

js2702 opened this issue Oct 3, 2023 · 3 comments
Labels
Improvement Created by Linear-GitHub Sync

Comments

@js2702
Copy link
Contributor

js2702 commented Oct 3, 2023

We are doing some tests with large quantities of data (10000-15000 new rows) on a foreign key related table (so compensations messages are sent). What we've encountered is that sometimes the Electric service will complain about the Postgres connection being closed. We've been progressively increasing the number of rows to test and when reaching 10K it may or may not fail. When it fails it's possible that it tries again and then it gets synced correctly. But if we keep increasing the number of rows it starts failing consistently.

The tests are on Electric server and client 0.6.4 and we ran them on a macOS machine (Docker) and on a Linux server. It happens on both of them.

Electric logs

manabox_sync-electric-1  | 08:35:29.239 pid=<0.2824.0> origin=postgres_1 pg_slot=postgres_1 [debug] Sending 60010 messages to the subscriber: from #Lsn<0/75AA9> to #Lsn<0/108291>
manabox_sync-electric-1  | 08:36:30.160 pid=<0.2824.0> origin=postgres_1 pg_slot=postgres_1 [error] GenServer #PID<0.2824.0> terminating
manabox_sync-electric-1  | ** (MatchError) no match of right hand side value: {:error, :closed}
manabox_sync-electric-1  |     (electric 0.6.4) lib/electric/replication/postgres/tcp_server.ex:620: Electric.Replication.Postgres.TcpServer.tcp_send/2
manabox_sync-electric-1  |     (elixir 1.15.4) lib/enum.ex:984: Enum."-each/2-lists^foreach/1-0-"/2
manabox_sync-electric-1  |     (electric 0.6.4) lib/electric/replication/postgres/slot_server.ex:321: Electric.Replication.Postgres.SlotServer.send_transaction/3
manabox_sync-electric-1  |     (elixir 1.15.4) lib/enum.ex:2510: Enum."-reduce/3-lists^foldl/2-0-"/3
manabox_sync-electric-1  |     (electric 0.6.4) lib/electric/replication/postgres/slot_server.ex:275: Electric.Replication.Postgres.SlotServer.handle_events/3
manabox_sync-electric-1  |     (gen_stage 1.2.1) lib/gen_stage.ex:2578: GenStage.consumer_dispatch/6
manabox_sync-electric-1  |     (stdlib 4.3.1.2) gen_server.erl:1123: :gen_server.try_dispatch/4
manabox_sync-electric-1  |     (stdlib 4.3.1.2) gen_server.erl:1200: :gen_server.handle_msg/6
... // Last Message
manabox_sync-electric-1  | 08:36:30.176 pid=<0.2905.0> origin=postgres_1 pg_slot=postgres_1 [debug] slot server started, registered as {:n, :l, {Electric.Replication.Postgres.SlotServer, "postgres_1"}} and {:n, :l, {Electric.Replication.Postgres.SlotServer, {:slot_name, "postgres_1"}}}

Postgres logs

manabox_sync-postgres-1  | 2023-10-03 08:37:04.899 GMT [184] ERROR:  could not receive data from WAL stream: server closed the connection unexpectedly
manabox_sync-postgres-1  |              This probably means the server terminated abnormally
manabox_sync-postgres-1  |              before or while processing the request.
manabox_sync-postgres-1  | 2023-10-03 08:37:04.901 GMT [1] LOG:  background worker "logical replication worker" (PID 184) exited with exit code 1
manabox_sync-postgres-1  | 2023-10-03 08:37:04.902 GMT [298] LOG:  logical replication apply worker for subscription "postgres_1" has started

Extra

Kinda on topic, would there be any difference between an user syncing 10K oplogs and 1K users syncing 10 oplogs? In terms of server performance.
If you know any tool we could use to test a higher number of users it would be great to hear.

@alco
Copy link
Member

alco commented Oct 4, 2023

Hey @js2702. Thanks a lot for sharing your findings!

We have a load-testing/perf-analysis project on our roadmap, haven't quite got there just yet. Dealing with large amounts of data is definitely something that can be improved by using bulk operations and more compact subprotocol for data transfer between the client and the server and between Electric and PG.

Kinda on topic, would there be any difference between an user syncing 10K oplogs and 1K users syncing 10 oplogs? In terms of server performance.

In theory, there shouldn't be a difference. Electric fans-in all incoming client writes into a single stream that is then fed into PG via logical replication.

If you know any tool we could use to test a higher number of users it would be great to hear.

Could you share some details about your current toolset you're using to run those tests?

@js2702
Copy link
Contributor Author

js2702 commented Oct 4, 2023

Right now we are using a script that uses a part of our application to mass import csv files.

To check the performance and network bandwidth we are using Cadvisor and Prometheus for analytics.

version: "3.8"
name: docker_metrics

services:
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.47.2
    privileged: true
    devices:
      - "/dev/kmsg"
    ports:
      - 8080:8080

    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro

  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - 9090:9090
    command:
      - --config.file=/etc/prometheus/prometheus.yml
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
    depends_on:
      - cadvisor

And the prometheus.yml config

scrape_configs:
  - job_name: cadvisor
    scrape_interval: 5s
    static_configs:
      - targets:
          - cadvisor:8080

We are measuring outcoming bytes from the Electric container and incoming bytes into the Postgres container. Substracting one from the other to obtain an approximate number of what a hosting provider like GCP could charge for the egress.

Prometheus queries
increase(container_network_receive_bytes_total{name="postgres-1"}[30s])
increase(container_network_transmit_bytes_total{name="electric-1"}[30s])

@alco
Copy link
Member

alco commented Oct 5, 2023

@js2702 Thank you for those details!

@balegas balegas added the Improvement Created by Linear-GitHub Sync label Oct 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Improvement Created by Linear-GitHub Sync
Projects
None yet
Development

No branches or pull requests

3 participants