Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heartbeat table updating. Positions table not. Logs show successfully processing 0 records. #2010

Open
eihli opened this issue Jun 1, 2023 · 7 comments

Comments

@eihli
Copy link

eihli commented Jun 1, 2023

We've encountered this at least twice. Both times after a database restart. We're on RDS. The first time was a resize. The second time was a change to a parameter group (or something like that).

We're running in the zendesk/maxwell:v1.37.6 docker image.

docker logs --follow shows this and it updates in realtime:

17:22:52,538 INFO  LogInputStreamReader - [2023-06-01 17:22:52.538664] [0x00000043][0x00007f2efdc38700] [info] [processing_statistics_logger.cc:114] Stage 2 Triggers: { stream: 'users-db', manual: 0, count: 0, size: 0, matches: 0, timed: 0, KinesisRecords: 0, PutRecords: 0 }
17:22:52,538 INFO  LogInputStreamReader - [2023-06-01 17:22:52.538672] [0x00000043][0x00007f2efdc38700] [info] [processing_statistics_logger.cc:129] (users-db) Average Processing Time: -nan ms
17:23:07,538 INFO  LogInputStreamReader - [2023-06-01 17:23:07.538751] [0x00000043][0x00007f2efdc38700] [info] [processing_statistics_logger.cc:111] Stage 1 Triggers: { stream: 'users-db', manual: 0, count: 0, size: 0, matches: 0, timed: 0, UserRecords: 0, KinesisRecords: 0 }
17:23:07,538 INFO  LogInputStreamReader - [2023-06-01 17:23:07.538791] [0x00000043][0x00007f2efdc38700] [info] [processing_statistics_logger.cc:114] Stage 2 Triggers: { stream: 'users-db', manual: 0, count: 0, size: 0, matches: 0, timed: 0, KinesisRecords: 0, PutRecords: 0 }
17:23:07,538 INFO  LogInputStreamReader - [2023-06-01 17:23:07.538799] [0x00000043][0x00007f2efdc38700] [info] [processing_statistics_logger.cc:129] (users-db) Average Processing Time: -nan ms

Running SELECT * FROM maxwell.heartbeats; a few times shows the heartbeat updating.

Running SELECT * FROM maxwell.positions; repeatedly never shows any update. It's stuck at a last_heartbeat_read and a binlog_file from hours ago.

docker restart fixes the issue.

Any ideas? Threads to pull on? More info I can provide?

Thanks for any help.

@osheroff
Copy link
Collaborator

is data flowing during this time? Ie is the whole system stuck or just heartbeats?

@liuxiaomingg
Copy link

encounter the same problem The data flows normally

@eihli
Copy link
Author

eihli commented Jun 13, 2023

is data flowing during this time? Ie is the whole system stuck or just heartbeats?

In my case, the logs always show 0 UserRecords and 0 KinesisRecords even though rows are being updated in the database.

{ stream: 'users-db', manual: 0, count: 0, size: 0, matches: 0, timed: 0, UserRecords: 0, KinesisRecords: 0 }

Also, the Maxwell's position table doesn't update. The row in that table that tracks the latest binary log file and its latest position never updates.

But the heartbeats table does update.

@osheroff
Copy link
Collaborator

definitely sounds like maxwell's binlog replication thread is stuck or dead; the fact that the heartbeats table is updating doesn't necessarily mean it's getting any rows.

Can you capture complete logs from the system? The kinesis logs you're showing don't necessarily mean that much, but maybe there's some errors or warnings or info before that?

Curious when you say "after a database restart" -- does maxwell not die when the database first restarts? or is this when maxwell restarts after the db restarts?

If all else fails you might want to try running with --binlog_heartbeat, which is good at detecting connections that go dead / unresponsive for miscellaneous reaasons.

@liuxiaomingg
Copy link

Nothing wrong with the console interface ,When restart maxwell ,The service becomes normal

@liuxiaomingg
Copy link

There's no obvious error log on the console, either

@eihli
Copy link
Author

eihli commented Jun 13, 2023

Can you capture complete logs from the system? The kinesis logs you're showing don't necessarily mean that much, but maybe there's some errors or warnings or info before that?

I don't have the full logs but I did read them and nothing jumped out.

Curious when you say "after a database restart" -- does maxwell not die when the database first restarts? or is this when maxwell restarts after the db restarts?

We have it set to unlimited reconnection retries. I think I saw it retry a few thousand times in a minute (those numbers are from memory, don't count on them being accurate) or so and then when the database came back up it reconnected and started logging as if everything was A-OK.

If all else fails you might want to try running with --binlog_heartbeat, which is good at detecting connections that go dead / unresponsive for miscellaneous reaasons.

I'll give this a shot. Thanks!

One other note: this isn't 100% reproducible for us. Sometimes after Maxwell can't connect for a ~minute it will reconnect and continue processing data normally. Rough estimate is that we've seen it break ~3 times out of ~20 database restarts. Perhaps of note is that usually when we do a database restart it's because we're resizing an RDS instance. I didn't expect that to be an issue because we're not upgrading MySQL, changing IP address, or doing anything else that I'd expect to affect the connection/binlog.

If nothing jumps out as an "ah-hah" from this description, I can try to narrow in on making it reproducible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants