Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MySQL connection timeouts and pod restarts #74

Open
gregzip opened this issue Apr 4, 2023 · 25 comments · Fixed by fiskaltrust/middleware#164
Open

MySQL connection timeouts and pod restarts #74

gregzip opened this issue Apr 4, 2023 · 25 comments · Fixed by fiskaltrust/middleware#164
Assignees
Labels
bug Something isn't working

Comments

@gregzip
Copy link

gregzip commented Apr 4, 2023

Describe the bug
Some time after starting them, requests to pods start failing and MySQL connection issues show up in logs. These pods frequently fail their health checks and are eventually restarted by K8s.

To Reproduce
Let pods process requests for some time.

Expected behavior
Requests are processed correctly.

Screenshots
N/A

STDOUT/STDERR

fail: fiskaltrust.Middleware.Queue.JournalProcessor[0]
      An error occured while processing the Journal request.
      System.AggregateException: One or more errors occurred. (Connect Timeout expired.)
       ---> MySqlConnector.MySqlException (0x80004005): Connect Timeout expired.
         at MySqlConnector.MySqlConnection.CreateSessionAsync(ConnectionPool pool, Int32 startTickCount, Nullable`1 ioBehavior, CancellationToken cancellationToken) in /_/src/MySqlConnector/MySqlConnection.cs:line 948
         at MySqlConnector.MySqlConnection.OpenAsync(Nullable`1 ioBehavior, CancellationToken cancellationToken) in /_/src/MySqlConnector/MySqlConnection.cs:line 455
         at fiskaltrust.Middleware.Storage.MySQL.Repositories.Configuration.MySQLCashBoxRepository.GetAsync()
         at fiskaltrust.Middleware.Storage.MySQL.Repositories.MySQLConfigurationRepository.GetCashBoxListAsync()
         at fiskaltrust.Middleware.Queue.JournalProcessor.GetConfiguration()
         --- End of inner exception stack trace ---
         at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
         at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
         at fiskaltrust.Middleware.Queue.JournalProcessor.ProcessAsync(JournalRequest request)
fail: fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader[0]
      Error while pushing configuration data for queue de44da25-f040-47d7-81c9-b573a85e0318 (UrlType: rest; Url: rest://localhost:1500/de44da25-f040-47d7-81c9-b573a85e0318)
      System.AggregateException: One or more errors occurred. (Connect Timeout expired.)
       ---> MySqlConnector.MySqlException (0x80004005): Connect Timeout expired.
         at MySqlConnector.MySqlConnection.CreateSessionAsync(ConnectionPool pool, Int32 startTickCount, Nullable`1 ioBehavior, CancellationToken cancellationToken) in /_/src/MySqlConnector/MySqlConnection.cs:line 948
         at MySqlConnector.MySqlConnection.OpenAsync(Nullable`1 ioBehavior, CancellationToken cancellationToken) in /_/src/MySqlConnector/MySqlConnection.cs:line 455
         at fiskaltrust.Middleware.Storage.MySQL.Repositories.Configuration.MySQLCashBoxRepository.GetAsync()
         at fiskaltrust.Middleware.Storage.MySQL.Repositories.MySQLConfigurationRepository.GetCashBoxListAsync()
         at fiskaltrust.Middleware.Queue.JournalProcessor.GetConfiguration()
         --- End of inner exception stack trace ---
         at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
         at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
         at fiskaltrust.Middleware.Queue.JournalProcessor.ProcessAsync(JournalRequest request)
         at fiskaltrust.Middleware.Queue.Queue.Journal(Int64 ftJournalType, Int64 from, Int64 to)
         at fiskaltrust.Middleware.Helper.Helipad.AsyncEnumerablePOSWrapper.JournalAsync(JournalRequest request)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadConfigurationEntities(IPOS worker)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.ProcessUploadWith(ClientConfiguration clientConfiguration, Guid queueId)

POSSystem (please complete the following information):

  • b2460c21-f12b-4b2c-9bf4-a9487f8ede57 (Memento 8.225.0)

Cashbox Information (please complete the following information):
The problem doesn't seem cashbox-dependent at all. Some examples:

  • 7bb401c1-d83d-4342-8031-66e5335f34c9
  • c8316716-6112-4f60-ab92-84da75a42f55
  • 6cbd8538-5840-4202-ab07-b3c7c3dd3c4a
  • 419062af-57f3-4e1f-af02-02ad43e85cb2
  • 4386f17b-1de9-4ae5-aae5-978008eda87f

Additional context
We have 1k+ active cashboxes.

The less instances we start, the sooner we start seeing these errors. With 3 instances errors started showing up within minutes of starting them. With 20 instances it took a few hours.

After adding new instances, old ones are still experiencing the errors until they are restarted.

The database connection limit is 2000 and the number of active connection is always below 700. The errors were also showing up when we temporarily spun up a 96 CPU database instance, so a database issue is very unlikely.

@gregzip
Copy link
Author

gregzip commented Apr 5, 2023

We just tried testing a hypothesis where some connections are leaked and perhaps blocking the connection pool or something. We set our MySQL instance to auto-terminate idle connections after 30s and while the number of concurrent connections dropped, the problem is still there.

@volllly
Copy link
Contributor

volllly commented Apr 11, 2023

Hey, we'll look into these three issues (#74, #75 and #76) this week.

Can you confirm that you are running into these with version 1.3.45-rc1-bullseye?

@gregzip
Copy link
Author

gregzip commented Apr 11, 2023 via email

@volllly
Copy link
Contributor

volllly commented Apr 17, 2023

Hey, we were not yet able to reproduce this issue.

Do you experience this issue also when using only one instance.
And could you get us an sqldump of the database of an affected queue (the database name is the queueid without dashes -). You can upload the dump here

@forsthug
Copy link

Hi Gregzip,
please set the timeout in the connectionstring according to your requirements. Both connection & command timeout need to be set.
.....;Connect Timeout=120;default command timeout=120;
br

@volllly
Copy link
Contributor

volllly commented May 2, 2023

Hey @gregzip, if this problem still persists we've just released 1.3.46-rc1-bullseye which should improve stability and performance.

Can you please update and check if the problems prevail?

@gregzip
Copy link
Author

gregzip commented May 3, 2023

Hi, it still persists. We were planning to switch from Aurora to regular MySQL to make sure these are not some exotic driver issues again. We'll have a go with 1.3.46-rc1-bullseye then.

Btw. What happened to 1.3.45-rc2-bullseye? We were recommended to use it as a fix for #75 but it now seems to be gone (and it caused an outage on our side as K8s was unable to pull the image from the registry anymore)

cc @fpenz @derekmcauley7

@volllly
Copy link
Contributor

volllly commented May 3, 2023

I apologize for that. 1.3.45-rc2-bullseye is available again.

There was an an issue in our CI/CD pipline yesterday. We've identified and fixed the problem so it will not occur again :)

@gregzip
Copy link
Author

gregzip commented May 11, 2023

Hey! there was a slight delay in our verification due to limited load on Monday and Tuesday. after some further checks on 1.3.46-rc1-bullseye, the problem still occurs but is less frequent (actually takes a bit more time to show up, then it's consistently there until a pod restarts)

We're continuing with our attempt to migrate the DB but it has it's own challenges (it's 600+ GB at the moment). Is there a way to make it smaller somehow? Rebuilding some cashboxes or removing some unnecessary historical data perhaps? 😅

@volllly
Copy link
Contributor

volllly commented May 15, 2023

Hey, thanks for the update 😁

can you check how you see the message Worker created. cashboxid: ... in your logs in a typical day?

Also can you do a memory dump on a pod that is currently experiencing the problem (We have a guide here) and upload the dump and logs over a whole day from multiple pods here.

Regarding the DB migration if you provide us with a list of all queues in the database we can check if all data is already uploaded (or up to which entry it is uploaded) and you can then do a backup and remove the already uploaded entries.

Also a size analysis of the database would be interesting (size of each table for a representative queue). I suspect that the largest table is the ftJournalDE table. This could be a symptom of open transactions on the TSE. See this article on how to close open transactions before each daily closing or manual export.

@gregzip
Copy link
Author

gregzip commented May 18, 2023

Hi!

Regarding the worker created message, I think it might not be representative because we were restarting all the pods every 30 mins to avoid the DB connection issues.

That said, those issues seem to be gone after we moved to MySQL community and increased the number of pods to 15. I think it's the latter that helped. My current hypothesis is that we were running into MySQLConnector's default pool size limit of 100 - we had 10 pods running for ~1300 active cashboxes. Would that make sense from your perspective?

As I said, we managed to migrate to regular MySQL server with all of that data in place. I had a quick look at the biggest tables and indeed it seems these are ftJournalDE

+----------------------------------+-------------+------------+-----------+------------+
| database_name                    | TABLE_NAME  | total_size | data_size | index_size |
+----------------------------------+-------------+------------+-----------+------------+
| d11f0f217e7c42359860ca52a786b353 | ftJournalDE |    3128.56 |   3128.52 |       0.05 |
| 2c87fccb64314d61923f113404fc528a | ftJournalDE |    3058.56 |   3058.52 |       0.05 |
| 4c040bc573b240b8b382192de1e70b01 | ftJournalDE |    2788.56 |   2788.52 |       0.05 |
| 8567cc8b02f548d482612a5177c05c0d | ftJournalDE |    2761.56 |   2761.52 |       0.05 |
| 229d32f98dbb4c7aa15d2242225f2892 | ftJournalDE |    2551.56 |   2551.52 |       0.05 |
| 79bec6a01fe84f3ead13dd402b43e45f | ftJournalDE |    2505.56 |   2505.52 |       0.05 |
| 3b2f12c90fb740e084fabdffe0eba9d4 | ftJournalDE |    2499.53 |   2499.52 |       0.02 |
| cfd306fd24c140a5be5e025815709a2d | ftJournalDE |    2494.56 |   2494.52 |       0.05 |
| 5c031a79d4d7470ab1b485713c3623db | ftJournalDE |    2442.56 |   2442.52 |       0.05 |
| 236b77b825a64a439115f9b1a76627aa | ftJournalDE |    2388.56 |   2388.52 |       0.05 |
+----------------------------------+-------------+------------+-----------+------------+

I checked that there are indeed open transactions for them and I tried to close them for a couple of cashboxes (and wait for daily closing and even re-creating the cashboxes today) but the size is still the same. 🤔

Some questions about this:

  • how long after closing open transactions should we expect the size of these tables to drop?
  • is it okay if we query for open transactions on every closing for every cashbox? (we have to automate this process somehow with the number of cashboxes we handle)

@volllly
Copy link
Contributor

volllly commented May 26, 2023

Hey,
Yes, I think the default pool size could have been a factor.

We could try to add a config parameter for the Maximum Pool Size and see if increasing the value for that and reducing the replicas to 10 again also solves the issue.


how long after closing open transactions should we expect the size of these tables to drop?

Sorry, I was not clear on that. Closing the open transactions prevents rapid growth of the database in the future but it will not shrink the database.

At the moment we don't have an automated way to delete historical data. What we can do though is check on our side which data was already uploaded (please provide us with a list of queues for that) and then you can remove the old data ftJournalDE entries which were already uploaded.

is it okay if we query for open transactions on every closing for every cashbox? (we have to automate this process somehow with the number of cashboxes we handle)

Yes, that is perfectly fine.

@gregzip
Copy link
Author

gregzip commented May 30, 2023

We could try to add a config parameter for the Maximum Pool Size and see if increasing the value for that and reducing the replicas to 10 again also solves the issue.

That sounds great 🙏

Yes, that is perfectly fine.

Okay, we'll do that soon-ish. I already closed all of the currently open ones via an ad-hoc script so I hope this gives us some time

@gregzip
Copy link
Author

gregzip commented Jun 14, 2023

Hi! Any news on this? We'd be excited to check if the max pool size fixes our issues, as this could potentially allow us to use less compute resources and save quite a bit on infra costs 💸

@volllly
Copy link
Contributor

volllly commented Jun 20, 2023

Hey, we have not forgotten about you 😅 I hope we'll be able to tackle that in one of the next sprints

@volllly
Copy link
Contributor

volllly commented Jun 22, 2023

Hey, we've just released 1.3.47-rc1-bullseye which introduces the MaxPoolSize configuration parameter to the MySQL queue and uses 500 as the default value 🚀

@gregzip
Copy link
Author

gregzip commented Jul 1, 2023

We tried the new version and it seems the issue still persists 😬 it was mostly fine for a couple of days with less instances, so I disabled our restarting cron and today things exploded again with the same errors as before.

For now we're back to lots of instances and a cron that restarts them every night.

@gregzip
Copy link
Author

gregzip commented Jul 3, 2023

I did some investigation today and it seems that the pod started failing around 200-220 connections.

I'm going to re-create the scenario tomorrow and watch the CPU/memory usage more closely to be absolutely sure it's not about that. I should've done it before but I somehow assumed that if the entire node cpu is very low, then it's not an issue. But it seems that it's always an isolated pod that's failing, so that one pod might not be doing great there. I'll post my findings.

@gregzip
Copy link
Author

gregzip commented Jul 4, 2023

Quick update:

  • I checked the scenario when there's only one pod with little resources and those errors only popped up when it was already unresponsive and being restarted by K8s
  • I then upscaled to 4 pods with more resources (1CPU, 2GB RAM each) and this worked fine for a while (maybe ~1h) until it started failing again
    • Max CPU usage of each pod was about 30% and memory usage about 20% of the limit
    • The number of connections didn't seem to impact the error much as it seemed to fluctuate between 120-180 in the 2 pods that started failing, before they eventually got restarted

It seems that the issue is not about connection pooling (anymore?) and it's not about computing resources either 😬

Edit:
One more data point:
I left 14 pods running for the night but forgot to turn on our nightly auto-restart, and some of the pods started failing in the morning with very high CPU usage. I checked and the usage kept growing throughout the night even though there was absolutely no transactions happening.
This might point to something like a thread leak or some other uncleaned processes going on? 🤔

@volllly
Copy link
Contributor

volllly commented Jul 6, 2023

Hey, thanks for the detailed investigation. We'll investigate if we have a thread leak.

Could you also please get us one days worth of logs of all pods (with log level debug)? This would help us a lot in analyzing the situation 🙏

@gregzip
Copy link
Author

gregzip commented Aug 11, 2023

hey! sorry for a late reply, we were swamped with other topics.

here are the logs, I think the format is a bit weird but that's what CloudWatch gave me 😕

pod1.txt
pod2.txt
pod3.txt
pod4.txt
pod5.txt
pod6.txt
pod7.txt
pod8.txt
pod9.txt
pod10.txt
pod11.txt
pod12.txt
pod13.txt
pod14.txt
pod15.txt

@gregzip
Copy link
Author

gregzip commented Sep 12, 2023

Hey! Any news on this?

@volllly
Copy link
Contributor

volllly commented Oct 11, 2023

Hey, news yet, could also provide us with a memory dump of a pod when it is experiencing the problems? (you can upload it here).

FYI Here is a KBA on how to get a memory dump.

@gregzip
Copy link
Author

gregzip commented Oct 23, 2023

hey there! I know we owe you a memory dump but this is not fixed yet, even with a larger pool size

@volllly volllly reopened this Oct 24, 2023
@volllly
Copy link
Contributor

volllly commented Oct 24, 2023

Sorry, this was autoclosed by mistake

@volllly volllly self-assigned this Oct 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants