MySQL connection timeouts and pod restarts #74

gregzip · 2023-04-04T09:50:14Z

Describe the bug
Some time after starting them, requests to pods start failing and MySQL connection issues show up in logs. These pods frequently fail their health checks and are eventually restarted by K8s.

To Reproduce
Let pods process requests for some time.

Expected behavior
Requests are processed correctly.

Screenshots
N/A

STDOUT/STDERR

fail: fiskaltrust.Middleware.Queue.JournalProcessor[0]
      An error occured while processing the Journal request.
      System.AggregateException: One or more errors occurred. (Connect Timeout expired.)
       ---> MySqlConnector.MySqlException (0x80004005): Connect Timeout expired.
         at MySqlConnector.MySqlConnection.CreateSessionAsync(ConnectionPool pool, Int32 startTickCount, Nullable`1 ioBehavior, CancellationToken cancellationToken) in /_/src/MySqlConnector/MySqlConnection.cs:line 948
         at MySqlConnector.MySqlConnection.OpenAsync(Nullable`1 ioBehavior, CancellationToken cancellationToken) in /_/src/MySqlConnector/MySqlConnection.cs:line 455
         at fiskaltrust.Middleware.Storage.MySQL.Repositories.Configuration.MySQLCashBoxRepository.GetAsync()
         at fiskaltrust.Middleware.Storage.MySQL.Repositories.MySQLConfigurationRepository.GetCashBoxListAsync()
         at fiskaltrust.Middleware.Queue.JournalProcessor.GetConfiguration()
         --- End of inner exception stack trace ---
         at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
         at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
         at fiskaltrust.Middleware.Queue.JournalProcessor.ProcessAsync(JournalRequest request)

fail: fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader[0]
      Error while pushing configuration data for queue de44da25-f040-47d7-81c9-b573a85e0318 (UrlType: rest; Url: rest://localhost:1500/de44da25-f040-47d7-81c9-b573a85e0318)
      System.AggregateException: One or more errors occurred. (Connect Timeout expired.)
       ---> MySqlConnector.MySqlException (0x80004005): Connect Timeout expired.
         at MySqlConnector.MySqlConnection.CreateSessionAsync(ConnectionPool pool, Int32 startTickCount, Nullable`1 ioBehavior, CancellationToken cancellationToken) in /_/src/MySqlConnector/MySqlConnection.cs:line 948
         at MySqlConnector.MySqlConnection.OpenAsync(Nullable`1 ioBehavior, CancellationToken cancellationToken) in /_/src/MySqlConnector/MySqlConnection.cs:line 455
         at fiskaltrust.Middleware.Storage.MySQL.Repositories.Configuration.MySQLCashBoxRepository.GetAsync()
         at fiskaltrust.Middleware.Storage.MySQL.Repositories.MySQLConfigurationRepository.GetCashBoxListAsync()
         at fiskaltrust.Middleware.Queue.JournalProcessor.GetConfiguration()
         --- End of inner exception stack trace ---
         at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
         at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
         at fiskaltrust.Middleware.Queue.JournalProcessor.ProcessAsync(JournalRequest request)
         at fiskaltrust.Middleware.Queue.Queue.Journal(Int64 ftJournalType, Int64 from, Int64 to)
         at fiskaltrust.Middleware.Helper.Helipad.AsyncEnumerablePOSWrapper.JournalAsync(JournalRequest request)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadConfigurationEntities(IPOS worker)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.ProcessUploadWith(ClientConfiguration clientConfiguration, Guid queueId)

POSSystem (please complete the following information):

b2460c21-f12b-4b2c-9bf4-a9487f8ede57 (Memento 8.225.0)

Cashbox Information (please complete the following information):
The problem doesn't seem cashbox-dependent at all. Some examples:

7bb401c1-d83d-4342-8031-66e5335f34c9
c8316716-6112-4f60-ab92-84da75a42f55
6cbd8538-5840-4202-ab07-b3c7c3dd3c4a
419062af-57f3-4e1f-af02-02ad43e85cb2
4386f17b-1de9-4ae5-aae5-978008eda87f

Additional context
We have 1k+ active cashboxes.

The less instances we start, the sooner we start seeing these errors. With 3 instances errors started showing up within minutes of starting them. With 20 instances it took a few hours.

After adding new instances, old ones are still experiencing the errors until they are restarted.

The database connection limit is 2000 and the number of active connection is always below 700. The errors were also showing up when we temporarily spun up a 96 CPU database instance, so a database issue is very unlikely.

The text was updated successfully, but these errors were encountered:

gregzip · 2023-04-05T14:13:32Z

We just tried testing a hypothesis where some connections are leaked and perhaps blocking the connection pool or something. We set our MySQL instance to auto-terminate idle connections after 30s and while the number of concurrent connections dropped, the problem is still there.

volllly · 2023-04-11T12:51:41Z

Hey, we'll look into these three issues (#74, #75 and #76) this week.

Can you confirm that you are running into these with version 1.3.45-rc1-bullseye?

gregzip · 2023-04-11T14:35:45Z

Yes, exactly

…

On Tue, Apr 11, 2023 at 2:51 PM Paul Volavsek ***@***.***> wrote: Hey, we'll look into these three issues (#74 <#74>, #75 <#75> and #76 <#76>) this week. Can you confirm that you are running into these with 1.3.45-rc1-bullseye? — Reply to this email directly, view it on GitHub <#74 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AZTOAQVW4UCMBWRNCJBMSG3XAVHWRANCNFSM6AAAAAAWSOZ5NM> . You are receiving this because you authored the thread.Message ID: <fiskaltrust/product-de-bring-your-own-datacenter/issues/74/1503274731@ github.com>

-- Together, We Grow. Grzegorz Ziemoński Engineering Manager [image: Facebook] <https://www.facebook.com/phorestsalonsoftware/> [image: LinkedIn] <https://www.linkedin.com/company/phorest/> [image: Twitter] <https://twitter.com/thephorestword/> [image: Instagram] <https://www.instagram.com/phorestsalonsoftware/> [image: Phorest Logo] <https://www.phorest.com/> IE *phorest.com* <https://www.phorest.com/> IE: +353 (0)1 8747800 UK: +44 (0)207 100 9290 US: +1 (406) 284‑7019 [image: Salon Owners Summit 2024] <https://www.salonownersummit.com/> We work flexibly at Phorest, while it suits me to send this email now I do not expect a response outside of your own working hours

volllly · 2023-04-17T09:57:11Z

Hey, we were not yet able to reproduce this issue.

Do you experience this issue also when using only one instance.
And could you get us an sqldump of the database of an affected queue (the database name is the queueid without dashes -). You can upload the dump here

forsthug · 2023-04-18T08:52:25Z

Hi Gregzip,
please set the timeout in the connectionstring according to your requirements. Both connection & command timeout need to be set.
.....;Connect Timeout=120;default command timeout=120;
br

volllly · 2023-05-02T13:18:13Z

Hey @gregzip, if this problem still persists we've just released 1.3.46-rc1-bullseye which should improve stability and performance.

Can you please update and check if the problems prevail?

gregzip · 2023-05-03T11:41:28Z

Hi, it still persists. We were planning to switch from Aurora to regular MySQL to make sure these are not some exotic driver issues again. We'll have a go with 1.3.46-rc1-bullseye then.

Btw. What happened to 1.3.45-rc2-bullseye? We were recommended to use it as a fix for #75 but it now seems to be gone (and it caused an outage on our side as K8s was unable to pull the image from the registry anymore)

cc @fpenz @derekmcauley7

volllly · 2023-05-03T12:33:50Z

I apologize for that. 1.3.45-rc2-bullseye is available again.

There was an an issue in our CI/CD pipline yesterday. We've identified and fixed the problem so it will not occur again :)

gregzip · 2023-05-11T15:25:53Z

Hey! there was a slight delay in our verification due to limited load on Monday and Tuesday. after some further checks on 1.3.46-rc1-bullseye, the problem still occurs but is less frequent (actually takes a bit more time to show up, then it's consistently there until a pod restarts)

We're continuing with our attempt to migrate the DB but it has it's own challenges (it's 600+ GB at the moment). Is there a way to make it smaller somehow? Rebuilding some cashboxes or removing some unnecessary historical data perhaps? 😅

volllly · 2023-05-15T13:42:49Z

Hey, thanks for the update 😁

can you check how you see the message Worker created. cashboxid: ... in your logs in a typical day?

Also can you do a memory dump on a pod that is currently experiencing the problem (We have a guide here) and upload the dump and logs over a whole day from multiple pods here.

Regarding the DB migration if you provide us with a list of all queues in the database we can check if all data is already uploaded (or up to which entry it is uploaded) and you can then do a backup and remove the already uploaded entries.

Also a size analysis of the database would be interesting (size of each table for a representative queue). I suspect that the largest table is the ftJournalDE table. This could be a symptom of open transactions on the TSE. See this article on how to close open transactions before each daily closing or manual export.

gregzip · 2023-05-18T09:52:25Z

Hi!

Regarding the worker created message, I think it might not be representative because we were restarting all the pods every 30 mins to avoid the DB connection issues.

That said, those issues seem to be gone after we moved to MySQL community and increased the number of pods to 15. I think it's the latter that helped. My current hypothesis is that we were running into MySQLConnector's default pool size limit of 100 - we had 10 pods running for ~1300 active cashboxes. Would that make sense from your perspective?

As I said, we managed to migrate to regular MySQL server with all of that data in place. I had a quick look at the biggest tables and indeed it seems these are ftJournalDE

+----------------------------------+-------------+------------+-----------+------------+
| database_name                    | TABLE_NAME  | total_size | data_size | index_size |
+----------------------------------+-------------+------------+-----------+------------+
| d11f0f217e7c42359860ca52a786b353 | ftJournalDE |    3128.56 |   3128.52 |       0.05 |
| 2c87fccb64314d61923f113404fc528a | ftJournalDE |    3058.56 |   3058.52 |       0.05 |
| 4c040bc573b240b8b382192de1e70b01 | ftJournalDE |    2788.56 |   2788.52 |       0.05 |
| 8567cc8b02f548d482612a5177c05c0d | ftJournalDE |    2761.56 |   2761.52 |       0.05 |
| 229d32f98dbb4c7aa15d2242225f2892 | ftJournalDE |    2551.56 |   2551.52 |       0.05 |
| 79bec6a01fe84f3ead13dd402b43e45f | ftJournalDE |    2505.56 |   2505.52 |       0.05 |
| 3b2f12c90fb740e084fabdffe0eba9d4 | ftJournalDE |    2499.53 |   2499.52 |       0.02 |
| cfd306fd24c140a5be5e025815709a2d | ftJournalDE |    2494.56 |   2494.52 |       0.05 |
| 5c031a79d4d7470ab1b485713c3623db | ftJournalDE |    2442.56 |   2442.52 |       0.05 |
| 236b77b825a64a439115f9b1a76627aa | ftJournalDE |    2388.56 |   2388.52 |       0.05 |
+----------------------------------+-------------+------------+-----------+------------+

I checked that there are indeed open transactions for them and I tried to close them for a couple of cashboxes (and wait for daily closing and even re-creating the cashboxes today) but the size is still the same. 🤔

Some questions about this:

how long after closing open transactions should we expect the size of these tables to drop?
is it okay if we query for open transactions on every closing for every cashbox? (we have to automate this process somehow with the number of cashboxes we handle)

volllly · 2023-05-26T11:16:33Z

Hey,
Yes, I think the default pool size could have been a factor.

We could try to add a config parameter for the Maximum Pool Size and see if increasing the value for that and reducing the replicas to 10 again also solves the issue.

how long after closing open transactions should we expect the size of these tables to drop?

Sorry, I was not clear on that. Closing the open transactions prevents rapid growth of the database in the future but it will not shrink the database.

At the moment we don't have an automated way to delete historical data. What we can do though is check on our side which data was already uploaded (please provide us with a list of queues for that) and then you can remove the old data ftJournalDE entries which were already uploaded.

is it okay if we query for open transactions on every closing for every cashbox? (we have to automate this process somehow with the number of cashboxes we handle)

Yes, that is perfectly fine.

gregzip · 2023-05-30T15:04:37Z

We could try to add a config parameter for the Maximum Pool Size and see if increasing the value for that and reducing the replicas to 10 again also solves the issue.

That sounds great 🙏

Yes, that is perfectly fine.

Okay, we'll do that soon-ish. I already closed all of the currently open ones via an ad-hoc script so I hope this gives us some time

gregzip · 2023-06-14T15:37:51Z

Hi! Any news on this? We'd be excited to check if the max pool size fixes our issues, as this could potentially allow us to use less compute resources and save quite a bit on infra costs 💸

volllly · 2023-06-20T06:20:37Z

Hey, we have not forgotten about you 😅 I hope we'll be able to tackle that in one of the next sprints

volllly · 2023-06-22T13:15:10Z

Hey, we've just released 1.3.47-rc1-bullseye which introduces the MaxPoolSize configuration parameter to the MySQL queue and uses 500 as the default value 🚀

gregzip · 2023-07-01T12:17:27Z

We tried the new version and it seems the issue still persists 😬 it was mostly fine for a couple of days with less instances, so I disabled our restarting cron and today things exploded again with the same errors as before.

For now we're back to lots of instances and a cron that restarts them every night.

gregzip · 2023-07-03T16:55:38Z

I did some investigation today and it seems that the pod started failing around 200-220 connections.

I'm going to re-create the scenario tomorrow and watch the CPU/memory usage more closely to be absolutely sure it's not about that. I should've done it before but I somehow assumed that if the entire node cpu is very low, then it's not an issue. But it seems that it's always an isolated pod that's failing, so that one pod might not be doing great there. I'll post my findings.

gregzip · 2023-07-04T11:58:27Z

Quick update:

I checked the scenario when there's only one pod with little resources and those errors only popped up when it was already unresponsive and being restarted by K8s
I then upscaled to 4 pods with more resources (1CPU, 2GB RAM each) and this worked fine for a while (maybe ~1h) until it started failing again
- Max CPU usage of each pod was about 30% and memory usage about 20% of the limit
- The number of connections didn't seem to impact the error much as it seemed to fluctuate between 120-180 in the 2 pods that started failing, before they eventually got restarted

It seems that the issue is not about connection pooling (anymore?) and it's not about computing resources either 😬

Edit:
One more data point:
I left 14 pods running for the night but forgot to turn on our nightly auto-restart, and some of the pods started failing in the morning with very high CPU usage. I checked and the usage kept growing throughout the night even though there was absolutely no transactions happening.
This might point to something like a thread leak or some other uncleaned processes going on? 🤔

volllly · 2023-07-06T12:07:26Z

Hey, thanks for the detailed investigation. We'll investigate if we have a thread leak.

Could you also please get us one days worth of logs of all pods (with log level debug)? This would help us a lot in analyzing the situation 🙏

gregzip · 2023-08-11T16:29:25Z

hey! sorry for a late reply, we were swamped with other topics.

here are the logs, I think the format is a bit weird but that's what CloudWatch gave me 😕

pod1.txt
pod2.txt
pod3.txt
pod4.txt
pod5.txt
pod6.txt
pod7.txt
pod8.txt
pod9.txt
pod10.txt
pod11.txt
pod12.txt
pod13.txt
pod14.txt
pod15.txt

gregzip · 2023-09-12T07:52:24Z

Hey! Any news on this?

volllly · 2023-10-11T11:44:23Z

Hey, news yet, could also provide us with a memory dump of a pod when it is experiencing the problems? (you can upload it here).

FYI Here is a KBA on how to get a memory dump.

gregzip · 2023-10-23T15:28:34Z

hey there! I know we owe you a memory dump but this is not fixed yet, even with a larger pool size

volllly · 2023-10-24T07:32:36Z

Sorry, this was autoclosed by mistake

gregzip added the bug Something isn't working label Apr 4, 2023

This was referenced Apr 4, 2023

Restarting byodc deployments occasionally leaves cashboxes locked #75

Closed

Failed to upload to api/service/queue. Statuscode: InternalServerError #76

Open

volllly linked a pull request Jun 22, 2023 that will close this issue

Add_Maximum_Pool_Size_config_parameter_to_MySQL_queue fiskaltrust/middleware#164

Merged

volllly closed this as completed in fiskaltrust/middleware#164 Oct 23, 2023

volllly reopened this Oct 24, 2023

volllly self-assigned this Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MySQL connection timeouts and pod restarts #74

MySQL connection timeouts and pod restarts #74

gregzip commented Apr 4, 2023

gregzip commented Apr 5, 2023

volllly commented Apr 11, 2023 •

edited

gregzip commented Apr 11, 2023 via email

volllly commented Apr 17, 2023

forsthug commented Apr 18, 2023

volllly commented May 2, 2023

gregzip commented May 3, 2023

volllly commented May 3, 2023

gregzip commented May 11, 2023 •

edited

volllly commented May 15, 2023 •

edited

gregzip commented May 18, 2023 •

edited

volllly commented May 26, 2023

gregzip commented May 30, 2023

gregzip commented Jun 14, 2023

volllly commented Jun 20, 2023

volllly commented Jun 22, 2023

gregzip commented Jul 1, 2023

gregzip commented Jul 3, 2023

gregzip commented Jul 4, 2023 •

edited

volllly commented Jul 6, 2023

gregzip commented Aug 11, 2023

gregzip commented Sep 12, 2023

volllly commented Oct 11, 2023

gregzip commented Oct 23, 2023

volllly commented Oct 24, 2023

MySQL connection timeouts and pod restarts #74

MySQL connection timeouts and pod restarts #74

Comments

gregzip commented Apr 4, 2023

gregzip commented Apr 5, 2023

volllly commented Apr 11, 2023 • edited

gregzip commented Apr 11, 2023 via email

volllly commented Apr 17, 2023

forsthug commented Apr 18, 2023

volllly commented May 2, 2023

gregzip commented May 3, 2023

volllly commented May 3, 2023

gregzip commented May 11, 2023 • edited

volllly commented May 15, 2023 • edited

gregzip commented May 18, 2023 • edited

volllly commented May 26, 2023

gregzip commented May 30, 2023

gregzip commented Jun 14, 2023

volllly commented Jun 20, 2023

volllly commented Jun 22, 2023

gregzip commented Jul 1, 2023

gregzip commented Jul 3, 2023

gregzip commented Jul 4, 2023 • edited

volllly commented Jul 6, 2023

gregzip commented Aug 11, 2023

gregzip commented Sep 12, 2023

volllly commented Oct 11, 2023

gregzip commented Oct 23, 2023

volllly commented Oct 24, 2023

volllly commented Apr 11, 2023 •

edited

gregzip commented May 11, 2023 •

edited

volllly commented May 15, 2023 •

edited

gregzip commented May 18, 2023 •

edited

gregzip commented Jul 4, 2023 •

edited