Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to upload to api/service/queue. Statuscode: InternalServerError #76

Open
gregzip opened this issue Apr 4, 2023 · 25 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@gregzip
Copy link

gregzip commented Apr 4, 2023

Describe the bug
We're seeing multiple errors with almost no description, pretty much just Failed to upload to api/service/queue. Statuscode: InternalServerError.

To Reproduce
Let pods process requests for some time.

Expected behavior
Requests are processed correctly.

Screenshots
N/A

STDOUT/STDERR

fail: fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService[0]
Failed to upload to api/service/queue. Statuscode: InternalServerError; Response message:

(yes, the response message is empty)

POSSystem (please complete the following information):

  • b2460c21-f12b-4b2c-9bf4-a9487f8ede57 (Memento 8.225.0)

Cashbox Information (please complete the following information):
Hard to say, we send hundreds of requests, some of them fail because of #74 and #75, and the error message doesn't contain it.

Additional context
It's less important than #74 but still annoying as it triggers alerts on our side.

@gregzip gregzip added the bug Something isn't working label Apr 4, 2023
@gregzip
Copy link
Author

gregzip commented May 15, 2023

I was able to isolate this problem on a Sunday when there's almost no other load and it seemed to be caused by one of the queues. I changed it's configuration to use the newest package version and sent a null echo, and the issue is now gone.

As far as the issue is concerned, I think we're happy to close this but I'd be interested to know why this worked. Are those configuration updates something we should do with other queues as well? Or is it about updating the config but not necessarily related to the configured package version? 🤔

I got the fix idea from this follow-up line in the logs:

fail: fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader[0]
Error while pushing configuration data for queue bab0b7ef-9595-436a-bc94-f1e580da6a09

@gregzip
Copy link
Author

gregzip commented May 16, 2023

Unfortunately this wasn't a permanent fix - after some time both errors started showing up again for the same queue / cashbox.

@gregzip
Copy link
Author

gregzip commented May 16, 2023

One more little update: these 2 kinds of errors: Failed to upload to api/service/queue. and Error while pushing configuration data for queue seem to be related (you know that better than me probably 😬) but they don't come in pairs. We sometimes get quite a few Failed to upload to api/service/queue. for none or single instance of the latter. This seems to suggest other queues / cashboxes are affected as well but the logging message doesn't say which ones.

@volllly
Copy link
Contributor

volllly commented May 17, 2023

Hey, thanks for the update. With the added information we were able to find the problem 🎉

We'll release a fix this week or the begining of next week. I'll notify you here once its out.

@volllly
Copy link
Contributor

volllly commented May 26, 2023

Hey, my estimation was off by a bit 😅

We've just released 1.3.46-bullseye and 1.3.46-jammy including a bugfix of this issue 🚀

@gregzip
Copy link
Author

gregzip commented May 31, 2023

Hey! Just released 1.3.46-bullseye and so far it seems there's less of these errors but they're not 100% gone:

fail: fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader[0]
      Error while pushing configuration data for queue bab0b7ef-9595-436a-bc94-f1e580da6a09 (UrlType: rest; Url: rest://localhost:1500/bab0b7ef-9595-436a-bc94-f1e580da6a09)
      fiskaltrust.Middleware.Helper.Helipad.Exceptions.HelipadException
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService.UploadDataRequestAsync(String url, Object item)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService.UploadQueueListAsync(List`1 data)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadFromStreamAsync(Stream stream)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadConfigurationEntities(IPOS worker)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.ProcessUploadWith(ClientConfiguration clientConfiguration, Guid queueId)
--



@gregzip
Copy link
Author

gregzip commented Jun 15, 2023

Hey! Any updates on this? I was doing some cleanup work and ran a couple of queries to see how common is the issue.

This is the graph with messages matching /Error while pushing configuration data/ and /Failed to upload/ excluded:
image

This is the graph with those 2 included:
image

Not only these errors are still there, they also make up most of the errors we're seeing in logs at the moment.

@volllly
Copy link
Contributor

volllly commented Jun 20, 2023

Hey, we've also not forgotten about this one 😁 we just need to find the time to tackle it

@gregzip
Copy link
Author

gregzip commented Sep 12, 2023

Hey! Any news on this?

@mertcanyalhi
Copy link

mertcanyalhi commented Sep 13, 2023

We've started to observe the same issue after upgrading from 1.3.44 (1.3-buster) to 1.3.46 (1.3-bullseye). Apart from the errors mentioned above, we're observing the following error:

fail: fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader[0]
      Error while pushing entity data (341, journalde)
      System.Net.Http.HttpRequestException: Response status code does not indicate success: 503 (Service Unavailable).
         at System.Net.Http.HttpResponseMessage.EnsureSuccessStatusCode()
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService.GetPointerAsync(Guid queueId, String entityValue)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadEntities(IPOS worker, Guid queueId, KeyValuePair`2 entity, Int32 uploadReadCount)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadEntityAsync(ClientConfiguration clientConfiguration, Guid queueId, KeyValuePair`2 entity, Int32 startRecordCount, Action`1 setRecordCount)

After digging into this issue further, we noticed that the requests against Helipad (https://helipad.fiskaltrust.cloud/) are failing (e.g. while performing HelipadService.GetPointerAsync or HelipadService.UploadDataRequestAsync, which uses the HttpClient for communicating with Helipad).

When this issue is observed, our requests against the ByoDC middleware are timing out (our HTTP client timeout is set to 10 seconds due to operational reasons); can we please ask for your input/help on this?

@volllly
Copy link
Contributor

volllly commented Sep 29, 2023

Hey, can you check if these problems are fixed in the 1.3.49-bullseye/1.3.49-jammy releases?

@mertcanyalhi
Copy link

Hey @volllly,

Timeouts

  • The majority of the timeouts (coupled with the 5xx errors) we observed in 1.3.46 were fixed, but our requests are still timing out randomly, but not as frequently as before.
  • Before we can provide any further information, we'll need to go through the load balancer logs/trace redis locks.

5xx responses in logs

  • We're still observing 5xx responses in logs, but the time window does not match with any of the timeouts we're observing.
  • Not 100% sure, but based on the logs and traces, it looks like Helipad is responding with 500/503:
fail: fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService[0]
      Failed to upload to api/service/queuede. Statuscode: InternalServerError; Response message: 
fail: fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader[0]
      Error while pushing configuration data for queue 73257a4a-70a6-4f8d-89cd-581e82eae30d (UrlType: rest; Url: rest://localhost:1500/73257a4a-70a6-4f8d-89cd-581e82eae30d)
      fiskaltrust.Middleware.Helper.Helipad.Exceptions.HelipadException
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService.UploadDataRequestAsync(String url, Object item)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService.UploadeQueueDEListAsync(List`1 data)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadFromStreamAsync(Stream stream)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadConfigurationEntities(IPOS worker)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.ProcessUploadWith(ClientConfiguration clientConfiguration, Guid queueId)
fail: fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService[0]
      Failed to upload to api/service/signaturcreationunitde. Statuscode: ServiceUnavailable; Response message: 
fail: fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader[0]
      Error while pushing configuration data for queue 83b65314-d793-49e4-a56f-634f7c650c8b (UrlType: rest; Url: rest://localhost:1500/83b65314-d793-49e4-a56f-634f7c650c8b)
      fiskaltrust.Middleware.Helper.Helipad.Exceptions.HelipadException
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService.UploadDataRequestAsync(String url, Object item)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService.UploadSignatureCreationUnitDEListAsync(List`1 data)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadFromStreamAsync(Stream stream)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadConfigurationEntities(IPOS worker)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.ProcessUploadWith(ClientConfiguration clientConfiguration, Guid queueId)
fail: fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader[0]
      Error while pushing entity data (3, queueitem)
      System.Net.Http.HttpRequestException: Response status code does not indicate success: 503 (Service Unavailable).
         at System.Net.Http.HttpResponseMessage.EnsureSuccessStatusCode()
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadService.GetPointerAsync(Guid queueId, String entityValue)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadEntities(IPOS worker, Guid queueId, KeyValuePair`2 entity, Int32 uploadReadCount)
         at fiskaltrust.Middleware.Helper.Helipad.Services.HelipadUploader.UploadEntityAsync(ClientConfiguration clientConfiguration, Guid queueId, KeyValuePair`2 entity, Int32 startRecordCount, Action`1 setRecordCount)

@mertcanyalhi
Copy link

Hey @volllly, we tried to find a correlation between the ByoDC request timeouts and the errors, and noticed the following:

  • In 50% of the cases, we noticed the following error in the exact same time window: Failed to upload to api/service/queue. Statuscode: InternalServerError
  • In 50% of the cases, we noticed no error/warning logs in the same time window. However, for these cases, we observed the same error ^ in the past 5 to 33 minute window.

At this point, we're not sure if the timeouts are caused by the Helipad uploads, but just wanted to provide input. We'll change the log verbosity to see if we can catch any further logs.

@mertcanyalhi
Copy link

Hey @volllly, we observed a similar behavior; once we restarted the pods, we did not have any request timeouts against ByoDC for ~3 days. After 3 days, we started to observe timeouts, and one of the pods restarted after it became unhealthy.

Before the pod restarted, there was a small spike in the memory usage (the chart given below shows the memory usage percentage where 100% corresponds to 1Gi):

image

After the pod restarted, the spikes in CPU usages dropped even though the pod was receiving the exact same traffic for 13 out of 15 cashboxes:

image

The CPU usage seems to have increasing number of spikes over time for the affected pod:

image

And there is an increase in CPU usage spikes for all pods, but it depends on the # of transactions processed by the cashboxes they currently hold:

image

It looks like the increase in average CPU usage can also be grouped in the following clusters:

  • 00:00-07:59 - After TSE-device TAR file downloads are triggered
  • 08:00-23:59 - After daily operations are initiated

image

We also observed an increase in memory usage over time, which may be an indicator of a leak 🤔:

image

There are spikes in the increases that are caused by zero receipt requests (0x4445000000000002) with TSE-device TAR file download request flag (0x0000000002000000):

image

Attached logs for the pod that was restarted below:

byodc-7f84d56544-6k8fl.log

@volllly
Copy link
Contributor

volllly commented Oct 11, 2023

Hey, sorry for not getting back to you earlier, could you possibly provide us with a memory dump of a pod when it is experiencing the problems? (you can upload it here).

FYI Here is a KBA on how to get a memory dump.

@volllly
Copy link
Contributor

volllly commented Oct 13, 2023

@gregzip @mertcanyalhi we've released 1.3.50-rc1-jammy/1.3.50-rc1-bullseye which should fix some of the issues with timeouts and memory usage.

@volllly
Copy link
Contributor

volllly commented Oct 13, 2023

Also we just deployed an update to the sandbox helipad that should fix the 500/503 upload errors.

@volllly volllly self-assigned this Oct 24, 2023
@mertcanyalhi
Copy link

mertcanyalhi commented Oct 26, 2023

@volllly Unfortunately, 1.3.50-rc1-bullseye did not solve the issue for us. We're still observing timeouts, increasing memory usage, and InternalServerError/ServiceUnavailable responses.

image

We tried to get a memory dump multiple times, but unfortunately, we're hitting the memory limit when we're getting a dump after the memory usage increases. We were able to get a dump from a pod that has the lowest amount of memory usage, and uploaded it, but I'm not sure if it'll help. We'll increase the memory limit once again, and will try to get a memory dump once we observe increasing memory usage.

@mertcanyalhi
Copy link

@volllly Uploaded more memory dumps for multiple pods that were captured in the past 3 days.
We're observing InternalServerError responses from Helipad in the logs, and also increased latency for the requests against ByoDC when Helipad returns InternalServerError responses.

@mertcanyalhi
Copy link

@volllly The following graph shows the memory usage of a ByoDC pod, and once the memory usage started to increase, we started to observe the following more frequently:

  • Increased latencies on all requests (e.g. GET /api/version)
  • MySQL connection errors
  • Helipad upload errors (not sure if this is due to an actual error as we're also frequently observing InternalServerError responses from Helipad)

image

@mertcanyalhi
Copy link

mertcanyalhi commented Nov 13, 2023

@volllly We noticed that the # of SELECT queries is growing with signature requests over time, and there are even huge amount of SELECT queries in idle. When the pods are restarted, the # of SELECT queries drop to 0, and start to increase when there is traffic.

The growing # of SELECT queries and the drops after pod restarts can be observed from the chart given below:

image

Yesterday while there was low traffic (85 signature requests in total from 3 cashboxes), we captured the SELECT queries (orange line on the chart ^).

The following breakdown shows the # of SELECT  queries in a 2 minute window when there was only a single signature request:

SELECT Queries: 3903
 |_ `select * from ftCashBox`: 279
 |_ `select * from ftQueue`: 280
 |_ `select * from ftQueueAT`: 279
 |_ `select * from ftQueueDE`: 280
 |_ `select * from ftQueueFR`: 279
 |_ `select * from ftQueueME`: 279
 |_ `select * from ftQueueItem WHERE TimeStamp >= {0} ORDER BY TimeStamp LIMIT {1}`: 278
 |_ `select * from ftActionJournal WHERE TimeStamp >= {0} ORDER BY TimeStamp LIMIT {1}`: 278
 |_ `select * from ftReceiptJournal WHERE TimeStamp >= {0} ORDER BY TimeStamp LIMIT {1}`: 279
 |_ `select * from ftJournalDE WHERE TimeStamp >= {0} ORDER BY TimeStamp LIMIT {1}`: 278
 |_ `select * from ftSignaturCreationUnitAT`: 278
 |_ `select * from ftSignaturCreationUnitDE`: 278
 |_ `select * from ftSignaturCreationUnitFR`: 278
 |_ `select * from ftSignaturCreationUnitME`: 278
 |_ `Select * from OpenTransaction where cbReceiptReference = {0};`: 2

The following breakdown shows the # of SELECT  queries in a 2 minute window after the pods were restarted (6 workers were initialized after the pods were restarted):

SELECT Queries:
|_ `select * from ftCashBox`: 4
|_ `select * from ftQueue`: 4
|_ `select * from ftQueueAT`: 4
|_ `select * from ftQueueDE`: 4
|_ `select * from ftQueueFR`: 4
|_ `select * from ftQueueME`: 4
|_ `select * from ftQueueItem WHERE TimeStamp >= {0} ORDER BY TimeStamp LIMIT {1}`: 4
|_ `select * from ftActionJournal WHERE TimeStamp >= {0} ORDER BY TimeStamp LIMIT {1}`: 4
|_ `select * from ftReceiptJournal WHERE TimeStamp >= {0} ORDER BY TimeStamp LIMIT {1}`: 4
|_ `select * from ftJournalDE WHERE TimeStamp >= {0} ORDER BY TimeStamp LIMIT {1}`: 4
|_ `select * from ftSignaturCreationUnitAT`: 4
|_ `select * from ftSignaturCreationUnitDE`: 4
|_ `select * from ftSignaturCreationUnitFR`: 4
|_ `select * from ftSignaturCreationUnitME`: 4

@volllly
Copy link
Contributor

volllly commented Nov 22, 2023

Hey, we've pushed a fix to helipad you should not see 500 and 503 errors anymore.

@TSchmiedlechner
Copy link
Member

Hey @mertcanyalhi, are you still experiencing this issue, or did the fix deployed in November solve the problem?

@mertcanyalhi
Copy link

mertcanyalhi commented Jan 22, 2024

Hey @TSchmiedlechner

  • 500/503 errors - we observed such responses after November 22, but did not observe them in the past 7 days
  • Growing number of SELECT queries/growing memory usage - we are now restarting ByoDC pods daily, but we can still see a daily increase pattern in # of SELECT queries & memory usage

We're currently using 1.3.54-bullseye.

@gregzip
Copy link
Author

gregzip commented Jan 23, 2024

We're still seeing heaps of errors daily like:

fail: fiskaltrust.SignatureCloud.DE.Services.HelipadUploader[0]
      Error while pushing entity data (2, receiptjournal)
      System.Net.WebException: The remote server returned an error: (500) Internal Server Error.
         at System.Net.HttpWebRequest.EndGetResponse(IAsyncResult asyncResult)
         at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)

At some point I just stopped looking at them and decided to only act when clients are seriously affected, but it's something that'd be good to resolve - the very reason we monitor this is to avoid situations where a client has to call us 🙃

(we're at 1.3.55-rc1, also restarting pods daily so that they don't die one by one during the day)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants