Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discovery API fails silently after high traffic (200 req / second) #1258

Open
bmitchinson opened this issue Jun 21, 2021 · 5 comments
Open
Labels
bug Something isn't working dev tech debt UrbanOS

Comments

@bmitchinson
Copy link
Member

Describe the bug
When stress testing discovery-api, if it becomes overwhelmed, it's presto connection will break and not automatically restore. The API will become non-functional, and doesn't know to restart itself.

To Reproduce
Steps to reproduce the behavior:

  1. Download JMeter for stress testing
  2. GET Query the dev discovery-api /api/v1/organization/parkmobile/dataset/parking_meter_transactions_2020/query with 200 users. (May take multiple attempts / a slight bump to the 200 number)
  3. Notice that dev discovery ui will fail to load any SQL terminals and report that it failed to reach the API
  4. Confirm that the discovery-api pod errors when attempting to connect to presto, as described with the below error and shown in pod logs.
[error] POST http://kdp-kubernetes-data-platform-presto.kdp:8080/v1/statement -> error: :checkout_timeout (80
[error] #PID<0.23441.1> running DiscoveryApiWeb.Endpoint (connection #PID<0.4612.0>, stream id 11) terminated

image.png

Expected behavior
If the API Reaches the broken presto checkout_timeout state, that's fine, but it should at least be detected somehow, possibly in the health endpoint. This way the pod can restore itself to a functional state again.

Additional context
Originally discovered by @bmitchinson and @christinepoydence

@bmitchinson bmitchinson added the bug Something isn't working label Jun 21, 2021
@christinepoydence
Copy link
Member

As per @LtChae, this error is likely caused by the worker pool in Prestige not being large enough to accommodate the number of requests.

@bmitchinson
Copy link
Member Author

After several further attempts to reproduce this, we were unable to crash the API as described.

It would crash, but not silently. It would start up a new API pod within seconds with minimal down time.

@christinepoydence are you alright with me closing this until this is ever reproduced. Can reopen at that time.

@bmitchinson
Copy link
Member Author

When making 50 requests to the API, after scaling it's deployment to 4 instances, we received the same error (shown below and originally mentioned above) on one of the four pods. The pod was not reported as unhealthy, so we had to find which one it was, restart it, and attempt the test again. After 2nd attempt of the test, the same thing occured, so it seems to be more consistently reproducible now.

Is this a problem like Tim said, where we need to scale the Prestige workers? And the API Pods are actually fine?

16:06:48.899 [error] POST http://kdp-kubernetes-data-platform-presto.kdp:8080/v1/statement -> error: :timeout (8009.072 ms)
16:06:48.899 [error] Error explaining statement: SELECT * FROM parkmobile__parking_meter_transactions_2020
ORDER BY paymentdate DESC
LIMIT 1
16:06:48.899 [error] %FunctionClauseError{args: nil, arity: 2, clauses: nil, function: :validate, kind: nil, module: Prestige.Client.RequestStream}

@LtChae
Copy link
Contributor

LtChae commented Jun 24, 2021

From that error, it looks like it may be timing out trying to get access to presto. Try running the test with the presto console up and watch to see if presto struggles with the load.

@christinepoydence
Copy link
Member

christinepoydence commented Jun 24, 2021

@LtChae - we did this and didn't see any issues in the preto console. It seemed to only receive 45 out of the 50 requests that we sent, but it handled those just fine.

@ksmith-accenture ksmith-accenture added dev On Hold Assigned to cards who were originally under the column titled 'On Hold' and removed On Hold Assigned to cards who were originally under the column titled 'On Hold' labels Jul 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working dev tech debt UrbanOS
Projects
None yet
Development

No branches or pull requests

5 participants