Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instabilities with browsers under the load (300-600 tests in parallel) #1

Open
AlexeyAltunin opened this issue Apr 18, 2020 · 9 comments

Comments

@AlexeyAltunin
Copy link

Hi! I have been testing Callisto starting from the last week.

Issue description: there are random containers/browsers freezes -> hanging pods , reproduced for running a lot of tests in parallel

3 types of errors:

  1. WebDriverError: Pod does not have an IP (not critical, happens very seldom)

<center><h1>500 Internal Server Error</h1></center>
 <hr><center>nginx/1.17.2</center>
 </body>

Fixed after increasing resources for nginx

  1. The most critical one, happens quite often but randomly, impacts on pipeline stability. This log was found in hanging browser pods:
[91:124:0417/171003.763223:ERROR:zygote_host_impl_linux.cc(259)] Failed to adjust OOM score of renderer with pid 376: Permission denied (13)
[91:124:0417/171004.767769:ERROR:zygote_host_impl_linux.cc(259)] Failed to adjust OOM score of renderer with pid 380: Permission denied (13)
[91:124:0417/171005.367275:ERROR:zygote_host_impl_linux.cc(259)] Failed to adjust OOM score of renderer with pid 384: Permission denied (13)
[91:124:0417/171005.594971:ERROR:zygote_host_impl_linux.cc(259)] Failed to adjust OOM score of renderer with pid 389: Permission denied (13)
[91:124:0417/171006.003322:ERROR:zygote_host_impl_linux.cc(259)] Failed to adjust OOM score of renderer with pid 393: Permission denied (13)
[91:124:0417/171006.581433:ERROR:zygote_host_impl_linux.cc(259)] Failed to adjust OOM score of renderer with pid 397: Permission denied (13)

Didn't find smth useful for callisto pod

Our configuration:

  1. 300-600 tests in parallel
  2. GCP GKE cluster
    Spec:
initial_node_count = 1

  autoscaling {
    min_node_count = 1
    max_node_count = 200
  }

  node_config {
    preemptible  = true
    machine_type = "n2-highcpu-8"
  1. Callisto setup: values.yaml
# Unique ID of callisto instance
instanceID: 'unknown'

rbac:
  create: true

callisto:
...  
  replicas: 1
  resources:
    limits:
      cpu: "500m"
      memory: "512Mi"
    requests:
      cpu: "250m"
      memory: "128Mi"
  logLevel: "DEBUG"
  service:
    type: "LoadBalancer"
 
  browser:
    name: "chrome"
    chromeImage: "selenoid/chrome:81.0"
    resources:
      limits:
        cpu: "1000m"
        memory: "1024Mi"
      requests:
        cpu: "500m"
        memory: "512Mi"
...
    env:
    - name: TZ
      value: 'UTC'
    - name: ENABLE_VNC
      value: 'true'

nginx:
  image:
    registry:
    repository: nginx
    tag: '1.17.2-alpine'
    pullPolicy: Always

  prometheusExporter:
    image:
      registry:
      repository: nginx/nginx-prometheus-exporter
      tag: '0.4.0'
      pullPolicy: Always
  replicas: 2
  minReadySeconds: 15
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  resources:
    requests:
      cpu: "2000m"
      memory: "1024Mi"
  
...

We also tested Callisto for small suites (30-45) in parallel and it works fine.
Did you face the same issue or any ideas how to fix ?

Thanks in advance!

@vigneshfourkites
Copy link

is the issue fixed? @AlexeyAltunin Browser version update make the things stable?

@srntqn
Copy link
Member

srntqn commented Apr 6, 2021

We've discussed this issue in the mail and there was an assumption that the reason of the issue is the small size of the cluster nodes. It causes an often cluster autoscaling under the load and then it leads to the browsers freezes and failures. But it was just an assumption and we didn't check it. Maybe @AlexeyAltunin have some info.

Here in Wrike we use 32 vCPU/128 Gb RAM node config and there are no such problems with the browsers.

@vigneshfourkites do you experience the same issue?

@vigneshfourkites
Copy link

@srntqn No, we are in POC mode and try to run below 100 browsers. In future, we will scale more than 300 for sure, and precautionary measure under this issue might give us some idea in scaling the numbers. so asked this question! Thanks for the response!

@vigneshfourkites
Copy link

@srntqn We are running 32GB machine with 50 parallel test, containers are not destroyed properly and pods taint happening. what is the K8 version you are using? Any benchmark information do you have? currently using machine config is 8vCPU/32GB RAM.

@srntqn
Copy link
Member

srntqn commented Jun 4, 2021

@vigneshfourkites did you check the logs of callisto? Are there any errors?
Also, it could be helpful to check the logs of kubernetes API server and kubelet.

pods taint happening

Sorry, there is a chance that I understand it in a wrong way. Could you please provide more details? What do you mean here?

what is the K8 version you are using? Any benchmark information do you have?

Now we use 1.18.17 version of Kubernetes, unfortunately we have no benchmarks for this version. But there are no problems with pods creation/deletion and the latency is okay.

@vigneshfourkites
Copy link

vigneshfourkites commented Jun 5, 2021

@srntqn .. Yeah i saw some ERROR logs in the Callisto pods,

2021-06-05 12:12:35,603 unknown ERROR >>> {"tid": "web-2b5f388e811b46d9882d15f45f00b045"}
Traceback (most recent call last):
File "/app/callisto/libs/middleware.py", line 30, in error_middleware
return await handler(request)
File "/app/callisto/web/webdriver_logs.py", line 16, in webdriver_logs_handler
async for line in await uc.get_logs_stream(pod_name=get_pod_name(request)):
File "/venv/lib/python3.7/site-packages/aiohttp/streams.py", line 39, in anext
rv = await self.read_func()
File "/venv/lib/python3.7/site-packages/aiohttp/streams.py", line 328, in readline
await self._wait('readline')
File "/venv/lib/python3.7/site-packages/aiohttp/streams.py", line 296, in _wait
await waiter
concurrent.futures._base.CancelledError

what is the root cause for this error? post the above error, Delete/create request happened but pods are not removed/added in the cluster.

@vpokotilov
Copy link
Contributor

@vigneshfourkites this particular error is related to displaying logs in Selenoid-UI, and not related to starting or stopping pods.
It would be helpful to get more logs. For example, you can enable debug logs here by setting logLevel: "DEBUG".

@vigneshfourkites
Copy link

vigneshfourkites commented Jun 7, 2021

It is enabled already as DEBUG. I only see above mentioned failures in Callisto pod, other than that no errors logged. Do you restrict the browser CPU and Memory utilisation internally anywhere? Seems like, CPU is at 100% constantly during the execution. @vpokotilov

@srntqn
Copy link
Member

srntqn commented Jun 10, 2021

@vigneshfourkites looks like you have some problems with the cluster performance. Maybe the reason is the load produced by your tests. Did you try to decrease the number of parallel sessions and check how it will affect the performance?

Do you restrict the browser CPU and Memory utilisation internally anywhere?

We use only k8s requests/limits.

resources:
  limits:
    cpu: 2500m
    memory: 2000Mi
  requests:
    cpu: 1
    memory: 500Mi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants