Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support issue with Kubernetes #101

Open
conan-o-chang opened this issue Jun 30, 2020 · 4 comments
Open

Support issue with Kubernetes #101

conan-o-chang opened this issue Jun 30, 2020 · 4 comments

Comments

@conan-o-chang
Copy link

Expected Behaviour

When we send a SIGTERM to fwatchdog, it should be closed after fprocess was gracefully shut down.

Current Behaviour

When we send a SIGTERM to fwatchdog, it was still running after fprocess was gracefully shut down(exit code of fprocess is 0).
In other case, if fprocess close with an error(exit code is not 0), watchdog will show a log Forked function has terminated: ... and kill itself.

Possible Solution

Exit fwatchdog itself when fprocess exited. It should relate to https://github.com/openfaas-incubator/of-watchdog/blob/c796e1b714d703c90bf6e3f392471746e0aeab2d/executor/http_runner.go#L70-L76
A idea is exit fwatchdog whatever fprocess died with error or not, but I am not sure is there any side effect about this change.

Steps to Reproduce (for bugs)

I am using template python3-flask, and fprocess is gunicorn

  1. Pull faas template: faas template store pull python3-flask
  2. Generate a new project name python3-flask-gracefully-shutdown: faas new python3-flask-gracefully-shutdown --lang python3-flask
  3. Add gunicorn to requirements.txt: echo 'gunicorn' > python3-flask-gracefully-shutdown/requirements.txt
  4. Build images locally: faas build -f python3-flask-gracefully-shutdown.yml
  5. Run container: docker run --rm -ti -e fprocess='gunicorn index:app' -p 8080:8080 --name python3-flask-gracefully-shutdown python3-flask-gracefully-shutdown
  6. Send SIGTERM to the container: docker kill --signal=SIGTERM python3-flask-gracefully-shutdown
  7. The log show that watchdog wait for 10 seconds(default timeout limit) to exit itself after it get SIGTERM, but it should be exited immediately.
Forking - gunicorn [index:app]
2020/06/30 05:50:27 Started logging stderr from function.
2020/06/30 05:50:27 Started logging stdout from function.
2020/06/30 05:50:27 OperationalMode: http
2020/06/30 05:50:27 Timeouts: read: 10s, write: 10s hard: 10s.
2020/06/30 05:50:27 Listening on port: 8080
2020/06/30 05:50:27 Writing lock-file to: /tmp/.lock
2020/06/30 05:50:27 Metrics listening on port: 8081
2020/06/30 05:50:27 stderr: [2020-06-30 05:50:27 +0000] [11] [INFO] Starting gunicorn 20.0.4
2020/06/30 05:50:27 stderr: [2020-06-30 05:50:27 +0000] [11] [INFO] Listening at: http://127.0.0.1:8000 (11)
2020/06/30 05:50:27 stderr: [2020-06-30 05:50:27 +0000] [11] [INFO] Using worker: sync
2020/06/30 05:50:27 stderr: [2020-06-30 05:50:27 +0000] [16] [INFO] Booting worker with pid: 16
2020/06/30 05:50:32 SIGTERM received.. shutting down server in 10s
2020/06/30 05:50:32 Removing lock-file : /tmp/.lock
2020/06/30 05:50:32 stderr: [2020-06-30 05:50:32 +0000] [11] [INFO] Handling signal: term
2020/06/30 05:50:32 stderr: [2020-06-30 05:50:32 +0000] [16] [INFO] Worker exiting (pid: 16)
2020/06/30 05:50:32 stderr: [2020-06-30 05:50:32 +0000] [11] [INFO] Shutting down: Master
2020/06/30 05:50:42 No new connections allowed. Exiting in: 10s

Context

I am using OpenFaaS for some long-running task with auto-scaling on Kubernetes. One day I found that there are a lot of Pods did not terminate correctly(status is Terminating), which means it was wasting resource.

Your Environment

  • Docker version docker version (e.g. Docker 17.0.05 ):
    Docker version 19.03.8, build afacb8b
  • Are you using Docker Swarm or Kubernetes (FaaS-netes)?
    Kubernetes
  • Operating System and version (e.g. Linux, Windows, MacOS):
    MacOS 10.13.6
  • Link to your project or a code example to reproduce issue:
@alexellis
Copy link
Member

What was your write_timeout value? Pods do not die instantly but are configured to stay for a grace period.

I'd like you to show how to reproduce this on Kubernetes, not with an arbitrary docker kill command as that's not necessarily the correct workflow and behaviour.

@alexellis
Copy link
Member

/set title: Support issue with Kubernetes

@derek derek bot changed the title fwatchdog keep running after fprocess was gracefully shut down Support issue with Kubernetes Jun 30, 2020
@conan-o-chang
Copy link
Author

@alexellis Thank you so much for your comment. The reason is that I set write_timeout of the function too high. After setting write_timeout right, those pod will close correctly. Sorry for misunderstanding the meaning of write_timeout.

There is my testing code: https://github.com/conan-o-chang/openfaas-long-running-task

@conan-o-chang
Copy link
Author

Hi @alexellis , sorry for bothering you again.
I find that If I set a high write_timeout of my function, the issue sill happening. fwatchdog keep running after fprocess was finished, and it make Kubernetes's pods keep Terminating

Steps to Reproduce (for bugs)

  1. The source code I am using: https://github.com/conan-o-chang/openfaas-long-running-task
  2. Pull template python3-flask: faas template store pull python3-flask
  3. faas up the function: faas up --skip-push -f python3-flask-long-running.yml
  4. Patch terminationGracePeriodSeconds:kubectl -n openfaas-fn patch deploy python3-flask-long-running -p "{\"spec\":{\"template\":{\"spec\":{\"terminationGracePeriodSeconds\": 600}}}}"
  5. Scale up the function: kubectl scale deploy/python3-flask-long-running --replicas=5 -n openfaas-fn
# We get 5 pods of the function.
Every 2.0s: kubectl get po -n openfaas-fn                             Mon Jul  6 02:05:20 2020
NAME                                          READY   STATUS    RESTARTS   AGE
python3-flask-long-running-5b6684b647-dv92b   1/1     Running   0          16s
python3-flask-long-running-5b6684b647-f445m   1/1     Running   0          11s
python3-flask-long-running-5b6684b647-jtvtl   1/1     Running   0          11s
python3-flask-long-running-5b6684b647-nprh5   1/1     Running   0          11s
python3-flask-long-running-5b6684b647-tk2mt   1/1     Running   0          11s
  1. Scale down the function: kubectl scale deploy/python3-flask-long-running --replicas=1 -n openfaas-fn
# 4 pods are terminating
Every 2.0s: kubectl get po -n openfaas-fn                             Mon Jul  6 02:05:35 2020
NAME                                          READY   STATUS        RESTARTS   AGE
python3-flask-long-running-5b6684b647-dv92b   1/1     Running       0          31s
python3-flask-long-running-5b6684b647-f445m   0/1     Terminating   0          26s
python3-flask-long-running-5b6684b647-jtvtl   0/1     Terminating   0          26s
python3-flask-long-running-5b6684b647-nprh5   1/1     Terminating   0          26s
python3-flask-long-running-5b6684b647-tk2mt   1/1     Terminating   0          26s
  1. Check the running pod: kubectl -n openfaas-fn exec -ti pods/python3-flask-long-running-5b6684b647-dv92b -- sh
# both fwatchdog and gunicorn(fprocess) is funning
~ $ ps
PID   USER     TIME  COMMAND
    1 app       0:00 fwatchdog
   11 app       0:00 {gunicorn} /usr/local/bin/python /home/app/.local/bin/gunicorn index
   15 app       0:00 {gunicorn} /usr/local/bin/python /home/app/.local/bin/gunicorn index
   20 app       0:00 sh
   26 app       0:00 ps
  1. Check the terminating pod: kubectl -n openfaas-fn exec -ti pods/python3-flask-long-running-5b6684b647-f445m -- sh
# fwatchdog is keep running even gunicorn was finished.
~ $ ps
PID   USER     TIME  COMMAND
    1 app       0:00 fwatchdog
   20 app       0:00 sh
   25 app       0:00 ps
  1. Finally, the pods is gone after 10 mins(equal to write_timeout of the function).
Every 2.0s: kubectl get po -n openfaas-fn                             Mon Jul  6 02:15:26 2020
NAME                                          READY   STATUS        RESTARTS   AGE
python3-flask-long-running-5b6684b647-dv92b   1/1     Running       0          10m
python3-flask-long-running-5b6684b647-f445m   0/1     Terminating   0          10m
python3-flask-long-running-5b6684b647-jtvtl   0/1     Terminating   0          10m
python3-flask-long-running-5b6684b647-nprh5   0/1     Terminating   0          10m
python3-flask-long-running-5b6684b647-tk2mt   0/1     Terminating   0          10m
Every 2.0s: kubectl get po -n openfaas-fn                             Mon Jul  6 02:15:50 2020
NAME                                          READY   STATUS    RESTARTS   AGE
python3-flask-long-running-5b6684b647-dv92b   1/1     Running   0          10m
  1. To find out what happen to the pod, I try to delete the running pod for getting the log. kubectl logs -n openfaas-fn deploy/python3-flask-long-running -f
Forking - gunicorn [index:app -b 127.0.0.1:5000 --log-level debug --timeout=600 --graceful-timeout=600]
2020/07/05 18:05:05 Started logging stderr from function.
2020/07/05 18:05:05 Started logging stdout from function.
2020/07/05 18:05:05 OperationalMode: http
2020/07/05 18:05:05 Timeouts: read: 10m0s, write: 10m0s hard: 10m0s.
2020/07/05 18:05:05 Listening on port: 8080
2020/07/05 18:05:05 Writing lock-file to: /tmp/.lock
2020/07/05 18:05:05 Metrics listening on port: 8081
...
2020/07/05 18:05:05 stderr: [2020-07-05 18:05:05 +0000] [11] [INFO] Starting gunicorn 20.0.4
2020/07/05 18:05:05 stderr: [2020-07-05 18:05:05 +0000] [11] [DEBUG] Arbiter booted
2020/07/05 18:05:05 stderr: [2020-07-05 18:05:05 +0000] [11] [INFO] Listening at: http://127.0.0.1:5000 (11)
2020/07/05 18:05:05 stderr: [2020-07-05 18:05:05 +0000] [11] [INFO] Using worker: sync
2020/07/05 18:05:05 stderr: [2020-07-05 18:05:05 +0000] [15] [INFO] Booting worker with pid: 15
2020/07/05 18:05:05 stderr: [2020-07-05 18:05:05 +0000] [11] [DEBUG] 1 workers
2020/07/05 18:35:12 SIGTERM received.. shutting down server in 10m0s
2020/07/05 18:35:12 Removing lock-file : /tmp/.lock
2020/07/05 18:35:12 stderr: [2020-07-05 18:35:12 +0000] [11] [INFO] Handling signal: term
2020/07/05 18:35:12 stderr: [2020-07-05 18:35:12 +0000] [15] [INFO] Worker exiting (pid: 15)
2020/07/05 18:35:12 stderr: [2020-07-05 18:35:12 +0000] [11] [INFO] Shutting down: Master
2020/07/05 18:45:12 No new connections allowed. Exiting in: 10m0s
rpc error: code = Unknown desc = Error: No such container: ba335b88f0f7862682be186c20d78c0ac46500a3973b2186e93fd76cbba8a80d

@conan-o-chang conan-o-chang reopened this Jul 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants