-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Endpoint Instability #2535
Comments
Hi @jeysibel, is your Docker environment hosted inside a specific cloud provider infra? (AWS, GCP...) |
We have a similar problem here. I set up a 2 node Test-Cluster where I am testing portainer 1.20.0 with agent version 1.2.0 under docker version Docker version 18.09.0. Before the version updates we had the exact same problem as described by @jeysibel. Our setup are 2 VMs (Ubuntu 18.04.1) as Nodes (1 Master, 1 Worker). Our stack is actually described like this: version: '3.2'
services:
portainer:
image: portainer/portainer:1.20.0
command: --no-analytics -H tcp://tasks.portainer-agent-internal:9001 --tlsskipverify
networks:
agent_network:
traefik:
volumes:
- portainer_data:/data
deploy:
mode: replicated
replicas: 1
placement:
constraints: [node.role == manager]
portainer-agent-internal:
image: portainer/agent:1.2.0
environment:
AGENT_CLUSTER_ADDR: tasks.portainer-agent-internal
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- /var/lib/docker/volumes:/var/lib/docker/volumes
networks:
agent_network:
deploy:
mode: global
placement:
constraints: [node.platform.os == linux]
networks:
traefik:
external:
name: somename
agent_network:
driver: overlay
attachable: true
volumes:
portainer_data: When I fresh deploy this stack (meaning that I also had to remove the portainer-volume), and then login into portainer, the agent will have the status UP. But there is already a startup error: management_portainer.1.hjbuwee3cz5l@host | 2018/12/14 09:37:37 http error: endpoint snapshot error (endpoint=primary, URL=tcp://tasks.portainer-agent-internal:9001) (err=Error response from daemon: ) When I click onto this endpoint, everything seems to work until I click on the Volumes. It takes some time till an Error occurs. And after this error the agent will not work anymore. Portainer outputs these logs: management_portainer.1.hjbuwee3cz5l@host | 2018/12/14 09:39:21 http: proxy error: context canceled
management_portainer.1.hjbuwee3cz5l@host | 2018/12/14 09:39:24 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
management_portainer.1.hjbuwee3cz5l@host | 2018/12/14 09:39:24 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
management_portainer.1.hjbuwee3cz5l@host | 2018/12/14 09:39:25 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
management_portainer.1.hjbuwee3cz5l@host | 2018/12/14 09:39:25 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
management_portainer.1.hjbuwee3cz5l@host | 2018/12/14 09:39:25 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
management_portainer.1.hjbuwee3cz5l@host | 2018/12/14 09:39:25 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
management_portainer.1.hjbuwee3cz5l@host | 2018/12/14 09:39:25 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
management_portainer.1.hjbuwee3cz5l@host | 2018/12/14 09:39:49 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
management_portainer.1.hjbuwee3cz5l@host | 2018/12/14 09:39:49 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
management_portainer.1.hjbuwee3cz5l@host | 2018/12/14 09:39:49 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
management_portainer.1.hjbuwee3cz5l@host | 2018/12/14 09:39:49 http error: Unable to query endpoint (err=Endpoint is down) (code=503) And the portainer-agent logs: management_portainer-agent-internal.0.uirdsgmjzzrd@host1 | 2018/12/14 09:37:14 [INFO] serf: EventMemberJoin: 66e637779153 10.0.107.7
management_portainer-agent-internal.0.uirdsgmjzzrd@host1 | 2018/12/14 09:37:14 [INFO] serf: EventMemberJoin: 2b53335926b2 10.0.107.6
management_portainer-agent-internal.0.uirdsgmjzzrd@host1 | 2018/12/14 09:37:14 [INFO] - Starting Portainer agent version 1.2.0 on 0.0.0.0:9001 (cluster mode: true)
management_portainer-agent-internal.0.uirdsgmjzzrd@host1 | 2018/12/14 09:37:37 http error: Unable to execute cluster operation (err=Get https://10.0.107.6:9001/volumes: net/http: request canceled (Client.Timeout exceeded while awaiting headers)) (code=500)
management_portainer-agent-internal.0.uirdsgmjzzrd@host1 | 2018/12/14 09:37:37 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)
management_portainer-agent-internal.0.uirdsgmjzzrd@host1 | 2018/12/14 09:38:42 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)
management_portainer-agent-internal.0.uirdsgmjzzrd@host1 | 2018/12/14 09:39:21 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)
management_portainer-agent-internal.0.uirdsgmjzzrd@host1 | 2018/12/14 09:39:21 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)
management_portainer-agent-internal.0.uirdsgmjzzrd@host1 | 2018/12/14 09:39:21 http error: Unable to execute cluster operation (err=Get https://10.0.107.7:9001/volumes?filters=%7B%22dangling%22:%5B%22false%22%5D%7D: net/http: request canceled (Client.Timeout exceeded while awaiting headers)) (code=500)
management_portainer-agent-internal.0.j2tucp6p2hg0@host2 | 2018/12/14 09:37:12 [INFO] serf: EventMemberJoin: 2b53335926b2 10.0.107.6
management_portainer-agent-internal.0.j2tucp6p2hg0@host2 | 2018/12/14 09:37:12 [INFO] - Starting Portainer agent version 1.2.0 on 0.0.0.0:9001 (cluster mode: true)
management_portainer-agent-internal.0.j2tucp6p2hg0@host2 | 2018/12/14 09:37:14 [INFO] serf: EventMemberJoin: 66e637779153 10.0.107.7
management_portainer-agent-internal.0.j2tucp6p2hg0@host2 | 2018/12/14 09:37:27 http error: Missing request signature headers (err=Unauthorized) (code=403)
management_portainer-agent-internal.0.j2tucp6p2hg0@host2 | 2018/12/14 09:37:37 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)
management_portainer-agent-internal.0.j2tucp6p2hg0@host2 | 2018/12/14 09:38:42 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)
management_portainer-agent-internal.0.j2tucp6p2hg0@host2 | 2018/12/14 09:38:42 http error: Unable to execute cluster operation (err=Get https://10.0.107.7:9001/volumes: net/http: request canceled (Client.Timeout exceeded while awaiting headers)) (code=500)
management_portainer-agent-internal.0.j2tucp6p2hg0@host2 | 2018/12/14 09:39:21 http error: Unable to execute cluster operation (err=Get https://10.0.107.7:9001/volumes?filters=%7B%22dangling%22:%5B%22true%22%5D%7D: net/http: request canceled (Client.Timeout exceeded while awaiting headers)) (code=500)
management_portainer-agent-internal.0.j2tucp6p2hg0@host2 | 2018/12/14 09:39:21 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)
management_portainer-agent-internal.0.j2tucp6p2hg0@host2 | 2018/12/14 09:39:21 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)
management_portainer-agent-internal.0.j2tucp6p2hg0@host2 | 2018/12/14 09:42:37 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)
management_portainer-agent-internal.0.j2tucp6p2hg0@host2 | 2018/12/14 09:42:37 http error: Unable to execute cluster operation (err=Get https://10.0.107.6:9001/volumes: net/http: request canceled (Client.Timeout exceeded while awaiting headers)) (code=500)
management_portainer-agent-internal.0.uirdsgmjzzrd@host1 | 2018/12/14 09:42:37 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500) When reaching this status no restart (agent or portainer) will help recovering. I have to remove the stack and the volume and redeploy everything. This works until someone will click on the volumes again. As additional info: The volume command is actually really slow for our docker engine it takes like 9 seconds. That is a problem we have to work on but all that should not blow the portainer agent like it actually does. Argh. And the volumes are also loaded on the dashboard. So it will break nevertheless. |
I solved our Issue. And I now think it is different. My problem was a volume plugin spec pointing to a socket that did not exist anymore because we removed the daemon. After removing the specs portainer runs fine. But I still think this should not break the whole agent. |
We're currently using Private Cloud + Docker machines (18.06) installed on ubuntu 16.04 VM's, all of them in the same network/vlan, network traffic is ok between hosts. |
Hi! We have similar issue, endpoint is marked as down. Restart and update don't help.
We have portainer in swarm mode deployed with this stackfile: |
Same issue and exactly the same stack file as above (straight from the documentation). Endpoin refresh through the web UI usually solves it very quickly but it's quite unstable. Agent logs:
Portainer log:
etc. |
Similar issue here, sometimes portainer cannot connect to swarm but the swarm/containers are ok. Agent logs:
Portainer log:
|
+1 |
2 similar comments
+1 |
+1 |
For those experiencing the
The preview image Note that OP's issue is not related to this issue as I suspect network issues inside the infra/between the Swarm nodes.
|
+1 |
Hi |
Due to another strange behaviour from one of our deployed apps, i had to deep dive in docker swarm networking, so i'd discovered that, despite docker official create swarm tutorial only states that we have to ensure each worker to be able to connect to managers and vice versa, not about from a worker to another, we've to ensure that each docker node needs to be opened the docker ports to n-1 other nodes ( via iptables rules) From now I continue having problems sometimes when i try to read service logs or specific container logs or try to exec a console from docker-agent endpoint, if i try to read via console on swarm manager (where portainer is placed) everything is ok. I suggest to the portainer's devs to put a note or a troubleshoot section in the portainer agent documentation page (https://portainer.readthedocs.io/en/stable/agent.html) specifying that portainer agents need to talk each other and not only n agents to 1 portainer directly ... this way we've to ensure communication between docker worker nodes must be active with the service ports opened to the both sides .... PS: I think It's bug only happens when you apply firewall rules to the docker machines, in this case, I think the rules was so much restrictive, due the lack of precise information in the official docker documentation. (https://docs.docker.com/engine/swarm/swarm-tutorial/create-swarm/ - "The other nodes in the swarm must be able to access the manager at the IP address.") |
@deviantony Is the I wanted to look at the internals and help contribute to a fix. |
Hi @dang3r No, the agent is closed-source. Thanks for your will to help though. |
@deviantony with regards to #2535 (comment): my Swarm networks are working fine and all required ports are accessible between the nodes. BTW, I think there is a misunderstanding regarding published ports for the agent. I am not publishing the ports due to this issue. It would not make any sense to have published ports as a requirements. Publishing ports is only needed to make them accessible from the outside (container) world. Inside an (overlay) network the containers can reach each over via any port, just like multiple machines being on the same LAN. |
Hi there everyone,
Alongside this, version 1.5.1 and 1.5.0 of the Agent are out and bring a lot of stability improvements.
If you are still experiencing instability on the latest version of Portainer & the Agent, feel free to reach out to us as we will happily walk through the issue with you. Many thanks from the Portainer team, have a great day! |
@deviantony This is already the case for us, this is how we setup our swarm. I will try to resetup the swarm again and see if we missed something during the setup. |
Same behavior with portainer 1.22.1 and agent 1.5.1 |
Same behavior with portainer 1.22.1 and agent 1.5.1 |
Hi there, I tested this as follows:
Let me know if there is anything you have done differently and I can try and reproduce this again |
I can stil reproduce issues as soon as one Docker node is under heavy load and requests to the Docker daemon are taking long or timing out. As soon as the CPU load goes down on the node and the Docker CLI can reach the daemon agent, the Agent is also able to reach it again and everything runs smoothly. As soon as a single Agent in the cluster is unable to reach its Docker daemon, the whole Portainer instance becomes either extremely slow or unresponsive. |
the problem is therefore related to the loading of the nodes in some moments which causes the event to time out and therefore does not return the information of the nodes and containers. |
Yes, the Agent and Portainer gets stuck waiting for a response from the Docker API.
I think there is a need to handle such situations gracefully, for example:
That would be much better than being completely unable to use Portainer in such a situation. At the moment a single stuck Agent prevents Portainer from showing any results. |
Very strange... I have this issue on 3 different clusters, all with completely open ports (no firewalls) and protainer installed with the same deployment file.... The only difference is that all have at least 6 nodes (3 managers and extra workers) but it happens if i drain any of them.... Several other things are working as expected so I don't think it is a swarm setup issue or configuration... Don't know how to debug further. |
@baskinsy can you share the logs of the agent with us? We might be able to identify the cause of the problem from here. |
@deviantony I think the issue is happening only if you restart docker daemon or reboot the drained node. I'll check further as soon as I find the opportunity and report back with logs. |
How did you remove it and resolved this issue, could you provide the steps please? |
Activity on this issue seems to have slowed down after all the changes we made to get Portainer and the agent more stable. As such, I have moved discussion to the agent repo as the problem that continues to be reported now is the agent is not resilient, so the agent should be made more resilient. If there are new issues that arise with endpoints in Portainer being unstable, we can re-open this issue. Otherwise Please report issues with the agent here |
Thank you for your efforts on this issue. |
Any update on this? |
@cecchisandrone theres an update in my previous comment. Are you still experiencing endpoint instability? |
Seems I solved by upgrading Docker version |
I'm using the latest version now, but the problem still arises. Add endpoint succeeded, but access failed |
Same here - any idea when this will be solved? |
I confirm the issue with _ping endpoint still exists (after updating docker or portainers). Portainer UI stops working once in a while (without any change made), while inspecting network calls from browser, noticed that endpoint api/endpoints/1/docker/_ping returns 403 with following error {"message":"Invalid request signature","details":"Unauthorized"}. I have been watching this issue for a year and nothing works except docker update portainer_agent --force. Please correct this long lasting issue after these years!. |
IMPORTANT UPDATE (29/07/19):
Since this issue was opened, there have been many reports opened with the same or similar reproduction steps. One thing is clear; there is no single root cause.
Through extensive testing across multiple OS, browsers and deployment scenarios we have confirmed the below bugs that can lead to endpoint instability.
Confirmed bugs (This list will be updated as others are confirmed):
Failure: Unable to query endpoint #2624 Portainer not resetting agent headers when switching state(Fixed)Portainer unable to reach endpoint right after a node is unavailable #2938 Agent takes a while to acknowledge that another agent is unavailablePortainer and Agent have errors when Docker command takes longer than 10 seconds #2949 Portainer and Agent have errors when Docker command takes longer than 10 secondsCurrent status:
Portainer (v1.22.0) brought the fix for #2624 and #2949 (available with the agent release 1.4.0) as well as the long awaited open-sourcing of the Portainer Agent. Through open-sourcing, we hope it will allow us to be increasingly transparent and open the codebase to contributions from the community.
We are now focusing all efforts towards eliminating endpoint instability within Portainer, while being as transparent as possible.
We have also created the channel #fix2535 on our community Slack server as an easier alternative to discussion on this issue. You can join our slack server here.
As we work on fixes for the multiple bugs causing this issue we will post images containing the fixes for those willing to test. Any feedback we can get from these fixes or from your current deployments where you experience the this issue will be of immense help.
There may be bugs that we are unaware of and we want to make sure we cover them all.
---- ORIGINAL BUG REPORT ----
Bug description
Setting a portainer agent stack as described in the official documentation, leads to an unstable endpoint, sometimes up, sometimes down, sometimes showing correct info, sometimes giving error messages inside portainer.
Expected behavior
The agents should comunicate flawlessly with no errors or load problems once joined the agent cluster.
Briefly describe what you were expecting.
Steps to reproduce the issue:
Steps to reproduce the behavior:
name: docker-agent
endpoint URL: tasks.core_csi-portainer-agent:9001 (the dns resolution is ok, giving one entry for each agent)(I also tried put the ipvs address core_csi-portainer-agent:9001, it appear to be more resilient, and works better than the official way, but very unstable showing pages. )
public IP: core_csi-portainer-agent (the dns resolution is ok, giving the IPVS address)
The endpoint appears to be running and ready, click on it, then the dashboard page appears empty with no data, but the endpoint is selected. So try to browse between containers, services, images, volumes .., sometimes it works, but after a few commands the agents stop working and the agent logs start with entries like this: ([ERR] memberlist: Push/Pull with 8db633a4b3a6 failed: dial tcp 172.20.1.6:7946: i/o timeout) or this ([INFO] memberlist: Suspect 8db633a4b3a6 has failed, no acks received) or this (http error: Unable to execute cluster operation (err=Get https://172.20.1.6:9001/images/json?all=0: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)) (code=500)).
If I update the service, it will be running for some time and after that it stops working once more. If i left it intact and after some time i retry to browse the agent endpoint, it works for some time before die.
Technical details:
Swarm cluster with one manager and 2 workers
Additional context
I've tried to do some kind of troubleshoot, so i put a ubuntu container with some tools in the same overlay network than the stack (portainer + agents). i can see that the internal DNS resolution is ok and i can telnet to the ports listed in logs (9001 and 7946), see below:
- Agents logs
-DNS resolution inside the overlay network
- Network connectivity to services with a new stack deployed inside the same overlay network
The text was updated successfully, but these errors were encountered: