Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endpoint Instability #2535

Closed
jeysibel opened this issue Dec 10, 2018 · 142 comments
Closed

Endpoint Instability #2535

jeysibel opened this issue Dec 10, 2018 · 142 comments

Comments

@jeysibel
Copy link

jeysibel commented Dec 10, 2018

IMPORTANT UPDATE (29/07/19):

Since this issue was opened, there have been many reports opened with the same or similar reproduction steps. One thing is clear; there is no single root cause.

Through extensive testing across multiple OS, browsers and deployment scenarios we have confirmed the below bugs that can lead to endpoint instability.

Confirmed bugs (This list will be updated as others are confirmed):

Current status:
Portainer (v1.22.0) brought the fix for #2624 and #2949 (available with the agent release 1.4.0) as well as the long awaited open-sourcing of the Portainer Agent. Through open-sourcing, we hope it will allow us to be increasingly transparent and open the codebase to contributions from the community.

We are now focusing all efforts towards eliminating endpoint instability within Portainer, while being as transparent as possible.

We have also created the channel #fix2535 on our community Slack server as an easier alternative to discussion on this issue. You can join our slack server here.

As we work on fixes for the multiple bugs causing this issue we will post images containing the fixes for those willing to test. Any feedback we can get from these fixes or from your current deployments where you experience the this issue will be of immense help.

There may be bugs that we are unaware of and we want to make sure we cover them all.

---- ORIGINAL BUG REPORT ----

Bug description

Setting a portainer agent stack as described in the official documentation, leads to an unstable endpoint, sometimes up, sometimes down, sometimes showing correct info, sometimes giving error messages inside portainer.

Expected behavior
The agents should comunicate flawlessly with no errors or load problems once joined the agent cluster.

Briefly describe what you were expecting.

Steps to reproduce the issue:

Steps to reproduce the behavior:

  1. Go to shell, and deploy a stack similar to this (stack core)

version: '3.3'

services:
  csi-portainer-agent:
    image: portainer/agent
    environment:
      AGENT_CLUSTER_ADDR: tasks.core_csi-portainer-agent
      AGENT_PORT: 9001 (i also tried with no AGENT_PORT set, with no success)
      # LOG_LEVEL: debug
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /var/lib/docker/volumes:/var/lib/docker/volumes
#    ports: (after read other github issue thread i tried set the port, with no luck)
#      - target: 9001
#        published: 9001
#        protocol: tcp
#        mode: host
    networks:
      redeBackbone:
    deploy:
      mode: global
      placement:
        constraints: [node.platform.os == linux]
...
  1. Login in portainer and try to add a new endpoint

name: docker-agent
endpoint URL: tasks.core_csi-portainer-agent:9001 (the dns resolution is ok, giving one entry for each agent)(I also tried put the ipvs address core_csi-portainer-agent:9001, it appear to be more resilient, and works better than the official way, but very unstable showing pages. )
public IP: core_csi-portainer-agent (the dns resolution is ok, giving the IPVS address)

  1. Go to the main page

screenshot_2018-11-29 portainer

The endpoint appears to be running and ready, click on it, then the dashboard page appears empty with no data, but the endpoint is selected. So try to browse between containers, services, images, volumes .., sometimes it works, but after a few commands the agents stop working and the agent logs start with entries like this: ([ERR] memberlist: Push/Pull with 8db633a4b3a6 failed: dial tcp 172.20.1.6:7946: i/o timeout) or this ([INFO] memberlist: Suspect 8db633a4b3a6 has failed, no acks received) or this (http error: Unable to execute cluster operation (err=Get https://172.20.1.6:9001/images/json?all=0: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)) (code=500)).

screenshot_2018-11-29 portainer2
screenshot_2018-11-29 portainer3
screenshot_2018-11-29 portainer8

If I update the service, it will be running for some time and after that it stops working once more. If i left it intact and after some time i retry to browse the agent endpoint, it works for some time before die.

  1. See error

screenshot_2018-11-29 portainer4

Technical details:
Swarm cluster with one manager and 2 workers

  • Portainer version: 1.19.2
  • Docker version (managed by Portainer): 18.06-ce
  • Platform (windows/linux): linux ubuntu 16.04
  • Command used to start Portainer : docker stack deploy -c core.yml core
  • Browser: any

Additional context
I've tried to do some kind of troubleshoot, so i put a ubuntu container with some tools in the same overlay network than the stack (portainer + agents). i can see that the internal DNS resolution is ok and i can telnet to the ports listed in logs (9001 and 7946), see below:

- Agents logs

##### agent on manager node ip: 172.20.1.7

2018/11/30 14:46:58 [INFO] serf: EventMemberJoin: 92e9ebd0d3f4 172.20.1.7
2018/11/30 14:46:58 [INFO] serf: EventMemberJoin: 8db633a4b3a6 172.20.1.6
2018/11/30 14:46:58 [INFO] serf: EventMemberJoin: 6ac3cc1f045c 172.20.1.8
2018/11/30 14:46:58 [INFO] - Starting Portainer agent version 1.1.2 on 0.0.0.0:9001 (cluster mode: true)


##### agent on worker1 node ip: 172.20.1.8

2018/11/30 14:46:58 [INFO] serf: EventMemberJoin: 6ac3cc1f045c 172.20.1.8
2018/11/30 14:46:58 [INFO] serf: EventMemberJoin: 8db633a4b3a6 172.20.1.6
2018/11/30 14:46:58 [INFO] serf: EventMemberJoin: 92e9ebd0d3f4 172.20.1.7
2018/11/30 14:47:03 [INFO] memberlist: Suspect 8db633a4b3a6 has failed, no acks received
2018/11/30 14:47:08 [INFO] - Starting Portainer agent version 1.1.2 on 0.0.0.0:9001 (cluster mode: true)
2018/11/30 14:47:30 http error: Unable to execute cluster operation (err=Get https://172.20.1.6:9001/containers/json?all=1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)) (code=500)
2018/11/30 14:47:30 http error: Unable to execute cluster operation (err=Get https://172.20.1.6:9001/volumes: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)) (code=500)
2018/11/30 14:47:30 http error: Unable to execute cluster operation (err=Get https://172.20.1.6:9001/images/json?all=0: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)) (code=500)
2018/11/30 14:47:54 [ERR] memberlist: Push/Pull with 8db633a4b3a6 failed: dial tcp 172.20.1.6:7946: i/o timeout
2018/11/30 14:49:04 [ERR] memberlist: Push/Pull with 8db633a4b3a6 failed: dial tcp 172.20.1.6:7946: i/o timeout
2018/11/30 14:49:44 [ERR] memberlist: Push/Pull with 8db633a4b3a6 failed: dial tcp 172.20.1.6:7946: i/o timeout
2018/11/30 14:50:24 [ERR] memberlist: Push/Pull with 8db633a4b3a6 failed: dial tcp 172.20.1.6:7946: i/o timeout
2018/11/30 14:50:52 [INFO] memberlist: Suspect 8db633a4b3a6 has failed, no acks received
2018/11/30 14:51:32 [INFO] memberlist: Suspect 8db633a4b3a6 has failed, no acks received
2018/11/30 14:51:34 [ERR] memberlist: Push/Pull with 8db633a4b3a6 failed: dial tcp 172.20.1.6:7946: i/o timeout
2018/11/30 14:52:14 [ERR] memberlist: Push/Pull with 8db633a4b3a6 failed: dial tcp 172.20.1.6:7946: i/o timeout
2018/11/30 14:53:12 [INFO] memberlist: Suspect 8db633a4b3a6 has failed, no acks received
2018/11/30 14:54:30 [INFO] memberlist: Suspect 8db633a4b3a6 has failed, no acks received
2018/11/30 14:54:32 [WARN] memberlist: Refuting a suspect message (from: 8db633a4b3a6)
2018/11/30 14:54:48 [WARN] memberlist: Refuting a suspect message (from: 8db633a4b3a6)



##### agent on worker2 node ip: 172.20.1.6

2018/11/30 14:46:59 [INFO] serf: EventMemberJoin: 8db633a4b3a6 172.20.1.6
2018/11/30 14:46:59 [INFO] serf: EventMemberJoin: 92e9ebd0d3f4 172.20.1.7
2018/11/30 14:46:59 [INFO] serf: EventMemberJoin: 6ac3cc1f045c 172.20.1.8
2018/11/30 14:47:03 [WARN] memberlist: Refuting a suspect message (from: 6ac3cc1f045c)
2018/11/30 14:47:09 [INFO] - Starting Portainer agent version 1.1.2 on 0.0.0.0:9001 (cluster mode: true)
2018/11/30 14:50:35 [ERR] memberlist: Push/Pull with 6ac3cc1f045c failed: dial tcp 172.20.1.8:7946: i/o timeout
2018/11/30 14:50:53 [WARN] memberlist: Refuting a suspect message (from: 6ac3cc1f045c)
2018/11/30 14:51:32 [WARN] memberlist: Refuting a suspect message (from: 6ac3cc1f045c)
2018/11/30 14:51:45 [ERR] memberlist: Push/Pull with 6ac3cc1f045c failed: dial tcp 172.20.1.8:7946: i/o timeout
2018/11/30 14:52:25 [ERR] memberlist: Push/Pull with 6ac3cc1f045c failed: dial tcp 172.20.1.8:7946: i/o timeout
2018/11/30 14:53:12 [WARN] memberlist: Refuting a suspect message (from: 6ac3cc1f045c)
2018/11/30 14:54:05 [ERR] memberlist: Push/Pull with 6ac3cc1f045c failed: dial tcp 172.20.1.8:7946: i/o timeout
2018/11/30 14:54:30 [WARN] memberlist: Refuting a suspect message (from: 6ac3cc1f045c)
2018/11/30 14:54:33 [INFO] memberlist: Suspect 6ac3cc1f045c has failed, no acks received
2018/11/30 14:54:45 [ERR] memberlist: Push/Pull with 6ac3cc1f045c failed: dial tcp 172.20.1.8:7946: i/o timeout
2018/11/30 14:54:49 [INFO] memberlist: Suspect 6ac3cc1f045c has failed, no acks received
2018/11/30 14:55:25 [ERR] memberlist: Push/Pull with 6ac3cc1f045c failed: dial tcp 172.20.1.8:7946: i/o timeout
2018/11/30 14:55:29 [WARN] memberlist: Refuting a suspect message (from: 6ac3cc1f045c)
2018/11/30 14:55:31 [WARN] memberlist: Refuting a suspect message (from: 6ac3cc1f045c)
2018/11/30 14:55:58 [INFO] memberlist: Suspect 6ac3cc1f045c has failed, no acks received

-DNS resolution inside the overlay network

dig tasks.core_csi-portainer-agent

; <<>> DiG 9.11.3-1ubuntu1.3-Ubuntu <<>> tasks.core_csi-portainer-agent
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 35022
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;tasks.core_csi-portainer-agent.	IN	A

;; ANSWER SECTION:
tasks.core_csi-portainer-agent.	600 IN	A	172.20.1.204
tasks.core_csi-portainer-agent.	600 IN	A	172.20.1.206
tasks.core_csi-portainer-agent.	600 IN	A	172.20.1.203

;; Query time: 0 msec
;; SERVER: 127.0.0.11#53(127.0.0.11)
;; WHEN: Mon Dec 10 13:28:00 UTC 2018
;; MSG SIZE  rcvd: 186

root@3e49a820834c:/# dig core_csi-portainer-agent

; <<>> DiG 9.11.3-1ubuntu1.3-Ubuntu <<>> core_csi-portainer-agent
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 34261
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;core_csi-portainer-agent.	IN	A

;; ANSWER SECTION:
core_csi-portainer-agent. 600	IN	A	172.20.1.62

;; Query time: 0 msec
;; SERVER: 127.0.0.11#53(127.0.0.11)
;; WHEN: Mon Dec 10 13:28:07 UTC 2018
;; MSG SIZE  rcvd: 82
=================================

- Network connectivity to services with a new stack deployed inside the same overlay network

root@f199dab2d510:/# telnet 172.20.1.214 7946
Trying 172.20.1.214...
Connected to 172.20.1.214.
Escape character is '^]'.

Connection closed by foreign host.
root@f199dab2d510:/# telnet 172.20.1.214 7946
Trying 172.20.1.214...
Connected to 172.20.1.214.
Escape character is '^]'.

Connection closed by foreign host.
root@f199dab2d510:/# nslookup tasks.core_csi-portainer-agent
Server:		127.0.0.11
Address:	127.0.0.11#53

Non-authoritative answer:
Name:	tasks.core_csi-portainer-agent
Address: 172.20.1.216
Name:	tasks.core_csi-portainer-agent
Address: 172.20.1.214
Name:	tasks.core_csi-portainer-agent
Address: 172.20.1.215

root@f199dab2d510:/# telnet 172.20.1.215 7946
Trying 172.20.1.215...
Connected to 172.20.1.215.
Escape character is '^]'.

Connection closed by foreign host.
root@f199dab2d510:/# telnet 172.20.1.216 7946
Trying 172.20.1.216...
Connected to 172.20.1.216.
Escape character is '^]'.

Connection closed by foreign host.
root@f199dab2d510:/# telnet 172.20.1.214 7946
Trying 172.20.1.214...
Connected to 172.20.1.214.
Escape character is '^]'.

Connection closed by foreign host.
root@f199dab2d510:/# nslookup core_csi-portainer-agent
Server:		127.0.0.11
Address:	127.0.0.11#53

Non-authoritative answer:
Name:	core_csi-portainer-agent
Address: 172.20.1.213

root@f199dab2d510:/# telnet 172.20.1.213 7946
Trying 172.20.1.213...
Connected to 172.20.1.213.
Escape character is '^]'.

Connection closed by foreign host.

###server replies on port 9001
root@f199dab2d510:/# curl -k https://172.20.1.214:9001/images/json?all=0
{"err":"Unable to verify Portainer signature"}
root@f199dab2d510:/# curl -k https://172.20.1.213:9001/images/json?all=0
{"err":"Unable to verify Portainer signature"}
root@f199dab2d510:/# curl -k https://172.20.1.215:9001/images/json?all=0
{"err":"Unable to verify Portainer signature"}
root@f199dab2d510:/# curl -k https://172.20.1.216:9001/images/json?all=0
{"err":"Unable to verify Portainer signature"}
root@f199dab2d510:/# 
@jeysibel jeysibel changed the title Portainer Agent Unstable as Portainer Endpoint Unstable Portainer Agent as Portainer Endpoint Dec 10, 2018
@deviantony
Copy link
Member

Hi @jeysibel, is your Docker environment hosted inside a specific cloud provider infra? (AWS, GCP...)

@unglaublicherdude
Copy link

unglaublicherdude commented Dec 14, 2018

We have a similar problem here. I set up a 2 node Test-Cluster where I am testing portainer 1.20.0 with agent version 1.2.0 under docker version Docker version 18.09.0.

Before the version updates we had the exact same problem as described by @jeysibel.

Our setup are 2 VMs (Ubuntu 18.04.1) as Nodes (1 Master, 1 Worker). Our stack is actually described like this:

version: '3.2'
services:
  portainer:
    image: portainer/portainer:1.20.0
    command: --no-analytics -H tcp://tasks.portainer-agent-internal:9001 --tlsskipverify
    networks:
      agent_network:
      traefik:
    volumes:
     - portainer_data:/data
    deploy:
      mode: replicated
      replicas: 1
      placement:
        constraints: [node.role == manager]

  portainer-agent-internal:
    image: portainer/agent:1.2.0
    environment:
      AGENT_CLUSTER_ADDR: tasks.portainer-agent-internal
    volumes:
     - /var/run/docker.sock:/var/run/docker.sock
     - /var/lib/docker/volumes:/var/lib/docker/volumes
    networks:
      agent_network:
    deploy:
      mode: global
      placement:
        constraints: [node.platform.os == linux]

networks:
  traefik:
    external:
      name: somename
  agent_network:
    driver: overlay
    attachable: true

volumes:
  portainer_data:

When I fresh deploy this stack (meaning that I also had to remove the portainer-volume), and then login into portainer, the agent will have the status UP.

But there is already a startup error:

management_portainer.1.hjbuwee3cz5l@host    | 2018/12/14 09:37:37 http error: endpoint snapshot error (endpoint=primary, URL=tcp://tasks.portainer-agent-internal:9001) (err=Error response from daemon: )

When I click onto this endpoint, everything seems to work until I click on the Volumes. It takes some time till an Error occurs. And after this error the agent will not work anymore.

Portainer outputs these logs:

management_portainer.1.hjbuwee3cz5l@host    | 2018/12/14 09:39:21 http: proxy error: context canceled
management_portainer.1.hjbuwee3cz5l@host    | 2018/12/14 09:39:24 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
management_portainer.1.hjbuwee3cz5l@host    | 2018/12/14 09:39:24 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
management_portainer.1.hjbuwee3cz5l@host    | 2018/12/14 09:39:25 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
management_portainer.1.hjbuwee3cz5l@host    | 2018/12/14 09:39:25 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
management_portainer.1.hjbuwee3cz5l@host    | 2018/12/14 09:39:25 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
management_portainer.1.hjbuwee3cz5l@host    | 2018/12/14 09:39:25 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
management_portainer.1.hjbuwee3cz5l@host    | 2018/12/14 09:39:25 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
management_portainer.1.hjbuwee3cz5l@host    | 2018/12/14 09:39:49 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
management_portainer.1.hjbuwee3cz5l@host    | 2018/12/14 09:39:49 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
management_portainer.1.hjbuwee3cz5l@host    | 2018/12/14 09:39:49 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
management_portainer.1.hjbuwee3cz5l@host    | 2018/12/14 09:39:49 http error: Unable to query endpoint (err=Endpoint is down) (code=503)

And the portainer-agent logs:

management_portainer-agent-internal.0.uirdsgmjzzrd@host1    | 2018/12/14 09:37:14 [INFO] serf: EventMemberJoin: 66e637779153 10.0.107.7
management_portainer-agent-internal.0.uirdsgmjzzrd@host1    | 2018/12/14 09:37:14 [INFO] serf: EventMemberJoin: 2b53335926b2 10.0.107.6
management_portainer-agent-internal.0.uirdsgmjzzrd@host1    | 2018/12/14 09:37:14 [INFO] - Starting Portainer agent version 1.2.0 on 0.0.0.0:9001 (cluster mode: true)
management_portainer-agent-internal.0.uirdsgmjzzrd@host1    | 2018/12/14 09:37:37 http error: Unable to execute cluster operation (err=Get https://10.0.107.6:9001/volumes: net/http: request canceled (Client.Timeout exceeded while awaiting headers)) (code=500)
management_portainer-agent-internal.0.uirdsgmjzzrd@host1    | 2018/12/14 09:37:37 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)
management_portainer-agent-internal.0.uirdsgmjzzrd@host1    | 2018/12/14 09:38:42 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)
management_portainer-agent-internal.0.uirdsgmjzzrd@host1    | 2018/12/14 09:39:21 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)
management_portainer-agent-internal.0.uirdsgmjzzrd@host1    | 2018/12/14 09:39:21 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)
management_portainer-agent-internal.0.uirdsgmjzzrd@host1    | 2018/12/14 09:39:21 http error: Unable to execute cluster operation (err=Get https://10.0.107.7:9001/volumes?filters=%7B%22dangling%22:%5B%22false%22%5D%7D: net/http: request canceled (Client.Timeout exceeded while awaiting headers)) (code=500)
management_portainer-agent-internal.0.j2tucp6p2hg0@host2    | 2018/12/14 09:37:12 [INFO] serf: EventMemberJoin: 2b53335926b2 10.0.107.6
management_portainer-agent-internal.0.j2tucp6p2hg0@host2    | 2018/12/14 09:37:12 [INFO] - Starting Portainer agent version 1.2.0 on 0.0.0.0:9001 (cluster mode: true)
management_portainer-agent-internal.0.j2tucp6p2hg0@host2    | 2018/12/14 09:37:14 [INFO] serf: EventMemberJoin: 66e637779153 10.0.107.7
management_portainer-agent-internal.0.j2tucp6p2hg0@host2    | 2018/12/14 09:37:27 http error: Missing request signature headers (err=Unauthorized) (code=403)
management_portainer-agent-internal.0.j2tucp6p2hg0@host2    | 2018/12/14 09:37:37 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)
management_portainer-agent-internal.0.j2tucp6p2hg0@host2    | 2018/12/14 09:38:42 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)
management_portainer-agent-internal.0.j2tucp6p2hg0@host2    | 2018/12/14 09:38:42 http error: Unable to execute cluster operation (err=Get https://10.0.107.7:9001/volumes: net/http: request canceled (Client.Timeout exceeded while awaiting headers)) (code=500)
management_portainer-agent-internal.0.j2tucp6p2hg0@host2    | 2018/12/14 09:39:21 http error: Unable to execute cluster operation (err=Get https://10.0.107.7:9001/volumes?filters=%7B%22dangling%22:%5B%22true%22%5D%7D: net/http: request canceled (Client.Timeout exceeded while awaiting headers)) (code=500)
management_portainer-agent-internal.0.j2tucp6p2hg0@host2    | 2018/12/14 09:39:21 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)
management_portainer-agent-internal.0.j2tucp6p2hg0@host2    | 2018/12/14 09:39:21 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)
management_portainer-agent-internal.0.j2tucp6p2hg0@host2    | 2018/12/14 09:42:37 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)
management_portainer-agent-internal.0.j2tucp6p2hg0@host2    | 2018/12/14 09:42:37 http error: Unable to execute cluster operation (err=Get https://10.0.107.6:9001/volumes: net/http: request canceled (Client.Timeout exceeded while awaiting headers)) (code=500)
management_portainer-agent-internal.0.uirdsgmjzzrd@host1    | 2018/12/14 09:42:37 http error: Unable to proxy the request via the Docker socket (err=context canceled) (code=500)

When reaching this status no restart (agent or portainer) will help recovering. I have to remove the stack and the volume and redeploy everything. This works until someone will click on the volumes again.

As additional info: The volume command is actually really slow for our docker engine it takes like 9 seconds. That is a problem we have to work on but all that should not blow the portainer agent like it actually does.

Argh. And the volumes are also loaded on the dashboard. So it will break nevertheless.

@unglaublicherdude
Copy link

I solved our Issue. And I now think it is different. My problem was a volume plugin spec pointing to a socket that did not exist anymore because we removed the daemon. After removing the specs portainer runs fine. But I still think this should not break the whole agent.

@jeysibel
Copy link
Author

Hi @jeysibel, is your Docker environment hosted inside a specific cloud provider infra? (AWS, GCP...)

We're currently using Private Cloud + Docker machines (18.06) installed on ubuntu 16.04 VM's, all of them in the same network/vlan, network traffic is ok between hosts.

@deviantony deviantony added this to Need triage in Bug triage via automation Dec 16, 2018
@deviantony deviantony moved this from Need triage to Confirmation required - High priority in Bug triage Dec 16, 2018
@nksupport-protsko
Copy link

nksupport-protsko commented Dec 29, 2018

Hi! We have similar issue, endpoint is marked as down. Restart and update don't help.
The same messages in portainer-agent logs:

2018/12/29 16:57:50 [WARN] memberlist: Refuting a suspect message (from: 28e7b98c2624)
2018/12/29 16:58:01 [INFO] memberlist: Suspect b8b37729c657 has failed, no acks received
2018/12/29 16:58:09 [WARN] memberlist: Refuting a suspect message (from: b8b37729c657)
2018/12/29 16:58:14 [INFO] memberlist: Suspect d10dcd14da8d has failed, no acks received
2018/12/29 16:58:17 [INFO] serf: attempting reconnect to 8b3adfa511ac 10.0.0.53:7946
2018/12/29 16:58:20 [ERR] memberlist: Push/Pull with d10dcd14da8d failed: dial tcp 10.0.0.71:7946: i/o timeout
2018/12/29 16:58:20 [WARN] memberlist: Was able to connect to f97057df58c8 but other probes failed, network may be misconfigured
2018/12/29 16:58:31 [INFO] memberlist: Suspect 28e7b98c2624 has failed, no acks received

We have portainer in swarm mode deployed with this stackfile:
https://downloads.portainer.io/portainer-agent-stack.yml

@lifepeer
Copy link

lifepeer commented Jan 7, 2019

Same issue and exactly the same stack file as above (straight from the documentation). Endpoin refresh through the web UI usually solves it very quickly but it's quite unstable.

Agent logs:

portainer_agent.0.juz0j6hmsuav@node1    | 2019/01/07 22:44:12 [INFO] serf: EventMemberJoin: 53222c54e5be 10.0.6.6
portainer_agent.0.juz0j6hmsuav@node1    | 2019/01/07 22:44:14 [INFO] - Starting Portainer agent version 1.2.0 on 0.0.0.0:9001 (cluster mode: true)
portainer_agent.0.juz0j6hmsuav@node1    | 2019/01/07 22:44:14 [INFO] serf: EventMemberJoin: c6141eafd580 10.0.6.8
portainer_agent.0.vb7p2rfcrt2b@node3    | 2019/01/07 22:44:14 [INFO] serf: EventMemberJoin: c6141eafd580 10.0.6.8
portainer_agent.0.vb7p2rfcrt2b@node3    | 2019/01/07 22:44:14 [INFO] serf: EventMemberJoin: 53222c54e5be 10.0.6.6
portainer_agent.0.vb7p2rfcrt2b@node3    | 2019/01/07 22:44:14 [INFO] - Starting Portainer agent version 1.2.0 on 0.0.0.0:9001 (cluster mode: true)
portainer_agent.0.vb7p2rfcrt2b@node3    | 2019/01/07 22:47:55 http: TLS handshake error from 10.0.6.3:43078: EOF
portainer_agent.0.ivcqpesbpxzk@node2    | 2019/01/07 22:44:46 [INFO] serf: EventMemberJoin: 23ec18d0df73 10.0.6.7
portainer_agent.0.ivcqpesbpxzk@ node2    | 2019/01/07 22:45:06 [INFO] - Starting Portainer agent version 1.2.0 on 0.0.0.0:9001 (cluster mode: true)

Portainer log:

portainer_portainer.1.dy7ex5nxmk0i@node1    | 2019/01/07 22:44:08 Templates already registered inside the database. Skipping template import.
portainer_portainer.1.dy7ex5nxmk0i@node1    | 2019/01/07 22:44:08 Instance already has defined endpoints. Skipping the endpoint defined via CLI.
portainer_portainer.1.dy7ex5nxmk0i@node1    | 2019/01/07 22:44:08 Starting Portainer 1.20.0 on :9000
portainer_portainer.1.dy7ex5nxmk0i@node1    | 2019/01/07 22:46:34 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
portainer_portainer.1.dy7ex5nxmk0i@node1    | 2019/01/07 22:46:34 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
portainer_portainer.1.dy7ex5nxmk0i@node1    | 2019/01/07 22:46:34 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
portainer_portainer.1.dy7ex5nxmk0i@node1    | 2019/01/07 22:46:39 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.dy7ex5nxmk0i@node1    | 2019/01/07 22:49:38 background schedule error (endpoint snapshot). Unable to create snapshot (endpoint=node1, URL=tcp://tasks.agent:9001) (err=Cannot connect to the Docker daemon at tcp://tasks.agent:9001. Is the docker daemon running?)
portainer_portainer.1.dy7ex5nxmk0i@node1    | 2019/01/07 22:49:47 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.dy7ex5nxmk0i@node1    | 2019/01/07 22:49:47 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.dy7ex5nxmk0i@node1    | 2019/01/07 22:49:47 http error: Unable to query endpoint (err=Endpoint is down) (code=503)

etc.

@Jacq
Copy link

Jacq commented Jan 9, 2019

Similar issue here, sometimes portainer cannot connect to swarm but the swarm/containers are ok.
Portainer notifies the endpoint is down, after a while without messing with portainer ui the endpoint becomes online.
This happened very few times with previous version but with latest 1.2.0 is very common, almost everytime I'm connected to portainer UI after some minutes.

Agent logs:

portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 12:20:27 [INFO] serf: EventMemberJoin: 72652f77b056 10.10.0.4
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 12:20:28 [INFO] serf: EventMemberJoin: f39c9df400a5 10.10.0.6
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 12:20:28 [INFO] serf: EventMemberJoin: 236583c83f6e 10.10.0.7
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 12:20:28 [INFO] serf: EventMemberJoin: f6fd914271eb 10.10.0.5
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 12:20:28 [INFO] serf: EventMemberJoin: b34307a5c7d2 10.10.0.3
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 12:20:28 [INFO] - Starting Portainer agent version 1.2.0 on 0.0.0.0:9001 (cluster mode: true)
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 15:27:40 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 17:50:45 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 17:50:50 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 17:51:25 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 17:51:31 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 17:51:35 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 17:51:47 http error: An error occured during websocket exec operation (err=websocket: close 1000 (normal): websocket: close 1005 (no status)) (code=500)
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 17:51:47 http: response.WriteHeader on hijacked connection
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 17:51:47 http: response.Write on hijacked connection
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/07 17:51:50 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/09 07:02:59 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.sj6e16xf8enx@docker1    | 2019/01/09 08:47:19 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.jcpuy99ateeb@docker4    | 2019/01/07 12:20:26 [INFO] serf: EventMemberJoin: 236583c83f6e 10.10.0.7
portainer_agent.0.jcpuy99ateeb@docker4    | 2019/01/07 12:20:26 [INFO] serf: EventMemberJoin: f6fd914271eb 10.10.0.5
portainer_agent.0.jcpuy99ateeb@docker4    | 2019/01/07 12:20:26 [INFO] serf: EventMemberJoin: b34307a5c7d2 10.10.0.3
portainer_agent.0.jcpuy99ateeb@docker4    | 2019/01/07 12:20:26 [INFO] serf: EventMemberJoin: f39c9df400a5 10.10.0.6
portainer_agent.0.jcpuy99ateeb@docker4    | 2019/01/07 12:20:26 [INFO] - Starting Portainer agent version 1.2.0 on 0.0.0.0:9001 (cluster mode: true)
portainer_agent.0.jcpuy99ateeb@docker4    | 2019/01/07 12:20:28 [INFO] serf: EventMemberJoin: 72652f77b056 10.10.0.4
portainer_agent.0.ssf0lzo2psfa@docker3    | 2019/01/07 12:20:25 [INFO] serf: EventMemberJoin: b34307a5c7d2 10.10.0.3
portainer_agent.0.jcpuy99ateeb@docker4    | 2019/01/07 15:27:10 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.jcpuy99ateeb@docker4    | 2019/01/07 17:51:26 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.jcpuy99ateeb@docker4    | 2019/01/07 17:51:51 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.s03dl3hb8f36@docker5    | 2019/01/07 12:20:25 [INFO] serf: EventMemberJoin: f6fd914271eb 10.10.0.5
portainer_agent.0.s03dl3hb8f36@docker5    | 2019/01/07 12:20:25 [INFO] - Starting Portainer agent version 1.2.0 on 0.0.0.0:9001 (cluster mode: true)
portainer_agent.0.s03dl3hb8f36@docker5    | 2019/01/07 12:20:25 [INFO] serf: EventMemberJoin: b34307a5c7d2 10.10.0.3
portainer_agent.0.s03dl3hb8f36@docker5    | 2019/01/07 12:20:25 [INFO] serf: EventMemberJoin: f39c9df400a5 10.10.0.6
portainer_agent.0.s03dl3hb8f36@docker5    | 2019/01/07 12:20:26 [INFO] serf: EventMemberJoin: 236583c83f6e 10.10.0.7
portainer_agent.0.s03dl3hb8f36@docker5    | 2019/01/07 12:20:28 [INFO] serf: EventMemberJoin: 72652f77b056 10.10.0.4
portainer_agent.0.ssf0lzo2psfa@docker3    | 2019/01/07 12:20:25 [INFO] serf: EventMemberJoin: f6fd914271eb 10.10.0.5
portainer_agent.0.s03dl3hb8f36@docker5    | 2019/01/07 15:27:35 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.ssf0lzo2psfa@docker3    | 2019/01/07 12:20:25 [INFO] serf: EventMemberJoin: f39c9df400a5 10.10.0.6
portainer_agent.0.ssf0lzo2psfa@docker3    | 2019/01/07 12:20:25 [INFO] - Starting Portainer agent version 1.2.0 on 0.0.0.0:9001 (cluster mode: true)
portainer_agent.0.ssf0lzo2psfa@docker3    | 2019/01/07 12:20:26 [INFO] serf: EventMemberJoin: 236583c83f6e 10.10.0.7
portainer_agent.0.ssf0lzo2psfa@docker3    | 2019/01/07 12:20:28 [INFO] serf: EventMemberJoin: 72652f77b056 10.10.0.4
portainer_agent.0.ssf0lzo2psfa@docker3    | 2019/01/07 15:27:26 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.ssf0lzo2psfa@docker3    | 2019/01/07 15:27:58 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.ssf0lzo2psfa@docker3    | 2019/01/07 17:48:28 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.ssf0lzo2psfa@docker3    | 2019/01/07 17:50:46 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.ssf0lzo2psfa@docker3    | 2019/01/09 07:13:59 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.ssf0lzo2psfa@docker3    | 2019/01/09 07:50:59 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 12:20:25 [INFO] serf: EventMemberJoin: f39c9df400a5 10.10.0.6
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 12:20:25 [INFO] serf: EventMemberJoin: f6fd914271eb 10.10.0.5
portainer_agent.0.s03dl3hb8f36@docker5    | 2019/01/07 15:28:06 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 12:20:25 [INFO] serf: EventMemberJoin: b34307a5c7d2 10.10.0.3
portainer_agent.0.s03dl3hb8f36@docker5    | 2019/01/07 17:48:28 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.s03dl3hb8f36@docker5    | 2019/01/07 17:50:38 http error: The agent was unable to contact any other agent (err=Unable to find the targeted agent) (code=500)
portainer_agent.0.s03dl3hb8f36@docker5    | 2019/01/07 17:50:51 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.s03dl3hb8f36@docker5    | 2019/01/07 17:51:36 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.s03dl3hb8f36@docker5    | 2019/01/09 07:00:31 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 12:20:25 [INFO] - Starting Portainer agent version 1.2.0 on 0.0.0.0:9001 (cluster mode: true)
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 12:20:26 [INFO] serf: EventMemberJoin: 236583c83f6e 10.10.0.7
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 12:20:28 [INFO] serf: EventMemberJoin: 72652f77b056 10.10.0.4
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 15:27:10 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.s03dl3hb8f36@docker5    | 2019/01/09 07:14:17 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 15:27:22 http error: An error occured during websocket exec operation (err=websocket: close 1000 (normal): websocket: close 1005 (no status)) (code=500)
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 15:27:22 http: response.WriteHeader on hijacked connection
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 15:27:22 http: response.Write on hijacked connection
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 15:27:26 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 15:27:35 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 15:27:41 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 15:27:58 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 15:28:06 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/07 17:51:31 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/09 07:00:31 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/09 07:03:00 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/09 07:13:59 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/09 07:14:17 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/09 07:50:59 http error: Missing request signature headers (err=Unauthorized) (code=403)
portainer_agent.0.yj05ptv0q95o@docker2    | 2019/01/09 08:47:20 http error: Missing request signature headers (err=Unauthorized) (code=403)

Portainer log:

portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 12:21:44 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 12:21:44 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 12:48:41 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 12:48:41 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 12:48:41 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 12:48:41 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 15:27:10 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 15:27:26 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 15:27:35 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 15:27:41 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 15:27:58 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 15:28:06 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 17:48:28 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 17:50:38 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 17:50:46 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 17:50:51 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 17:51:22 websocketproxy: Error when copying from backend to client: websocket: close 1006 (abnormal closure): unexpected EOF
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 17:51:26 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 17:51:31 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 17:51:36 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/07 17:51:51 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/08 07:29:37 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/08 07:29:37 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/08 07:29:37 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/08 07:29:37 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/08 15:29:41 http error: Invalid JWT token (err=Invalid JWT token) (code=401)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 07:00:31 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 07:03:00 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 07:13:59 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 07:14:17 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 07:50:59 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 08:47:20 websocketproxy: couldn't dial to remote backend url websocket: bad handshake
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:48:46 http: proxy error: Docker container identifier not found
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:48:48 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:48:48 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:48:48 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:48:48 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:48:50 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:48:50 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:07 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:07 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:07 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:07 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:08 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:10 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:13 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:13 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:16 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:16 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:19 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:19 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:19 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:19 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:19 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:19 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:19 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:19 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:19 http error: Unable to query endpoint (err=Endpoint is down) (code=503)
portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:50:04 http error: Unable to query endpoint (err=Endpoint is down) (code=503)

@Einstein42
Copy link

+1

2 similar comments
@ghost
Copy link

ghost commented Jan 13, 2019

+1

@PrathikGopal
Copy link

+1

@deviantony deviantony added this to the 1.20.1 milestone Jan 13, 2019
@deviantony
Copy link
Member

deviantony commented Jan 13, 2019

For those experiencing the endpoint is down error, I think that most of your problems are related to #2556 which is under investigation.

portainer_portainer.1.pm4uc1ek1t5a@docker5    | 2019/01/09 18:49:19 http error: Unable to query endpoint (err=Endpoint is down) (code=503)

The preview image portainerci/portainer:fix2556-frequent-offline-mode contains a potential fix for this problem, I'd encourage you to test it.

Note that OP's issue is not related to this issue as I suspect network issues inside the infra/between the Swarm nodes.

2018/11/30 14:47:08 [INFO] - Starting Portainer agent version 1.1.2 on 0.0.0.0:9001 (cluster mode: true)
2018/11/30 14:47:30 http error: Unable to execute cluster operation (err=Get https://172.20.1.6:9001/containers/json?all=1: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)) (code=500)
2018/11/30 14:47:30 http error: Unable to execute cluster operation (err=Get https://172.20.1.6:9001/volumes: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)) (code=500)
2018/11/30 14:47:30 http error: Unable to execute cluster operation (err=Get https://172.20.1.6:9001/images/json?all=0: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)) (code=500)
2018/11/30 14:47:54 [ERR] memberlist: Push/Pull with 8db633a4b3a6 failed: dial tcp 172.20.1.6:7946: i/o timeout
2018/11/30 14:49:04 [ERR] memberlist: Push/Pull with 8db633a4b3a6 failed: dial tcp 172.20.1.6:7946: i/o timeout

@deviantony deviantony added this to To do in 1.20.1 Jan 14, 2019
@Z10yTap0k
Copy link

+1

@radirad
Copy link

radirad commented Jan 24, 2019

Hi
I had the same problem.
With agent version 1.1.2 and everything works like a charm.
With 1.2.0 almost evert time i get timeout.
Portainer version 1.20.0, docker 18.09.1 .

@jeysibel
Copy link
Author

jeysibel commented Jan 24, 2019

Due to another strange behaviour from one of our deployed apps, i had to deep dive in docker swarm networking, so i'd discovered that, despite docker official create swarm tutorial only states that we have to ensure each worker to be able to connect to managers and vice versa, not about from a worker to another, we've to ensure that each docker node needs to be opened the docker ports to n-1 other nodes ( via iptables rules)
So, when a container placed in worker node A try to connect a container placed in worker B it routes traffic between the two workers (directly) instead through managers, so once i've opened the iptables ports in from 1 to n-1 docker nodes (managers and workers) the portainer agent status stopped to be shown as down ...

From now I continue having problems sometimes when i try to read service logs or specific container logs or try to exec a console from docker-agent endpoint, if i try to read via console on swarm manager (where portainer is placed) everything is ok.

I suggest to the portainer's devs to put a note or a troubleshoot section in the portainer agent documentation page (https://portainer.readthedocs.io/en/stable/agent.html) specifying that portainer agents need to talk each other and not only n agents to 1 portainer directly ... this way we've to ensure communication between docker worker nodes must be active with the service ports opened to the both sides ....

PS: I think It's bug only happens when you apply firewall rules to the docker machines, in this case, I think the rules was so much restrictive, due the lack of precise information in the official docker documentation. (https://docs.docker.com/engine/swarm/swarm-tutorial/create-swarm/ - "The other nodes in the swarm must be able to access the manager at the IP address.")

@dang3r
Copy link
Contributor

dang3r commented Jan 30, 2019

@deviantony Is the portainer/agent open source? I was unable to find it as part of the portainer organization.

I wanted to look at the internals and help contribute to a fix.

@deviantony
Copy link
Member

Hi @dang3r

No, the agent is closed-source. Thanks for your will to help though.

@deviantony deviantony modified the milestones: 1.20.1, next Jan 31, 2019
@deviantony deviantony removed this from To do in 1.20.1 Jan 31, 2019
@mback2k
Copy link

mback2k commented Oct 14, 2019

@deviantony with regards to #2535 (comment): my Swarm networks are working fine and all required ports are accessible between the nodes.

BTW, I think there is a misunderstanding regarding published ports for the agent. I am not publishing the ports due to this issue. It would not make any sense to have published ports as a requirements. Publishing ports is only needed to make them accessible from the outside (container) world. Inside an (overlay) network the containers can reach each over via any port, just like multiple machines being on the same LAN.

@ghost
Copy link

ghost commented Oct 15, 2019

Hi there everyone,
In case you missed it, Portainer version 1.22.1 is out and includes several bug fixes aimed at improving the stability of endpoints, particularly Agent enabled endpoints.

Alongside this, version 1.5.1 and 1.5.0 of the Agent are out and bring a lot of stability improvements.

If you are still experiencing instability on the latest version of Portainer & the Agent, feel free to reach out to us as we will happily walk through the issue with you.

Many thanks from the Portainer team, have a great day!

@mahmoudawadeen
Copy link

mahmoudawadeen commented Oct 18, 2019

@mahmoudawadeen yes, there is definitely something wrong inside your Swarm environment regarding networking.

Here are my recommendations regarding Swarm setup, ensure that you have followed these steps when creating the Swarm cluster:

  • Make sure that ports 7946/tcp, 7946/udp and 4789/udp are open on all the nodes. For the manager node, also make sure that 2377/tcp is open.
  • Use the --advertise-addr option when creating the cluster via docker swarm init..., use either the private IP address or NIC name directly (--advertise-addr eth1 for example)
  • Use the --advertise-addr when joining a cluster on worker nodes via docker swarm join, same as above use either private IP or NIC name directly

Then deploy the Portainer stack

@deviantony This is already the case for us, this is how we setup our swarm. I will try to resetup the swarm again and see if we missed something during the setup.

@baskinsy
Copy link

I have upgraded my portainer swarm installation to 1.22.1 and tried to drain a worker node and re-activate it. Agent was redeployed and started but again I'm facing the semi-missing node issue on portainer and wrong values are displayed on dashboard. Nodes in cluster is on 5 when on Swarm menu there are all 6 of them and dashboard values is like they don't count containers, volumes etc etc that are on the semi-missing node. Force updating the agent service resolves the issues and brings things back to normal.

Same behavior with portainer 1.22.1 and agent 1.5.1

@RDLRA
Copy link

RDLRA commented Oct 28, 2019

Same behavior with portainer 1.22.1 and agent 1.5.1

@ghost
Copy link

ghost commented Nov 5, 2019

Hi there,
I was unable to reproduce this on a 3 node swarm running portainer 1.22.1 and agent 1.5.1. Hosts are running docker version 18.03.0-ce

I tested this as follows:

  1. Navigate to the cluster overview & set worker to drain mode
  2. Navigate to dashboard view & see 2 nodes in cluster + correct amount of resources are shown
  3. Navigate to cluster overview & set drained node to active in UI
  4. Navigate to dashboard view & see 3 nodes in cluster + correct amount of resources are shown

Let me know if there is anything you have done differently and I can try and reproduce this again

@mback2k
Copy link

mback2k commented Nov 5, 2019

I can stil reproduce issues as soon as one Docker node is under heavy load and requests to the Docker daemon are taking long or timing out. As soon as the CPU load goes down on the node and the Docker CLI can reach the daemon agent, the Agent is also able to reach it again and everything runs smoothly. As soon as a single Agent in the cluster is unable to reach its Docker daemon, the whole Portainer instance becomes either extremely slow or unresponsive.

@RDLRA
Copy link

RDLRA commented Nov 5, 2019

I can stil reproduce issues as soon as one Docker node is under heavy load and requests to the Docker daemon are taking long or timing out. As soon as the CPU load goes down on the node and the Docker CLI can reach the daemon agent, the Agent is also able to reach it again and everything runs smoothly. As soon as a single Agent in the cluster is unable to reach its Docker daemon, the whole Portainer instance becomes either extremely slow or unresponsive.

the problem is therefore related to the loading of the nodes in some moments which causes the event to time out and therefore does not return the information of the nodes and containers.
correct?
so ther'isnt solution?

@mback2k
Copy link

mback2k commented Nov 5, 2019

the problem is therefore related to the loading of the nodes in some moments which causes the event to time out and therefore does not return the information of the nodes and containers.
correct?

Yes, the Agent and Portainer gets stuck waiting for a response from the Docker API.

so ther'isnt solution?

I think there is a need to handle such situations gracefully, for example:

  • Make the Agent time out after maybe 5 seconds.
  • If an Agent timed out, show either cached or partial results.
  • If only cached or partial results are shown, display a warning to the user in Portainer.

That would be much better than being completely unable to use Portainer in such a situation. At the moment a single stuck Agent prevents Portainer from showing any results.

@baskinsy
Copy link

baskinsy commented Nov 5, 2019

Hi there,
I was unable to reproduce this on a 3 node swarm running portainer 1.22.1 and agent 1.5.1. Hosts are running docker version 18.03.0-ce

I tested this as follows:

1. Navigate to the cluster overview & set worker to drain mode

2. Navigate to dashboard view & see 2 nodes in cluster + correct amount of resources are shown

3. Navigate to cluster overview & set drained node to active in UI

4. Navigate to dashboard view & see 3 nodes in cluster + correct amount of resources are shown

Let me know if there is anything you have done differently and I can try and reproduce this again

Very strange... I have this issue on 3 different clusters, all with completely open ports (no firewalls) and protainer installed with the same deployment file.... The only difference is that all have at least 6 nodes (3 managers and extra workers) but it happens if i drain any of them.... Several other things are working as expected so I don't think it is a swarm setup issue or configuration... Don't know how to debug further.

@deviantony
Copy link
Member

@baskinsy can you share the logs of the agent with us? We might be able to identify the cause of the problem from here.

@baskinsy
Copy link

baskinsy commented Nov 7, 2019

@deviantony I think the issue is happening only if you restart docker daemon or reboot the drained node. I'll check further as soon as I find the opportunity and report back with logs.

@amirakhan
Copy link

I solved our Issue. And I now think it is different. My problem was a volume plugin spec pointing to a socket that did not exist anymore because we removed the daemon. After removing the specs portainer runs fine. But I still think this should not break the whole agent.

How did you remove it and resolved this issue, could you provide the steps please?

@ghost
Copy link

ghost commented Apr 20, 2020

Activity on this issue seems to have slowed down after all the changes we made to get Portainer and the agent more stable. As such, I have moved discussion to the agent repo as the problem that continues to be reported now is the agent is not resilient, so the agent should be made more resilient.

If there are new issues that arise with endpoints in Portainer being unstable, we can re-open this issue. Otherwise Please report issues with the agent here

@ghost ghost closed this as completed Apr 20, 2020
Bug triage automation moved this from Confirmed to Fixed Apr 20, 2020
@ghost ghost unpinned this issue Apr 20, 2020
@alphaDev23
Copy link

Thank you for your efforts on this issue.

@cecchisandrone
Copy link

Any update on this?

@ghost
Copy link

ghost commented Jun 13, 2020

@cecchisandrone theres an update in my previous comment. Are you still experiencing endpoint instability?

@cecchisandrone
Copy link

Seems I solved by upgrading Docker version

@mmm8955405
Copy link

I'm using the latest version now, but the problem still arises. Add endpoint succeeded, but access failed

@sepidre
Copy link

sepidre commented Jun 5, 2022

Same here - any idea when this will be solved?
Can this be supported somehow? :-)

@alidehghan
Copy link

I confirm the issue with _ping endpoint still exists (after updating docker or portainers). Portainer UI stops working once in a while (without any change made), while inspecting network calls from browser, noticed that endpoint api/endpoints/1/docker/_ping returns 403 with following error {"message":"Invalid request signature","details":"Unauthorized"}. I have been watching this issue for a year and nothing works except docker update portainer_agent --force. Please correct this long lasting issue after these years!.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests