Skip to content
This repository has been archived by the owner on Sep 4, 2021. It is now read-only.

Cluster died after rebooting a node #3686

Closed
Alir3z4 opened this issue Nov 15, 2016 · 9 comments
Closed

Cluster died after rebooting a node #3686

Alir3z4 opened this issue Nov 15, 2016 · 9 comments

Comments

@Alir3z4
Copy link

Alir3z4 commented Nov 15, 2016

Hello,

I've installed 1 node cluster on the AWS (via cloud installer).
While I was working with it, I've got some errors while trying to make a deployment from a Github repo.
When I pressed Launch button, I've seen this error:

Error getting slugrunner image: controller: resource not found

I went to AWS EC2 Console and Reboot the instance from there, after that the Dashboard isn't running anymore.

I've followed other issues here, such as #2075 and other similar ones, none of them worked.

flyn-host ps result:

root@...:/home/ubuntu# flynn-host ps
ID                                              STATE    CREATED        CONTROLLER APP  CONTROLLER TYPE  ERROR
ip1003163-59bf627f-4519-43e0-a260-de3c4578ad88  running  7 minutes ago  postgres        postgres         
ip1003163-55de131e-fe88-4282-82b3-1c4e0985010c  running  9 minutes ago  flannel         app              
ip1003163-23e6afa3-3df5-45c9-90b7-4c6e40f6dc96  running  9 minutes ago  discoverd       app 

When I tried to scale up the dashboard from CLI:

[alireza@arci]$ ./flynn-cli -a dashboard scale web=1
Get https://controller.2rv1.flynnhub.com/apps/dashboard/release: dial tcp xxx.xxx.xx:443: getsockopt: connection refused

Version:

[alireza@arch flynn-cli]$ ./flynn-cli version
v20161115.0

Debug logs: https://gist.github.com/40409e12e74dee695d14ccc9d1492f03

@Alir3z4
Copy link
Author

Alir3z4 commented Nov 15, 2016

@titanous suggested running flynn-host fix --min-hosts 1 --peer-ips 127.0.0.1 at #2075 (comment)

Results:

root@:/home/ubuntu# flynn-host fix --min-hosts 1 --peer-ips 127.0.0.1
INFO[11-15|18:15:58] found expected hosts                     n=1
INFO[11-15|18:15:58] ensuring discoverd is running on all hosts 
INFO[11-15|18:15:58] checking flannel 
INFO[11-15|18:15:58] flannel looks good 
INFO[11-15|18:15:58] waiting for discoverd to be available 
INFO[11-15|18:15:58] checking for running controller API 
INFO[11-15|18:15:58] checking status of sirenia databases 
INFO[11-15|18:15:58] checking for database state              db=postgres
INFO[11-15|18:15:58] checking sirenia cluster status          fn=CheckSirenia service=postgres
INFO[11-15|18:15:58] found running leader                     fn=CheckSirenia service=postgres
INFO[11-15|18:15:58] found running instances                  fn=CheckSirenia service=postgres count=1
INFO[11-15|18:15:58] getting sirenia status                   fn=CheckSirenia service=postgres
INFO[11-15|18:15:58] cluster claims to be read-write          fn=CheckSirenia service=postgres
INFO[11-15|18:15:58] checking for database state              db=mariadb
INFO[11-15|18:15:58] skipping recovery of db, no state in discoverd db=mariadb
INFO[11-15|18:15:58] checking for database state              db=mongodb
INFO[11-15|18:15:58] checking sirenia cluster status          fn=CheckSirenia service=mongodb
INFO[11-15|18:15:58] found running leader                     fn=CheckSirenia service=mongodb
INFO[11-15|18:15:58] found running instances                  fn=CheckSirenia service=mongodb count=1
INFO[11-15|18:15:58] getting sirenia status                   fn=CheckSirenia service=mongodb
INFO[11-15|18:15:58] cluster claims to be read-write          fn=CheckSirenia service=mongodb
INFO[11-15|18:15:58] checking for running controller API 
INFO[11-15|18:15:58] killing any running schedulers to prevent interference 
INFO[11-15|18:15:58] no controller web process running, getting release details from hosts 
INFO[11-15|18:15:58] starting controller web job              job.id=ip1003163-3fbc3084-9ead-497b-8e38-618a325a82cc release=c7359e8b-70ab-4f04-acc3-6c6670f2b0f6
INFO[11-15|18:15:58] waiting for job to start 
18:16:58.779286 host.go:157: discoverd: timed out waiting for instances

Running flynn-host fix --min-hosts=1 gave me the same results.

@Alir3z4
Copy link
Author

Alir3z4 commented Nov 15, 2016

I just found out after restarting the instance on AWS EC2, the public ip of the instance has been changed.
but flyn-cli is still trying to use the old IP?

Should IP changes be handled by the AWS VPC already that Flynn creates on cluster initialization ?

@Alir3z4
Copy link
Author

Alir3z4 commented Nov 16, 2016

I just initialized a 3 node cluster on AWS, I reboot the machines via console and again the same thing happened.

Dashboard is not accessible and scaling it gives connection refused error.
It seems flynn will die by a simple restart.

The IPs on the instances are same as before, nothing has been changed since the fresh installation.
Cannot do debug logs as well, all the nodes give me the same error:

root@ip-10-0-0-130:/home/ubuntu# flynn-host collect-debug-info
INFO[11-16|13:20:34] uploading logs and debug information to a private, anonymous gist 
INFO[11-16|13:20:34] this may take a while depending on the size of your logs 
INFO[11-16|13:20:34] getting flynn-host logs 
INFO[11-16|13:20:34] getting sirenia metadata 
INFO[11-16|13:20:34] getting scheduler state 
EROR[11-16|13:20:34] error getting scheduler state            err="object_not_found: no leader found"
INFO[11-16|13:20:34] getting job logs 

I ran the flynn-host fix --min-hosts 3, the result:

ubuntu@ip-10-0-0-130:~$ flynn-host fix --min-hosts 3
INFO[11-16|13:13:12] found expected hosts                     n=3
INFO[11-16|13:13:12] ensuring discoverd is running on all hosts 
INFO[11-16|13:13:12] checking flannel 
INFO[11-16|13:13:12] flannel looks good 
INFO[11-16|13:13:12] waiting for discoverd to be available 
INFO[11-16|13:13:12] checking for running controller API 
INFO[11-16|13:13:12] checking status of sirenia databases 
INFO[11-16|13:13:12] checking for database state              db=postgres
INFO[11-16|13:13:12] checking sirenia cluster status          fn=CheckSirenia service=postgres
INFO[11-16|13:13:12] found running leader                     fn=CheckSirenia service=postgres
INFO[11-16|13:13:12] found running instances                  fn=CheckSirenia service=postgres count=2
INFO[11-16|13:13:12] getting sirenia status                   fn=CheckSirenia service=postgres
INFO[11-16|13:13:12] cluster claims to be read-write          fn=CheckSirenia service=postgres
INFO[11-16|13:13:12] checking for database state              db=mariadb
INFO[11-16|13:13:12] skipping recovery of db, no state in discoverd db=mariadb
INFO[11-16|13:13:12] checking for database state              db=mongodb
INFO[11-16|13:13:12] checking sirenia cluster status          fn=CheckSirenia service=mongodb
INFO[11-16|13:13:12] no running leader                        fn=CheckSirenia service=mongodb
INFO[11-16|13:13:12] found running instances                  fn=CheckSirenia service=mongodb count=0
INFO[11-16|13:13:12] getting sirenia status                   fn=CheckSirenia service=mongodb
INFO[11-16|13:13:12] killing any running schedulers to prevent interference 
INFO[11-16|13:13:12] getting service metadata                 fn=FixSirenia service=mongodb
INFO[11-16|13:13:12] getting primary job info                 fn=FixSirenia service=mongodb job.id=ip100268-e1451985-a4f2-4c61-925e-1dea329881b2
INFO[11-16|13:13:12] getting sync job info                    fn=FixSirenia service=mongodb job.id=ip1004182-a8beaac1-c540-49d6-b89e-537f4c0470ca
INFO[11-16|13:13:12] terminating unassigned sirenia instances fn=FixSirenia service=mongodb
INFO[11-16|13:13:12] starting primary job                     fn=FixSirenia service=mongodb job.id=ip100268-b40310fa-a8c7-4d28-8392-60609577fb88
INFO[11-16|13:13:12] starting sync job                        fn=FixSirenia service=mongodb job.id=ip1004182-301ba978-37b6-478e-a42c-0ea388d91f50
INFO[11-16|13:13:12] waiting for instance to start            fn=FixSirenia service=mongodb job.id=ip100268-b40310fa-a8c7-4d28-8392-60609577fb88
INFO[11-16|13:13:13] waiting for cluster to come up read-write fn=FixSirenia service=mongodb addr=100.100.7.2:27017
13:18:13.278015 host.go:157: timeout waiting for expected status

Here's also the last 100 lines of the flyn-host.log:

root@ip-10-0-0-130:/home/ubuntu# tail -100 /var/log/flynn/flynn-host.log  
t=2016-11-16T13:20:35+0000 lvl=info msg=attaching app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-9dc574db-8bbf-40e3-b6a3-6fa9c8e55bab
t=2016-11-16T13:20:35+0000 lvl=info msg="sucessfully attached" app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-9dc574db-8bbf-40e3-b6a3-6fa9c8e55bab
t=2016-11-16T13:20:35+0000 lvl=info msg=finished app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-9dc574db-8bbf-40e3-b6a3-6fa9c8e55bab
t=2016-11-16T13:20:35+0000 lvl=info msg="request completed" component=host req_id=7915c235-a0bc-4665-adbd-20d4a41fed1d status=101 duration=16.041734ms
t=2016-11-16T13:20:35+0000 lvl=info msg="request started" component=host req_id=2bbb78c5-656d-4fd5-aed1-24593a5b7211 method=POST path=/attach client_ip=10.0.0.130
t=2016-11-16T13:20:35+0000 lvl=info msg=starting app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-d4b567a2-02e8-4dd4-a3ca-c485311a29d2
t=2016-11-16T13:20:35+0000 lvl=info msg=attaching app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-d4b567a2-02e8-4dd4-a3ca-c485311a29d2
t=2016-11-16T13:20:35+0000 lvl=info msg="sucessfully attached" app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-d4b567a2-02e8-4dd4-a3ca-c485311a29d2
t=2016-11-16T13:20:35+0000 lvl=info msg=finished app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-d4b567a2-02e8-4dd4-a3ca-c485311a29d2
t=2016-11-16T13:20:35+0000 lvl=info msg="request completed" component=host req_id=2bbb78c5-656d-4fd5-aed1-24593a5b7211 status=101 duration=3.740327ms
t=2016-11-16T13:20:35+0000 lvl=info msg="request started" component=host req_id=eceb6ab6-cdd8-4c06-8da5-82d97f763c5b method=POST path=/attach client_ip=10.0.0.130
t=2016-11-16T13:20:35+0000 lvl=info msg=starting app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-d8e13387-ce07-4f5b-b1e7-d866f2d26706
t=2016-11-16T13:20:35+0000 lvl=info msg=attaching app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-d8e13387-ce07-4f5b-b1e7-d866f2d26706
t=2016-11-16T13:20:35+0000 lvl=info msg="sucessfully attached" app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-d8e13387-ce07-4f5b-b1e7-d866f2d26706
t=2016-11-16T13:20:35+0000 lvl=info msg=finished app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-d8e13387-ce07-4f5b-b1e7-d866f2d26706
t=2016-11-16T13:20:35+0000 lvl=info msg="request completed" component=host req_id=eceb6ab6-cdd8-4c06-8da5-82d97f763c5b status=101 duration=799.573µs
t=2016-11-16T13:20:51+0000 lvl=info msg="request started" component=host req_id=c7c43efe-21c8-43c4-a99d-a61c5513030e method=GET path=/host/jobs client_ip=10.0.4.182
t=2016-11-16T13:20:51+0000 lvl=info msg="request completed" component=host req_id=c7c43efe-21c8-43c4-a99d-a61c5513030e status=200 duration=2.070395ms
t=2016-11-16T13:20:51+0000 lvl=info msg="request started" component=host req_id=2b6ddb84-ebf3-44a5-b3dd-c737fe4cfb48 method=POST path=/attach client_ip=10.0.4.182
t=2016-11-16T13:20:51+0000 lvl=info msg=starting app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-9dc574db-8bbf-40e3-b6a3-6fa9c8e55bab
t=2016-11-16T13:20:51+0000 lvl=info msg=attaching app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-9dc574db-8bbf-40e3-b6a3-6fa9c8e55bab
t=2016-11-16T13:20:51+0000 lvl=info msg="sucessfully attached" app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-9dc574db-8bbf-40e3-b6a3-6fa9c8e55bab
t=2016-11-16T13:20:51+0000 lvl=info msg=finished app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-9dc574db-8bbf-40e3-b6a3-6fa9c8e55bab
t=2016-11-16T13:20:51+0000 lvl=info msg="request completed" component=host req_id=2b6ddb84-ebf3-44a5-b3dd-c737fe4cfb48 status=101 duration=10.004283ms
t=2016-11-16T13:20:51+0000 lvl=info msg="request started" component=host req_id=92b53082-39b6-4c6e-a393-0df3420bd067 method=POST path=/attach client_ip=10.0.4.182
t=2016-11-16T13:20:51+0000 lvl=info msg=starting app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-d4b567a2-02e8-4dd4-a3ca-c485311a29d2
t=2016-11-16T13:20:51+0000 lvl=info msg=attaching app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-d4b567a2-02e8-4dd4-a3ca-c485311a29d2
t=2016-11-16T13:20:51+0000 lvl=info msg="sucessfully attached" app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-d4b567a2-02e8-4dd4-a3ca-c485311a29d2
t=2016-11-16T13:20:51+0000 lvl=info msg=finished app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-d4b567a2-02e8-4dd4-a3ca-c485311a29d2
t=2016-11-16T13:20:51+0000 lvl=info msg="request completed" component=host req_id=92b53082-39b6-4c6e-a393-0df3420bd067 status=101 duration=2.816599ms
t=2016-11-16T13:20:51+0000 lvl=info msg="request started" component=host req_id=448c03b2-66e4-4e7f-b30c-c28dd46d0ee4 method=POST path=/attach client_ip=10.0.4.182
t=2016-11-16T13:20:51+0000 lvl=info msg=starting app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-d8e13387-ce07-4f5b-b1e7-d866f2d26706
t=2016-11-16T13:20:51+0000 lvl=info msg=attaching app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-d8e13387-ce07-4f5b-b1e7-d866f2d26706
t=2016-11-16T13:20:51+0000 lvl=info msg="sucessfully attached" app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-d8e13387-ce07-4f5b-b1e7-d866f2d26706
t=2016-11-16T13:20:51+0000 lvl=info msg=finished app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-d8e13387-ce07-4f5b-b1e7-d866f2d26706
t=2016-11-16T13:20:51+0000 lvl=info msg="request completed" component=host req_id=448c03b2-66e4-4e7f-b30c-c28dd46d0ee4 status=101 duration=1.947717ms
t=2016-11-16T13:21:23+0000 lvl=eror msg="error repairing cluster" component=cluster-monitor fn=checkCluster err="discoverd: timed out waiting for instances"
t=2016-11-16T13:21:23+0000 lvl=eror msg="did not find any controller api instances" component=cluster-monitor fn=checkCluster
t=2016-11-16T13:21:23+0000 lvl=eror msg="scheduler is not up" component=cluster-monitor fn=checkCluster
t=2016-11-16T13:21:23+0000 lvl=eror msg="fault deadline reached" component=cluster-monitor fn=checkCluster
t=2016-11-16T13:21:23+0000 lvl=info msg="initiating cluster repair" component=cluster-monitor fn=repairCluster
t=2016-11-16T13:21:23+0000 lvl=info msg="killing any running schedulers to prevent interference" component=cluster-monitor fn=repairCluster
t=2016-11-16T13:21:23+0000 lvl=info msg="request started" component=host req_id=0743a3cd-e054-4b49-a253-0c22353eb32f method=GET path=/host/jobs client_ip=10.0.0.130
t=2016-11-16T13:21:23+0000 lvl=info msg="request completed" component=host req_id=0743a3cd-e054-4b49-a253-0c22353eb32f status=200 duration=1.117635ms
t=2016-11-16T13:21:23+0000 lvl=info msg="checking status of sirenia databases" component=cluster-monitor fn=repairCluster
t=2016-11-16T13:21:23+0000 lvl=info msg="checking for database state" component=cluster-monitor fn=repairCluster db=postgres
t=2016-11-16T13:21:23+0000 lvl=info msg="checking sirenia cluster status" component=cluster-monitor fn=repairCluster fn=CheckSirenia service=postgres
t=2016-11-16T13:21:23+0000 lvl=info msg="found running leader" component=cluster-monitor fn=repairCluster fn=CheckSirenia service=postgres
t=2016-11-16T13:21:23+0000 lvl=info msg="found running instances" component=cluster-monitor fn=repairCluster fn=CheckSirenia service=postgres count=2
t=2016-11-16T13:21:23+0000 lvl=info msg="getting sirenia status" component=cluster-monitor fn=repairCluster fn=CheckSirenia service=postgres
t=2016-11-16T13:21:23+0000 lvl=info msg="cluster claims to be read-write" component=cluster-monitor fn=repairCluster fn=CheckSirenia service=postgres
t=2016-11-16T13:21:23+0000 lvl=info msg="checking for database state" component=cluster-monitor fn=repairCluster db=mariadb
t=2016-11-16T13:21:23+0000 lvl=info msg="skipping recovery of db, no state in discoverd" component=cluster-monitor fn=repairCluster db=mariadb
t=2016-11-16T13:21:23+0000 lvl=info msg="no controller web process running, getting release details from hosts" component=cluster-monitor fn=repairCluster
t=2016-11-16T13:21:23+0000 lvl=info msg="request started" req_id=2aca8a7b-1bbc-44bc-b0ca-82fc7e73d4a9 component=host method=GET path=/host/jobs client_ip=10.0.0.130
t=2016-11-16T13:21:23+0000 lvl=info msg="request completed" req_id=2aca8a7b-1bbc-44bc-b0ca-82fc7e73d4a9 component=host status=200 duration=1.045199ms
t=2016-11-16T13:21:23+0000 lvl=info msg="starting controller web job" component=cluster-monitor fn=repairCluster job.id=ip1004182-0c54a7f7-d138-43cd-a8bc-eff1ee4920fc release=c19d8097-cc4e-4a2f-88c4-4a61acf8b8a9
t=2016-11-16T13:21:23+0000 lvl=info msg="waiting for job to start" component=cluster-monitor fn=repairCluster
t=2016-11-16T13:21:39+0000 lvl=info msg="request started" component=host req_id=652a75ac-18e7-45e0-b523-8726ef985af9 method=GET path=/host/jobs client_ip=10.0.2.68
t=2016-11-16T13:21:39+0000 lvl=info msg="request completed" component=host req_id=652a75ac-18e7-45e0-b523-8726ef985af9 status=200 duration=3.269771ms
t=2016-11-16T13:21:39+0000 lvl=info msg="request started" component=host req_id=71f87425-cf51-48d2-9009-e7dadc7f10c8 method=POST path=/attach client_ip=10.0.2.68
t=2016-11-16T13:21:39+0000 lvl=info msg=starting app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-9dc574db-8bbf-40e3-b6a3-6fa9c8e55bab
t=2016-11-16T13:21:39+0000 lvl=info msg=attaching app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-9dc574db-8bbf-40e3-b6a3-6fa9c8e55bab
t=2016-11-16T13:21:39+0000 lvl=info msg="sucessfully attached" app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-9dc574db-8bbf-40e3-b6a3-6fa9c8e55bab
t=2016-11-16T13:21:39+0000 lvl=info msg=finished app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-9dc574db-8bbf-40e3-b6a3-6fa9c8e55bab
t=2016-11-16T13:21:39+0000 lvl=info msg="request completed" component=host req_id=71f87425-cf51-48d2-9009-e7dadc7f10c8 status=101 duration=9.185237ms
t=2016-11-16T13:21:39+0000 lvl=info msg="request started" component=host req_id=6680f1f7-bad2-4107-a07e-c6b900430b90 method=POST path=/attach client_ip=10.0.2.68
t=2016-11-16T13:21:39+0000 lvl=info msg=starting app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-d4b567a2-02e8-4dd4-a3ca-c485311a29d2
t=2016-11-16T13:21:39+0000 lvl=info msg=attaching app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-d4b567a2-02e8-4dd4-a3ca-c485311a29d2
t=2016-11-16T13:21:39+0000 lvl=info msg="sucessfully attached" app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-d4b567a2-02e8-4dd4-a3ca-c485311a29d2
t=2016-11-16T13:21:39+0000 lvl=info msg=finished app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-d4b567a2-02e8-4dd4-a3ca-c485311a29d2
t=2016-11-16T13:21:39+0000 lvl=info msg="request completed" component=host req_id=6680f1f7-bad2-4107-a07e-c6b900430b90 status=101 duration=4.855914ms
t=2016-11-16T13:21:39+0000 lvl=info msg="request started" component=host req_id=1163fbbb-ac24-4f2c-a681-415b21811fa3 method=POST path=/attach client_ip=10.0.2.68
t=2016-11-16T13:21:39+0000 lvl=info msg=starting app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-d8e13387-ce07-4f5b-b1e7-d866f2d26706
t=2016-11-16T13:21:39+0000 lvl=info msg=attaching app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-d8e13387-ce07-4f5b-b1e7-d866f2d26706
t=2016-11-16T13:21:39+0000 lvl=info msg="sucessfully attached" app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-d8e13387-ce07-4f5b-b1e7-d866f2d26706
t=2016-11-16T13:21:39+0000 lvl=info msg=finished app=host pid=2751 host.id=ip1000130 fn=attach job.id=ip1000130-d8e13387-ce07-4f5b-b1e7-d866f2d26706
t=2016-11-16T13:21:39+0000 lvl=info msg="request completed" component=host req_id=1163fbbb-ac24-4f2c-a681-415b21811fa3 status=101 duration=694.225µs
t=2016-11-16T13:22:23+0000 lvl=eror msg="error repairing cluster" component=cluster-monitor fn=checkCluster err="discoverd: timed out waiting for instances"
t=2016-11-16T13:22:23+0000 lvl=eror msg="did not find any controller api instances" component=cluster-monitor fn=checkCluster
t=2016-11-16T13:22:23+0000 lvl=eror msg="scheduler is not up" component=cluster-monitor fn=checkCluster
t=2016-11-16T13:22:23+0000 lvl=eror msg="fault deadline reached" component=cluster-monitor fn=checkCluster
t=2016-11-16T13:22:23+0000 lvl=info msg="initiating cluster repair" component=cluster-monitor fn=repairCluster
t=2016-11-16T13:22:23+0000 lvl=info msg="killing any running schedulers to prevent interference" component=cluster-monitor fn=repairCluster
t=2016-11-16T13:22:23+0000 lvl=info msg="request started" component=host req_id=8988a859-c539-4b5f-8f99-4f88f040bbe0 method=GET path=/host/jobs client_ip=10.0.0.130
t=2016-11-16T13:22:23+0000 lvl=info msg="request completed" component=host req_id=8988a859-c539-4b5f-8f99-4f88f040bbe0 status=200 duration=1.183111ms
t=2016-11-16T13:22:23+0000 lvl=info msg="checking status of sirenia databases" component=cluster-monitor fn=repairCluster
t=2016-11-16T13:22:23+0000 lvl=info msg="checking for database state" component=cluster-monitor fn=repairCluster db=postgres
t=2016-11-16T13:22:23+0000 lvl=info msg="checking sirenia cluster status" component=cluster-monitor fn=repairCluster fn=CheckSirenia service=postgres
t=2016-11-16T13:22:23+0000 lvl=info msg="found running leader" component=cluster-monitor fn=repairCluster fn=CheckSirenia service=postgres
t=2016-11-16T13:22:23+0000 lvl=info msg="found running instances" component=cluster-monitor fn=repairCluster fn=CheckSirenia service=postgres count=2
t=2016-11-16T13:22:23+0000 lvl=info msg="getting sirenia status" component=cluster-monitor fn=repairCluster fn=CheckSirenia service=postgres
t=2016-11-16T13:22:23+0000 lvl=info msg="cluster claims to be read-write" component=cluster-monitor fn=repairCluster fn=CheckSirenia service=postgres
t=2016-11-16T13:22:23+0000 lvl=info msg="checking for database state" component=cluster-monitor fn=repairCluster db=mariadb
t=2016-11-16T13:22:23+0000 lvl=info msg="skipping recovery of db, no state in discoverd" component=cluster-monitor fn=repairCluster db=mariadb
t=2016-11-16T13:22:23+0000 lvl=info msg="no controller web process running, getting release details from hosts" component=cluster-monitor fn=repairCluster
t=2016-11-16T13:22:23+0000 lvl=info msg="request started" req_id=56a0fc7b-90a7-4a31-84e0-7dfb8dcf3938 component=host method=GET path=/host/jobs client_ip=10.0.0.130
t=2016-11-16T13:22:23+0000 lvl=info msg="request completed" req_id=56a0fc7b-90a7-4a31-84e0-7dfb8dcf3938 component=host status=200 duration=1.124783ms
t=2016-11-16T13:22:23+0000 lvl=info msg="starting controller web job" component=cluster-monitor fn=repairCluster job.id=ip1004182-c3d4ad69-288b-48f2-9332-b8988d4d6232 release=c19d8097-cc4e-4a2f-88c4-4a61acf8b8a9
t=2016-11-16T13:22:23+0000 lvl=info msg="waiting for job to start" component=cluster-monitor fn=repairCluster

@Alir3z4 Alir3z4 changed the title Rebooted AWS EC2 machine from AWS Console the cluster won't come up anymore Cluster died after rebooting a node Nov 16, 2016
@alidavut
Copy link

same issue on digital ocean

@bdevore17
Copy link

Having same issue with manually install on ubuntu 16.04. When i try to deploy a new app, it fails with "ERROR: Error getting slugrunner image: controller: resource not found
exit status 1"

@Alir3z4
Copy link
Author

Alir3z4 commented Nov 17, 2016

I tried a 3 cluster node again today, the same thing happened, have no idea why and this is reproducible.

Cluster goes out of reach from internet and inside the cluster nodes by a simple restart of the machine(s).

@lmars
Copy link
Contributor

lmars commented Nov 18, 2016

@Alir3z4 apologies for the delay, I have proposed a fix for both issues in #3711.

@Alir3z4
Copy link
Author

Alir3z4 commented Nov 19, 2016

Thanks for taking care of this and for the fix and quick merge/release.
I'll start provisioning another cluster on Monday again and see how it goes.

@lmars Is it possible to update the current cluster with the new release or I have to re-initialize again?

@Alir3z4
Copy link
Author

Alir3z4 commented Nov 19, 2016

@lmars tried to update but an error happened.
I've opened an issue for it at #3714

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants