Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up jenkins nodes which are not contactable over ssh #3486

Closed
Tracked by #3380
sxa opened this issue Mar 21, 2024 · 15 comments
Closed
Tracked by #3380

Clean up jenkins nodes which are not contactable over ssh #3486

sxa opened this issue Mar 21, 2024 · 15 comments

Comments

@sxa
Copy link
Member

sxa commented Mar 21, 2024

On an ssh failure, jenkins is trying to reconnect to machines about once every half hour. We should analyse the list and ensure we know why each is not contactable, and determine whether to remove it, or remediate it, or whether it is a known temporary outage. There are quite a few, particularly in the test-docker set, so I'm going to tag @Haroon-Khel on this one. This was identified through other work to clear up the jenkins system logs.

Machines which have been non-contactable over ssh by jenkins today
 36 build-alibaba-ubuntu1804-armv8-1
 36 build-alibaba-ubuntu1804-armv8-2
 57 build-spearhead-freebsd12-x64-1
 57 C3jenkins
  4 dockerhost-azure-ubuntu2204-x64-2
 13 dockerhost-marist-ubuntu2204-s390x-1
 56 dockerhost-skytap-ubuntu2204-x64-1
 57 test-alibaba-ubuntu1804-armv8-1
 36 test-alibaba-ubuntu1804-armv8-2
 36 test-aws-ubuntu2004-x64-1
  1 test-docker-alpine314-armv8-3
  1 test-docker-alpine314-x64-1
  1 test-docker-alpine314-x64-2
  1 test-docker-alpine317-x64-1
  1 test-docker-alpine317-x64-2
  1 test-docker-alpine319-armv8-1
 56 test-docker-alpine319-armv8-2
 51 test-docker-alpine319-armv8-3
 52 test-docker-alpine319-armv8-4
 36 test-docker-alpine319-x64-1
 36 test-docker-alpine319-x64-2
 55 test-docker-alpine319-x64-3
 36 test-docker-centos7-x64-1
  1 test-docker-centos8-armv8-1
  1 test-docker-centos8-x64-1
 62 test-docker-centos8-x64-2
  1 test-docker-debain12-armv8l-1
  1 test-docker-debian11-x64-1
  1 test-docker-debian11-x64-2
 36 test-docker-debian12-x64-1
 36 test-docker-debian12-x64-2
 21 test-docker-debian12-x64-3
  1 test-docker-fedora35-x64-1
 62 test-docker-fedora35-x64-2
  1 test-docker-fedora37-x64-1
  1 test-docker-fedora37-x64-2
  1 test-docker-fedora37-x64-3
  1 test-docker-fedora39-armv8l-1
 36 test-docker-fedora39-x64-1
 13 test-docker-sles12-s390x-1
  1 test-docker-sles15-armv8l-1
 13 test-docker-sles15-s390x-1
  1 test-docker-ubi8-x64-1
 62 test-docker-ubi8-x64-2
 36 test-docker-ubi8-x64-3
  1 test-docker-ubuntu1804-armv8l-4
  1 test-docker-ubuntu2004-armv7l-1
  1 test-docker-ubuntu2004-armv7l-2
  1 test-docker-ubuntu2004-armv7l-3
  1 test-docker-ubuntu2004-armv7l-4
  1 test-docker-ubuntu2004-armv7l-5
  1 test-docker-ubuntu2004-armv7l-6
  6 test-docker-ubuntu2004-armv8l-1
 55 test-docker-ubuntu2004-armv8l-2
 55 test-docker-ubuntu2004-armv8l-3
  1 test-docker-ubuntu2004-x64-1
  1 test-docker-ubuntu2004-x64-2
 36 test-docker-ubuntu2004-x64-3
  2 test-docker-ubuntu2004-x64-4
  1 test-docker-ubuntu2204-armv8-1
  1 test-docker-ubuntu2204-armv8-2
 55 test-docker-ubuntu2204-armv8-3
  1 test-docker-ubuntu2204-armv8-4
  6 test-docker-ubuntu2204-armv8l-2
  1 test-docker-ubuntu2204-x64-1
 62 test-docker-ubuntu2204-x64-2
  1 test-docker-ubuntu2204-x64-3
 36 test-docker-ubuntu2204-x64-4
 36 test-docker-ubuntu2204-x64-5
 40 test-docker-ubuntu2204-x64-6
 52 test-docker-ubuntu2310-armv8l-1
 57 test-equinix_esxi-ubuntu2204-x64-2
 57 test-ibmcloud-rhel6-x64-1
 43 test-macincloud-macos1201-x64-1
 43 test-macincloud-macos1201-x64-2
 57 test-osuosl-aix72-ppc64-5
 51 test-osuosl-ubuntu1604-ppc64le-3
 57 test-osuosl-ubuntu1604-ppc64le-4
 36 test-packet-ubuntu1604-armv8-2-OFF
 51 test-rise-debian12-riscv64-4
 57 test-rise-debian12-riscv64-9
 35 trss-node
@sxa sxa mentioned this issue Mar 21, 2024
18 tasks
@sxa
Copy link
Member Author

sxa commented Mar 26, 2024

A number of these are the ones hosted on the equinix machines which are to be decommissioned as part of #3292:

I guess the docker images have been shut down on those hosts as well as being marked offline which is why jenkins is still trying to connect to them.

@sxa
Copy link
Member Author

sxa commented Mar 28, 2024

It looks like many of the ones with just one entry in the log are ones that have been marked offline in the jenkins UI.

The following are on dockerhost.dockerhost-equinix-ubuntu2004-x64-1 and can now be decommissioned - the machines on the ubuntu2204 have all been removed already:

These have all now been removed from jenkins.

@sxa
Copy link
Member Author

sxa commented Apr 2, 2024

@sxa sxa added this to the 2024-04 (April) milestone Apr 4, 2024
@sxa
Copy link
Member Author

sxa commented Apr 5, 2024

I've removed the alibaba machines from jenkins. They are still in the inventory file for now.

  • build-alibaba-ubuntu1804-armv8-1
  • build-alibaba-ubuntu1804-armv8-2
  • build-alibaba-win2012r2-x64-1
  • build-alibaba-win2012r2-x64-2
  • test-alibaba-debian-riscv64-1
  • test-alibaba-ubuntu1804-armv8-1
  • test-alibaba-ubuntu1804-armv8-2

Jenkins agent node definitions have been backed up to alibababnodes.tar.gz in the nodes directory on the sever in case their information is required in the future.
Ditto for the trss-node which was pointing to the old server on AWS

@sxa
Copy link
Member Author

sxa commented Apr 5, 2024

Remaining test-docker machines that are not contactable:

Noting that these try to connect about once every 20 minutes in a failure case, and take a varying amount of time to fail the connection, up to 825s

@Haroon-Khel
Copy link
Contributor

Of the offline machines in #3486 (comment) Im seeing alot of bash: line 1: /usr/lib/jvm/jdk17/bin/java: No such file or directory

It seems on the dockerhosts, the ports have been changed?

root@dockerhost-azure-ubuntu2204-x64-2:~# docker ps | grep 32771
e60c862b5614   aqa_alp319     "/usr/sbin/sshd -D"   2 weeks ago   Up 6 days   0.0.0.0:32768->22/tcp, :::32768->22/tcp           ALP319.32771
2d2c4cbe2944   aqa_u2004      "/usr/sbin/sshd -D"   2 weeks ago   Up 6 days   0.0.0.0:32771->22/tcp, :::32771->22/tcp           U2004.32768

I wonder what caused this

@sxa
Copy link
Member Author

sxa commented Apr 5, 2024

The one I've just looked at seemed to be trying to use jdk21 instead of jdk17 which is on the machine so maybe there is some inconsistency there. If it's not that on your machine, maybe just double check it's got the JDK for the correct architecture e.g.

root@dockerhost-azure-ubuntu2204-x64-2:~# docker exec U2004.32768 file /usr/lib/jvm/jdk17/bin/java
/usr/lib/jvm/jdk17/bin/java: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 2.6.18, not stripped
root@dockerhost-azure-ubuntu2204-x64-2:~# 

@Haroon-Khel
Copy link
Contributor

The ports were rearranged such that alpine nodes became debian/ubuntu/fedora nodes. Alpine uses jdk21 while the others use 17 hence the confusion in jenkins. We use 21 on x64 and arm64 alpine because there is no arm64 alpine jdk17 binary

@sxa
Copy link
Member Author

sxa commented Apr 5, 2024

Also noting that we're getting Attempting to reconnect test-ibmcloud-rhel6-x64-1 for a machine which has been marked offline in the jenkins UI just now which is "somewhat unexpected" sine there are no obvious connection issues in the log so assume this is just a jenkins oddity ...

EDIT: Noting that the SSH Launch of message does NOT come up for these machines, so that is the better message to look for.

@sxa
Copy link
Member Author

sxa commented Apr 5, 2024

The ports were rearranged such that alpine nodes became debian/ubuntu/fedora nodes. Alpine uses jdk21 while the others use 17 hence the confusion in jenkins. We use 21 on x64 and arm64 alpine because there is no arm64 alpine jdk17 binary

👍🏻

We should consider a migration of everything up to 21 where possible (arm32 and Solaris being the exceptions, although arm32 could have an ea-beta build but I'd rather leave those at 17) Ref #3442 (comment)

@sxa
Copy link
Member Author

sxa commented Apr 5, 2024

Noting that as per #1843 (comment) the machine test-aws-ubuntu2004-x64-1 has been decommissioned so I'll remove that from jenkins too.
Similarly https://github.com/adoptium/infrastructure/pull/2150/files removed test-osuosl-ubuntu2004-ppc64le-[34] so they are now removed too.

@sxa
Copy link
Member Author

sxa commented Apr 8, 2024

Other than the RISE ones which are offline due to the administrator being away last week, we are left with just two systems showing recurring problems today:

@sxa
Copy link
Member Author

sxa commented Apr 25, 2024

test-docker-ubuntu2004-x64-4 has been rebuilt and now works.

I'm seeing four in the log now but these are the containers on the Skytap x64 dockerhost which is expired its credits again despite the reduction in size of that system which was put in place for this month:

@sxa
Copy link
Member Author

sxa commented Apr 30, 2024

Since the skytap machine is down to 6 cores I'm deleting all of the above agents other than debian12 and UBI8 from the machine

@sxa
Copy link
Member Author

sxa commented May 3, 2024

Closing on the basis that all of these have been resolved other than the Skytap x64 node which is a "known issue"

@sxa sxa closed this as completed May 3, 2024
@sxa sxa added the reliability label May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

2 participants