Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Machine requirement - Third (!) Linux/x64 dockerhost/build host #3470

Closed
sxa opened this issue Mar 13, 2024 · 14 comments
Closed

New Machine requirement - Third (!) Linux/x64 dockerhost/build host #3470

sxa opened this issue Mar 13, 2024 · 14 comments
Assignees

Comments

@sxa
Copy link
Member

sxa commented Mar 13, 2024

I need to request a new machine:

  • New machine operating system (e.g. linux/windows/macos/solaris/aix): Linux
  • New machine architecture (e.g. x64/aarch32/arm32/ppc64/ppc64le/sparc): x64
  • Provider (leave blank if it does not matter): Any ... But we don't have many options
  • Desired usage: Replacement for the two dockerhost x64 systems currently hosted on Equinix
  • Any unusual specification/setup required: docker for running dockerhost containers and build pipelines
  • How many of them are required: 1

Please explain what this machine is needed for: Replacement for Equinix systems which we have to decommission as per #3292

Follow-on to #3378
Replacement for first attempt at #3352 due to Skytap quota issues.

@sxa sxa added this to the 2024-03 (March) milestone Mar 13, 2024
@Haroon-Khel Haroon-Khel self-assigned this Mar 14, 2024
@Haroon-Khel
Copy link
Contributor

New azure dockerhost machine setup on 20.83.24.86, dockerhost-azure-ubuntu2204-x64-2.
100G for /home/jenkins/
400G for /var/lib/docker

Will add into jenkins, inventory.yml and add docker containers shortly

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Mar 14, 2024

Something I noticed when installing docker.

Installing docker with the playbooks, with the dockerhost.yml playbook:

root@dockerhost-azure-ubuntu2204-x64-2:~# docker version
Client:
 Version:           24.0.5
 API version:       1.43
 Go version:        go1.20.3
 Git commit:        24.0.5-0ubuntu1~22.04.1
 Built:             Mon Aug 21 19:50:14 2023
 OS/Arch:           linux/amd64
 Context:           default

Server:
 Engine:
  Version:          24.0.5
  API version:      1.43 (minimum version 1.12)
  Go version:       go1.20.3
  Git commit:       24.0.5-0ubuntu1~22.04.1
  Built:            Mon Aug 21 19:50:14 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.2
  GitCommit:        
 runc:
  Version:          1.1.7-0ubuntu1~22.04.2
  GitCommit:        
 docker-init:
  Version:          0.19.0
  GitCommit:    

I get this gpg error when building one of our docker images, (fedora 39 but Ive seen it when building others)

 ---> Running in b5d0a622635b
Removing intermediate container b5d0a622635b
 ---> 352976db5502
Step 7/26 : RUN gpg --verify /tmp/jdk17.sig /tmp/jdk17.tar.gz
 ---> Running in 4ba53986cf16
gpg: Signature made Thu Jan 18 14:00:41 2024 UTC
gpg:                using RSA key 3B04D753C9050D9A5D343F39843C48A565F8F04B
gpg: Note: database_open 134217901 waiting for lock (held by 11) ...
gpg: Note: database_open 134217901 waiting for lock (held by 11) ...
gpg: Note: database_open 134217901 waiting for lock (held by 11) ...
gpg: Note: database_open 134217901 waiting for lock (held by 11) ...
gpg: Note: database_open 134217901 waiting for lock (held by 11) ...
gpg: keydb_search failed: Connection timed out
gpg: Note: database_open 134217901 waiting for lock (held by 11) ...
gpg: Note: database_open 134217901 waiting for lock (held by 11) ...
gpg: Note: database_open 134217901 waiting for lock (held by 11) ...
gpg: Note: database_open 134217901 waiting for lock (held by 11) ...
gpg: Note: database_open 134217901 waiting for lock (held by 11) ...
gpg: keydb_search failed: Connection timed out
gpg: Can't check signature: No public key
The command '/bin/sh -c gpg --verify /tmp/jdk17.sig /tmp/jdk17.tar.gz' returned a non-zero code: 2

I recommend we update our playbook docker installation with whats on dockers own site https://docs.docker.com/engine/install/ubuntu/#installation-methods

root@dockerhost-azure-ubuntu2204-x64-2:~# docker version
Client: Docker Engine - Community
 Version:           25.0.4
 API version:       1.44
 Go version:        go1.21.8
 Git commit:        1a576c5
 Built:             Wed Mar  6 16:32:12 2024
 OS/Arch:           linux/amd64
 Context:           default

Server: Docker Engine - Community
 Engine:
  Version:          25.0.4
  API version:      1.44 (minimum version 1.24)
  Go version:       go1.21.8
  Git commit:       061aa95
  Built:            Wed Mar  6 16:32:12 2024
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.28
  GitCommit:        ae07eda36dd25f8a1b98dfbf587313b99c0190bb
 runc:
  Version:          1.1.12
  GitCommit:        v1.1.12-0-g51d5e94
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

I dont get the same gpg error when building images with this version. I havent looked deeper into this so I dont quite know exactly whats causing the error, but this isnt the first time im seeing it

@sxa
Copy link
Member Author

sxa commented Mar 14, 2024

Both of your examples seem to be on the same machine - have you changed it front the defaults installed in they playbooks. I've not personally noticed anything like this on any of the other systems we've provisioned with Ubuntu 22.04.

What happens if you start up a new container and run the commands from the Dockerfile individually?

@Haroon-Khel
Copy link
Contributor

have you changed it front the defaults installed in they playbooks.

On this new dockerhost, I installed docker this way https://docs.docker.com/engine/install/ubuntu/#installation-methods, which is different to how we do it in the playbooks. Installing the playbook way,

, is how I hit the above error. The two ways install different versions.

Ive created 4 nodes on the machine, with no error except having to add a firewall rule to the machine in the azure console.

test-docker-ubuntu2204-x64-6
test-docker-ubuntu2004-x64-4
test-docker-debian12-x64-3
test-docker-alpine319-x64-3

@Haroon-Khel
Copy link
Contributor

Was seeing a process apparently never started error on the dockerhost machine during a jdk22 build job, https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk22/job/jdk22-linux-x64-temurin/31/

The 1000 user id and group id were taken by the azureuser user. Ive swapped it so now the jenkins user has id and gid 1000

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Mar 18, 2024

Still seeing the error https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk22/job/jdk22-linux-x64-temurin/33/console

12:26:33  process apparently never started in /home/jenkins/workspace/build-scripts/jobs/jdk22/jdk22-linux-x64-temurin@tmp/durable-332cc27a
12:26:33  (running Jenkins temporarily with -Dorg.jenkinsci.plugins.durabletask.BourneShellScript.LAUNCH_DIAGNOSTICS=true might make the problem clearer)
[Pipeline] }

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Mar 18, 2024

Fixed it. Despite the change of uid and guid, the cached adoptopenjdk/centos7_build_image still had the wrong uid and guid for its jenkins workspace

root@dockerhost-azure-ubuntu2204-x64-2:~# docker run -it adoptopenjdk/centos7_build_image bash
[root@a8994343424d ~]# ls -la /home/jenkins/
total 20
drwx------ 2 1005 1005 4096 Mar 14 04:51 .
drwxr-xr-x 1 root root 4096 Mar 14 04:51 ..
-rw-r--r-- 1 1005 1005   18 Nov 24  2021 .bash_logout
-rw-r--r-- 1 1005 1005  193 Nov 24  2021 .bash_profile
-rw-r--r-- 1 1005 1005  231 Nov 24  2021 .bashrc

Deleting this image and allowing the build job to pull a new one fixed the issue https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk22/job/jdk22-linux-x64-temurin/35/console

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Mar 21, 2024

Regarding getting a Solaris box up and running on dockerhost-azure-ubuntu2204-x64-2

Using Stewart's setup instructions here, I'm getting caught up on the virtualbox installation

There were problems setting up VirtualBox.  To re-start the set-up process, run
  /sbin/vboxconfig
as root.  If your system is using EFI Secure Boot you may need to sign the
kernel modules (vboxdrv, vboxnetflt, vboxnetadp, vboxpci) before you can load
them. Please see your Linux system's documentation for more information.

I believe this is because secure boot is enabled, its disabled on dockerhost-azure-ubuntu2204-x64-1 which I believe is why that machine is able to host its solaris machine. Im going to try to disable it on the -2 machine to see if this fixes the virtualbox installation

@Haroon-Khel
Copy link
Contributor

I managed to disable secureboot through the azure console

@Haroon-Khel
Copy link
Contributor

Seeing this error

Stderr: VBoxManage: error: VT-x is disabled in the BIOS for all CPU modes (VERR_VMX_MSR_ALL_VMX_DISABLED)
VBoxManage: error: Details: code NS_ERROR_FAILURE (0x80004005), component ConsoleWrap, interface IConsole

Virtualisation is enabled on the D series v4 cpus

@sxa
Copy link
Member Author

sxa commented Mar 25, 2024

Interesting - /proc/cpuinfo has vmx and some other differences in the flags despite both being listed as the same CPU:
model name : Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
First machine that works:
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single tpr_shadow ept vpid ept_ad fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves vnmi avx512_vnni arch_capabilities
Second machine:
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_vnni arch_capabilities
The working machine has these extra ones: vmx tpr_shadow ept vpid ept_ad vnmi

@gdams @karianna Are you aware of why two supposedly identical CPUs hosted on Azure might be serving up different CPU flags in this situation? Any help you can provide to get the new one able to run VMs would be appreciated. There was some experimentation in #3347 (comment) but I would have expected two machines of the same type and processor to support it. Could that have been an invalid assumption?

@karianna
Copy link
Contributor

Are they the same SKU from the same region?

@sxa
Copy link
Member Author

sxa commented Mar 26, 2024

Are they the same SKU from the same region?

The ones we're talking about are the second and third in this list (The first is unused as it was an AMD system that didn't seem to support running the VMs - it seems the third is having the same problem) so they seem to be in the same location and are D16s_v4 (Is that the SKU you're asking for?):

image

Working machine: dockerhost-azure-ubuntu2204-x64-1 (The -intel one in the portal image above)
Problem machine from this issue: dockerhost-azure-ubuntu2204-x64-2

@Haroon-Khel
Copy link
Contributor

The build solaris box is up at https://ci.adoptium.net/computer/build-azure-solaris10-x64-1/. Its hosted on dockerhost-azure-ubuntu2204-x64-1. A jdk8 build ran without error https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-solaris-x64-hotspot/802/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

3 participants