Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Look at moving macstadium machines to orka #2536

Open
sxa opened this issue Apr 22, 2022 · 38 comments
Open

Look at moving macstadium machines to orka #2536

sxa opened this issue Apr 22, 2022 · 38 comments

Comments

@sxa
Copy link
Member

sxa commented Apr 22, 2022

I need to request a new machine:

  • New machine operating system (e.g. linux/windows/macos/solaris/aix): mac
  • New machine architecture (e.g. x64/aarch32/arm32/ppc64/ppc64le/sparc): x64
  • Provider (leave blank if it does not matter): macstadium
  • Desired usage: Replacement for the non-orka machine that we have to reduce costs
  • Any unusual specification/setup required: Standard playbooks, although if there is an opportunity for more lower spec ones that could be beneficial
  • How many of them are required: Start with 2, look at increasing.

Please explain what this machine is needed for:

@sxa sxa self-assigned this Apr 22, 2022
@sxa sxa changed the title Looks at moving macstadium machines to orka Look at moving macstadium machines to orka Apr 29, 2022
@sxa sxa removed their assignment Feb 6, 2023
@sxa sxa assigned sxa and gdams Jul 12, 2023
@sxa
Copy link
Member Author

sxa commented Jul 12, 2023

As per discussion a few weeks ago that the action is on me to progress, George and I will look at this migration together.

@sxa sxa added this to the 2023-08 (August) milestone Jul 12, 2023
@sxa
Copy link
Member Author

sxa commented Jul 12, 2023

Related: adoptium/temurin-build#3354

@sxa
Copy link
Member Author

sxa commented Aug 8, 2023

Our orka systems have been deprovisioned due to inactivity - currently having negotiations to determine a way forward.

@sxa
Copy link
Member Author

sxa commented Aug 22, 2023

Discussions with MacStadium have indicated that an orka-based solution (which would not be sponsored at present) would be approximately twice the cost of the static systems which we have at present so we are looking at alternative options.

Here is a breakdown of the number of systems and their types we have at macstadium:

Use x64 aarch64
Build 2xG3 (4core) 2xG5G
Test 6xG3B (4core) 1xG4B (6core) 2xG5A
TCK 2xC3D (sml) 1xG4D (lge) 2xG5E

So that's a total of 4+9+5 = 18 systems.
We currently have two hosted with MacInCloud with a potential option to increase that, particularly for x64 capacity

@sxa
Copy link
Member Author

sxa commented Aug 22, 2023

Looking at the performance of various systems, here are some runs of the JDK8/x64 extended.openjdk suite on the different machines:

System Time Failures?
TC G4D [*] 2h28 17 (hostname issues) 3702
TC G3D - i5/2C/8G [*] 6h51 Same hostname issues as G4D 3701
G3B - i7/4C/16G 3h03 All passed
G3B - i7/4C/16G [*] 3h38 Three failures in java.nio
aarch64 (Rosetta) [*] 2h24 14 failures
MacInCloud i7-8700B 3h15 1 failure in com/sun/jndi/ldap
G4B i7/6C/32G [*] 1h46 10 failures in net/nio/rmi

[*] - These machines have not typically been used for running the openjdk suites in the past so these may be newly visible failures. The second G3B machine was one of the build machines rather than one tagged for test.

So with the exception of the second line, the performance of these for running the full extended.openjdk suite looks reasonable. It should be noted that it is between 2x and 2.5x slower to run the same tests on JDK21 so around 8h for a G3B and 3h30 for a G4B.

@sxa
Copy link
Member Author

sxa commented Aug 22, 2023

Some other pieces of note:

  • We are currently using a number of older machines running macos 10. We should consider whether we wish to retain any such machines for compatibility testing.
  • We have also tried some runs with cross-compiling from aarch64 to x64 with a view to reducing the number of the (larger) build machines which we require by not requiring dedicated x64 build systems. This has been generally positive, although JDK8 has not yet been verified (wer'll need to get a suitable boot JDK installed, since Adoptium doesn't produce one for JDK8 on aarch64.
  • The cross-compilation described above would also be useful if we switch to using dynmically created VMs using the macos Virtualization Framework on aarch64 in support of having fully isolated environments for each build

@sxa
Copy link
Member Author

sxa commented Aug 23, 2023

Noting that JDK8 will not build on macos12 with Xcode 13:

checking for xcodebuild... /usr/bin/xcodebuild
configure: error: Xcode 6, 9-12 is required to build JDK 8, the version found was 13.1. Use --with-xcode-path to specify the location of Xcode or make Xcode active by using xcode-select.
No configurations found for /Users/jenkins/sxa/temurin-build/build-farm/workspace/build/src/! Please run configure to create a configuration.
Makefile:55: *** Cannot continue.  Stop.
OpenJDK make failed, archiving make failed logs

If I try a cross-compile from macos11/aarch64 with Xcode 12 I need to make a couple of other changes
It can be made to try a build by adjusting mac.sh to ensure xcode-select -switch / is run, and using --openjdk-target=x86_64-apple-darwin in the configure args. However for JDK8 the build fails with some more errors:

error: use of undeclared identifier 'finite'; did you mean 'isfinite'?

Which seems to have been deprecated and then removed in earlier Xcode versions (Possible backport?)

Error: value size does not match register size specified by the constraint and modifier [-Werror,-Wasm-operand-widths]

may be more problematic

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Aug 23, 2023

Just to be rigorous, Ive kicked off the AQA test pipeline on all of our mac machines. JDK8 and 11 for x64, just 11 for arm. The focus is the build and test-macstadium machines, the other machines can be used as a 'control'

test-macstadium-macos1014-x64-1 https://ci.adoptium.net/job/AQA_Test_Pipeline/158/console
test-macstadium-macos1014-x64-2 https://ci.adoptium.net/job/AQA_Test_Pipeline/157/console
test-macstadium-macos11-arm64-1 https://ci.adoptium.net/job/AQA_Test_Pipeline/162/console
test-macstadium-macos11-arm64-2 https://ci.adoptium.net/job/AQA_Test_Pipeline/161/console
test-macstadium-macos1014-x64-3 https://ci.adoptium.net/job/AQA_Test_Pipeline/163/console
test-macstadium-macos1014-x64-4 https://ci.adoptium.net/job/AQA_Test_Pipeline/164/console
test-macstadium-macos1015-x64-1 https://ci.adoptium.net/job/AQA_Test_Pipeline/165/console
build-macstadium-macos11-arm64-2 https://ci.adoptium.net/job/AQA_Test_Pipeline/166/console
build-macstadium-macos11-arm64-1 https://ci.adoptium.net/job/AQA_Test_Pipeline/167/console
build-macstadium-macos1014-x64-1 https://ci.adoptium.net/job/AQA_Test_Pipeline/168/console
build-macstadium-macos1014-x64-2 https://ci.adoptium.net/job/AQA_Test_Pipeline/169/console
test-macincloud-macos1201-x64-1 https://ci.adoptium.net/job/AQA_Test_Pipeline/170/console
test-macincloud-macos1201-x64-2 https://ci.adoptium.net/job/AQA_Test_Pipeline/171/console

@Haroon-Khel
Copy link
Contributor

Bit of a bad idea to run all of them at the same time. Some of the test jobs have expired even after 1 day.

Sifting through the tests that have finished and not expired, avoiding duplicates (ie if jdk_security1_0 and jdk_security1_1 have the same failed tests, only jdk_security1_0 is shown)

test-macstadium-macos11-arm64-1 jdk_security1_0,jdk_security4_0,jdk_util_0,jdk_svc_sanity_0,jvm_compiler_0,jdk_io_0,jdk_other_0,jdk_net_0,jdk_net_0,jdk_time_0,jdk_tools_0,jdk_jfr_0,jdk_jdi_0,jdk_security_infra_0

test-macstadium-macos11-arm64-2 (same failures as -1)
jdk_security1_0,jdk_security4_0,jdk_util_0,jdk_svc_sanity_0,jvm_compiler_0,jdk_io_0,jdk_other_0,jdk_net_0,jdk_net_0,jdk_time_0,jdk_tools_0,jdk_jfr_0,jdk_jdi_0,jdk_security_infra_0

build-macstadium-macos11-arm64-2
jdk_math_1,jdk_security1_0,jdk_security4_0,jdk_util_0,jdk_svc_sanity_0,jvm_compiler_0,jdk_io_0,jdk_other_0,jdk_net_0,jdk_security3_0,jdk_time_0,jdk_tools_0,jdk_jfr_0,jdk_jdi_0 ,jdk_security_infra_0

build-macstadium-macos11-arm64-1 (same failures as -2)
jdk_math_1,jdk_security1_0,jdk_security4_0,jdk_util_0,jdk_svc_sanity_0,jvm_compiler_0,jdk_io_0,jdk_other_0,jdk_net_0,jdk_security3_0,jdk_time_0,jdk_tools_0,jdk_jfr_0,jdk_jdi_0 ,jdk_security_infra_0

@sxa
Copy link
Member Author

sxa commented Aug 29, 2023

So the failures you've got are only from the arm64 ones? And are all those targets from the openjdk suite - where the others targets all good?
I'm a bit surprised we're seeing issues on arm64 when using the arm64 builds - I would expect some issues when trying to run the x64 ones on arm64 but it looks like you've run those with the real arm64 build - is that correct?
I'm particularly interested in test-macstadium-macos1014-x64-4 and the build-x64 ones so if those results have got lost we should get those re-run

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Aug 29, 2023

Machine Xcode version JDK11 x64 build JDK17 x64 build JDK20 x64 build
build-macstadium-macos11-arm64-1 Apple clang version 12.0.0 (clang-1200.0.32.29) build build build
build-macstadium-macos11-arm64-2 Apple clang version 12.0.0 (clang-1200.0.32.29) build build build
test-macstadium-macos11-arm64-1 Apple clang version 12.0.0 (clang-1200.0.32.29) build build build
test-macstadium-macos11-arm64-2 Apple clang version 12.0.0 (clang-1200.0.32.29) build build build
test-macincloud-macos1201-x64-1 Apple clang version 13.0.0 (clang-1300.0.29.3) build build build
test-macincloud-macos1201-x64-2 Apple clang version 13.1.6 (clang-1316.0.21.2.3) build build build

Can only kick off one build job at a time and on one machine at a time 😅 , this will take a while

@sxa
Copy link
Member Author

sxa commented Aug 29, 2023

A couple of other things to add to this list - see if we can build ok on clang13 on macos12 (The two macincloud machines) but also see if we can install the older version of xcode (The one used for JDK8) on a newer macos version.

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Sep 6, 2023

Notes from building x64 jdk8 on my m1 mac

Install xcode11.7. I can do this on my own mac (with GUI), need to find a way to do this headless

Switch to xcode 11.7
xcode-select -switch 'path to Xcode11.7'

Install 'intel' homebrew into /usr/local/Homebrew, requires a new Rosetta bash shell

arch -x86_64 /usr/bin/env bash
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

Back to a non Rosetta shell:
export PKG_CONFIG_PATH="/usr/local/lib/pkgconfig"
Install intel libpng (for freetype)
arch -x86_64 brew install libpng

Command to run build

arch -x86_64 ./makejdk-any-platform.sh --clean-git-repo --jdk-boot-dir 'path to x64 jdk8 mac binary'/Contents/Home --configure-args '--with-toolchain-type=clang --openjdk-target=x86_64-apple-darwin --with-cups=/opt/homebrew/opt/cups/' --target-file-name jdk8_x64.tar.gz --build-variant temurin jdk8u

If theres still errors with the freetype compilation, install intel freetype and rerun build
arch -x86_64 brew install freetype

@Haroon-Khel
Copy link
Contributor

I built another x64 jdk8 binary on build-macstadium-macos11-arm64-1 and uploaded it to jenkins here

I kicked off the aqa test pipeline, https://ci.adoptium.net/job/AQA_Test_Pipeline/173/console. Only sanity openjdk failed
https://ci.adoptium.net/job/Test_openjdk8_hs_sanity.openjdk_x86-64_mac/883/

jdk_jdi_jdk8_0
 com/sun/jdi/RedefineCrossEvent.java.RedefineCrossEvent
 com/sun/jdi/PrivateTransportTest.sh.PrivateTransportTest

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Sep 15, 2023

In the interest of seeing how x64 mac tests run on arm64 mac, i kicked off https://ci.adoptium.net/job/AQA_Test_Pipeline/174/console (jdk11 aqa tests on test-macstadium-macos11-arm64-1

Most tests passed. Failing ones are:

Jlink_ReqMod

MathLoadTest_all_5m

jdk_io
   java/io/Serializable/serialFilter/GlobalFilterTest.java

jdk_time
   java/time/test/java/time/format/TestUTCParse.java

jdk_jfr_0 44 failed tests

jdk_jdi
   com/sun/jdi/JdbOptions.java

jdk_security_infra
   security/infra/java/security/cert/CertPathValidator/certification/GoogleCA.java

jdk_svc_sanity
   jdk/jfr/jcmd/TestJcmdStartStopDefault.java

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Sep 18, 2023

Ref #2536 (comment)

com/sun/jdi/RedefineCrossEvent.java.RedefineCrossEvent is excluded on openj9, https://github.com/adoptium/aqa-tests/blob/80e978693163b65ce6d3caabeb823ba594766167/openjdk/excludes/ProblemList_openjdk8-openj9.txt#L333

Known issue adoptium/aqa-tests#227, it fails the same way

Execution failed: `main' threw exception: com.sun.jdi.VMDisconnectedException: connection is closed    

Rerunning com/sun/jdi/PrivateTransportTest.sh.PrivateTransportTest on test-macstadium-macos1014-x64-2 https://ci.adoptium.net/job/Grinder/7564/console. Test passed ✅

So a cross compiled x64 jdk8 binary passes the tests in the AQA pipeline. Excellent news

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Sep 19, 2023

Ref #2536 (comment)

Rerunning the failing tests on different arm64 mac machines to rule out infra related failure

Jlink_ReqMod, MathLoadTest_all_5m https://ci.adoptium.net/view/Test_grinder/job/Grinder/7568/console on build-macstadium-macos11-arm64-2

MathLoadTest_all_5m passed, rerunning Jlink_ReqMod on build-macstadium-macos11-arm64-1 https://ci.adoptium.net/view/Test_grinder/job/Grinder/7574/console

On build-macstadium-macos11-arm64-1
java/io/Serializable/serialFilter/GlobalFilterTest.java https://ci.adoptium.net/job/Grinder/7569/console
java/time/test/java/time/format/TestUTCParse.java https://ci.adoptium.net/job/Grinder/7570/console
com/sun/jdi/JdbOptions.java https://ci.adoptium.net/job/Grinder/7571/console
security/infra/java/security/cert/CertPathValidator/certification/GoogleCA.java https://ci.adoptium.net/job/Grinder/7572/console
jdk/jfr/jcmd/TestJcmdStartStopDefault.java https://ci.adoptium.net/job/Grinder/7573/console

security/infra/java/security/cert/CertPathValidator/certification/GoogleCA.java rerun
https://ci.adoptium.net/job/Grinder/7575/console on build-macstadium-macos11-arm64-2

java/time/test/java/time/format/TestUTCParse.java rerun https://ci.adoptium.net/job/Grinder/7576/console on build-macstadium-macos11-arm64-2

@gdams
Copy link
Member

gdams commented Nov 17, 2023

JDK11 build completed using XCode command line tools (same as before) https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk11u/job/jdk11u-mac-x64-temurin/327/

@gdams
Copy link
Member

gdams commented Nov 20, 2023

Right now the main issues I'm seeing are with the VPN expiring after a certain amount of time, this should be resolved once the firewall is configured to allow Jenkins in

@sxa
Copy link
Member Author

sxa commented Dec 7, 2023

@gdams Not sure it's been explicitly mentioned in here but since it came up int he PMC this week can you clarify the reason for moving to XCode 15? The openjdk build matrix lists 12 as the Oracle-supported compiler, with 13.1 as "known good" too. It seems possibly that this is the cause of a lot of warnings showing in the build: adoptium/temurin-build#3562 so we should consider how to handle this.

@smlambert
Copy link
Contributor

Still seeing some Terminated failures, for example, from https://ci.adoptium.net/job/Test_openjdk8_hs_sanity.openjdk_x86-64_mac/938/console:

14:17:45  TESTING:
14:17:47  Directory "/Users/admin/workspace/workspace/Test_openjdk8_hs_sanity.openjdk_x86-64_mac/aqa-tests/TKG/../TKG/output_17023199822147/jdk_lang_1/work" not found: creating
14:17:47  Directory "/Users/admin/workspace/workspace/Test_openjdk8_hs_sanity.openjdk_x86-64_mac/aqa-tests/TKG/../TKG/output_17023199822147/jdk_lang_1/report" not found: creating
14:17:49  XML output with verification to /Users/admin/workspace/workspace/Test_openjdk8_hs_sanity.openjdk_x86-64_mac/aqa-tests/TKG/output_17023199822147/jdk_lang_1/work
14:36:34  make[1]: *** [sanity.openjdk-..] Terminated: 15
14:36:34  make: *** [_sanity.openjdk] Terminated: 15
14:36:34  /Users/admin/workspace/workspace/Test_openjdk8_hs_sanity.openjdk_x86-64_mac@tmp/durable-bbc054c3/script.sh: line 1:  1778 Terminated: 15          $MAKE _sanity.openjdk
14:36:34  make[2]: *** [sanity.openjdk-openjdk] Terminated: 15
14:36:34  make[3]: *** [jdk_lang_1] Terminated: 15
[Pipeline] sh

@sxa
Copy link
Member Author

sxa commented Dec 14, 2023

Still seeing some Terminated failures, for example, from https://ci.adoptium.net

Grinding away to see how reproducible this is and if there's any consistency:

Noting that "Worked?" is going to indicate:

  • JDK8/x64 passed all test suites other than jdk_jdi_jdk8_0, jdk_jdi_jdk8_1
  • JDK21/aarch64 passed all test suites other than jdk_lang_0 jdk_lang_1 jdk_security2_0 jdk_security2_1 jdk_util_0 jdk_util_1
Grinder arch machine Worked? Comment
8244 x64 cloud-2 jdi failures
8247 x64 j4dtq
8248 x64 cloud-2 jdi failures
8249 x64 cloud-1 jdi failures
8250 x64 bnxp5
8251 x64 6zdxr
8252 x64 4jxrn
8253 aarch64 ckvq7
8254 aarch64 lwrdg
8255 aarch64 7gffk
8256 aarch64 gnvw4
8257 aarch64 fm88f
8258 aarch64 d9tcd
8259 aarch64 q5bmc

Noting that the macincloud x64 machines completed sanity.openjdk in about 40 minutes, the orka ones took about 1h30

None of them seemed to have any of the unexpected termination problems. Although since these were all kicked off in parallel there were unlikely to have re-used any existing machines...

@sxa
Copy link
Member Author

sxa commented Dec 14, 2023

Looking at some recently failing jobs on macos:

JDK8 extended.system#929 - Failed during the setup phase

18:35:37  Uncompressing file: OpenJDK8U-jdk_x64_mac_hotspot_2023-12-13-18-05.tar.gz ...
18:35:44  Cannot contact test-orka-macos14-x64-5p7nd: hudson.remoting.RequestAbortedException: java.io.IOException: Unexpected termination of the channel
04:35:12  Cancelling nested steps due to timeout

JDK8 extended.functional#572 - Failed a few minutes after the start

18:17:34  TESTING:
18:17:35  Directory "/home/jenkins/workspace/Test_openjdk8_hs_extended.functional_x86-64_linux/aqa-tests/TKG/../TKG/output_17024914523857/CryptoTests_0/work" not found: creating
18:17:35  Directory "/home/jenkins/workspace/Test_openjdk8_hs_extended.functional_x86-64_linux/aqa-tests/TKG/../TKG/output_17024914523857/CryptoTests_0/report" not found: creating
18:17:35  XML output  to /home/jenkins/workspace/Test_openjdk8_hs_extended.functional_x86-64_linux/aqa-tests/TKG/output_17024914523857/CryptoTests_0/work
18:17:42  make[1]: *** [settings.mk:356: extended.functional-..] Terminated
18:17:42  make: *** [makefile:65: _extended.functional] Terminated
18:17:42  Terminated
18:17:42  make[2]: *** [/home/jenkins/workspace/Test_openjdk8_hs_extended.functional_x86-64_linux/aqa-tests/TKG/../TKG/settings.mk:356: extended.functional-functional] Terminated
18:17:42  make[3]: *** [/home/jenkins/workspace/Test_openjdk8_hs_extended.functional_x86-64_linux/aqa-tests/TKG/../TKG/settings.mk:356: extended.functional-security] Terminated
18:17:42  make[4]: *** [/home/jenkins/workspace/Test_openjdk8_hs_extended.functional_x86-64_linux/aqa-tests/TKG/../TKG/settings.mk:356: extended.functional-Crypto] Terminated
18:17:42  make[5]: *** [autoGen.mk:31: CryptoTests_0] Terminated
[Pipeline] sh

JDK17 extended.functional - 2-3 minutes after the start

00:04:49  TESTING:
00:04:50  Directory "/Users/admin/workspace/workspace/Test_openjdk17_hs_extended.functional_x86-64_mac/aqa-tests/TKG/../TKG/output_17024258838618/CryptoTests_0/work" not found: creating
00:04:50  Directory "/Users/admin/workspace/workspace/Test_openjdk17_hs_extended.functional_x86-64_mac/aqa-tests/TKG/../TKG/output_17024258838618/CryptoTests_0/report" not found: creating
00:04:51  XML output  to /Users/admin/workspace/workspace/Test_openjdk17_hs_extended.functional_x86-64_mac/aqa-tests/TKG/output_17024258838618/CryptoTests_0/work
00:06:46  Cannot contact test-orka-macos14-x64-jpxkh: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@3c1aacd7:test-orka-macos14-x64-jpxkh": Remote call on test-orka-macos14-x64-jpxkh failed. The channel is closing down or has closed down
10:01:31  Cancelling nested steps due to timeout
10:01:31  Could not connect to test-orka-macos14-x64-jpxkh to send interrupt signal to process

JDK17 sanity.openjdk - very early failure
JDk11 sanity.openjdk - within five minutes of job start

18:43:50  Directory "/Users/admin/workspace/workspace/Test_openjdk11_hs_sanity.openjdk_x86-64_mac/aqa-tests/TKG/../TKG/output_17024066213950/jdk_lang_0/work" not found: creating
18:43:50  Directory "/Users/admin/workspace/workspace/Test_openjdk11_hs_sanity.openjdk_x86-64_mac/aqa-tests/TKG/../TKG/output_17024066213950/jdk_lang_0/report" not found: creating
18:44:23  XML output with verification to /Users/admin/workspace/workspace/Test_openjdk11_hs_sanity.openjdk_x86-64_mac/aqa-tests/TKG/output_17024066213950/jdk_lang_0/work
18:52:45  Cannot contact test-orka-macos14-x64-zr7cl: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@51d72d7:test-orka-macos14-x64-zr7cl": Remote call on test-orka-macos14-x64-zr7cl failed. The channel is closing down or has closed down
04:39:57  Cancelling nested steps due to timeout
04:39:57  Could not connect to test-orka-macos14-x64-zr7cl to send interrupt signal to process
[Pipeline] sh

This is looking like it might be the Orka system decommissioning the machine because it thinks it's no longer used after being provisioned in a previous run but it's not immediately clear.
Looking at the last one there is an entry in the jenkins log from two minutes later about it being deleted (subject to time sync being correct)

[12/12/23 19:39:24] SSH Launch of test-orka-macos14-x64-zr7cl on xxx.yyy.zz.aa completed in 30,723 ms
jenkins.log.1:2023-12-13 04:41:44.163+0000 [id=107]	INFO	h.slaves.CloudRetentionStrategy#check: Disconnecting test-orka-macos14-x64-zr7cl
jenkins.log.1:2023-12-13 04:41:44.163+0000 [id=107]	INFO	i.j.p.orka.OrkaProvisionedAgent#_terminate: Terminating agent. VM id: test-orka-macos14-x64-zr7cl
jenkins.log.1:2023-12-13 04:41:44.201+0000 [id=107]	INFO	i.jenkins.plugins.orka.OrkaCloud#deleteVM: VM test-orka-macos14-x64-zr7cl is successfully deleted.

@sxa
Copy link
Member Author

sxa commented Jan 10, 2024

@gdams has raised the disconnect issues with MacStadium. Awaiting a response.

@smlambert
Copy link
Contributor

smlambert commented Jan 12, 2024

Regularity of x64 mac test jobs being terminated / disconnected seems to have increased (4/9 of the dry run jobs fail to run). jdk17 dry run pipeline

Screenshot 2024-01-11 at 9 09 49 PM

aarch64 mac test jobs seem not to suffer from this problem (as frequently, if at all)

Screenshot 2024-01-11 at 9 13 22 PM

@smlambert
Copy link
Contributor

jdk8 dry run pipeline

Screenshot 2024-01-11 at 10 35 57 PM

@smlambert
Copy link
Contributor

Unable jdk_net and jdk_nio 4 test cases related to multicasting do not pass on Orka machines, details here:
adoptium/aqa-tests#5156 (comment)

jdk_net
TEST: java/net/DatagramSocket/DatagramSocketExample.java
TEST: java/net/DatagramSocket/DatagramSocketMulticasting.java

jdk_nio
TEST: java/nio/channels/DatagramChannel/AdaptorMulticasting.java
TEST: java/nio/channels/DatagramChannel/BasicMulticastTests.java

@sxa
Copy link
Member Author

sxa commented Apr 10, 2024

@gdams as discussed - here are some examples of the errors I'm seeing in the jenkins log as a result of Orka:

Unable to make field private static final long java.nio.channels.ClosedChannelException.serialVersionUID accessible

2024-04-09 22:00:28.171+0000 [id=2688702] WARNING jenkins.util.Listeners#lambda$notify$0
java.lang.reflect.InaccessibleObjectException: Unable to make field private static final long java.nio.channels.ClosedChannelException.serialVersionUID accessible: module java.base does not "opens java.nio.channels" to unnamed module @6be968ce
at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:354)
at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:297)
at java.base/java.lang.reflect.Field.checkCanSetAccessible(Field.java:178)
at java.base/java.lang.reflect.Field.setAccessible(Field.java:172)
at com.thoughtworks.xstream.converters.reflection.FieldDictionary.buildDictionaryEntryForClass(FieldDictionary.java:176)
at com.thoughtworks.xstream.converters.reflection.FieldDictionary.buildMap(FieldDictionary.java:142)
at com.thoughtworks.xstream.converters.reflection.FieldDictionary.fieldsFor(FieldDictionary.java:80)
at com.thoughtworks.xstream.converters.reflection.PureJavaReflectionProvider.visitSerializableFields(PureJavaReflectionProvider.java:167)
at hudson.util.RobustReflectionConverter.doMarshal(RobustReflectionConverter.java:206)
at hudson.util.RobustReflectionConverter.marshal(RobustReflectionConverter.java:163)
at com.thoughtworks.xstream.converters.extended.ThrowableConverter.marshal(ThrowableConverter.java:62)
at com.thoughtworks.xstream.core.AbstractReferenceMarshaller.convert(AbstractReferenceMarshaller.java:68)
at com.thoughtworks.xstream.core.TreeMarshaller.convertAnother(TreeMarshaller.java:59)
at com.thoughtworks.xstream.core.AbstractReferenceMarshaller$1.convertAnother(AbstractReferenceMarshaller.java:83)
at hudson.util.RobustReflectionConverter.marshallField(RobustReflectionConverter.java:283)
at hudson.util.RobustReflectionConverter$2.writeField(RobustReflectionConverter.java:270)
Caused: java.lang.RuntimeException: Failed to serialize hudson.slaves.OfflineCause$ChannelTermination#cause for class hudson.slaves.OfflineCause$ChannelTermination
at hudson.util.RobustReflectionConverter$2.writeField(RobustReflectionConverter.java:274)
at hudson.util.RobustReflectionConverter$2.visit(RobustReflectionConverter.java:241)
at com.thoughtworks.xstream.converters.reflection.PureJavaReflectionProvider.visitSerializableFields(PureJavaReflectionProvider.java:174)
at hudson.util.RobustReflectionConverter.doMarshal(RobustReflectionConverter.java:226)
at hudson.util.RobustReflectionConverter.marshal(RobustReflectionConverter.java:163)
at com.thoughtworks.xstream.core.AbstractReferenceMarshaller.convert(AbstractReferenceMarshaller.java:68)
at com.thoughtworks.xstream.core.TreeMarshaller.convertAnother(TreeMarshaller.java:59)
at com.thoughtworks.xstream.core.AbstractReferenceMarshaller$1.convertAnother(AbstractReferenceMarshaller.java:83)
at hudson.util.RobustReflectionConverter.marshallField(RobustReflectionConverter.java:283)
at hudson.util.RobustReflectionConverter$2.writeField(RobustReflectionConverter.java:270)
Caused: java.lang.RuntimeException: Failed to serialize hudson.model.Node#temporaryOfflineCause for class hudson.slaves.DumbSlave
at hudson.util.RobustReflectionConverter$2.writeField(RobustReflectionConverter.java:274)
at hudson.util.RobustReflectionConverter$2.visit(RobustReflectionConverter.java:241)
at com.thoughtworks.xstream.converters.reflection.PureJavaReflectionProvider.visitSerializableFields(PureJavaReflectionProvider.java:174)
at hudson.util.RobustReflectionConverter.doMarshal(RobustReflectionConverter.java:226)
at hudson.util.RobustReflectionConverter.marshal(RobustReflectionConverter.java:163)
at com.thoughtworks.xstream.core.AbstractReferenceMarshaller.convert(AbstractReferenceMarshaller.java:68)
at com.thoughtworks.xstream.core.TreeMarshaller.convertAnother(TreeMarshaller.java:59)
at com.thoughtworks.xstream.core.TreeMarshaller.convertAnother(TreeMarshaller.java:44)
at com.thoughtworks.xstream.core.TreeMarshaller.start(TreeMarshaller.java:83)
at com.thoughtworks.xstream.core.AbstractTreeMarshallingStrategy.marshal(AbstractTreeMarshallingStrategy.java:37)
at com.thoughtworks.xstream.XStream.marshal(XStream.java:1303)
at com.thoughtworks.xstream.XStream.marshal(XStream.java:1292)
at com.thoughtworks.xstream.XStream.toXML(XStream.java:1265)
at com.thoughtworks.xstream.XStream.toXML(XStream.java:1252)
at hudson.plugins.jobConfigHistory.FileHistoryDao.hasDuplicateHistory(FileHistoryDao.java:1299)
at hudson.plugins.jobConfigHistory.ComputerHistoryListener.onChange(ComputerHistoryListener.java:117)
at hudson.plugins.jobConfigHistory.ComputerHistoryListener.onConfigurationChange(ComputerHistoryListener.java:69)
at jenkins.util.Listeners.lambda$notify$0(Listeners.java:59)
at jenkins.util.Listeners.notify(Listeners.java:70)
at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:278)
at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1705)
at jenkins.model.Nodes$5.run(Nodes.java:279)
at hudson.model.Queue._withLock(Queue.java:1401)
at hudson.model.Queue.withLock(Queue.java:1275)
at jenkins.model.Nodes.removeNode(Nodes.java:270)
at jenkins.model.Jenkins.removeNode(Jenkins.java:2266)
at hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:91)
at hudson.slaves.CloudRetentionStrategy.check(CloudRetentionStrategy.java:61)
at hudson.slaves.CloudRetentionStrategy.check(CloudRetentionStrategy.java:45)
at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:970)
at hudson.model.Queue._withLock(Queue.java:1401)
at hudson.model.Queue.withLock(Queue.java:1275)
at hudson.slaves.SlaveComputer.setNode(SlaveComputer.java:967)
at hudson.model.AbstractCIBase.updateComputer(AbstractCIBase.java:147)
at hudson.model.AbstractCIBase$1.run(AbstractCIBase.java:255)
at hudson.model.Queue._withLock(Queue.java:1401)
at hudson.model.Queue.withLock(Queue.java:1275)
at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:238)
at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1705)
at jenkins.model.Nodes$5.run(Nodes.java:279)
at hudson.model.Queue._withLock(Queue.java:1401)
at hudson.model.Queue.withLock(Queue.java:1275)
at jenkins.model.Nodes.removeNode(Nodes.java:270)
at jenkins.model.Jenkins.removeNode(Jenkins.java:2266)
at hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:91)
at hudson.slaves.CloudRetentionStrategy.check(CloudRetentionStrategy.java:61)
at hudson.slaves.CloudRetentionStrategy.check(CloudRetentionStrategy.java:45)
at hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:970)
at hudson.model.Queue._withLock(Queue.java:1401)
at hudson.model.Queue.withLock(Queue.java:1275)
at hudson.slaves.SlaveComputer.setNode(SlaveComputer.java:967)
at hudson.model.AbstractCIBase.updateComputer(AbstractCIBase.java:147)
at hudson.model.AbstractCIBase$1.run(AbstractCIBase.java:255)
at hudson.model.Queue._withLock(Queue.java:1401)
at hudson.model.Queue.withLock(Queue.java:1275)
at hudson.model.AbstractCIBase.updateComputerList(AbstractCIBase.java:238)
at jenkins.model.Jenkins.updateComputerList(Jenkins.java:1705)
at jenkins.model.Nodes$5.run(Nodes.java:279)
at hudson.model.Queue._withLock(Queue.java:1401)
at hudson.model.Queue.withLock(Queue.java:1275)
at jenkins.model.Nodes.removeNode(Nodes.java:270)
at jenkins.model.Jenkins.removeNode(Jenkins.java:2266)
at hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:91)
at io.jenkins.plugins.orka.WaitSSHLauncher.deleteAgent(WaitSSHLauncher.java:58)
at io.jenkins.plugins.orka.WaitSSHLauncher.launch(WaitSSHLauncher.java:45)
at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:297)
at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:840)
2024-04-09 22:00:28.174+0000 [id=2688702] INFO o.j.p.cloudstats.CloudStatistics#getIdFor: No support for cloud-stats-plugin by class io.jenkins.plugins.orka.OrkaProvisionedAgent
2024-04-09 22:00:28.207+0000 [id=2688702] WARNING jenkins.util.Listeners#lambda$notify$0

Deploying VM failed with: HTTP Code: 500, Error: Internal error occurred: Requested CPU is not available in the cluster _Note: I'm not sure if the exception underneath it in the log is directly related to the Orka message_

2024-04-10 00:32:35.570+0000 [id=2701897] WARNING i.j.plugins.orka.AgentTemplate#provision: Deploying VM failed with: HTTP Code: 500, Error: Internal error occurred: Requested CPU is not available in the cluster
No available nodes with sufficient memory
No node in READY state is available to deploy to. Run orka3 nodes list --namespace orka-default to check nodes state
2024-04-10 00:32:35.596+0000 [id=2701646] WARNING i.j.plugins.orka.AgentTemplate#provision: Deploying VM failed with: HTTP Code: 500, Error: Internal error occurred: Requested CPU is not available in the cluster
No available nodes with sufficient memory
No node in READY state is available to deploy to. Run orka3 nodes list --namespace orka-default to check nodes state
2024-04-10 00:32:40.865+0000 [id=2701607] WARNING h.i.i.InstallUncaughtExceptionHandler#handleException
java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms
at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:170)
at org.eclipse.jetty.io.IdleTimeout.idleCheck(IdleTimeout.java:112)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
Caused: java.io.IOException

There's also this:
2024-04-09 22:00:28.129+0000 [id=2688702] INFO o.j.p.cloudstats.CloudStatistics#getIdFor: No support for cloud-stats-plugin by class io.jenkins.plugins.orka.OrkaProvisionedAgent

For the second one above, I guess it's possible that it's being generated as a result of us hitting capacity on the cluster but might be good to verify whether such a condition has happened today. Since we've been kicking off five release runs in parallel it's entirely possible this is a fairly unique condition :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

4 participants