Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JDK22: zLinux: Extended.OpenJDK regularly times out and aborts #852

Closed
adamfarley opened this issue Dec 5, 2023 · 13 comments
Closed

JDK22: zLinux: Extended.OpenJDK regularly times out and aborts #852

adamfarley opened this issue Dec 5, 2023 · 13 comments

Comments

@adamfarley
Copy link
Contributor

Summary
Tests run very slow on this platform, and abort on most machines due to timeout.

Details
Apparently this is par for the course for this test set. Past timeouts occurred on:

And the sole pass happened on:

Next steps

I'm not seeing signs of a hang, just slow performance. Perhaps we could extend the timeout in this case?

@adamfarley
Copy link
Contributor Author

adamfarley commented Dec 5, 2023

Some data for comparison:

Test name JDK22 zLinux JDK21 zLinux JDK22 xLinux JDK22 xWindows
jvm_compiler_1 126 mins 146 mins 153 mins 104 mins
jvm_compiler_0 150 mins 210 mins 150 mins 105 mins

Ok, so maybe it's not that this platform+version are particularly slow. Hmm.

@Haroon-Khel - What do you think. Should we just increase timeout to 15 hours and move on?

@Haroon-Khel
Copy link
Contributor

Id recommend 20hrs

@Haroon-Khel
Copy link
Contributor

iirc there was another issue in the infra repo documenting extended openjdk timeouts. I cant find it at the moment. But this information should be added to that issue before closing this issue

@sxa
Copy link
Member

sxa commented Dec 28, 2023

We've got adoptium/infrastructure#2662 regarding general machine-specific issues.

I'm not sure what that table comparing differnet platforms is showing us, but it doesn't seem to demonstrate the problem - was that data not from the slow runs we initially documented? Was there a reason for choosing jvm_compiler_? as the comparison?

@smlambert Was there a reason to transfer this from the tests to pipelines repository? I didn't think anything in here could have an influence over this, or does this control the timeout values somewhere for the extended.openjdk jobs? The answer to that will potential answer my slack question on how configurable the default TIME_LIMIT options for the jobs are.

@smlambert
Copy link
Contributor

Transferred here because it is the right place to set TIME_LIMIT parameter so that it doesn't become 'unset' when/if test jobs are regenerated.

@sxa
Copy link
Member

sxa commented Jan 3, 2024

My last comment on here seems to have got lost. I thoght the jobs were created via Test_Job_Auto_Gen which runs aqatests -> testJobTemplate. Where could we make a change in ci-jenkins-pipelines that would affect the TIME_LIMIT value in the generated test jobs?

@smlambert
Copy link
Contributor

Since we automated the generation of new test jobs out of this pipeline code base when they do not exist (here), it is good to set TIME_LIMIT from this code base, so JDK23+ test jobs that are generated also get the new value for TIME_LIMIT. While there are several places where it could be done, a good spot to set this would be getCommonTestJobParams

Adding a few lines of code after L187:

if (arch == 's390x') {
           jobParams.put('TIME_LIMIT', '20')
}

Alternatively, although a larger endeavour would be to add a mechanism to hold testArgs in the configuration files (much like there are buildArgs), but I would rather not go that route for 1 use case of 1 parameter. If later we see the need to be modifying many other test parameters, we could look at such an approach.

@sxa
Copy link
Member

sxa commented Jan 4, 2024

Noting that this should also be applied for riscv64 where sanity.openjdk can take around 17 hours on some of the systems.
Also extended.openjdk without Parallel=dynamic can take up to three days depending on the machine so I've been setting that to TIME_LIMIT = 100 hours when I've been running them (Although obviously there's good scope for running with Parallel=Dynamic in that case!)

@adamfarley
Copy link
Contributor Author

adamfarley commented Feb 5, 2024

Ok, I've put this together to resolve this issue as advised, and to cover Stewart's riscv case.

Here's a test run for zlinux.
And another for riscv.

Note: Both are queued, and will not exist until the previous job finishes.

Update:

  • The riscv run appears to have failed due to a networking issue unrelated to this change. Rerunning here.
  • The s390 run appears to have vanished for some reason. It definitely existed, as the replacement job has gone straight from 13 to 15. Link.

@sxa
Copy link
Member

sxa commented Feb 5, 2024

@adamfarley It looks like you've done the PR for s390x but you've said it'll cover the riscv use case too. Have I missed something (quite possible - I'm reading this on my phone!)

@adamfarley
Copy link
Contributor Author

@sxa - It should cover the riscv case as well. if-then-elseif.

@sxa
Copy link
Member

sxa commented Feb 6, 2024

Yeah LGTM - that didn't seem to show on my phone yesterday :-)

@adamfarley
Copy link
Contributor Author

Ok, the change is in and proven to work, plus a timeout extension PR here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants