Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] MixedClusterEsqlSpecIT classMethod failing #107879

Closed
lkts opened this issue Apr 24, 2024 · 9 comments · Fixed by #107890
Closed

[CI] MixedClusterEsqlSpecIT classMethod failing #107879

lkts opened this issue Apr 24, 2024 · 9 comments · Fixed by #107890
Labels
:Core/Infra/Core Core issues without another label needs:risk Requires assignment of a risk label (low, medium, blocker) Team:Core/Infra Meta label for core/infra team >test-failure Triaged test failures from CI

Comments

@lkts
Copy link
Contributor

lkts commented Apr 24, 2024

Build scan:
https://gradle-enterprise.elastic.co/s/7lya5sv63r7qu/tests/:x-pack:plugin:esql:qa:server:mixed-cluster:v8.14.0%23javaRestTest/org.elasticsearch.xpack.esql.qa.mixed.MixedClusterEsqlSpecIT

Reproduction line:

null

Applicable branches:
main, 8.14, 8.13

Reproduces locally?:
Didn't try

Failure history:
Failure dashboard for org.elasticsearch.xpack.esql.qa.mixed.MixedClusterEsqlSpecIT#classMethod

Failure excerpt:

java.lang.RuntimeException: An error occurred orchestrating test cluster.

  at __randomizedtesting.SeedInfo.seed([9C9B4EDBE2330773]:0)
  at org.elasticsearch.test.cluster.local.DefaultLocalClusterHandle.execute(DefaultLocalClusterHandle.java:264)
  at org.elasticsearch.test.cluster.local.DefaultLocalClusterHandle.writeUnicastHostsFile(DefaultLocalClusterHandle.java:245)
  at org.elasticsearch.test.cluster.local.DefaultLocalClusterHandle.waitUntilReady(DefaultLocalClusterHandle.java:188)
  at org.elasticsearch.test.cluster.local.DefaultLocalClusterHandle.start(DefaultLocalClusterHandle.java:74)
  at org.elasticsearch.test.cluster.local.DefaultLocalElasticsearchCluster$1.evaluate(DefaultLocalElasticsearchCluster.java:45)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
  at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
  at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
  at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
  at org.junit.rules.RunRules.evaluate(RunRules.java:20)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:390)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:850)
  at java.lang.Thread.run(Thread.java:1583)

  Caused by: java.lang.RuntimeException: Timed out after PT2M waiting for ports files for: { cluster: 'test-cluster', node: 'test-cluster-1' }

    at org.elasticsearch.test.cluster.local.AbstractLocalClusterFactory$Node.waitUntilReady(AbstractLocalClusterFactory.java:285)
    at org.elasticsearch.test.cluster.local.AbstractLocalClusterFactory$Node.getTransportEndpoint(AbstractLocalClusterFactory.java:204)
    at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197)
    at java.util.AbstractList$RandomAccessSpliterator.forEachRemaining(AbstractList.java:722)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
    at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:960)
    at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:934)
    at java.util.stream.AbstractTask.compute(AbstractTask.java:327)
    at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:754)
    at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:387)
    at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1312)
    at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1843)
    at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1808)
    at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)

@lkts lkts added :Analytics/ES|QL AKA ESQL >test-failure Triaged test failures from CI labels Apr 24, 2024
@elasticsearchmachine elasticsearchmachine added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) needs:risk Requires assignment of a risk label (low, medium, blocker) labels Apr 24, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@dakrone
Copy link
Member

dakrone commented Apr 24, 2024

This failed for me (twice) also. Buildscan at https://gradle-enterprise.elastic.co/s/yt6gb6fd5gwm6

@alex-spies alex-spies self-assigned this Apr 25, 2024
@astefan
Copy link
Contributor

astefan commented Apr 25, 2024

This might not be related to ESQL: #99727, #94126 and #104166 are relevant.
Also, the buildscan @dakrone linked shows failures in FullClusterRestartIT.

@alex-spies
Copy link
Contributor

++, I don't think this is specific to ESQL - maybe some previous tests do not correctly clean stopped/restarted clusters?

Both build scans fail after the full cluster restart test, this one also after the rolling upgrade test.

:qa:rolling-upgrade:v8.14.0#bwcTest
:qa:ccs-rolling-upgrade-remote-cluster:v8.14.0#oldClusterTest FAILED
:qa:full-cluster-restart:v8.14.0#bwcTest
:x-pack:plugin:downsample:qa:mixed-cluster:v8.14.0#yamlRestTest FAILED
:x-pack:plugin:ent-search:qa:full-cluster-restart:v8.14.0#bwcTest FAILED
:x-pack:plugin:shutdown:qa:full-cluster-restart:v8.14.0#bwcTest
:x-pack:plugin:inference:qa:rolling-upgrade:v8.14.0#bwcTest FAILED
:x-pack:plugin:esql:qa:server:mixed-cluster:v8.14.0#javaRestTest FAILED

@alex-spies alex-spies added :Core/Infra/Core Core issues without another label and removed :Analytics/ES|QL AKA ESQL labels Apr 25, 2024
@alex-spies alex-spies removed their assignment Apr 25, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@elasticsearchmachine elasticsearchmachine added Team:Core/Infra Meta label for core/infra team and removed Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) labels Apr 25, 2024
@alex-spies
Copy link
Contributor

@elastic/es-core-infra , I hope it's okay to throw this one your way since it looks similar to past issues, as @astefan has pointed out.

@mark-vieira
Copy link
Contributor

We're seeing a high number of failures in multi-nodes tests with the nodes failing to start within 2 minutes. This sounds like we may have introduced some change that is slowing down node startup.

@mark-vieira
Copy link
Contributor

I'm wondering if adding #107619 has triggered this. I think we just cannot have rolling upgrade tests run in parallel like this. I'll open a PR to serialize those tests and see if that stabilizes things a bit.

@alex-spies
Copy link
Contributor

Hey, just wanted to say thanks to @mark-vieira and @thecoop for looking into this!

elasticsearchmachine pushed a commit that referenced this issue Apr 25, 2024
This fixes #107879

Reduce parallelism for java rest tests for esql
thecoop added a commit to thecoop/elasticsearch that referenced this issue Apr 25, 2024
This fixes elastic#107879

Reduce parallelism for java rest tests for esql
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label needs:risk Requires assignment of a risk label (low, medium, blocker) Team:Core/Infra Meta label for core/infra team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants