Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

馃悶 Setting --oci-max-parallelism 2 causes deadlocks #6894

Open
gerhard opened this issue Mar 16, 2024 · 4 comments
Open

馃悶 Setting --oci-max-parallelism 2 causes deadlocks #6894

gerhard opened this issue Mar 16, 2024 · 4 comments
Milestone

Comments

@gerhard
Copy link
Member

gerhard commented Mar 16, 2024

What is the issue?

Setting --oci-max-parallelism 2 on the Engine causes deadlocks, both with & without modules.

This is what that looks like from the CLI / user perspective:

dagger-0.10.2-deadlock.mp4

Attaching Dagger Engine v0.10.2 logs + SIGQUIT. Thank you @vito for https://vito.github.io/swirly/ and the tips!

FWIW:

dagger-0.10.2-sigquit-swirly.mp4

cc @sipsma @vito @jedevc @aluzzardi

Dagger version

Confirmed on 0.9.11 & 0.10.2

Steps to reproduce

No response

Log output

No response

@gerhard gerhard added the kind/bug Something isn't working label Mar 16, 2024
@gerhard gerhard changed the title 馃悶 Setting --oci-max-parallelism 2 can cause deadlocks 馃悶 Setting --oci-max-parallelism 2 causes deadlocks Mar 16, 2024
@Scalahansolo
Copy link

The team over here at Motion have seen this happen to us as well on 0.10.3

@jedevc
Copy link
Member

jedevc commented Mar 28, 2024

I think this is expected - suppose the following module structure:

  • Module A starts (running a container)
  • Module A calls a function in Module B
  • Module B start (running a container)
  • Module B calls dag.Container() (running a container)
    • This hangs forever! There are already 2 containers running in parallel, so no more can be scheduled.

I think with modules, this effectively means that max-parallelism is pretty much just completely broken - it just fits very badly with the idea of nested containers.

Not sure what the right fix is here:

  • Don't use max-parallelism (generally, my preference, you should enforce max usage using different mechanisms instead, since a single job-step could use anywhere from 0% to 100% of available resources - they're not just replaceable)
  • When using a nested step, don't have this "consume a resource" (this prevents deadlocks, but isn't great, since there's now more containers running than the user asked for)

I don't think there's a clear way forwards here, but to me this kind of highlights that we need a better way of constraining resources - max-parallelism is just really not fit for the shape that pipelines tend to take these days. Maybe we should provide some cgroup config options instead (if the end desire is to constrain CPU usage/etc).

@jedevc
Copy link
Member

jedevc commented May 15, 2024

Adding this to the v0.12.0 milestone since by that release we should probably either:

  • Remove this flag
  • Ensure that nested execs don't count towards max-parallelism
    As is, with modules, this flag can cause huge performance issues, so should be avoided.

@jedevc
Copy link
Member

jedevc commented May 16, 2024

x-linking to #6465 as well.

@samalba samalba removed the kind/bug Something isn't working label May 31, 2024
gerhard added a commit to gerhard/dagger that referenced this issue Jun 10, 2024
While we are keeping this default so that we don't break existing users,
setting this value is something that we want to move away from.

The problem is that this setting limits how many operations can run in
parallel. It is still possible for a single operation to max out all
available cores. It is also known for a value of `2` to cause deadlocks,
i.e. dagger#6894

For now, we just allow this to be disabled with either `--set
engine.args=''` or by explicitly setting this value to an empty list.

This started as dagger#7395 which turned
out to be too big of a change. We since scaled back the initial ambition
& are taking a smaller step towards eventually phasing this out.

FWIW, all the Dagger Engines that we run inside the Dagger infra do not
use the `--oci-max-parallelism` option.

This also removes the option from tekton-dagger-task docs example.

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
gerhard added a commit that referenced this issue Jun 10, 2024
While we are keeping this default so that we don't break existing users,
setting this value is something that we want to move away from.

The problem is that this setting limits how many operations can run in
parallel. It is still possible for a single operation to max out all
available cores. It is also known for a value of `2` to cause deadlocks,
i.e. #6894

For now, we just allow this to be disabled with either `--set
engine.args=''` or by explicitly setting this value to an empty list.

This started as #7395 which turned
out to be too big of a change. We since scaled back the initial ambition
& are taking a smaller step towards eventually phasing this out.

FWIW, all the Dagger Engines that we run inside the Dagger infra do not
use the `--oci-max-parallelism` option.

This also removes the option from tekton-dagger-task docs example.

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants