Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] GPU training not working with non-default tabular presets #4196

Closed
1 of 3 tasks
Newtoniano opened this issue May 14, 2024 · 2 comments · Fixed by #4210
Closed
1 of 3 tasks

[BUG] GPU training not working with non-default tabular presets #4196

Newtoniano opened this issue May 14, 2024 · 2 comments · Fixed by #4210
Assignees
Labels
bug Something isn't working module: tabular Needs Triage Issue requires Triage priority: 1 High priority
Milestone

Comments

@Newtoniano
Copy link

Bug Report Checklist

  • I provided code that demonstrates a minimal reproducible example.
  • I confirmed bug exists on the latest mainline of AutoGluon via source install.
  • I confirmed bug exists on the latest stable version of AutoGluon.

Describe the bug
Fitting the dataset using num_gpus=1 only works when presets='medium_quality'.
This might have been introduced by the recent ray version bump #3774

In the ray dashboard I see these events, and the tasks are stuck:
Warning: The following resource request cannot be scheduled right now: {'CPU': 12.0, 'GPU': 0.5}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.

I only have 1 gpu, setting num_gpus to some value <1 seems to at least let the ray tasks run, albeit this doesn’t really work since num_gpus>=1 is expected downstream

From what I can tell this issue is present on google colab too, so it should be easily reproducible.

Expected behavior
GPU training should work even for more complex tabular model presets

To Reproduce
Run the tabular quickstart example notebook with predictor = TabularPredictor(label=label).fit(train_data, num_gpus=1, presets='good_quality') or better presets

Installed Versions
autogluon==1.1.0

@Newtoniano Newtoniano added bug: unconfirmed Something might not be working Needs Triage Issue requires Triage labels May 14, 2024
@Innixma Innixma added this to the 1.2 Release milestone May 15, 2024
@Innixma Innixma added module: tabular bug Something isn't working priority: 1 High priority and removed bug: unconfirmed Something might not be working labels May 15, 2024
@Innixma Innixma modified the milestones: 1.2 Release, 1.1.1 Release May 15, 2024
@Innixma Innixma self-assigned this May 15, 2024
@dadlobster
Copy link

Same issue for me, ran on WSL2, got one output model and that's it, no CPU/GPU usage after that but did eat up a chunk of ram

@Innixma
Copy link
Contributor

Innixma commented May 21, 2024

I've reproduced the issue. The problem results in a resource deadlock. I am working on a fix in #4210. The issue has to do with an inconsistency in how ray handles GPU resources compared to CPU resources for nested tasks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working module: tabular Needs Triage Issue requires Triage priority: 1 High priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants