Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request - Add Cores used to tibanna stat output #304

Open
nhartwic opened this issue Oct 17, 2020 · 7 comments
Open

Feature Request - Add Cores used to tibanna stat output #304

nhartwic opened this issue Oct 17, 2020 · 7 comments

Comments

@nhartwic
Copy link

Tibanna is great for scaling up analysis on AWS, it doesn't play nicely with CPU quotas though. #267 While error handling for this special case would be appreciated, it would also be nice to be able to track how much of the quota is being used by the running jobs. I'm imagining that in addition to reporting instance type when calling "tibanna stat", we could also know how many CPUs are associated with that instance type. This doesn't eliminate the need for properly handling quota limits in tibanna but it should make it easier for users to track how their jobs interact with that quota by making it easier to know when a user is close to their quota, assuming they know there CPU quota, which they should.

Unless there is a reason that this feature is a bad idea, I think I'll look into making a pull request for it.

I'm not entirely sure what the best way to implement this is. Easiest is probably to just add a dictionary somewhere that maps from instance type to CPUs and manually update that as AWS updates its EC2 offerings. This dictionary can then be used during the 'stat' calls to add more information to the table. This has the drawback of requiring manual updates to the code to maintain functionality when aws updates its ec2 offerings and just feels generally clunky. Thoughts?

@SooLee
Copy link
Member

SooLee commented Aug 6, 2021

@nhartwic have you tried plot_metrics? There is also a summary cpu utilization (%) in the post run json.

@nhartwic
Copy link
Author

nhartwic commented Aug 6, 2021

I'm more referencing the potential issues with requesting more compute than your account has available on quota. Plot_metrics is great for determining how many cores are actually being used by a node, but bad at telling me how many cores I currently have requested accross all of my jobs. If I have twenty jobs running and each is using a different number of cores, its hard to know how many cores are left available for my 21st job I'm about to submit.

Its literally just a cpu quota issue with aws. At the time I wrote that issue, it seemed like the easiest sollution would be to just include that info in the tibanna stat output so that a user can just do...

tibanna stat -t RUNNING

...To see how many cores they are currently using. Assuming they know their quota, they can then determine if they want to launch another job and with how many cores.

@SooLee
Copy link
Member

SooLee commented Aug 6, 2021

I see. There is an optional field in the input json behavior_on_capacity_limit that can be set to wait_and_retry when certain limits are met e.g. InsufficientInstanceCapacity - it's not the same thing, but could be an alternative?

@nhartwic
Copy link
Author

nhartwic commented Aug 6, 2021

At the time I made the issue, I was either not aware of that option or it hadn't been implimented yet. In general, that works but I don't hate the idea of adding a column to the stat output either. Its up to you. Honestly, AWS quotas have been high enough for me that I haven't ever really come close to hitting limits. Its more of a 'would be nice to know' than anything.

@SooLee
Copy link
Member

SooLee commented Aug 6, 2021

Tibanna keeps an internal mapping file (not totally up to date) for instance types and cpu, memory, etc. and it is used to pick an instance type if the user specified 'cpu' and 'memory' instead of 'instance_type'. We can maybe add a reverse of that functionality to add 'cpu' and 'memory' of the auto-determined or user-specified instance type somewhere so stat can pick it up and use it.

@nhartwic
Copy link
Author

nhartwic commented Aug 6, 2021

That sounds good. I'd consider it a nice feature anyway. I wouldn't consider this a high priority feature though. Obviously it wasn't that important or I'd have taken the time to implement it myself.

@laurentiush
Copy link

Hi, I am running into the same issue here. I configured "behavior_on_capacity_limit": "wait_and_retry" but it does not seem to wait and retry. It just tries several instance types and then fails.

"errorMessage": "Unexpected result from create_fleet command: {\"FleetId\": \"fleet-xxxxxxxxxxxx\", \"Errors\": [{\"LaunchTemplateAndOverrides\": {\"LaunchTemplateSpecification\": {\"LaunchTemplateId\": \"lt-xxxxxxxxxxx\", \"Version\": \"1\"}, \"Overrides\": {\"InstanceType\": \"t3.small\", \"SubnetId\": \"subnet-xxxxxxxxxxxxxx\", \"ImageId\": \"ami-xxxxxxxxxxx\"}}, \"Lifecycle\": \"spot\", \"ErrorCode\": \"MaxSpotInstanceCountExceeded\", \"ErrorMessage\": \"Max spot instance count exceeded\"}, {\"LaunchTemplateAndOverrides\": {\"LaunchTemplateSpecification\": {\"LaunchTemplateId\": \"lt-xxxxxxxxxx\", \"Version\": \"1\"}, \"Overrides\": {\"InstanceType\": \"t3.medium\", \"SubnetId\": \"subnet-xxxxxxxxxxxxxx\", \"ImageId\": \"ami-xxxxxxxxxxx\"}}, \"Lifecycle\": \"spot\", \"ErrorCode\": \"MaxSpotInstanceCountExceeded\", \"ErrorMessage\": \"Max spot instance count exceeded\"},

and so on for a few more instance types.

Do you have any idea what might cause this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants