Feature Request - Add Cores used to tibanna stat output #304

nhartwic · 2020-10-17T03:50:07Z

Tibanna is great for scaling up analysis on AWS, it doesn't play nicely with CPU quotas though. #267 While error handling for this special case would be appreciated, it would also be nice to be able to track how much of the quota is being used by the running jobs. I'm imagining that in addition to reporting instance type when calling "tibanna stat", we could also know how many CPUs are associated with that instance type. This doesn't eliminate the need for properly handling quota limits in tibanna but it should make it easier for users to track how their jobs interact with that quota by making it easier to know when a user is close to their quota, assuming they know there CPU quota, which they should.

Unless there is a reason that this feature is a bad idea, I think I'll look into making a pull request for it.

I'm not entirely sure what the best way to implement this is. Easiest is probably to just add a dictionary somewhere that maps from instance type to CPUs and manually update that as AWS updates its EC2 offerings. This dictionary can then be used during the 'stat' calls to add more information to the table. This has the drawback of requiring manual updates to the code to maintain functionality when aws updates its ec2 offerings and just feels generally clunky. Thoughts?

SooLee · 2021-08-06T15:17:20Z

@nhartwic have you tried plot_metrics? There is also a summary cpu utilization (%) in the post run json.

nhartwic · 2021-08-06T19:49:33Z

I'm more referencing the potential issues with requesting more compute than your account has available on quota. Plot_metrics is great for determining how many cores are actually being used by a node, but bad at telling me how many cores I currently have requested accross all of my jobs. If I have twenty jobs running and each is using a different number of cores, its hard to know how many cores are left available for my 21st job I'm about to submit.

Its literally just a cpu quota issue with aws. At the time I wrote that issue, it seemed like the easiest sollution would be to just include that info in the tibanna stat output so that a user can just do...

tibanna stat -t RUNNING

...To see how many cores they are currently using. Assuming they know their quota, they can then determine if they want to launch another job and with how many cores.

SooLee · 2021-08-06T20:04:40Z

I see. There is an optional field in the input json behavior_on_capacity_limit that can be set to wait_and_retry when certain limits are met e.g. InsufficientInstanceCapacity - it's not the same thing, but could be an alternative?

nhartwic · 2021-08-06T20:07:29Z

At the time I made the issue, I was either not aware of that option or it hadn't been implimented yet. In general, that works but I don't hate the idea of adding a column to the stat output either. Its up to you. Honestly, AWS quotas have been high enough for me that I haven't ever really come close to hitting limits. Its more of a 'would be nice to know' than anything.

SooLee · 2021-08-06T20:13:11Z

Tibanna keeps an internal mapping file (not totally up to date) for instance types and cpu, memory, etc. and it is used to pick an instance type if the user specified 'cpu' and 'memory' instead of 'instance_type'. We can maybe add a reverse of that functionality to add 'cpu' and 'memory' of the auto-determined or user-specified instance type somewhere so stat can pick it up and use it.

nhartwic · 2021-08-06T21:20:29Z

That sounds good. I'd consider it a nice feature anyway. I wouldn't consider this a high priority feature though. Obviously it wasn't that important or I'd have taken the time to implement it myself.

laurentiush · 2023-11-05T20:50:58Z

Hi, I am running into the same issue here. I configured "behavior_on_capacity_limit": "wait_and_retry" but it does not seem to wait and retry. It just tries several instance types and then fails.

"errorMessage": "Unexpected result from create_fleet command: {\"FleetId\": \"fleet-xxxxxxxxxxxx\", \"Errors\": [{\"LaunchTemplateAndOverrides\": {\"LaunchTemplateSpecification\": {\"LaunchTemplateId\": \"lt-xxxxxxxxxxx\", \"Version\": \"1\"}, \"Overrides\": {\"InstanceType\": \"t3.small\", \"SubnetId\": \"subnet-xxxxxxxxxxxxxx\", \"ImageId\": \"ami-xxxxxxxxxxx\"}}, \"Lifecycle\": \"spot\", \"ErrorCode\": \"MaxSpotInstanceCountExceeded\", \"ErrorMessage\": \"Max spot instance count exceeded\"}, {\"LaunchTemplateAndOverrides\": {\"LaunchTemplateSpecification\": {\"LaunchTemplateId\": \"lt-xxxxxxxxxx\", \"Version\": \"1\"}, \"Overrides\": {\"InstanceType\": \"t3.medium\", \"SubnetId\": \"subnet-xxxxxxxxxxxxxx\", \"ImageId\": \"ami-xxxxxxxxxxx\"}}, \"Lifecycle\": \"spot\", \"ErrorCode\": \"MaxSpotInstanceCountExceeded\", \"ErrorMessage\": \"Max spot instance count exceeded\"},

and so on for a few more instance types.

Do you have any idea what might cause this?

laurentiush mentioned this issue Nov 6, 2023

Update ec2_utils.py to handle SpotInstance limits #399

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request - Add Cores used to tibanna stat output #304

Feature Request - Add Cores used to tibanna stat output #304

nhartwic commented Oct 17, 2020

SooLee commented Aug 6, 2021

nhartwic commented Aug 6, 2021

SooLee commented Aug 6, 2021

nhartwic commented Aug 6, 2021

SooLee commented Aug 6, 2021

nhartwic commented Aug 6, 2021

laurentiush commented Nov 5, 2023

Feature Request - Add Cores used to tibanna stat output #304

Feature Request - Add Cores used to tibanna stat output #304

Comments

nhartwic commented Oct 17, 2020

SooLee commented Aug 6, 2021

nhartwic commented Aug 6, 2021

SooLee commented Aug 6, 2021

nhartwic commented Aug 6, 2021

SooLee commented Aug 6, 2021

nhartwic commented Aug 6, 2021

laurentiush commented Nov 5, 2023