Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce GCP Fixed Costs by 50% #2453

Merged
merged 7 commits into from
May 14, 2024
Merged

Conversation

Adam-D-Lewis
Copy link
Member

@Adam-D-Lewis Adam-D-Lewis commented May 7, 2024

Reference Issues or PRs

Fixes #2452

What does this implement/fix?

Put a x in the boxes that apply

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds a feature)
  • Breaking change (fix or feature that would cause existing features not to work as expected)
  • Documentation Update
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes, no API changes)
  • Build related changes
  • Other (please describe): Make cost optimized E2 instances on GCP the default for node groups.

Testing

  • Did you test the pull request locally?
  • Did you add new tests?

Any other comments?

Copy link
Contributor

@dcmcand dcmcand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ Are we sure that the e2 is suitable for the user and worker node groups? It seems like for the worker group especially, they might handycap performance. Additionally, they do not support GPU's. I think it might be better to run the General node group on the e2-highmem-4 and the user and worker node groups on the n4-standard-4. Especially with them scaling down to zero now, I think that would be an acceptable tradeoff for performance vs price.

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented May 7, 2024

❓ Are we sure that the e2 is suitable for the user and worker node groups? It seems like for the worker group especially, they might handicap performance. Additionally, they do not support GPU's. I think it might be better to run the General node group on the e2-highmem-4 and the user and worker node groups on the n4-standard-4. Especially with them scaling down to zero now, I think that would be an acceptable tradeoff for performance vs price.

Great points @dcmcand , I looked into it a bit more.

It looks like CPU performance (Coremark score) is roughly equal between the 2 types.

image
image

Also, for GPU instances, we usually create new node groups specifically for the gpu profiles although it is not fully documented at the moment e.g. (image below) so I don't see that as an issue for the user or worker node defaults since they would not use these default node groups for their gpu instances.
image

The only disadvantage I see is the maximum egress is down from 10 to 8 Gbps, but I believe the cost savings is worth the 20% reduction in bandwidth for most users though it's just a hunch.

image
image

@Adam-D-Lewis Adam-D-Lewis requested a review from dcmcand May 7, 2024 20:58
@Adam-D-Lewis
Copy link
Member Author

@dcmcand any further concerns or comments?

@dcmcand
Copy link
Contributor

dcmcand commented May 9, 2024

I still feel a bit of concern, specifically around the dask worker. However, I have no data to actually justify by concern.

If someone upgrades and then applies the new config, it will result in the nodes being replaced. Do we have any concerns about that?

I also feel like we should make sure we document this change and how to restore the original functionality. Maybe in the FAQ? But probably also in the release notes.

@Adam-D-Lewis
Copy link
Member Author

Adam-D-Lewis commented May 9, 2024

I still feel a bit of concern, specifically around the dask worker. However, I have no data to actually justify by concern.
If someone upgrades and then applies the new config, it will result in the nodes being replaced. Do we have any concerns about that?

We've added node types to the nebari config that is created when running nebari init so if people have node types in their config (as is the default), then they won't be affected by this change. Assuming they don't have node types in their config, the nodes will be replaced. This causes Nebari to be unusable for about ~15 minutes as the nodes are switched out, but shouldn't cause a problem otherwise. I tested this on a deployment and it worked as expected.

I also feel like we should make sure we document this change and how to restore the original functionality. Maybe in the FAQ? But probably also in the release notes.

I'll document it in the Nebari upgrade command so that users will be notified if this will affect them and what they need to add to their config so that it won't affect them, and we can copy something similar to the release notes.

@dcmcand
Copy link
Contributor

dcmcand commented May 9, 2024

sounds good. thanks @Adam-D-Lewis

@Adam-D-Lewis Adam-D-Lewis mentioned this pull request May 14, 2024
10 tasks
@Adam-D-Lewis
Copy link
Member Author

We don't know what the next Nebari version will be at the moment (2024.5.2 vs 2024.6.1) so I opened a separate PR and assigned it to the 2024.5.2 milestone. #2466. My thought is that we merge this as is and make sure to merge the other in with the appropriate version number during the next release step.

@Adam-D-Lewis Adam-D-Lewis merged commit 2a2f2ee into develop May 14, 2024
25 of 26 checks passed
@Adam-D-Lewis Adam-D-Lewis deleted the reduce_gcp_costs_by_50_percent branch May 14, 2024 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done 💪🏾
Development

Successfully merging this pull request may close these issues.

[ENH] - Lower cost node sizes on GCP Nebari
2 participants