Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add KEDA HPA TriggerAuthentication and postgresql ScaledObject. #2384

Open
wants to merge 144 commits into
base: develop
Choose a base branch
from

Conversation

pt247
Copy link
Contributor

@pt247 pt247 commented Apr 8, 2024

Reference Issues or PRs

Fixes #2284

What does this implement/fix?

Adds KEDA HPA TriggerAuthentication and PostgreSQL ScaledObject.
In the current develop, one pod for conda-store-worker is always running. With this PR, the
conda-store-worker pods will scale down to zero at the start, scale up with the number of pending/queued builds, and scale down to zero when builds are complete.

Put a x in the boxes that apply

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds a feature)
  • Breaking change (fix or feature that would cause existing features not to work as expected)
  • Documentation Update PR
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes, no API changes)
  • Build related changes
  • Other (please describe):

Testing

  • Did you test the pull request locally?
  • Did you add new tests?: New local integration test added.

Any other comments?

  1. We can now set the following parameters while configuring conda-store:
conda_store:
  max_workers: 50
  worker_resources:
    requests:
      cpu: 1
      memory: 4Gi
  1. The local integration test is slow and adds 10-15 minutes to local integration tests.
  2. We can add follow-up PRs to expose KEDA and KEDA.ScaledObject parameters if needed.

@pt247 pt247 requested a review from viniciusdc April 8, 2024 11:06
@pt247 pt247 marked this pull request as ready for review April 8, 2024 11:07
@pt247 pt247 requested a review from dcmcand April 8, 2024 11:20
@Adam-D-Lewis Adam-D-Lewis self-requested a review April 9, 2024 16:01
Copy link
Member

@Adam-D-Lewis Adam-D-Lewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also set a memory request for the conda store worker deployment (I'd suggest a minimum of 1GiB), and allow the general node group to scale above 1 (e.g. set max_nodes default value to 5) by default for each provider e.g.

"general": AzureNodeGroup(instance="Standard_D8_v3", min_nodes=1, max_nodes=1),
.

Otherwise, if we try to build 100 conda envs simultaneously, we'll get 100 conda worker pods. They will all be scheduled on a single general node (assuming only 1 is running). The general node won't even try to scale up (b/c the conda store worker doesn't have any resource requests set), but the general node won't have enough memory to solve all the envs at the same time and many will fail.

We should also make this part of the next Nebari release to tell existing deployments that it's recommended that they increase the max_nodes setting of the nebari-config if it is set explicitly to 4 or less.

It also might be a good idea to set a maximum number of conda store workers as a safe guard (even if the max number is high). E.g. If 100 envs are all scheduled at once, maybe we should only start a maximum of 50 workers. (In the KEDA sql command maybe take the min of the number of queued/building envs and 50 for example). Ideally that max would be configurable in the nebari config under the conda_store section.

@dharhas
Copy link
Member

dharhas commented Apr 9, 2024

Are there conda env solves that require more than 1 GB of ram? I remember early on we had to keep bumping up the ram on the conda-store main node.

@Adam-D-Lewis
Copy link
Member

Adam-D-Lewis commented Apr 9, 2024

@dharhas

Are there conda env solves that require more than 1 GB of ram? I remember early on we had to keep bumping up the ram on the conda-store main node.

Yes, there are certainly conda solves (most?) that use more than 1 GB of RAM. I don't think there's a way to estimate beforhand how much RAM a conda solve will need. Setting the memory resource request doesn't limit the pod from using more RAM if available on the node, but it would put a limit on the number of conda store workers scheduled on the same node.

@pt247
Copy link
Contributor Author

pt247 commented Apr 9, 2024

It also might be a good idea to set a maximum number of conda store workers as a safe guard (even if the max number is high). E.g. If 100 envs are all scheduled at once, maybe we should only start a maximum of 50 workers. (In the KEDA sql command maybe take the min of the number of queued/building envs and 50 for example). Ideally that max would be configurable in the nebari config under the conda_store section.

Default as per the documentation - link is
maxReplicaCount: 100
@Adam-D-Lewis
I will change it to 50 as suggested.

@pt247
Copy link
Contributor Author

pt247 commented Apr 15, 2024

I have tried to create 100 conda environments for testing, and not all of them transition to a completed state. Some 24 get stuck in a queued state with the following error.

[2024-04-15 00:08:26,319: WARNING/Beat]   File "/opt/conda/envs/conda-store-server/lib/python3.10/dbm/dumb.py", line 316, in open return _Database(file, mode, flag=flag)                                                                                                                   
[2024-04-15 00:08:26,320: WARNING/Beat]   File "/opt/conda/envs/conda-store-server/lib/python3.10/dbm/dumb.py", line 70, in __init__ self._create(flag)       
[2024-04-15 00:08:26,320: WARNING/Beat]   File "/opt/conda/envs/conda-store-server/lib/python3.10/dbm/dumb.py", line 86, in _create with _io.open(self._datfile, 'w', encoding="Latin-1") as f:                                                                                               
[2024-04-15 00:08:26,320: WARNING/Beat]   FileNotFoundError: [Errno 2] No such file or directory: '/root/.conda-store/celerybeat-schedule.dat'    

@dcmcand In this case, should we roll this out based on the manual testing we have done so far?

@Adam-D-Lewis Are you happy with the other changes?

@pt247 pt247 requested a review from Adam-D-Lewis April 15, 2024 08:03
@Adam-D-Lewis
Copy link
Member

Adam-D-Lewis commented Apr 15, 2024

We should also set a memory request for the conda store worker deployment (I'd suggest a minimum of 1GiB)

@pt247 looks like we still need this part. This could be one of the reasons the conda store workers are failing.

image
Add lines 116-121 in src/_nebari/stages/kubernetes_services/template/modules/kubernetes/services/conda-store/worker.tf similar to what I've shown here.

@dcmcand
Copy link
Contributor

dcmcand commented Apr 16, 2024

I have tried to create 100 conda environments for testing, and not all of them transition to a completed state. Some 24 get stuck in a queued state with the following error.

[2024-04-15 00:08:26,319: WARNING/Beat]   File "/opt/conda/envs/conda-store-server/lib/python3.10/dbm/dumb.py", line 316, in open return _Database(file, mode, flag=flag)                                                                                                                   
[2024-04-15 00:08:26,320: WARNING/Beat]   File "/opt/conda/envs/conda-store-server/lib/python3.10/dbm/dumb.py", line 70, in __init__ self._create(flag)       
[2024-04-15 00:08:26,320: WARNING/Beat]   File "/opt/conda/envs/conda-store-server/lib/python3.10/dbm/dumb.py", line 86, in _create with _io.open(self._datfile, 'w', encoding="Latin-1") as f:                                                                                               
[2024-04-15 00:08:26,320: WARNING/Beat]   FileNotFoundError: [Errno 2] No such file or directory: '/root/.conda-store/celerybeat-schedule.dat'    

@dcmcand In this case, should we roll this out based on the manual testing we have done so far?

@Adam-D-Lewis Are you happy with the other changes?
@pt247 After adding the changes that @Adam-D-Lewis requested, can you redo your load test? We need to be able to gracefully scale conda-store workers. Ensuring that we have guidance around what resources/configs are needed for what scale would be important too. Finally, please also open a docs PR to document how to configure this.

@viniciusdc viniciusdc added the needs: review 👀 This PR is complete and ready for reviewing label May 14, 2024
@pt247
Copy link
Contributor Author

pt247 commented May 14, 2024

I had to comment out the Cypress tests to get the tests passing.
https://github.com/nebari-dev/nebari/actions/runs/9082667704/job/24959793640?pr=2384

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs: review 👀 This PR is complete and ready for reviewing
Projects
Status: Changes requested 🧱
Development

Successfully merging this pull request may close these issues.

[ENH] - Add kubernetes horizontal autoscaler for conda-store workers based on queue depth
5 participants