Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] - Config generated by guided init should contain default settings #2122

Open
rsignell opened this issue Nov 18, 2023 · 8 comments · May be fixed by #2294
Open

[BUG] - Config generated by guided init should contain default settings #2122

rsignell opened this issue Nov 18, 2023 · 8 comments · May be fixed by #2294

Comments

@rsignell
Copy link

rsignell commented Nov 18, 2023

Describe the bug

We generated a config using guided-init , and deployed Nebari on AWS with no apparent issues.

We then tried to use the deployment for training and students were not able to get the 30 workers they requested. We double checked our service quota in the region we were running the training and it was set to 500.
So we were mystified (and frustrated) why things weren't working.

Then after the training we realized that the generated config didn't contain the node groups section with the max limits for the different worker pools. In other words, the amazon section just looked like:

amazon_web_services:
  kubernetes_version: '1.26'
  region: us-west-2

So I imagine we were just getting the default max, which must have been something small.

When I added the section to:

amazon_web_services:
  kubernetes_version: '1.26'
  region: us-west-2
  node_groups:
    general:
      instance: m5.2xlarge
      min_nodes: 1
      max_nodes: 1
    user:
      instance: m5.xlarge
      min_nodes: 1
      max_nodes: 100
    worker:
      instance: m5.xlarge
      min_nodes: 1
      max_nodes: 450

and rerendered and redeployed, multiple students could spin up clusters of 30 again.

Expected behavior

The expected behaviour would be to have the guided-init create a config that contains the default settings so they could (1) show the user what the defaults were, and (2) provide a template that facilitates easy modification by the user.

OS and architecture in which you are running Nebari

aws

How to Reproduce the problem?

use the guided init on aws

Command output

No response

Versions and dependencies used.

nebari 2023.10.1

Compute environment

AWS

Integrations

No response

Anything else?

No response

@rsignell rsignell added needs: triage 🚦 Someone needs to have a look at this issue and triage type: bug 🐛 Something isn't working labels Nov 18, 2023
@aktech
Copy link
Member

aktech commented Feb 23, 2024

Yes, indeed. This would be super helpful.

@Adam-D-Lewis
Copy link
Member

Adam-D-Lewis commented Feb 27, 2024

We discussed in the community meeting and we agree that displaying the entire config in nebari-config.yaml is better and we would accept PRs doing that (reverting to prior Nebari behavior).

@Adam-D-Lewis
Copy link
Member

Adam-D-Lewis commented Feb 28, 2024

It looks like we're manually putting each thing we want to be in the config in src/_nebari/initialize.py. I think we should do something similar to config = read_configuration(config_filename, config_schema=config_schema) in src/_nebari/subcommands/deploy.py when we write the config file.

If we want to prevent some attributes from showing up in the config, I think we could do something where we add a private _hide_from_config var on each model where we don't want some fields to show up in the config by default (e.g. terraform overrides, maybe). I put together a small demo showing how this might work. I assume we'd modify Base class in src/nebari/schema.py. There's still more work, but here's the example so it's not lost.

from pydantic import BaseModel

class Base(BaseModel):
	_hide_from_config: set = {}
		
	@classmethod
	def _get_config_exclude_attrs(cls):
		exclude_dict = {}
		for attr_name, field in cls.__fields__.items():
			if attr_name in cls._hide_from_config:
				exclude_dict[attr_name] = True
			else:
				if hasattr(field.type_, '_get_config_exclude_attrs'):
					exclude_dict[attr_name] = field.type_._get_config_exclude_attrs()
		return exclude_dict

	def write_config(self):
		return self.json(exclude=self._get_config_exclude_attrs())

class SubSubModel(Base):
	_hide_from_config = {'hidden_sub_sub_model_value'}
	sub_sub_model_value: int = 2
	hidden_sub_sub_model_value: int = -9999
	

class SubModel(Base):
	_hide_from_config = {'hidden_sub_model_value'}
	sub_model_value: SubSubModel = SubSubModel()
	hidden_sub_model_value: int = -9999


class MyModel(Base):
	_hide_from_config: set = {'should_not_be_written'}
	should_be_written: SubModel
	should_not_be_written: str = None
	other_should_be_written: str = '1'


mm = MyModel(should_be_written=SubModel())
config = mm.write_config()
print('='*80)
print(mm.dict(exclude={
	'should_not_be_written': True, 
	'should_be_written': {
		'hidden_sub_model_value': True,
		'sub_model_value': {
			'hidden_sub_sub_model_value': True
		},
	}
}))
print('='*80)
print(mm.write_config())
print('='*80)

# Output looks like
# ================================================================================
# {'should_be_written': {'sub_model_value': {'sub_sub_model_value': 2}}, 'other_should_be_written': '1'}
# ================================================================================
# {"should_be_written": {"sub_model_value": {"sub_sub_model_value": 2}}, "other_should_be_written": "1"}
# ================================================================================

Update: Looked into this a bit more. Potential issues are:

  • key order (provider, version at the top of the config file, etc.)
    • define an ordered dict with the desired order for deployment
  • complex logic about when fields should be displayed or not (e.g. only write other attributes of ci_cd if ci_cd.type is
    not None, only show azure section if provider == 'azure')
  • Currently when updating nebari, if the latest nebari changes a default value for an unspecified (not in nebari-config.yaml file) the value will be changed to the latest nebari default value. If we put everything in the config then that won't be the case anymore.
    • I doubt this was relied upon, so it's fine for this behavior to change.

@Adam-D-Lewis
Copy link
Member

Adam-D-Lewis commented Mar 6, 2024

With a very small code change we could get the config file to look something like what I've pasted below.
PR here

prevent_deploy: false
nebari_version: 2024.3.1rc1
provider: local
namespace: dev
project_name: scratch-2efc
ci_cd:
  type: none
  branch: main
  commit_render: true
  before_script: []
  after_script: []
terraform_state:
  type: remote
  backend:
  config: {}
digital_ocean:
azure:
amazon_web_services:
google_cloud_platform:
existing:
local:
  kube_context:
  node_selectors:
    general:
      key: kubernetes.io/os
      value: linux
    user:
      key: kubernetes.io/os
      value: linux
    worker:
      key: kubernetes.io/os
      value: linux
external_container_reg:
  enabled: false
  access_key_id:
  secret_access_key:
  extcr_account:
  extcr_region:
dns:
  provider:
  auto_provision: false
ingress:
  terraform_overrides: {}
certificate:
  type: self-signed
  secret_name:
  acme_email:
  acme_server: https://acme-v02.api.letsencrypt.org/directory
domain:
security:
  authentication:
    type: password
  shared_users_group: true
  keycloak:
    initial_root_password: sltj9yhbg2uatvknvasd6kb8bnzp49q4
    overrides: {}
    realm_display_name: Nebari
jhub_apps:
  enabled: false
jupyterlab:
  default_settings: {}
  idle_culler:
    terminal_cull_inactive_timeout: 15
    terminal_cull_interval: 5
    kernel_cull_idle_timeout: 15
    kernel_cull_interval: 5
    kernel_cull_connected: true
    kernel_cull_busy: false
    server_shutdown_no_activity_timeout: 15
  initial_repositories: []
  preferred_dir:
jupyterhub:
  overrides: {}
telemetry:
  jupyterlab_pioneer:
    enabled: false
    log_format:
monitoring:
  enabled: true
argo_workflows:
  enabled: true
  overrides: {}
  nebari_workflow_controller:
    enabled: true
    image_tag: 2024.2.1rc2
conda_store:
  extra_settings: {}
  extra_config: ''
  image: quansight/conda-store-server
  image_tag: 2024.1.1
  default_namespace: nebari-git
  object_storage: 200Gi
environments:
  environment-dask.yaml:
    name: dask
    channels:
    - conda-forge
    dependencies:
    - python==3.11.6
    - ipykernel==6.26.0
    - ipywidgets==8.1.1
    - nebari-dask==2024.1.1
    - python-graphviz==0.20.1
    - pyarrow==14.0.1
    - s3fs==2023.10.0
    - gcsfs==2023.10.0
    - numpy=1.26.0
    - numba=0.58.1
    - pandas=2.1.3
    - xarray==2023.10.1
  environment-dashboard.yaml:
    name: dashboard
    channels:
    - conda-forge
    dependencies:
    - python==3.11.6
    - cufflinks-py==0.17.3
    - dash==2.14.1
    - geopandas==0.14.1
    - geopy==2.4.0
    - geoviews==1.11.0
    - gunicorn==21.2.0
    - holoviews==1.18.1
    - ipykernel==6.26.0
    - ipywidgets==8.1.1
    - jupyter==1.0.0
    - jupyter_bokeh==3.0.7
    - matplotlib==3.8.1
    - nebari-dask==2024.1.1
    - nodejs=20.8.1
    - numpy==1.26.0
    - openpyxl==3.1.2
    - pandas==2.1.3
    - panel==1.3.1
    - param==2.0.1
    - plotly==5.18.0
    - python-graphviz==0.20.1
    - rich==13.6.0
    - streamlit==1.28.1
    - sympy==1.12
    - voila==0.5.5
    - xarray==2023.10.1
    - pip==23.3.1
    - pip:
      - streamlit-image-comparison==0.0.4
      - noaa-coops==0.1.9
      - dash_core_components==2.0.0
      - dash_html_components==2.0.0
profiles:
  jupyterlab:
  - access: all
    display_name: Small Instance
    description: Stable environment with 2 cpu / 8 GB ram
    default: true
    users:
    groups:
    kubespawner_override:
      cpu_limit: 2
      cpu_guarantee: 1
      mem_limit: 8G
      mem_guarantee: 5G
  - access: all
    display_name: Medium Instance
    description: Stable environment with 4 cpu / 16 GB ram
    default: false
    users:
    groups:
    kubespawner_override:
      cpu_limit: 4
      cpu_guarantee: 3
      mem_limit: 16G
      mem_guarantee: 10G
  dask_worker:
    Small Worker:
      worker_cores_limit: 2
      worker_cores: 1
      worker_memory_limit: 8G
      worker_memory: 5G
      worker_threads: 2
    Medium Worker:
      worker_cores_limit: 4
      worker_cores: 3
      worker_memory_limit: 16G
      worker_memory: 10G
      worker_threads: 4
theme:
  jupyterhub:
    hub_title: Nebari - scratch-2efc
    hub_subtitle: Your open source data science platform, hosted
    welcome: Welcome! Learn about Nebari's features and configurations in <a href="https://www.nebari.dev/docs/welcome">the
      documentation</a>. If you have any questions or feedback, reach the team on
      <a href="https://www.nebari.dev/docs/community#getting-support">Nebari's support
      forums</a>.
    logo: 
      https://raw.githubusercontent.com/nebari-dev/nebari-design/main/logo-mark/horizontal/Nebari-Logo-Horizontal-Lockup-White-text.svg
    favicon: 
      https://raw.githubusercontent.com/nebari-dev/nebari-design/main/symbol/favicon.ico
    primary_color: '#4f4173'
    primary_color_dark: '#4f4173'
    secondary_color: '#957da6'
    secondary_color_dark: '#957da6'
    accent_color: '#32C574'
    accent_color_dark: '#32C574'
    text_color: '#111111'
    h1_color: '#652e8e'
    h2_color: '#652e8e'
    version: v2024.3.1rc1
    navbar_color: '#1c1d26'
    navbar_text_color: '#f1f1f6'
    navbar_hover_color: '#db96f3'
    display_version: 'True'
storage:
  conda_store: 200Gi
  shared_filesystem: 200Gi
default_images:
  jupyterhub: quay.io/nebari/nebari-jupyterhub:2024.2.1rc2
  jupyterlab: quay.io/nebari/nebari-jupyterlab:2024.2.1rc2
  dask_worker: quay.io/nebari/nebari-dask-worker:2024.2.1rc2
tf_extensions: []
helm_extensions: []

Issues:

  • having all cloud providers show up (most of them blank) is too confusing
  • certificate section is confusing with both secret_name and acme_* keys since you'll only need one or the other depending on the type.

I think we can solve this by using more specific Pydantic Models e.g. SelfSignedCertificate and SecretCertificate instead of capturing both cases in one Pydantic model.

Update: I think I fixed the certs using more specific Pydantic models. I'd want to fix the providers also though.

@Adam-D-Lewis Adam-D-Lewis linked a pull request Mar 6, 2024 that will close this issue
10 tasks
@rsignell
Copy link
Author

rsignell commented Mar 7, 2024

Well I think even with its issues this would be a big improvement!

@Adam-D-Lewis
Copy link
Member

Adam-D-Lewis commented Mar 9, 2024

A few more changes, the config now looks like what's below (got rid of extra aws, gcp, etc. sections)

prevent_deploy: false
nebari_version: 2024.3.1rc2.dev2+g44c9f71.d20240306
provider: local
namespace: dev
project_name: scratch-2efc
ci_cd:
  type: none
  branch: main
  commit_render: true
  before_script: []
  after_script: []
terraform_state:
  type: remote
  backend:
  config: {}
local:
  kube_context:
  node_selectors:
    general:
      key: kubernetes.io/os
      value: linux
    user:
      key: kubernetes.io/os
      value: linux
    worker:
      key: kubernetes.io/os
      value: linux
external_container_reg:
  enabled: false
  access_key_id:
  secret_access_key:
  extcr_account:
  extcr_region:
dns:
  provider:
  auto_provision: false
ingress:
  terraform_overrides: {}
certificate:
  type: self-signed
domain:
security:
  authentication:
    type: password
  shared_users_group: true
  keycloak:
    initial_root_password: syrl4br916u50lhxi7ifv3m921qkmckh
    overrides: {}
    realm_display_name: Nebari
jhub_apps:
  enabled: false
jupyterlab:
  default_settings: {}
  idle_culler:
    terminal_cull_inactive_timeout: 15
    terminal_cull_interval: 5
    kernel_cull_idle_timeout: 15
    kernel_cull_interval: 5
    kernel_cull_connected: true
    kernel_cull_busy: false
    server_shutdown_no_activity_timeout: 15
  initial_repositories: []
  preferred_dir:
jupyterhub:
  overrides: {}
telemetry:
  jupyterlab_pioneer:
    enabled: false
    log_format:
monitoring:
  enabled: true
argo_workflows:
  enabled: true
  overrides: {}
  nebari_workflow_controller:
    enabled: true
    image_tag: 2024.2.1rc2
conda_store:
  extra_settings: {}
  extra_config: ''
  image: quansight/conda-store-server
  image_tag: 2024.1.1
  default_namespace: nebari-git
  object_storage: 200Gi
environments:
  environment-dask.yaml:
    name: dask
    channels:
    - conda-forge
    dependencies:
    - python==3.11.6
    - ipykernel==6.26.0
    - ipywidgets==8.1.1
    - nebari-dask==2024.1.1
    - python-graphviz==0.20.1
    - pyarrow==14.0.1
    - s3fs==2023.10.0
    - gcsfs==2023.10.0
    - numpy=1.26.0
    - numba=0.58.1
    - pandas=2.1.3
    - xarray==2023.10.1
  environment-dashboard.yaml:
    name: dashboard
    channels:
    - conda-forge
    dependencies:
    - python==3.11.6
    - cufflinks-py==0.17.3
    - dash==2.14.1
    - geopandas==0.14.1
    - geopy==2.4.0
    - geoviews==1.11.0
    - gunicorn==21.2.0
    - holoviews==1.18.1
    - ipykernel==6.26.0
    - ipywidgets==8.1.1
    - jupyter==1.0.0
    - jupyter_bokeh==3.0.7
    - matplotlib==3.8.1
    - nebari-dask==2024.1.1
    - nodejs=20.8.1
    - numpy==1.26.0
    - openpyxl==3.1.2
    - pandas==2.1.3
    - panel==1.3.1
    - param==2.0.1
    - plotly==5.18.0
    - python-graphviz==0.20.1
    - rich==13.6.0
    - streamlit==1.28.1
    - sympy==1.12
    - voila==0.5.5
    - xarray==2023.10.1
    - pip==23.3.1
    - pip:
      - streamlit-image-comparison==0.0.4
      - noaa-coops==0.1.9
      - dash_core_components==2.0.0
      - dash_html_components==2.0.0
profiles:
  jupyterlab:
  - access: all
    display_name: Small Instance
    description: Stable environment with 2 cpu / 8 GB ram
    default: true
    users:
    groups:
    kubespawner_override:
      cpu_limit: 2
      cpu_guarantee: 1
      mem_limit: 8G
      mem_guarantee: 5G
  - access: all
    display_name: Medium Instance
    description: Stable environment with 4 cpu / 16 GB ram
    default: false
    users:
    groups:
    kubespawner_override:
      cpu_limit: 4
      cpu_guarantee: 3
      mem_limit: 16G
      mem_guarantee: 10G
  dask_worker:
    Small Worker:
      worker_cores_limit: 2
      worker_cores: 1
      worker_memory_limit: 8G
      worker_memory: 5G
      worker_threads: 2
    Medium Worker:
      worker_cores_limit: 4
      worker_cores: 3
      worker_memory_limit: 16G
      worker_memory: 10G
      worker_threads: 4
theme:
  jupyterhub:
    hub_title: Nebari - scratch-2efc
    hub_subtitle: Your open source data science platform, hosted
    welcome: Welcome! Learn about Nebari's features and configurations in <a href="https://www.nebari.dev/docs/welcome">the
      documentation</a>. If you have any questions or feedback, reach the team on
      <a href="https://www.nebari.dev/docs/community#getting-support">Nebari's support
      forums</a>.
    logo: 
      https://raw.githubusercontent.com/nebari-dev/nebari-design/main/logo-mark/horizontal/Nebari-Logo-Horizontal-Lockup-White-text.svg
    favicon: 
      https://raw.githubusercontent.com/nebari-dev/nebari-design/main/symbol/favicon.ico
    primary_color: '#4f4173'
    primary_color_dark: '#4f4173'
    secondary_color: '#957da6'
    secondary_color_dark: '#957da6'
    accent_color: '#32C574'
    accent_color_dark: '#32C574'
    text_color: '#111111'
    h1_color: '#652e8e'
    h2_color: '#652e8e'
    version: v2024.3.1rc2.dev2+g44c9f71.d20240306
    navbar_color: '#1c1d26'
    navbar_text_color: '#f1f1f6'
    navbar_hover_color: '#db96f3'
    display_version: 'True'
storage:
  conda_store: 200Gi
  shared_filesystem: 200Gi
default_images:
  jupyterhub: quay.io/nebari/nebari-jupyterhub:2024.2.1rc2
  jupyterlab: quay.io/nebari/nebari-jupyterlab:2024.2.1rc2
  dask_worker: quay.io/nebari/nebari-dask-worker:2024.2.1rc2
tf_extensions: []
helm_extensions: []

I want to take a look at the other config files (aws, gcp, etc.) and see if we should make any other changes.

@Adam-D-Lewis
Copy link
Member

Adam-D-Lewis commented Apr 11, 2024

@rsignell We want to be careful to get this fix right so a PR is going to take more time. However, in the mean time I've created a PR to add node groups back to the nebari config which is a fairly straight forward change.

@rsignell
Copy link
Author

@Adam-D-Lewis , awesome, that alone will be super useful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: New 🚦
Development

Successfully merging a pull request may close this issue.

4 participants