Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating pangeo-eosc infrastructure based on elastic Kubernetes Virtual Cluster using IM-Dashboard #22

Open
tinaok opened this issue Sep 21, 2022 · 39 comments

Comments

@tinaok
Copy link
Collaborator

tinaok commented Sep 21, 2022

This is an issue so that we can coordinate for creating pangeo-eosc infrastructure based on elastic Kubernetes Virtual Cluster using IM-Dashboard.

@tinaok
Copy link
Collaborator Author

tinaok commented Sep 21, 2022

Here is what I have in mind as to do list. Any thoughts @guillaumeeb & @j34ni ?

    • We need resource to test the elastic options. I.e. need to shutdown some instances to free VCPUs for start testing.
    • Feed back from former tests during August?
    • Find a way to create the infrastructure using 'elastic' option on IM-Dashboard
    • Creation (how do we set the limit : minimum and max? can we change the max time to time for hosting tutorials? for example?)
    • Benchmarks

@guillaumeeb
Copy link
Member

👍 for me! I'll let @j34ni answer the first two questions. For point 4, not sure we really need a maximum 🙂.

@j34ni
Copy link
Collaborator

j34ni commented Sep 21, 2022

@tinaok :
1- Feel free to remove all the nodes you want from the existing infrastructures (I did not manage to login into the IM dashboard lately)
2- The two of them worked great. The one without EGI check-in saved the day at the workshop since a lot of the participants had not enrolled. I have also used them after and they are quite "responsive", more than 8GB RAM would be nice for some applications though

@guillaumeeb
Copy link
Member

@j34ni @tinaok, I think at one point the Elastic Kubernetes offer from IM Dashboard was tried? Was there any reason why it was not kept?

@j34ni
Copy link
Collaborator

j34ni commented Sep 22, 2022

@guillaumeeb I do not remember that, may be was it at the same time as other things which failed and we went back a few steps to get something working?

@guillaumeeb
Copy link
Member

Yeah probably. I guess we just need to make some room on our VMs and try to redeploy an elastic version on Kubernetes to host our Pangeo platform.

@sebastian-luna-valero
Copy link
Collaborator

Ok, following up from #21

I think we never tried the elastic option before since we wanted to focus on other higher priority issues.

Have you tried the elastic option now? If so, could you please provide feedback?

Happy to help with this.

@guillaumeeb
Copy link
Member

Have you tried the elastic option now? If so, could you please provide feedback?

Not yet for my part. I can put this on my todo-list for the next days/weeks if @j34ni or @tinaok hasn't time to do so.

@tinaok
Copy link
Collaborator Author

tinaok commented Sep 28, 2022

Hi @guillaumeeb thank you very much, I had no time (and will have no time at all this week neither) for trying out unfortunately. Your help will be super appreciated.
I think it can be named as 'pangeo-eosc'

@guillaumeeb
Copy link
Member

guillaumeeb commented Sep 28, 2022

In the process of creating a Pangeo deployment with elastic Kubernetes on IM Dashboard, following https://github.com/pangeo-data/pangeo-eosc/blob/main/EGI.md, I've got some questions/remarks (noting there as much as for other people as for myself):

  • https://github.com/pangeo-data/pangeo-eosc/blob/main/EGI.md#step-1-dns-name says to create a new domain, but what we really need is a DNS name right, so a new host.
  • I selected Kubernetes, then added Elastic option and Grafana. No other options. I'm planning to do all the Daskhub thing via Helm only.
  • On IM Dashboard K8S cluster creation, HW Data tab, there are 1 front node, and several workers. Is the front node only used for K8S master component, or will it also host some user pods? This is important to know if a small VM is enough, or if we should use the same flavor as for workers.
  • Which Kubernetes version should we used? I kept the 1.23 default (not sure if Dask gateway or Jupyterhub is up to date).
  • Cloud Provider/ Select Site image: I chose Ubuntu Jammy. Is that OK?
  • For Elastic K8S, we can only chose the maximum of Worker nodes (not the minimum, but this might be the number of workers entered on K8S tab).
  • EGI.md lacks information on how to connect to the K8S front VM to issue commands (getting the pem file, connecting with cloudadm).
  • I don't know how to retrieve EGI Check-in secrets other than by searching in previous deployment. Is there a way through https://aai.egi.eu/federation?
  • My K8S deployment is marked in status "running", not "configured", but it looks like the deployment is terminated? Oh no, it took a long time to end up in "configured" state. I'm not sure all is well though, the prompt on the front node is not the same as on other deployments.

So after step 2 from EGI.md, I created my daskhub.yaml file as below:

dask-gateway:
  enabled: true
  gateway:
    auth:
      jupyterhub:
        apiToken: <token1>
      type: jupyterhub
    extraConfig:
      dasklimits: |
        c.ClusterConfig.cluster_max_cores = 6
        c.ClusterConfig.cluster_max_memory = "24 G"
        c.ClusterConfig.cluster_max_workers = 4
        c.ClusterConfig.idle_timeout = 1800
      optionHandler: |
        from dask_gateway_server.options import Options, Integer, Float, String

        def options_handler(options):
          if ":" not in options.image:
            raise ValueError("When specifying an image you must also provide a tag")
          return {
            "worker_cores": options.worker_cores,
            "worker_memory": int(options.worker_memory * 2 ** 30),
            "image": options.image,
          }

        c.Backend.cluster_options = Options(
          Integer("worker_cores", default=1, min=1, max=4, label="Worker Cores"),
          Float("worker_memory", default=2, min=2, max=8, label="Worker Memory (GiB)"),
          String("image", default="pangeo/ml-notebook:2022.09.21", label="Image"),
          handler=options_handler,
        )
    backend:
      worker:
        cores:
          limit: 4
        memory:
          limit: 8G
        threads: 2
dask-kubernetes:
  enabled: false
jupyterhub:
  hub:
    config:
      GenericOAuthenticator:
        client_id: <client>
        client_secret: <secret>
        oauth_callback_url: https://pangeo-elastic.vm.fedcloud.eu/hub/oauth_callback
        authorize_url: https://aai-dev.egi.eu/auth/realms/egi/protocol/openid-connect/auth
        token_url: https://aai-dev.egi.eu/auth/realms/egi/protocol/openid-connect/token
        userdata_url: https://aai-dev.egi.eu/auth/realms/egi/protocol/openid-connect/userinfo
        login_service: EGI Check-In
        scope:
          - openid
          - email
          - profile
          - eduperson_entitlement
        username_key: preferred_username
        userdata_params:
          state: state
        allowed_groups:
          - urn:mace:egi.eu:group:vo.pangeo.eu:role=member#aai.egi.eu
        claim_groups_key: eduperson_entitlement
      JupyterHub:
        authenticator_class: generic-oauth
    services:
      dask-gateway:
        apiToken: <token1>
  proxy:
    secretToken: <token2>
    service:
      type: ClusterIP
  singleuser:
    cpu:
      guarantee: 1
      limit: 2
    defaultUrl: /lab
    extraEnv:
      DASK_GATEWAY__CLUSTER__OPTIONS__IMAGE: '{JUPYTER_IMAGE_SPEC}'
    image:
      name: pangeo/ml-notebook
      tag: 2022.09.21
    memory:
      guarantee: 2G
      limit: 4G
    startTimeout: 600
    storage:
      capacity: 2Gi
      type: dynamic
rbac:
  enabled: true

and just issued the helm command:

sudo helm upgrade daskhub daskhub --repo=https://helm.dask.org --install --wait --cleanup-on-fail --create-namespace --namespace daskhub --version 2022.8.2 --values daskhub.yaml

Followed by reconfiguring ingress.

The helm command and kubectl for ingress worked with no error.
I can see Jupyterhub, but I get a login error:
image

Same as the other day when trying to access IM Dashboard.

I'll stop there for tonight. If someone can test the deployment at https://pangeo-foss4g.vm.fedcloud.eu and see if they can login?


Edit: correct link is https://pangeo-elastic.vm.fedcloud.eu/

@tinaok
Copy link
Collaborator Author

tinaok commented Sep 28, 2022

Thank you @guillaumeeb !!
I just tried to log on to https://pangeo-foss4g.vm.fedcloud.eu/jupyterhub/, but may be this one is the foss4g configuration (=old) one ( I have all my historical left to old one there, so I guess this is not the new one you are creating based on the elastic Kubernetes? )

https://github.com/pangeo-data/pangeo-eosc/blob/main/EGI.md#step-1-dns-name says to create a new domain, but what we really need is a DNS name right, so a new host.

In the example it was pangeo.vm.fedcloud.eu but I guess we can have it like pangeo-eosc.vm.fedcloud.eu ?

I selected Kubernetes, then added Elastic option and Grafana. No other options. I'm planning to do all the Daskhub thing via Helm only.

👍

I checked IP address of the cluster you made from swift dashboard and I could have the grafana login portal

EGI.md lacks information on how to connect to the K8S front VM to issue commands (getting the pem file, connecting with cloudadm).

If I remember right, the button in the IM dashboard indicating cluster configuration "configured" , could be clicked and there was information about cloudadm

@guillaumeeb
Copy link
Member

Crap, I didn't indicate the right link.

The correct one is https://pangeo-elastic.vm.fedcloud.eu. This is only a test deployment for now. There is no need to add /jupyterhub/. Jupyterhub is available at /.

@tinaok
Copy link
Collaborator Author

tinaok commented Sep 29, 2022

Thanks for the new link!
I confirm I get error too
52E898AC-49E8-4532-8973-D22F13986942

@guillaumeeb
Copy link
Member

@sebastian-luna-valero do you have any idea why the auth might failed?

I'll have another look at it tonight.

@sebastian-luna-valero
Copy link
Collaborator

Hi,

Could you please double check that the Redirect UR in https://aai-dev.egi.eu/federation has the same value as oauth_callback_url in the values.yaml?

image

@guillaumeeb
Copy link
Member

Could you please double check that the Redirect UR in https://aai-dev.egi.eu/federation has the same value as oauth_callback_url in the values.yaml?

I think I get the problem: I simply didn't go through the registration of a new Service in https://aai-dev.egi.eu/federation as I though I could reuse the old Open ID credentials. Could we use the same credentials by adding another Redirect URI to the form your showing? How can we have access to the management of the already existing service?

@sebastian-luna-valero
Copy link
Collaborator

Trying to answer some questions:

On IM Dashboard K8S cluster creation, HW Data tab, there are 1 front node, and several workers. Is the front node only used for K8S master component, or will it also host some user pods? This is important to know if a small VM is enough, or if we should use the same flavor as for workers.

I believe user pods end up running on worker nodes. However, for large clusters, I like to have a big flavor for the front-end too (with the master role).

Which Kubernetes version should we used? I kept the 1.23 default (not sure if Dask gateway or Jupyterhub is up to date).

I normally take a copy of the version I choose, trying to be reproducible. Last time I deployed 1.23.11 and was looking good.

Cloud Provider/ Select Site image: I chose Ubuntu Jammy. Is that OK?

It should be ok, but I still prefer to stay with 20.04.

For Elastic K8S, we can only chose the maximum of Worker nodes (not the minimum, but this might be the number of workers entered on K8S tab).

This is something that I wanted to explore myself since I haven't tried it yet. I believe the minimum number of workers can also be configured from CLI, maybe we need to ask to expose this option on IM dashboard as well.

EGI.md lacks information on how to connect to the K8S front VM to issue commands (getting the pem file, connecting with cloudadm).

Correct, would you open a PR with this and other suggestions? Happy to review.

I don't know how to retrieve EGI Check-in secrets other than by searching in previous deployment. Is there a way through https://aai.egi.eu/federation?

Please make sure you use https://aai-dev.egi.eu/federation since you can self-approve your request to add this new service. The secrets are available on that form after the service is approved.

My K8S deployment is marked in status "running", not "configured", but it looks like the deployment is terminated? Oh no, it took a long time to end up in "configured" state. I'm not sure all is well though, the prompt on the front node is not the same as on other deployments.

If you only get $ after ssh'ing into the front-end, that's expected after deployment. If other deployments had a different prompt maybe is because someone reconfigured the default option.

@sebastian-luna-valero
Copy link
Collaborator

I think I get the problem: I simply didn't go through the registration of a new Service in https://aai-dev.egi.eu/federation as I though I could reuse the old Open ID credentials. Could we use the same credentials by adding another Redirect URI to the form your showing? How can we have access to the management of the already existing service?

Each service (different URIs) have each own credentials. These credentials are only available to the "service owner", the one adding the config to https://aai-dev.egi.eu/federation, I am afraid.

@guillaumeeb
Copy link
Member

Please make sure you use https://aai-dev.egi.eu/federation since you can self-approve your request to add this new service. The secrets are available on that form after the service is approved.

Each service (different URIs) have each own credentials. These credentials are only available to the "service owner", the one adding the config to https://aai-dev.egi.eu/federation, I am afraid.

Okay, trying this now!

Correct, would you open a PR with this and other suggestions? Happy to review.

Yes I'm planning to do this once everything works fine.

Thanks for every answers, that's really helpful!

@guillaumeeb
Copy link
Member

@sebastian-luna-valero Thanks to your inputs, I created a service in https://aai-dev.egi.eu/federation/egi/services, and self approve it. I think I was able to give you access to this service.

However, it is still pending (Deployment in progress status). Do you know how much time it can take (it's already been about 30min)?

I'm trying with NativeAuthenticator waiting for the OIDC credentials to be OK.

@guillaumeeb
Copy link
Member

So the platform seems to be working (Jupyterhub and Dask-gateway), however, I do'nt see any scaling up when I ask for more pods.

I've launch a Dask-gateway cluster, and scaled it. Default platform has only one worker node with 8CPUs, 32GB.

Here are my pods waiting:

$ sudo kubectl get pods -n daskhub
NAME                                                 READY   STATUS    RESTARTS   AGE
api-daskhub-dask-gateway-547c8f684-vmxlw             1/1     Running   0          17m
continuous-image-puller-9kf6h                        1/1     Running   0          23h
controller-daskhub-dask-gateway-6d988656cf-66kht     1/1     Running   0          23h
dask-scheduler-04b0b67bffc840cc9f7bb0dc24c7c350      1/1     Running   0          19m
dask-scheduler-449348c96b1f4f76b530b86a5286bb4a      1/1     Running   0          15m
dask-worker-04b0b67bffc840cc9f7bb0dc24c7c350-2ptp2   1/1     Running   0          19m
dask-worker-04b0b67bffc840cc9f7bb0dc24c7c350-5wnvc   1/1     Running   0          19m
dask-worker-04b0b67bffc840cc9f7bb0dc24c7c350-crjkk   1/1     Running   0          19m
dask-worker-04b0b67bffc840cc9f7bb0dc24c7c350-lh25q   1/1     Running   0          19m
dask-worker-449348c96b1f4f76b530b86a5286bb4a-7w6tv   0/1     Pending   0          4m43s
dask-worker-449348c96b1f4f76b530b86a5286bb4a-9d79d   0/1     Pending   0          4m43s
dask-worker-449348c96b1f4f76b530b86a5286bb4a-cgs8t   0/1     Pending   0          15m
dask-worker-449348c96b1f4f76b530b86a5286bb4a-k54br   0/1     Pending   0          4m43s
dask-worker-449348c96b1f4f76b530b86a5286bb4a-mhrln   0/1     Pending   0          4m43s
dask-worker-449348c96b1f4f76b530b86a5286bb4a-nq5z9   0/1     Pending   0          15m
dask-worker-449348c96b1f4f76b530b86a5286bb4a-qp52k   0/1     Pending   0          15m
dask-worker-449348c96b1f4f76b530b86a5286bb4a-rk28n   0/1     Pending   0          4m43s
dask-worker-449348c96b1f4f76b530b86a5286bb4a-rm7vt   0/1     Pending   0          15m
dask-worker-449348c96b1f4f76b530b86a5286bb4a-wwvt2   0/1     Pending   0          15m
dask-worker-449348c96b1f4f76b530b86a5286bb4a-zb55p   0/1     Pending   0          15m
dask-worker-449348c96b1f4f76b530b86a5286bb4a-zv2m5   0/1     Pending   0          4m43s

And if I check one Pending pod details:

Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  16m                default-scheduler  0/2 nodes are available: 1 Insufficient cpu, 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
  Warning  FailedScheduling  14m (x1 over 15m)  default-scheduler  0/2 nodes are available: 1 Insufficient cpu, 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.

But I still have two nodes:

$ sudo kubectl get nodes
NAME                     STATUS   ROLES                  AGE   VERSION
kubeserver.localdomain   Ready    control-plane,master   24h   v1.23.11
vnode-1.localdomain      Ready    <none>                 24h   v1.23.11

Not sure where to look to see where the problem could be.

@guillaumeeb
Copy link
Member

I also see

44 updates can be applied immediately.
To see these additional updates run: apt list --upgradable


*** System restart required ***

When connecting to front node. Should I do something about that?

@guillaumeeb
Copy link
Member

Correct, would you open a PR with this and other suggestions? Happy to review.

Yes I'm planning to do this once everything works fine.

PR opened at #28.

Still waiting for https://aai-dev.egi.eu/federation/egi/services, my service is still on Deployment in Progress status. Maybe I've done something wrong. @sebastian-luna-valero if you have any hint.

I also tested manual scaling using IM Dashboard on this new deployment, this worked well.

@sebastian-luna-valero
Copy link
Collaborator

However, it is still pending (Deployment in progress status). Do you know how much time it can take (it's already been about 30min)?

My bad, we should use https://aai.egi.eu/federation (instead of https://aai-dev.egi.eu/federation). Then select Development in the Integration Environment option. This was actually correct in https://github.com/pangeo-data/pangeo-eosc/blob/main/EGI.md

Please try again, and this should solve the issue (i.e. after a few seconds, the service will be automatically Deployed).

So the platform seems to be working (Jupyterhub and Dask-gateway), however, I do'nt see any scaling up when I ask for more pods.

I am double checking why that would be the case in: grycap/clues#114 (comment)

*** System restart required ***

Ideally we should update and restart all the VMs in the cluster before the initial deployment, and then periodically to apply updates to the underlying operating system. This implies a downtime so I would it immediately before the workshop and immediately afterwards.

I also tested manual scaling using IM Dashboard on this new deployment, this worked well.

Great!

@guillaumeeb
Copy link
Member

Please try again, and this should solve the issue (i.e. after a few seconds, the service will be automatically Deployed).

This worked! Thanks a lot.

So that leaves us with the Elastic functionality not working. I just tried again by scaling a Dask cluster, but I still have pods that are not able to be scheduled due to insufficient resources, and no new nodes incoming.

@sebastian-luna-valero maybe we should open a new issue in https://github.com/grycap/clues rather than adding a comment in an existing issue? cc @micafer.

Everyone should be able to login to the https://pangeo-elastic.vm.fedcloud.eu deployment. I've just noted a display error using Dask-gateway when displaying cluster object in the notebook, but it's probably some minor bug.

@micafer
Copy link

micafer commented Oct 3, 2022

So that leaves us with the Elastic functionality not working. I just tried again by scaling a Dask cluster, but I still have pods that are not able to be scheduled due to insufficient resources, and no new nodes incoming.

@sebastian-luna-valero maybe we should open a new issue in https://github.com/grycap/clues rather than adding a comment in an existing issue? cc @micafer.

Please send me an email with the detailed problem ans we can try to debug the issue.

@tinaok
Copy link
Collaborator Author

tinaok commented Oct 4, 2022

@guillaumeeb

Everyone should be able to login to the https://pangeo-elastic.vm.fedcloud.eu deployment. I've just noted a display error using Dask-gateway when displaying cluster object in the notebook, but it's probably some minor bug.

I logged in looks great! thank you @guillaumeeb!! I didn't test dask yet but I do not see the cloud bucket? Is it the same Pangeo notebook docker image as we used for https://pangeo-foss4g.vm.fedcloud.eu/ infrastructure?

@guillaumeeb
Copy link
Member

You mean the S3 browser on the left side bar? Yes, not sure why it is not there.

I used pangeo/ml-notebook in the last available version. See y'all file in https://github.com/pangeo-data/pangeo-eosc/blob/main/EGI.md. I cannot check right now, but we might have used pangeo-notebook image in the other deployment. I thought that ml-notebook was more complete, but I may probably be wrong. This can be easily changed!

Other than that, it's the exact same deployment of Daskhub, so there won't be more functionalities. As the elastic part of kubernetes is not working currently, there is no interest of using this deployment instead of the other one.

@sebastian-luna-valero
Copy link
Collaborator

I think https://github.com/IBM/jupyterlab-s3-browser needs to be added explicitly (and we could update https://github.com/pangeo-data/pangeo-eosc/blob/main/EGI.md accordingly)

We are currently trying to solve the elastic k8s option, and we will report back here.

In the meantime, manual scaling up and down is the best option.

@tinaok
Copy link
Collaborator Author

tinaok commented Oct 4, 2022

Hi, I think the foss4g configuration was based on Pangeo-notebook docker image and not on the ml notebook. the purpose was not to use too much resources. @j34ni or @annefou can you please confirm?

@sebastian-luna-valero

Once the automatic scaling up works, existing dask hub need to be destroyed and re-created to have it benefit from it?

@j34ni
Copy link
Collaborator

j34ni commented Oct 4, 2022

Yes, it used the pangeo-notebook:latest

@sebastian-luna-valero
Copy link
Collaborator

Yes, we would need to redeploy to get the elasticity.

@guillaumeeb
Copy link
Member

I just redeployed the Daskhub with the pangeo/pangeo-notebook Docker image, and the S3 Object Storage Browser tab is back!

So now we're in the same setup, we just need to wait and see if we manage to get elasticity working.

Also, I still encounter an error when starting a dask-gateway cluster, but this does not prevent using it. I'll open an issue on the pangeo-docker repo to get some feedback. This is probably due to the last version of the image. We can pin a previous one if needed.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
File /srv/conda/envs/notebook/lib/python3.9/site-packages/IPython/core/formatters.py:921, in IPythonDisplayFormatter.__call__(self, obj)
    919 method = get_real_method(obj, self.print_method)
    920 if method is not None:
--> 921     method()
    922     return True

File /srv/conda/envs/notebook/lib/python3.9/site-packages/dask_gateway/client.py:1225, in GatewayCluster._ipython_display_(self, **kwargs)
   1223 widget = self._widget()
   1224 if widget is not None:
-> 1225     return widget._ipython_display_(**kwargs)
   1226 else:
   1227     from IPython.display import display

AttributeError: 'VBox' object has no attribute '_ipython_display_'

@guillaumeeb
Copy link
Member

And I confirm that ml-notebook image does not have s3-browser installed, see pangeo-data/pangeo-docker-images#383.

@sebastian-luna-valero
Copy link
Collaborator

Do we want to keep pangeo/ml-notebook in https://github.com/pangeo-data/pangeo-eosc/blob/main/EGI.md or should we replace it with pangeo/pangeo-notebook?

@guillaumeeb
Copy link
Member

We should replace it for now, feel free to do it!

@sebastian-luna-valero
Copy link
Collaborator

Sure! #29

@guillaumeeb
Copy link
Member

Just for information, I'm going to delete the pangeo-elastic infrastructure and create a new one using operational IM Dashboard instance after discussing with @micafer.

@guillaumeeb
Copy link
Member

I just redeployed the pangeo-elastic infrastructure. I see an improvement, but there are still things to debug for elasticity to be working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants