Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Issues with Ray execution on the M1 CPU #67

Open
1 of 2 tasks
roytman opened this issue May 4, 2024 · 3 comments
Open
1 of 2 tasks

[Bug] Issues with Ray execution on the M1 CPU #67

roytman opened this issue May 4, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@roytman
Copy link
Member

roytman commented May 4, 2024

Search before asking

  • I searched the issues and found no similar issues.

Component

Library/kfp

What happened + What you expected to happen

In this release, we moved from CodeFlare SDK to RayAPIServer,
I observe different error/warning messages in the Ray logs. See below.
The messages can stem from wrong API parameters or internal RayAPIServer implementation.

From the RayAPIServer Pod logs:

  • W0504 07:30:13.041268 1 interceptor.go:17] Get compute template failure: NotFoundError: Compute template noop-kfp--78783-head-template not found: configmaps "noop-kfp--78783-head-template" not found. (It looks like the server tries to access the template before it was created)

  • W0504 07:56:26.660498 1 warnings.go:70] unknown field "spec.headGroupSpec.template.metadata.creationTimestamp"
    W0504 07:56:26.660565 1 warnings.go:70] unknown field "spec.workerGroupSpecs[0].template.metadata.creationTimestamp"
    W0504 07:56:26.660585 1 warnings.go:70] unknown field "status.desiredCPU"
    W0504 07:56:26.660599 1 warnings.go:70] unknown field "status.desiredGPU"
    W0504 07:56:26.660630 1 warnings.go:70] unknown field "status.desiredMemory"
    W0504 07:56:26.660648 1 warnings.go:70] unknown field "status.desiredTPU"
    W0504 07:56:26.680745 1 cluster_server.go:43] Failed to get cluster's event, cluster: kubeflow/noop-kfp--1d2d3, err: No Event with RayCluster name noop-kfp--1d2d3

  • I0504 07:57:47.189239 1 interceptor.go:14] /proto.RayJobSubmissionService/SubmitRayJob handler starting
    {"level":"info","v":0,"logger":"jobsubmissionservice","message":"RayJobSubmissionService submit job"}
    [controller-runtime] log.SetLogger(...) was never called; logs will not be displayed.
    Detected at:

    goroutine 1775 [running]:
    runtime/debug.Stack()
    /usr/lib/golang/src/runtime/debug/stack.go:24 +0x65
    sigs.k8s.io/controller-runtime/pkg/log.eventuallyFulfillRoot()
    /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/log/log.go:60 +0xcd
    sigs.k8s.io/controller-runtime/pkg/log.(*delegatingLogSink).WithValues(0xc00042e3c0, {0x0, 0x0, 0x0})
    /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/log/deleg.go:168 +0x54
    github.com/go-logr/logr.Logger.WithValues(...)
    /opt/app-root/src/go/pkg/mod/github.com/go-logr/logr@v1.2.4/logr.go:323
    sigs.k8s.io/controller-runtime/pkg/log.FromContext({0x1d4b538?, 0xc000a86030?}, {0x0, 0x0, 0x0})
    /opt/app-root/src/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.16.3/pkg/log/log.go:98 +0xfd
    github.com/ray-project/kuberay/ray-operator/controllers/ray/utils.(*RayDashboardClient).SubmitJobReq(0xc000751100, {0x1d4b538, 0xc000a86030}, 0xc000ac96f0?, 0x0)
    /workspace/ray-operator/controllers/ray/utils/dashboard_httpclient.go:299 +0x91
    github.com/ray-project/kuberay/apiserver/pkg/server.(*RayJobSubmissionServiceServer).SubmitRayJob(0xc000164090, {0x1d4b538, 0xc000a86030}, 0xc00049a0f0)
    /workspace/apiserver/pkg/server/ray_job_submission_service_server.go:89 +0x484
    github.com/ray-project/kuberay/proto/go_client._RayJobSubmissionService_SubmitRayJob_Handler.func1({0x1d4b538, 0xc000a86030}, {0x1975500?, 0xc00049a0f0})
    /workspace/proto/go_client/job_submission_grpc.pb.go:166 +0x78
    github.com/ray-project/kuberay/apiserver/pkg/interceptor.ApiServerInterceptor({0x1d4b538, 0xc000a86030}, {0x1975500, 0xc00049a0f0}, 0xc00044e6e0, 0xc0008
    > .....

  • A successfully finished RAY job, returns:
    > 00:59:16 INFO - Exception running ray remote orchestration
    Initialization failure from server:
    Traceback (most recent call last):
    File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/util/client/server/proxier.py", line 711, in Datapath
    raise RuntimeError(
    RuntimeError: Starting Ray client server failed. See ray_client_server_23000.err for detailed logs.

There is no errors in ray_client_server_23000.err, but ray_client_server.err we can see some info:

ray_client_server.err.zip

Reproduction script

Run the noop pipeline and check the Ray server logs

Anything else

No response

OS

Other

Python

3.11

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@roytman roytman added the bug Something isn't working label May 4, 2024
@daw3rd daw3rd changed the title Unmaturity of Ray API Server [Bug] [Bug] Immaturity of Ray API Server May 6, 2024
@blublinsky
Copy link
Collaborator

Sorry, I am not convinced. ray_client_server_23000.err is ray error not the API server. see ray-project/ray#19792 for explanation

@roytman
Copy link
Member Author

roytman commented May 7, 2024

Most (if not all) mentioned issues are related to the execution of Ray on the Apple M1 CPU.
Meantime I rename the issue.

@roytman roytman changed the title [Bug] Immaturity of Ray API Server [Bug] Issues with Ray execution on the M1 CPU May 7, 2024
@blublinsky
Copy link
Collaborator

@roytman do we still need this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants