Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KServe performance tutorial failed #1410

Open
kalantar opened this issue Feb 27, 2023 · 2 comments
Open

KServe performance tutorial failed #1410

kalantar opened this issue Feb 27, 2023 · 2 comments
Labels
kind/bug Something isn't working

Comments

@kalantar
Copy link
Member

Describe the bug
Running tutorial failed. Report had no output. Logs have

% kubectl logs job.batch/default-1-job 
time=2023-02-27 18:25:23 level=info msg=task 1: run: started
time=2023-02-27 18:25:23 level=info msg=task 1: run: completed
time=2023-02-27 18:25:23 level=info msg=task 2: http: started
time=2023-02-27 18:25:23 level=error msg=fortio failed stack-trace=below ... 
::Trace:: lookup sklearn-irisv2.default.svc.cluster.local on 10.96.0.10:53: no such host
time=2023-02-27 18:25:23 level=error msg=failed to get results since fortio run was aborted
time=2023-02-27 18:25:23 level=error msg=task 2: http: failure
Error: lookup sklearn-irisv2.default.svc.cluster.local on 10.96.0.10:53: no such host

Referenced service seems to exist.

To Reproduce
Run tutorial.

Additional context
On retry it was observed that the inference service took > 5 minutes to become ready. It may be that the readiness check failed to cause experiment to fail.

@kalantar kalantar added the kind/bug Something isn't working label Feb 27, 2023
@kalantar
Copy link
Member Author

It appears that that the basic readiness check works. I set the timeout to 10s and increased the logging. I repeatedly see:

time=2023-02-27 21:35:44 level=trace msg=looking for resource (serving.kserve.io/v1beta1) inferenceservices: sklearn-irisv2 in namespace default
time=2023-02-27 21:35:44 level=trace msg=looking for condition: Ready
time=2023-02-27 21:35:44 level=error msg=condition status not True
followed by
time=2023-02-27 21:35:44 level=error msg=task 1: ready: failure

iter8 k report correctly identifies a failed task/experiment:

Experiment summary:
*******************

  Experiment completed: false
  No task failures: false
  Total number of tasks: 4
  Number of completed tasks: 0
  Number of completed loops: 1

@kalantar
Copy link
Member Author

Copy of slack comment:

I wonder if Fortio's

  -allow-initial-errors
        Allow and don't abort on initial warmup errors

should be exposed as a parameter in the http task. This might be a simple "fix" worth trying ... of course, more "warmup" behavior can also be defined in the task if this is insufficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant