Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build image step of tutorial fails due to kaniko issue on bare metal k3s #252

Open
vliaskov opened this issue Sep 27, 2021 · 1 comment
Open

Comments

@vliaskov
Copy link

To reproduce:

Fuseml (HEAD f09f8679) was installed successfully using fuseml-installer install on k3s, running on x86_64 bare metal machine.

Fuseml Environment
Platform: k3s
Kubernetes Version: v1.21.4+k3s1

The mlflow-e2e tutorial https://fuseml.github.io/docs/v0.2/tutorials/ was followed, with no problems up to and including step 6.

Actual behavior:

The trainer step of the pipeline times out with ImagePullBackOff as it cannot find the image produced.
Problem: According to the builder step log, the builder does not seem to build (and push) the image, because of a kaniko issue:

requesting list . done
invalid API response status 500
mlflow/trainer:754795333 not found in registry.fuseml-registry, building...
kaniko should only be run inside of a container, run with the --force flag if you are sure you want to continue

Step completed

Possibly related upstream kaniko issue: GoogleContainerTools/kaniko#1542

Expected behavior:

All steps of the pipeline run successfully.

Perhaps a further secondary issue is that the builder step status is shown as Completed in the tekton status tab, while the image has not actually been built. Let me know if a separate bug should be created for this.

container: step-builder
imageID: >-
  ghcr.io/fuseml/mlflow-builder@sha256:18c38d8c09765a9b3b52f6822a28602e8bc944b9343168fd548309ad5ac07a3f
name: builder
terminated:
  containerID: >-
    containerd://aeadae25fa1f4782090c03efb5cf8a40b6aa6e7143f4d96f6be09930a59f5ef9
  exitCode: 0
  finishedAt: '2021-09-27T13:07:51Z'
  message: >-
    [{"key":"image","value":"127.0.0.1:30500/mlflow/trainer:754795333","type":"TaskRunResult"}]
  reason: Completed
  startedAt: '2021-09-27T13:07:51Z' 
vliaskov added a commit to vliaskov/examples that referenced this issue Sep 29, 2021
…ble.

This is a hack/workaround for the currently unsolved kaniko issue of refusing
to build images on non-container/bare-metal environments, see:
GoogleContainerTools/kaniko#1542

Workaround for fuseml/fuseml#252
@vliaskov
Copy link
Author

I have set the container environment variable to kube in the builder step of mlflow-e2e as a hack to work around this issue. The kaniko issue needs to be fixed for this to be properly resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant