Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi platform images #2273

Draft
wants to merge 50 commits into
base: main
Choose a base branch
from
Draft

Multi platform images #2273

wants to merge 50 commits into from

Conversation

bdovaz
Copy link
Collaborator

@bdovaz bdovaz commented Jan 17, 2023

We continue the discussion from #1553

@nvuillam
Copy link
Member

nvuillam commented Jan 21, 2023

🦙 MegaLinter status: ⚠️ WARNING

Descriptor Linter Files Fixed Errors Elapsed time
✅ BASH bash-exec 5 0 0.01s
✅ BASH shellcheck 5 0 0.09s
✅ BASH shfmt 5 0 0 0.37s
✅ COPYPASTE jscpd yes no 3.42s
✅ DOCKERFILE hadolint 123 0 17.86s
✅ JSON eslint-plugin-jsonc 23 0 0 2.28s
✅ JSON jsonlint 21 0 0.19s
✅ JSON v8r 23 0 14.3s
✅ MAKEFILE checkmake 1 0 0.01s
⚠️ MARKDOWN markdownlint 255 0 254 28.05s
✅ MARKDOWN markdown-link-check 255 0 5.89s
✅ MARKDOWN markdown-table-formatter 255 0 0 29.08s
✅ OPENAPI spectral 1 0 1.41s
⚠️ PYTHON bandit 200 61 3.18s
✅ PYTHON black 200 0 0 4.78s
✅ PYTHON flake8 200 0 1.93s
✅ PYTHON isort 200 0 0 0.84s
✅ PYTHON mypy 200 0 10.74s
✅ PYTHON pylint 200 0 14.4s
⚠️ PYTHON pyright 200 319 22.8s
✅ PYTHON ruff 200 0 0 0.5s
✅ REPOSITORY checkov yes no 35.64s
✅ REPOSITORY git_diff yes no 0.37s
⚠️ REPOSITORY grype yes 1 10.34s
✅ REPOSITORY secretlint yes no 11.58s
✅ REPOSITORY trivy yes no 25.56s
✅ REPOSITORY trivy-sbom yes no 1.07s
✅ SPELL cspell 665 0 26.81s
✅ SPELL lychee 335 0 3.6s
✅ XML xmllint 3 0 0 0.37s
✅ YAML prettier 160 0 0 4.92s
✅ YAML v8r 102 0 161.76s
✅ YAML yamllint 161 0 1.58s

See detailed report in MegaLinter reports

MegaLinter is graciously provided by OX Security

@echoix
Copy link
Collaborator

echoix commented Jan 29, 2023

It isn't the best for us, but I think that what is done elsewhere to workaround this is instead of loading (--load), they push to the registry (--push). What that brings is what to do when we aren't ready to have a release, that we are building it, we would have to push it either way... that may mean that doing some retagging and removing the tags used for building would be needed.

@nvuillam
Copy link
Member

Internal contributors can use push because they have access to credentials, but I wouldn't like forks to be able to push to docker hub mega linter images ^^

@echoix
Copy link
Collaborator

echoix commented Jan 29, 2023

Of course, but the fork would need to log on the registry to be able to do that, and they couldn't.

@bdovaz
Copy link
Collaborator Author

bdovaz commented Jan 29, 2023

@echoix I would appreciate anything you can work on and advance this PR because right now I am quite busy with #2294 which is quite tedious because of having to go through all the linters that do format/autofix.

@bdovaz
Copy link
Collaborator Author

bdovaz commented Feb 26, 2023

@echoix I have done rebase and I am working on this PR, to see if I can at least clear the doubt of which ones will be compatible with arm64 and which ones will not.

My idea is to create from build.py a page with a table that says for each linter the compatibility by platform since I see impossible today (or maybe never) to get all linters to work in arm64.

@echoix
Copy link
Collaborator

echoix commented Feb 26, 2023

Yep, that's a good idea. Last weekend I stumbled to look again at status, and saw that there was some progress in the last weeks on the Docker issue, docker side, but it was just released, but not a complete fix. I think it will start to work easily one day, without manually playing with manifests :)

@echoix
Copy link
Collaborator

echoix commented Feb 26, 2023

Once we got the infrastructure sorted out correctly, I plan that we could start with only one flavor, even without all linters included, and ramp up from there. It'll let some time to sort things up. What is still bugging me is that I didn't find a way that could allow us to test the flavors quickly in CI. It's a bit risky that we ship untested images.

@echoix
Copy link
Collaborator

echoix commented Feb 26, 2023

@echoix
Copy link
Collaborator

echoix commented Feb 26, 2023

How come pyright needs node, by python?

image

@echoix
Copy link
Collaborator

echoix commented Feb 26, 2023

We can maybe take a look at why the two powershell jobs are taking more than an hour to build...
https://github.com/oxsecurity/megalinter/actions/runs/4276668304/jobs/7444926852
image

@nvuillam
Copy link
Member

@echoix there's probably a blocking prompt somewhere
or a 1000 times retry when github api limits are reached

@echoix
Copy link
Collaborator

echoix commented Feb 26, 2023

Maybe.. but even the raw logs don't contain anything else than waiting for a runner to pick the job

@Kurt-von-Laven
Copy link
Collaborator

I see #30 [linux/arm64 stage-1 7/19] RUN pwsh -c 'Install-Module -Name PSScriptAnalyzer -RequiredVersion ${PSSA_VERSION} -Scope AllUsers -Force' as the last step in the logs before the job was cancelled. I have tended to rerun our CI a few times to make sure the issue is reproducible. Where are the raw logs you're looking at?

@echoix
Copy link
Collaborator

echoix commented Feb 27, 2023

During the execution we didn't have any more output if we didn't have that section opened when it ran, and for the last 10 mins (where I had some opened), no new log lines were outputted. In the gear menu, there's a link to see raw logs, but it wasn't out yet. We wanted to have other CI jobs to run, and they were blocked there

@echoix
Copy link
Collaborator

echoix commented Feb 27, 2023

Now, we can see what was running:

image
https://github.com/oxsecurity/megalinter/actions/runs/4276668304/jobs/7444926825

image
https://github.com/oxsecurity/megalinter/actions/runs/4276668155/jobs/7444927563

On many of the cancelled tasks, we see that there wasn't any output for a long time, more than an hour. I see that the rest of the image was building fine, the amd64 steps were practically finished inside of a minute (started at 20:17:11, and 20:18:03 was the step before last for amd64), even when interleaving steps for both platforms together (since the emulation for arm64 takes longer, it will always finish last, but still).

Theres probably some crash or some really long emulated step for it to not have any output in an hour.

@bdovaz bdovaz force-pushed the dev/multi-platform-images branch 2 times, most recently from 9b31731 to 5e9c6a1 Compare March 26, 2023 20:26
@waterfoul
Copy link
Contributor

Not sure about your powershell issue but I've been trying to do something similar. We have an internal requirement which causes us to re-build most tools (so we can patch cves before the upstream project) and I noticed that many of the tools which don't support native arm can be easily compiled. For example any of the GO binaries (which there are many) can be built using the following dockerfile snippet:

# syntax=docker/dockerfile:1.5-labs
FROM --platform=$BUILDPLATFORM scratch as gitleaks-download
ADD --chown=65532 https://github.com/gitleaks/gitleaks.git#v8.16.2 /gitleaks

FROM --platform=$BUILDPLATFORM golang:alpine as build-gitleaks
COPY --link --from=gitleaks-download --chown=65532:65532 /gitleaks /gitleaks
WORKDIR /gitleaks
USER 65532
ENV GOCACHE=/gitleaks/.cache
ARG TARGETARCH
ENV GOARCH=$TARGETARCH
RUN --mount=type=cache,id=gopkg-${BUILDARCH},sharing=locked,target=/go/pkg,uid=65532 \
    --mount=target=/gitleaks/.cache,id=go-${BUILDARCH},sharing=shared,type=cache,uid=65532 \
## Patch CVEs
## END Patch
    go mod tidy \
 && go mod vendor

RUN --mount=target=/gitleaks/.cache,id=gitleaks-cache,sharing=shared,type=cache,uid=65532 \
    CGO_ENABLED=0 go build -mod=vendor -o /tmp/gitleaks

FROM scratch as gitleaks-fs
COPY --link --from=build-gitleaks /tmp/gitleaks /usr/bin/gitleaks
RUN ["/usr/bin/gitleaks", "--help"]

FROM scratch as gitleaks
COPY --link --from=gitleaks-fs /

This will use the native CPU to build both binaries, then check them using the target architecture (the final from just cleans up the output layers if you build that target). If you don't want to use docker/dockerfile:1.5-labs you can replace the first layer with any os and use git clone. To translate this into another proejct you just need to change out the repo, project name, and some of the test scripts (--help isn't always supported). If you would like I could help with this effort, I understand it's WIP though so I didn't want to step on any toes.

@echoix
Copy link
Collaborator

echoix commented Apr 10, 2023

I like your pattern!
Indeed, go tools are so easy to cross-build!
If you look in the main branch, certain tools written in go are built in multi-layered approach in case their latest releases aren't released automatically by CI. If the BUILDPLATFORM wasn't written in what is in main, it was my plan to use it if it was needed. You seem to have a written and complete example here of all the cross compilation concepts all at the same place, so it is a great reference and help.

One of my concerns with the multi-arch images is build time. Ideally we would use published and compiled versions for the correct architecture, and only fallback to compiling if needed.

Some of the problems we might need help:

Finding a way to be able to build and test multi-arch images in CI. At the surface, almost everything is there, but in practice, we hit a gray spot with Docker that wasn't quite resolved last time I looked. We can build multi-arch images, but cannot load them. It seems we need to push to registry, then pull one in the local docker to run it. We would like to avoid the old manifest-based approach to publishing individual images with tags per arch, then having a third tag that is only a manifest linking the two images.

Is there a way to run an image of another arch through QEMU to run a test in the container? (Like making sure it runs. Like tools that aren't static, single files like Go based linters, making sure they can actually start)

Lastly, what is missing from starting to enable linters for other arches (once the minimal pushing of the images can be done, pushing directly to the registry is a workaround), is to really finish up parsing the yaml descriptor files to follow the default behaviour when overrides aren't used (to allow for exceptions to accommodate each linter), then write out the Dockerfiles (with that build.py script, we usually know what Dockerfile we want to be outputted). I assume that's out of your specialty. 😃

@echoix
Copy link
Collaborator

echoix commented Apr 10, 2023

@waterfoul, I forgot to ask about your chown handling in your example. I know about the Linux permissions styles like 755, 655, 777, 400 etc, but is it a permission or a user that is 65532. What does it solve and what were the drawbacks when you implemented this?

I know that we have had and might still have issues especially with some node modules that downloads with high user ids that cause problems, and the workaround solution of changing the owner to root, is another source of problems, especially when the files created by megalinter, mounted to the computer of a user running it locally, becomes undeletable. Is this something that you and your team have encountered and have solved?

@waterfoul
Copy link
Contributor

One of my concerns with the multi-arch images is build time. Ideally we would use published and compiled versions for the correct architecture, and only fallback to compiling if needed.

I understand the concern but there are quite a few things you can leverage if you build it yourself to shrink the image. This example doesn't do those (yet) since I'm still working on shrinking our internal image. What is the longest build time which would be acceptable?

It seems we need to push to registry, then pull one in the local docker to run it. We would like to avoid the old manifest-based approach to publishing individual images with tags per arch, then having a third tag that is only a manifest linking the two images.

Buildkit already does this for you if the environment is configured correctly. Try using the flag "--platform linux/amd64,linux/arm64/v8" which should automatically use a local qemu to run the arm layers and the final tagged image should already contain both images with the manifest wired up correctly. If you want to build them separately you can run both builds with different --export-cache/--import-cahe (or --cache-from/--cache-to with docker buildx build) registries and no outputs. You can then add flags to import both separate registries for the final build and it should import the caches and merge the images.

Finding a way to be able to build and test multi-arch images in CI. At the surface, almost everything is there, but in practice, we hit a gray spot with Docker that wasn't quite resolved last time I looked. We can build multi-arch images, but cannot load them. It seems we need to push to registry, then pull one in the local docker to run it. We would like to avoid the old manifest-based approach to publishing individual images with tags per arch, then having a third tag that is only a manifest linking the two images.

Is there a way to run an image of another arch through QEMU to run a test in the container? (Like making sure it runs. Like tools that aren't static, single files like Go based linters, making sure they can actually start)

docker run --platform linux/arm64 should run the arm version of the image

I assume that's out of your specialty.

Yes and no. I'll take a look. I have a pretty broad set of skills at this point so I might be able to help (I was a Full stack AppDev before I moved into DevSecOps). There are quite a few things I've found in the dockerfile build process that I'd like to adjust so I'll take a look at that.

I forgot to ask about your chown handling in your example. I know about the Linux permissions styles like 755, 655, 777, 400 etc, but is it a permission or a user that is 65532. What does it solve and what were the drawbacks when you implemented this?

We have a policy that we only run using root unless explicitly necessary, we even do this when building since any process running in a build is just running inside a container. If you run everything as root a poisoned binary can be used to escape the build process and potentially be used for nefarious purposes like installing remote shells inside build infrastructure. While difficult it's not impossible and running as a non root user thwarts 99% of attempts since you need root to access most of what could be used to escape the container (including most docker engine exploits). 65532 is just a generic userid we used for the megalinter build. Making the container run under a non-root user would have issues (including those you listed and others) which could be solved but the building user in a different stage which has a static binary output would have no impact on it except for the owner id being set on the binary. I'd like to eventually help you tackle the final user eventually but that should be a separate effort

@Kurt-von-Laven
Copy link
Collaborator

This is an exciting thread to see! When the time comes, I have been very slowly making progress on setting a non-root user and running the container as a non-root user in #1985.

@bdovaz
Copy link
Collaborator Author

bdovaz commented Apr 10, 2023

@waterfoul I see you have a lot of experience and ideas so it makes perfect sense for you to continue this PR if you want to.

@bdovaz
Copy link
Collaborator Author

bdovaz commented Jan 5, 2024

@echoix in #3107 you solve part of the problem but it seems that there are still problems in some workflow like this one: https://github.com/oxsecurity/megalinter/actions/runs/7419637362/job/20189512158?pr=2273

Can you fix it in another PR? Thanks!

@bdovaz
Copy link
Collaborator Author

bdovaz commented Jan 5, 2024

For this, it needs

https://github.com/echoix/megalinter/pull/28/files#diff-c40ecec00fa976e23241a7188bb7ca5e905d288775c1286f81dd85d1e07ba2f0R41-R48

And

https://github.com/echoix/megalinter/pull/28/files#diff-c40ecec00fa976e23241a7188bb7ca5e905d288775c1286f81dd85d1e07ba2f0R205

And

https://github.com/echoix/megalinter/pull/28/files#diff-c40ecec00fa976e23241a7188bb7ca5e905d288775c1286f81dd85d1e07ba2f0R213-R214

And

https://github.com/echoix/megalinter/pull/28/files#diff-c40ecec00fa976e23241a7188bb7ca5e905d288775c1286f81dd85d1e07ba2f0L217-R226

For the trivy at the end I'm not sure.

But the important bits are the local registry, not loading, but pushing (to that registry)

It seems that you have it very clear, as I say, can you create a PR with the necessary changes and then I do the main branch merge? So I don't mix and that this PR does not differ much from its purpose.

Thanks @echoix!

@echoix
Copy link
Collaborator

echoix commented Jan 5, 2024

I was hoping that @nvuillam could cut a release before changing to Alpine 3.19, this weekend (or even today). I didn't initially plan to also have the upgrade to Python 3.12 in the same point release, but I it still happened.

I want to be sure that the possibilities to break would not be too big, and to be able to debug users by knowing what is the source of incompatibilities. Even if there is a couple days between the releases, at least we could say: go back to this release in the meantime.

@echoix
Copy link
Collaborator

echoix commented Jan 5, 2024

And I also hoped to continue the next step of applying the metadata action PR to the other workflows, but after a release, to make sure it didn't impact it. Now, we've been using it (for dev) for like a month, but it didn't touch the beta nor release.

@echoix
Copy link
Collaborator

echoix commented Jan 5, 2024

So, you could add the changes that I suggested to you in here in the meantime, it's the kind of workarounds that are related to multi-platform images. It'll get you unblocked, and they'll be in main at some point with a PR

@echoix
Copy link
Collaborator

echoix commented Feb 3, 2024

@bdovaz https://github.blog/changelog/2024-01-30-github-actions-introducing-the-new-m1-macos-runner-available-to-open-source/ !

@nvuillam
Copy link
Member

nvuillam commented Feb 7, 2024

Can this announce be a wake up call for this long time new feature ? :p

@echoix
Copy link
Collaborator

echoix commented Feb 7, 2024

But we need to manage to do all this with 14 Gb storage, 7Gb ram...

@nvuillam
Copy link
Member

nvuillam commented Feb 8, 2024

@echoix
Copy link
Collaborator

echoix commented Feb 8, 2024

I'm not sure yet, but at least they're fast. That kind of fast: OSGeo/grass#3395 (comment). To speed up serial-only tests like that, unoptimized, the raw speed and disk access speed must be for something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nostale This issue or pull request is not stale, keep it open
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants