ci: Add conditional CI for the containerd backed image store #46402

rumpl · 2023-09-05T08:23:36Z

replaces / closes [don't merge] ci: enable TEST_INTEGRATION_USE_SNAPSHOTTER #45232

- What I did

Added CI for the containerd integration

Any PR that has the label containerd-integration will also run tests with the snapshotters

Example PR without the needed label, note the skipped jobs rumpl#138

- How I did it

Moved everything from test.yml into a reusable workflow in .test.yml and used that to run tests with graph drivers and containerd snapshotters. Also added a workflow to test snapshotters in windows.

- How to verify it

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

rumpl · 2023-09-05T09:36:45Z

Note: we can't do the same in jenkins (filter on the label), we don't have the required plugins that would let us do that. We could have a check on the branch name perhaps? Check that the branch contains c8d or something like that. It's not ideal but would kinda do the trick.

Another possibility is to run the unit/integration tests (~15 minutes run) for all pull requests, but make it pass no matter what.

Any preferences?

crazy-max

Note that we have currently 98 jobs. With this change we would have 98 + 45 (.test) + 21 (.windows) = 164 which would increase queue drastically for any repos on moby org (such as buildkit).

.github/workflows/.test.yml

rumpl · 2023-09-05T10:00:44Z

Note that we have currently 98 jobs. With this change we would have 98 + 45 (.test) + 21 (.windows) = 164 which would increase queue drastically for any repos on moby org (such as buildkit).

True, but these new jobs don't run for all PRs, only for those that they are needed

rumpl · 2023-09-05T11:34:59Z

True, but these new jobs don't run for all PRs, only for those that they are needed

This is true... for now. At one point it would be great to test both graph drivers and containerd snapshotters at the same time (once the CI is green).

neersighted · 2023-09-06T15:56:37Z

.github/workflows/.test.yml

+      use_snapshotter:
+        required: true
+        type: boolean
+      graph_driver:


Suggested change

graph_driver:

storage_driver:

Since the flag is actually --storage-driver.

🤦 changed it!

.github/workflows/.test.yml

TBBle · 2023-09-07T14:14:18Z

I don't think it's actually testing the containerd image store. (Very late edit: I meant on Windows here, but didn't write it....)

When I tried something similar (see master...TBBle:moby:hack/containerd_image_store^) I found that the env-var TEST_INTEGRATION_USE_SNAPSHOTTER didn't affect the tests as desired because we register the Docker daemon as a service. So I had to set registry keys to affect the daemon's environment. It was pretty easy to do with reg between registering the daemon as a service and starting that service.

Anyway, you should see Enabling containerd snapshotter through the $TEST_INTEGRATION_USE_SNAPSHOTTER environment variable. This should only be used for testing. in the daemon logs if it's testing the containerd image store (like this), and I don't see that in, e.g., https://github.com/moby/moby/actions/runs/6109262248/job/16580260658?pr=46402.

Another sign it's not using the snapshotter is that it should have failed all of the tests during setup due to (I think) the wildly incorrect WCOW mount code in containerd 1.6 (which I contributed, AFAIR) vendored into Docker. That was replaced with good code in containerd 1.7.2, so I think getting anything meaningful out of the testsuite will depend on #45966.

You can also see a couple of quick-hack code changes I had to get the daemon even start, to deal with two issues: One was that when the containerd storage backend is enabled, the network setup code for builder-next assumes we're on Linux and hence uses an explicit network mode instead of "auto". The other is that the graph driver name (also used to select snapshotter) is different between Docker (windowsfilter) and Containerd (windows). Windows doesn't honour either the env-var or config option to control the graphdriver name, so you can't override the latter one from the CI script.

I also added the matrix in a different spot, but that should not affect any of the above.

Side thought: Are we actually running the Linux build through CI with TEST_INTEGRATION_USE_SNAPSHOTTER anywhere? I don't see in-tree configs, e.g., the Jenkinsfile, mentioning that env-var. I'm assuming containerd snapshotter support is not on-by-default yet; Jenkins doesn't appear to capture the daemon logs, but I see a couple of SKIPs due to !testEnv.UsingSnapshotter() so I assume the main Jenkins run is testing graphdrivers.

rumpl · 2023-09-07T14:36:20Z

Another sign it's not using the snapshotter is that it should have failed all of the tests during setup due to (I think) the wildly incorrect WCOW mount code in containerd 1.6 (which I contributed, AFAIR) vendored into Docker.

I did at one point manage to make it build/run and run a container:

But I never pushed the code because it was a quick hack to see what's needed

Side thought: Are we actually running the Linux build through CI with TEST_INTEGRATION_USE_SNAPSHOTTER anywhere? I don't see in-tree configs, e.g., the Jenkinsfile, mentioning that env-var. I'm assuming containerd snapshotter support is not on-by-default yet; Jenkins doesn't appear to capture the daemon logs, but I see a couple of SKIPs due to !testEnv.UsingSnapshotter() so I assume the main Jenkins run is testing graphdrivers.

For Linux we have a draft PR that @vvoland opened and rebases from time to time, which is not ideal. That's why I opened this PR. I didn't update Jenkins because we don't have the plugins needed to filter with PR labels as I do in this PR. So basically, we don't yet test on linux but this PR changes that.

I don't think it's actually testing the containerd image store.

Thanks, let me double check that!

TBBle · 2023-09-13T14:33:23Z

The logs from the relevant run have expired, but the problem I was alluding to is triggered by the Docker integration test suite setup code, not usual container-running operations. I assume COPY in the classic builder relies on being able to mount the container image locally, and so that should show the same failure. Apart from host-mounting, I think all the relevant stuff landed in the 1.6 series, so running containers and Dockerfiles without COPY and similar should work approximately as-is. (That's what I'd hoped to see in my own tests, but the testsuite setup failing denied me that pleasure.)

rumpl · 2023-09-19T08:25:38Z

I have rebased this PR and removed the windows part, at least with this we will have CI run on linux/amd64

TBBle

Comments, but nothing majorly blocking as far as I can see, just some possible opportunities to reduce future work a little bit (and a typo).

My only real concern here is the artifact/code-cov question. The other questions/ideas are Windows-specific and since Windows isn't being enabled here, they can be punted to whoever does that, if you prefer.

TBBle · 2023-09-19T11:27:32Z

.github/workflows/.windows.yml

+            If ("${{ inputs.use_snapshotter }}" -eq "true") {
+              echo "TEST_INTEGRATION_USE_SNAPSHOTTER=1" | Out-File -FilePath $Env:GITHUB_ENV -Encoding utf-8 -Append
+              echo "DOCKER_GRAPHDRIVER=windows" | Out-File -FilePath $Env:GITHUB_ENV -Encoding utf-8 -Append
+            }


Unless things have changed since last time I looked, this change won't have any effect:

The daemon gets its env-vars from the service setup, created during the register-service call, and Go's CreateService API doesn't appear to support populating env-vars. If there was a command-line option for this, that would work, as the command-line passed to the daemon during --register-service is bundled into the service configuration. The best option appears to be editing the service definition between registration and starting it. (Edit: The FIXME comment was added after I wrote this, so at least it's explicit this doesn't currently work now)

The daemon doesn't honour DOCKER_GRAPHDRIVER or DOCKER_DRIVER on Windows. I'm not sure if there's any intention to do so, either, but it might be valuable once containerd image store works, as there's possibly going to be multiple Windows snapshotters available.

So I'd suggest moving this block down between the Start-Process ... --register-service and the Start-Service calls now, and dropping DOCKER_GRAPHDRIVER, as that'll have to be done later anyway. I'm not sure if the test suite actually cares about TEST_INTEGRATION_USE_SNAPSHOTTER, but no harm in setting it when also setting the registry key for the service, I guess? (I did so in my version, but don't recall if I checked for it having any effect or not)

.github/workflows/windows-2019.yml

.github/workflows/windows-2022.yml

TBBle · 2023-09-19T11:41:48Z

.github/workflows/.windows.yml

+      use_snapshotter: # always false for now until we update containerd
+        required: false
+        type: boolean
+        default: false


I think we should skip the runtime==builtin axis of the two matrix jobs in this file, when this is true. use_snapshotter makes no sense when not using containerd, and since it's started from a separate optional job, simply ignoring use_snapshotter in that case will needlessly duplicate the integration test run (and fail).

I'm not sure how to do that in GitHub Workflows though. Maybe you can exclude from the matrix using the job inputs?

Buildkit has something like this https://github.com/moby/buildkit/blob/master/.github/workflows/validate.yml#L22-L60

I'm sure I can find something with a little help from @crazy-max

Also, we can do this in a followup since we are not calling this with containerd snapshotters enabled any more

I just looked more-closely at your link to BuildKit, and wow, today I learned that GitHub Actions lets you do crimes.

But yeah, should be feasible to mess with the matrix to insert an appropriate exclude clause based on the workflow inputs.

.github/workflows/.test.yml

TBBle · 2023-09-23T04:55:57Z

For the record, I was curious how the test suite would go on Windows now that containerd 1.7.6 has been vendored, so I rebased my version of this PR onto the mainline branch.

Sadly, it still fails at the first action during busybox build (RUN mkdir C:\tmp && mkdir C:\bin), and the logs aren't super-helpful there (unexpected EOF), so not sure what went wrong. There's like a 60% chance it has something to do with WCOW stdio though, since that's the main problem source for all of the Docker/Windows/Containerd work. ^_^ (Edit: Worked this one out)

Merged logs of failed step

2023-09-23T04:13:57.9142767Z MOBY [Information] image pulled [traceID=20a7c6128d72d420a4d28266b1803fd3 spanID=9d94b34734d8a189 digest=sha256:d227d7ab11f4aa0da7779adc31616924cf0b15846a10313d83a644541208f80b remote=mcr.microsoft.com/windows/servercore:ltsc2022]
2023-09-23T04:13:57.915340600Z CTRD level=debug msg="delete image" name="moby-dangling@sha256:d227d7ab11f4aa0da7779adc31616924cf0b15846a10313d83a644541208f80b"
2023-09-23T04:13:57.9285776Z Digest: sha256:d227d7ab11f4aa0da7779adc31616924cf0b15846a10313d83a644541208f80b
2023-09-23T04:13:57.9287309Z Status: Downloaded newer image for mcr.microsoft.com/windows/servercore:ltsc2022
2023-09-23T04:13:57.9563826Z MOBY [Information] debug: [BUILDER] Cache miss: [cmd /S /C mkdir C:\tmp && mkdir C:\bin]
2023-09-23T04:13:57.9563826Z MOBY [Information] debug: [BUILDER] Command to be executed: [cmd /S /C mkdir C:\tmp && mkdir C:\bin] [traceID=20a7c6128d72d420a4d28266b1803fd3 spanID=9d94b34734d8a189]
2023-09-23T04:13:57.9654515Z  ---> d227d7ab11f4
2023-09-23T04:13:57.9654911Z Step 6/13 : RUN mkdir C:\tmp && mkdir C:\bin
2023-09-23T04:13:57.973779100Z CTRD level=debug msg="stat snapshot" key="sha256:313b663315113261ec8ee4b9b3652b3504b995aa3fd9d3f63fdebf2316a3cfe0"
2023-09-23T04:13:57.985329600Z CTRD level=debug msg="prepare snapshot" key=d63a977fa40b8866ec74a25901c5b5d85bc669de6eaea25d398cd4d8b7943ad9-init-key parent="sha256:313b663315113261ec8ee4b9b3652b3504b995aa3fd9d3f63fdebf2316a3cfe0"
2023-09-23T04:13:57.992154900Z CTRD level=debug msg=createSnapshot
2023-09-23T04:13:58.011612200Z CTRD level=debug msg="event published" ns=moby topic=/snapshot/prepare type=containerd.events.SnapshotPrepare
2023-09-23T04:13:58.0955145Z MOBY [Information] debug: Calling proc (1)
2023-09-23T04:13:58.0962106Z MOBY [Information] debug: Calling proc (2)
2023-09-23T04:13:58.186150800Z CTRD level=debug msg="garbage collected" d=8.4145ms
2023-09-23T04:13:58.1874698Z unexpected EOF
2023-09-23T04:13:58.7239645Z ##[error]Process completed with exit code 1.
2023-09-23T04:13:58.7583587Z ##[group]Run $ErrorActionPreference = "SilentlyContinue"

But at least it's pretty easy to get it into a "testable" state, as the only CI things in my branch that this PR doesn't have is the TODO-marked Windows enablement.

I might rebase my branch onto this PR later to verify that, and provide a base for further exploration, if I find time to play with it before this PR is merged.

Edit: I did the rebase, it was relatively simple. Manually triggered a Windows 2022 run and it produces the same result as above.

@rumpl if you wish to steal TBBle@7d4ffa1 or part thereof for this PR, you're welcome to. If not, that's fine, it can sit in my branch, and once this PR lands, I'll probably put it up in a draft PR for visibility along with whatever hacks I have to get the Windows-side (closer) to parity.

Final edit: With enough hacks, I got that branch to actually run the testsuite on Windows. Still waiting for the last few to complete, but it looks like it'll be pretty close to the pass-rate of Linux (from this PR), which is nice. So there'll be at least a draft PR (due to all the hacks) for the Windows side of this coming in the future, once this one lands.

Final final edit: #46539 created, since it's now in a good-enough state for initial review, and is mostly pending on this PR to merge.

Post-edit: That 60% guess was wrong, it was calling a nil function-type value and panicking.

thaJeztah · 2023-09-29T10:57:47Z

For future reference / future self; we discussed this PR in Yesterday's maintainers call;

currently, this PR requires the containerd-integration label to be set on PRs for it to run
☝️ (probably) adding the label after the PR was opened won't trigger it "after-the-fact", but (we can give it some testing once merged) a "restart CI" or good'ol open-and-close should fix that
we also wanted to update this PR, and have it run against master / main: we know it's "red" (for now), but having CI run on all commits merged into master makes it more visible to track "what's left" on the containerd-integration work
BUT we also discussed that we should add back some CI -status badges in the README, and we don't want master to always be red, so I asked @rumpl to move this CI to a separate file, so that it can get its own badge (separate from "non-containerd-intregration")

I should also add that this is a large diff, and "lotsa YAML", so I'll give it another look, but in doing so will somewhat depend on prior reviews of those that already happened, and I'll ask @crazy-max (as our in-house GHA guru) to give it a final "once-over".

thaJeztah · 2023-09-29T10:59:31Z

@vvoland FYI; I added "closes" #45232 (as I think we can close that one once this is merged)

TBBle · 2023-09-29T14:10:26Z

.github/workflows/snapshotter-test.yml

@@ -0,0 +1,32 @@
+name: snapshotter-test


Since it's an analogue with test but using the snapshotter, should it be test-snapshotter? Then it should sort adjacent-to but below the test workflow, I hope.

This occurred to me when I was thinking about whether we'd want to do split out the snapshotter test job for windows-2022 the same way, and if so, what to call it.

Also, thinking about future naming if and when the snapshotter becomes the default, i.e. in test, what will the non-snapshotter test be named? non-snapshotter-test is very English-like, but sorts weirdly, and the most important aspect (test) is way over at the end.

We could call them test-graphdriver and test-snapshotter

We could call them test-graphdriver and test-snapshotter

Works for me 👍

(GitHub grouping / listing checks remains horrible in either case 😂)

Guess it's too much to ask to have this sorter alphabetically

thaJeztah · 2023-09-29T15:12:11Z

Thanks for splitting! Gonna have a look; one quick thing if you're updating/pushing; there's a typo in the first commit (s/vlaidate/validate/);

CI: move smoke, vlaidate, test and build-dev jobs

Moving these jobs to reusable workflows to declutter the test.yaml workflow, this will also help reuse these workflows to run tests with the container-snapshotter feature enabled Signed-off-by: Djordje Lukic <djordje.lukic@docker.com>

This workflow will run all the tests (unit, integration and integration-cli) with the containerd-snapshotter feature enalbed. We only run these tests for pull requests that have the "containerd-integration" label and also on any other branch/tag. Signed-off-by: Djordje Lukic <djordje.lukic@docker.com>

rumpl · 2023-09-29T16:14:34Z

Thanks for splitting! Gonna have a look; one quick thing if you're updating/pushing; there's a typo in the first commit (s/vlaidate/validate/);
CI: move smoke, vlaidate, test and build-dev jobs

Fixed that typo and another one in the second commit that you didn't see :)

TBBle · 2023-09-29T16:38:13Z

Passing thought: Are there unit tests affected by the TEST_INTEGRATION_USE_SNAPSHOTTER or DOCKER_GRAPHDRIVER env-vars? If not, can we skip them in snapshotter-test? If so, should they be, or is that a future task to refactor so-affected unit tests to always cover both variations as long as they're supported, disirregardless of the environment they run under?

Prompted because unit tests are about 25% of the GHA runtime, and serialise ahead of the integration tests, so avoiding needless duplication seems valuable, although this would lead to the fascinating result of the snapshotter integration tests starting (and completing?) before the graphdriver integration tests.

thaJeztah · 2023-09-29T17:18:57Z

Passing thought: Are there unit tests affected by the TEST_INTEGRATION_USE_SNAPSHOTTER or DOCKER_GRAPHDRIVER env-vars? If not, can we skip them in snapshotter-test? If so, should they be, or is that a future task to refactor so-affected unit tests to always cover both variations as long as they're supported, disirregardless of the environment they run under?

I don't think the unit-tests should be affected by this, but (honestly) can't say with 100% certainty (I know that some unit-tests are not-as-unit-as-they-should be (looking at you: "networking unit-tests that depend on iptables").

I definitely think we need to look at the matrices as a whole, as I think there's some unneeded duplication (as well as looking if we can make them more efficient; some checks depend "too much" on others, which means having to wait for others to complete before being able to restart them, whereas others are fast to run (but relatively high overhead for building the initial parts).

All that said; I think it's ok to look at that in follow-ups (but we could consider creating a tracking ticket for this).

(Also trying to give @rumpl some sleep after waking up in the middle of night, dreaming of YAML and more YAML) 😂

rumpl · 2023-09-29T18:08:43Z

Sooo much yaml! 😭😂

But you are right @TBBle, we might be able to run unit tests only once. I’m off for Dockercon next week so won’t have time to take a closer look, so maybe we should review this as is and we can optimize later? The double unit tests only hit those that open a PR in the snapshotter case; ok on master too but no one really waits for that I guess.

Hold off merging this though, now I kinda want to take a look at this this weekend while on the plane.

On top of this there are other optimisations we can do to our CI but I would say they are out of scope of what I want to accomplish here but will definitely get back to it while the ~~pain~~ memory is still fresh

crazy-max

I think we could merge test-graphdriver and test-snapshotter together and create a dynamic matrix if snapshotter is enabled. That would reduce changes footprint significantly but needs to make sure it's possible. I'm looking at it on another branch and keep you posted.

crazy-max · 2023-09-30T00:13:12Z

.github/workflows/.smoke.yml

+# reusable workflow
+name: .smoke


Why do we need a reusable workflow for smoke tests? It seems only called from test-graphdriver.

Maybe instead we could just move this to a regular workflow named test-smoke.yml as it's not tied to graphdriver or snapshotter.

I assume the idea is that if smoke fails, we can abort the run early. That's what the name suggests to me, anyway. So perhaps it should be called from both test-runs?

crazy-max · 2023-09-30T00:15:12Z

.github/workflows/.test.yml

+      -
+        name: Set up tracing
+        uses: ./.github/actions/setup-tracing


Don't think we need to set up tracing for unit tests? cc @cpuguy83

Correct we aren't doing any tracing in unit tests.

crazy-max · 2023-09-30T00:17:48Z

.github/workflows/.windows-build.yml

+# reusable workflow
+name: .windows-build


Why can't we just keep this in .windows.yml? We already have a bunch of workflows, would be nice to avoid unnecessary split.

crazy-max · 2023-09-30T00:19:24Z

.github/workflows/.windows.yml

@@ -249,7 +174,6 @@ jobs:
    runs-on: ${{ inputs.os }}
    timeout-minutes: 120
    needs:
-      - build


Related to previous comment, this is racy as we could call this reusable workflow without having a dependency on build so I think it's best to keep it build here.

crazy-max · 2023-09-30T00:20:44Z

.github/workflows/.test.yml

nit: I think it should be a rename of test.yml > test-graphdriver.yml instead

AFAIK in Git, renames are detected automatically at diff time, and target minimising diff size. In this case, the bulk of what was in test.yml ended up in .test.yml so it picked that as a rename.

crazy-max · 2023-09-30T01:18:57Z

@rumpl Opened #46572 as an alternative. Let me know what you think.

crazy-max · 2023-09-30T10:11:52Z

.github/workflows/test-snapshotter.yml

+  build-dev:
+    needs:
+      - validate-dco
+    uses: ./.github/workflows/.build-dev.yml


build-dev is called from test-graphdriver and test-snapshotter which would make the cache racy.

rumpl · 2023-10-16T08:36:40Z

Closing in favor of #46572

rumpl added the containerd-integration Issues and PRs related to containerd integration label Sep 5, 2023

rumpl force-pushed the ci-snapshotter-test branch 2 times, most recently from 8841e54 to 9e9af81 Compare September 5, 2023 08:26

rumpl marked this pull request as ready for review September 5, 2023 09:33

crazy-max reviewed Sep 5, 2023

View reviewed changes