Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn about potential incomplete layer downloads in cache #3748

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ben-krieger
Copy link

- What I did

I encountered an issue with "invalid index" after stopping an image pull with an interrupt.

$ linuxkit build linuxkit.yml
Extract kernel image: docker.io/linuxkit/kernel:5.10.76
Image docker.io/linuxkit/kernel:5.10.76 not found in local cache, pulling
^C                                                                                                                     

$ linuxkit build linuxkit.yml
Extract kernel image: docker.io/linuxkit/kernel:5.10.76
Image docker.io/linuxkit/kernel:5.10.76 not found in local cache, pulling
FATA[0022] Failed to extract kernel image and tarball: Could not pull image docker.io/linuxkit/kernel:5.10.76: invalid index       

The error wasn't very useful and it took me awhile to realize that a file in ~/.linuxkit/cache/blobs/sha256 was smaller than it should be.

Verbose output also wasn't particularly useful.

$ linuxkit -v build linuxkit.yml
time="2021-12-22T14:42:41-05:00" level=debug msg="Formats selected: [kernel+initrd]"
time="2021-12-22T14:42:41-05:00" level=debug msg="validating output: [kernel+initrd]"
time="2021-12-22T14:42:41-05:00" level=info msg="Extract kernel image: docker.io/linuxkit/kernel:5.10.76"
time="2021-12-22T14:42:41-05:00" level=debug msg="image tar: docker.io/linuxkit/kernel:5.10.76 "
time="2021-12-22T14:42:45-05:00" level=debug msg="ImagePull to cache docker.io/linuxkit/kernel:5.10.76 trusted reference docker.io/linuxkit/kernel:5.10.76"
time="2021-12-22T14:42:49-05:00" level=info msg="Image docker.io/linuxkit/kernel:5.10.76 not found in local cache, pulling"

...

time="2021-12-22T14:42:56-05:00" level=debug msg="<-- 200 https://production.cloudflare.docker.com/registry-v2/docker/registry/v2/blobs/sha256/47/476eaf3f642da71f069d62e736164bf85db1148c8ce95b98a58362b2e4aaf4a6/data?verify=1640205176-A2ksdpXKc%2Bw8H%2FWiix7BR8ncUdc%3D (1.088881975s)"
time="2021-12-22T14:42:56-05:00" level=debug msg="HTTP/2.0 200 OK"
time="2021-12-22T14:42:56-05:00" level=debug msg="Content-Length: 1104"
time="2021-12-22T14:42:56-05:00" level=debug msg="Accept-Ranges: bytes"
time="2021-12-22T14:42:56-05:00" level=debug msg="Age: 16095"
time="2021-12-22T14:42:56-05:00" level=debug msg="Cache-Control: public, max-age=14400"
time="2021-12-22T14:42:56-05:00" level=debug msg="Cf-Cache-Status: HIT"
time="2021-12-22T14:42:56-05:00" level=debug msg="Cf-Ray: 6c1bd611296d252c-SJC"
time="2021-12-22T14:42:56-05:00" level=debug msg="Content-Type: application/octet-stream"
time="2021-12-22T14:42:56-05:00" level=debug msg="Date: Wed, 22 Dec 2021 19:42:56 GMT"
time="2021-12-22T14:42:56-05:00" level=debug msg="Etag: \"adf1587a8c32779ccdefbcf1afa52da3\""
time="2021-12-22T14:42:56-05:00" level=debug msg="Expect-Ct: max-age=604800, report-uri=\"https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct\""
time="2021-12-22T14:42:56-05:00" level=debug msg="Expires: Wed, 22 Dec 2021 23:42:56 GMT"
time="2021-12-22T14:42:56-05:00" level=debug msg="Last-Modified: Mon, 08 Nov 2021 11:42:36 GMT"
time="2021-12-22T14:42:56-05:00" level=debug msg="Server: cloudflare"
time="2021-12-22T14:42:56-05:00" level=debug msg="Vary: Accept-Encoding"
time="2021-12-22T14:42:56-05:00" level=debug msg="X-Amz-Id-2: kGzFjLMCjX9GuoD9YdupXF0LdlfMiyaDeMXhHkIFN2KAtiw2YyBPXwt+HBZ7Qu16ZI9m6ioWVG8="
time="2021-12-22T14:42:56-05:00" level=debug msg="X-Amz-Request-Id: QHAW76336Z6C1SMN"
time="2021-12-22T14:42:56-05:00" level=debug msg="X-Amz-Version-Id: oBXtWVpolFH2CQUH.GPUM2jge_XkdbUR"
time="2021-12-22T14:42:56-05:00" level=debug
time="2021-12-22T14:42:56-05:00" level=debug msg="{\"architecture\":\"arm64\",\"config\":{\"Env\":[\"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin\"],\"WorkingDir\":\"/\",\"Labels\":{\"org.mobyproject.linuxkit.kernel.buildimage\":\"linuxkit/alpine:2be490394653b7967c250e86fd42cef88de428ba\",\"org.opencontainers.image.revision\":\"56c08df66bc4496dd856e588f9559224ace08279\",\"org.opencontainers.image.source\":\"https://github.com/linuxkit/linuxkit\"},\"ArgsEscaped\":true,\"OnBuild\":null},\"created\":\"2021-11-08T11:39:57.833975129Z\",\"history\":[{\"created\":\"2021-11-08T11:39:57.833975129Z\",\"created_by\":\"ENTRYPOINT []\",\"comment\":\"buildkit.dockerfile.v0\",\"empty_layer\":true},{\"created\":\"2021-11-08T11:39:57.833975129Z\",\"created_by\":\"CMD []\",\"comment\":\"buildkit.dockerfile.v0\",\"empty_layer\":true},{\"created\":\"2021-11-08T11:39:57.833975129Z\",\"created_by\":\"WORKDIR /\",\"comment\":\"buildkit.dockerfile.v0\",\"empty_layer\":true},{\"created\":\"2021-11-08T11:39:57.833975129Z\",\"created_by\":\"COPY /out/* / # buildkit\",\"comment\":\"buildkit.dockerfile.v0\"}],\"os\":\"linux\",\"rootfs\":{\"type\":\"layers\",\"diff_ids\":[\"sha256:bc347e483ed23251bd80081de38c54c305b79c45150a9c87cfb9a79296bdb594\"]}}"
time="2021-12-22T14:43:01-05:00" level=fatal msg="Failed to extract kernel image and tarball: Could not pull image docker.io/linuxkit/kernel:5.10.76: invalid index"

- How I did it

I added logs to the vendored code for github.com/google/go-containerregistry/pkg/v1/validate until it was clear that a tar archive was being created and then read. The tar.Reader for this archive spat out an io.ErrUnexpectedEOF on a call to Next(). When moving to the header of the next file, it hit the end of the archive before discarding enough bytes. The header was correct, but the content (an image layer) was incomplete.

- How to verify it

See in "What I did" and start with a clean cache. Technically, you need to time it so that a large layer is being downloaded, but this should almost always be the case.

With this branch, you will see the new output with a warning.

$ make local && ./bin/linuxkit build linuxkit.yml 
make -C ./src/cmd/linuxkit local
make[1]: Entering directory '/home/bkrieger/go/src/github.com/linuxkit/linuxkit/src/cmd/linuxkit'
gofmt...
govet...
golint...
ineffassign...
CGO_ENABLED=0 go build -o /home/bkrieger/go/src/github.com/linuxkit/linuxkit/bin/linuxkit  --ldflags "-X github.com/li"
go test -mod=vendor ./...
?       github.com/linuxkit/linuxkit/src/cmd/linuxkit   [no test files]
?       github.com/linuxkit/linuxkit/src/cmd/linuxkit/cache     [no test files]
?       github.com/linuxkit/linuxkit/src/cmd/linuxkit/docker    [no test files]
?       github.com/linuxkit/linuxkit/src/cmd/linuxkit/initrd    [no test files]
ok      github.com/linuxkit/linuxkit/src/cmd/linuxkit/moby      0.011s
?       github.com/linuxkit/linuxkit/src/cmd/linuxkit/pad4      [no test files]
ok      github.com/linuxkit/linuxkit/src/cmd/linuxkit/pkglib    0.154s
?       github.com/linuxkit/linuxkit/src/cmd/linuxkit/registry  [no test files]
?       github.com/linuxkit/linuxkit/src/cmd/linuxkit/spec      [no test files]
?       github.com/linuxkit/linuxkit/src/cmd/linuxkit/util      [no test files]
?       github.com/linuxkit/linuxkit/src/cmd/linuxkit/version   [no test files]
make[1]: Leaving directory '/home/bkrieger/go/src/github.com/linuxkit/linuxkit/src/cmd/linuxkit'
Extract kernel image: docker.io/linuxkit/kernel:5.10.76
WARN[0004] A file in your linuxkit cache seems to be incomplete. You may want to try clearing your cache. 
WARN[0008] A file in your linuxkit cache seems to be incomplete. You may want to try clearing your cache. 
Image docker.io/linuxkit/kernel:5.10.76 not found in local cache, pulling
WARN[0019] A file in your linuxkit cache seems to be incomplete. You may want to try clearing your cache. 
FATA[0019] Failed to extract kernel image and tarball: Could not pull image docker.io/linuxkit/kernel:5.10.76: invalid index

- Description for the changelog

Warn about potential incomplete layer downloads in cache

- A picture of a cute animal (not mandatory but encouraged)

( )_( )
(='.'=)
(")_(")

Signed-off-by: Ben Krieger <ben.krieger@intel.com>
Signed-off-by: Ben Krieger <ben.krieger@intel.com>
@ben-krieger
Copy link
Author

The root cause of this issue is that downloaded layers aren't written to a temp location and then moved after complete download and verification. I would be happy to contribute this fix, but given the numerous ways to implement this and side effects it could have (i.e. creating more directories or using /tmp), I thought it would be better to wait for a discussion. Would you like me to raise a separate issue?

@sprat
Copy link
Contributor

sprat commented Dec 22, 2021

I also encounter this issue many times yesterday, probably due to HTTP connections dropped in the middle of transfer... and then the cache gets corrupted.

@ben-krieger
Copy link
Author

ben-krieger commented Dec 22, 2021

One option would be to automatically delete undersized layers in the cache like this:

diff --git a/src/cmd/linuxkit/cache/write.go b/src/cmd/linuxkit/cache/write.go
index 172f6fdfd..083e902dc 100644
--- a/src/cmd/linuxkit/cache/write.go
+++ b/src/cmd/linuxkit/cache/write.go
@@ -13,7 +13,7 @@ import (
        "github.com/containerd/containerd/reference"
        "github.com/google/go-containerregistry/pkg/authn"
        "github.com/google/go-containerregistry/pkg/name"
-       "github.com/google/go-containerregistry/pkg/v1"
+       v1 "github.com/google/go-containerregistry/pkg/v1"
        "github.com/google/go-containerregistry/pkg/v1/layout"
        "github.com/google/go-containerregistry/pkg/v1/match"
        "github.com/google/go-containerregistry/pkg/v1/partial"
@@ -40,15 +40,14 @@ func (p *Provider) ImagePull(ref *reference.Spec, trustedRef, architecture strin
        }
        log.Debugf("ImagePull to cache %s trusted reference %s", image, pullImageName)
 
-       // unless alwaysPull is set to true, check locally first
-       if !alwaysPull {
-               imgSrc, err := p.ValidateImage(ref, architecture)
-               if err == nil && imgSrc != nil {
-                       log.Printf("Image %s found in local cache, not pulling", image)
-                       return imgSrc, nil
-               }
-               // there was an error, so try to pull
+       // always check locally in order to check for incomplete layers
+       if imgSrc, err := p.ValidateImage(ref, architecture); !alwaysPull && err == nil && imgSrc != nil {
+               log.Printf("Image %s found in local cache, not pulling", image)
+               return imgSrc, nil
+       } else if strings.Contains(err.Error(), "unexpected EOF") {
+               // TODO: delete last existing layer of index, since it is undersized
        }
+
        log.Printf("Image %s not found in local cache, pulling", image)
        remoteRef, err := name.ParseReference(pullImageName)
        if err != nil {

Obviously we need to be careful about deleting things that fail verification, since users expect verification to protect them. However, in this case, it might be justifiable.

Still, never storing a partial file in the cache from the beginning would be better.

@ben-krieger
Copy link
Author

@ben-krieger
Copy link
Author

One solution, though ugly, would be to simply modify the vendored copy of go-containerregistry. This has the downside of being difficult to maintain through updates, but it's a very small fix.

diff --git a/src/cmd/linuxkit/vendor/github.com/google/go-containerregistry/pkg/v1/layout/write.go b/src/cmd/linuxkit/vendor/github.com/google/go-containerregistry/pkg/v1/layout/write.go
index f912b124e..dae94bf6a 100644
--- a/src/cmd/linuxkit/vendor/github.com/google/go-containerregistry/pkg/v1/layout/write.go
+++ b/src/cmd/linuxkit/vendor/github.com/google/go-containerregistry/pkg/v1/layout/write.go
@@ -231,14 +231,19 @@ func (l Path) WriteBlob(hash v1.Hash, r io.ReadCloser) error {
                // Blob already exists, that's fine.
                return nil
        }
-       w, err := os.Create(file)
+
+       tmpfile := filepath.Join(dir, "."+hash.Hex)
+       w, err := os.Create(tmpfile)
        if err != nil {
                return err
        }
-       defer w.Close()
-
        _, err = io.Copy(w, r)
-       return err
+       _ = w.Close()
+       if err != nil {
+               return err
+       }
+
+       return os.Rename(tmpfile, file)
 }

@ben-krieger
Copy link
Author

ben-krieger commented Dec 25, 2021

I discussed this with another developer and they suggested an alternative that I think is better than either of my prior two suggestions:

We could add a linuxkit cache autoclean command to go with linuxkit cache clean. Instead of clearing the entire cache, it looks for incomplete layers and deletes them. If in the future downloaded layers are prefixed with a . and renamed upon completion, autoclean could instead clean up these temporary files.

I think this suggestions works well with the current PR, as the warning message may then include a recommended resolution - running linuxkit cache autoclean - which doesn't require any caveats, like "this may require you to download a large amount of data again."

Thoughts on this approach?

@djs55
Copy link
Contributor

djs55 commented Dec 29, 2021

Thanks for the analysis! I'd prefer not to modify the vendored code here, although if someone made a PR upstream and it was merged, I'd be happy to vendor the merge commit.

Since the ultimate fix has to be in google/go-containerregistry , I like @ben-krieger 's proposal to create a linuxkit cache autoclean to check for incomplete files and a warning message to suggest running it. (I assume running it automatically is a bad idea because it would be too expensive to check the shas of everything?)

@ben-krieger
Copy link
Author

@djs55 Thanks for weighing in.

As I started playing around to see what the implementation would look like, I found myself adding to and improving (*cache.Provider).ValidateImage to do the checks for undersized layers before calling go-containerregistry's validate functions. It's getting pretty large, messy, and duplicative, so I might actually try to get it upstreamed.

I think I'll have more success getting Improve error message from "unexpected EOF" to "blob size is smaller than manifest claims it should be" upstreamed than Use temporary files when storing blobs, but I'll try both.

If neither of those get accepted, I'll push my changes to ValidateImage and from there we can take a final look to see if it appears better to automatically delete undersized layers during downloads or to have a new subcommand find them and do it.

@deitch
Copy link
Collaborator

deitch commented Dec 29, 2021

I like both of the ideas mentioned above:

  • submit a fix upstream - I actually do not think it should be a tmp dir, but rather that, if you are downloading cache/blobs/sha256/abcd123, then you should download to cache/blobs/sha256/abcd123.download and, when complete and sha-verified, rename it. Either way, this should go upstream.
  • lkt cache autoclean - very much so. We have wanted to expand some of the lkt cache subcommands for a while. I think I prefer something a little more descriptive than autoclean, though. Perhaps sweep or remove-invalid-files (yeah, that is horrible, but gets the idea across), or simply validate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants