New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warn about potential incomplete layer downloads in cache #3748
base: master
Are you sure you want to change the base?
Warn about potential incomplete layer downloads in cache #3748
Conversation
Signed-off-by: Ben Krieger <ben.krieger@intel.com>
Signed-off-by: Ben Krieger <ben.krieger@intel.com>
The root cause of this issue is that downloaded layers aren't written to a temp location and then moved after complete download and verification. I would be happy to contribute this fix, but given the numerous ways to implement this and side effects it could have (i.e. creating more directories or using /tmp), I thought it would be better to wait for a discussion. Would you like me to raise a separate issue? |
I also encounter this issue many times yesterday, probably due to HTTP connections dropped in the middle of transfer... and then the cache gets corrupted. |
One option would be to automatically delete undersized layers in the cache like this: diff --git a/src/cmd/linuxkit/cache/write.go b/src/cmd/linuxkit/cache/write.go
index 172f6fdfd..083e902dc 100644
--- a/src/cmd/linuxkit/cache/write.go
+++ b/src/cmd/linuxkit/cache/write.go
@@ -13,7 +13,7 @@ import (
"github.com/containerd/containerd/reference"
"github.com/google/go-containerregistry/pkg/authn"
"github.com/google/go-containerregistry/pkg/name"
- "github.com/google/go-containerregistry/pkg/v1"
+ v1 "github.com/google/go-containerregistry/pkg/v1"
"github.com/google/go-containerregistry/pkg/v1/layout"
"github.com/google/go-containerregistry/pkg/v1/match"
"github.com/google/go-containerregistry/pkg/v1/partial"
@@ -40,15 +40,14 @@ func (p *Provider) ImagePull(ref *reference.Spec, trustedRef, architecture strin
}
log.Debugf("ImagePull to cache %s trusted reference %s", image, pullImageName)
- // unless alwaysPull is set to true, check locally first
- if !alwaysPull {
- imgSrc, err := p.ValidateImage(ref, architecture)
- if err == nil && imgSrc != nil {
- log.Printf("Image %s found in local cache, not pulling", image)
- return imgSrc, nil
- }
- // there was an error, so try to pull
+ // always check locally in order to check for incomplete layers
+ if imgSrc, err := p.ValidateImage(ref, architecture); !alwaysPull && err == nil && imgSrc != nil {
+ log.Printf("Image %s found in local cache, not pulling", image)
+ return imgSrc, nil
+ } else if strings.Contains(err.Error(), "unexpected EOF") {
+ // TODO: delete last existing layer of index, since it is undersized
}
+
log.Printf("Image %s not found in local cache, pulling", image)
remoteRef, err := name.ParseReference(pullImageName)
if err != nil { Obviously we need to be careful about deleting things that fail verification, since users expect verification to protect them. However, in this case, it might be justifiable. Still, never storing a partial file in the cache from the beginning would be better. |
FYI: This is a known issue in google/go-containerregistry: https://github.com/google/go-containerregistry/blob/7c19fa370dbd0179c9ce73a0432bf826f174c775/pkg/v1/layout/write.go#L247-L248 |
One solution, though ugly, would be to simply modify the vendored copy of go-containerregistry. This has the downside of being difficult to maintain through updates, but it's a very small fix. diff --git a/src/cmd/linuxkit/vendor/github.com/google/go-containerregistry/pkg/v1/layout/write.go b/src/cmd/linuxkit/vendor/github.com/google/go-containerregistry/pkg/v1/layout/write.go
index f912b124e..dae94bf6a 100644
--- a/src/cmd/linuxkit/vendor/github.com/google/go-containerregistry/pkg/v1/layout/write.go
+++ b/src/cmd/linuxkit/vendor/github.com/google/go-containerregistry/pkg/v1/layout/write.go
@@ -231,14 +231,19 @@ func (l Path) WriteBlob(hash v1.Hash, r io.ReadCloser) error {
// Blob already exists, that's fine.
return nil
}
- w, err := os.Create(file)
+
+ tmpfile := filepath.Join(dir, "."+hash.Hex)
+ w, err := os.Create(tmpfile)
if err != nil {
return err
}
- defer w.Close()
-
_, err = io.Copy(w, r)
- return err
+ _ = w.Close()
+ if err != nil {
+ return err
+ }
+
+ return os.Rename(tmpfile, file)
} |
I discussed this with another developer and they suggested an alternative that I think is better than either of my prior two suggestions: We could add a I think this suggestions works well with the current PR, as the warning message may then include a recommended resolution - running Thoughts on this approach? |
Thanks for the analysis! I'd prefer not to modify the vendored code here, although if someone made a PR upstream and it was merged, I'd be happy to vendor the merge commit. Since the ultimate fix has to be in google/go-containerregistry , I like @ben-krieger 's proposal to create a |
@djs55 Thanks for weighing in. As I started playing around to see what the implementation would look like, I found myself adding to and improving I think I'll have more success getting If neither of those get accepted, I'll push my changes to ValidateImage and from there we can take a final look to see if it appears better to automatically delete undersized layers during downloads or to have a new subcommand find them and do it. |
I like both of the ideas mentioned above:
|
- What I did
I encountered an issue with "invalid index" after stopping an image pull with an interrupt.
The error wasn't very useful and it took me awhile to realize that a file in
~/.linuxkit/cache/blobs/sha256
was smaller than it should be.Verbose output also wasn't particularly useful.
- How I did it
I added logs to the vendored code for
github.com/google/go-containerregistry/pkg/v1/validate
until it was clear that a tar archive was being created and then read. Thetar.Reader
for this archive spat out anio.ErrUnexpectedEOF
on a call toNext()
. When moving to the header of the next file, it hit the end of the archive before discarding enough bytes. The header was correct, but the content (an image layer) was incomplete.- How to verify it
See in "What I did" and start with a clean cache. Technically, you need to time it so that a large layer is being downloaded, but this should almost always be the case.
With this branch, you will see the new output with a warning.
- Description for the changelog
Warn about potential incomplete layer downloads in cache
- A picture of a cute animal (not mandatory but encouraged)