Cache extracted layers #3850

dgageot · 2022-10-11T13:50:35Z

When a linuxkit image is built, the longest part is to merge image layers into single tarballs.

If the image is in the docker daemon, we docker run the image and docker export the resulting container.
If the image is in the linuxkit cache, we merge the layers into a single tar file.

In both cases, we could cache the result locally. It would basically be a digest->single tarball cache.

I've prototyped this and my build goes from 26s to 9s.

wdyt?

The text was updated successfully, but these errors were encountered:

deitch · 2022-10-11T14:24:15Z

You are talking about pretty much this line here where we call src.TarReader(), where src points to it either in docker image cache or linuxkit image cache.

Is it worth it? Sure subequent builds for from 26s to 9s, but the initial build still has to convert to layers and create the tar which then gets cached.

The trade-off is the complexity. We would need to implement it on our own, as the cache mechanism we are using ggcr does not have support for caching that as far as I know. How would you indicate what the cache entry actually is?

For example, when I currently try to get linuxkit/init:97b398b5deab3fc62531fae833085c19d9f92a67 it points to sha256:32908513b9ad552eeab6720a69723b01d5e2a46fc67bd821b277fef3d6272de0, which is an index, and goes on from there.

Let's say that we create the flattened image tar for amd64. That tar file has a sha256sum, so we could store it in the OCI layout under ~/.linuxkit/cache/blobs/sha256/, which fits with the oci-layout spec just fine.

What would we use to reference that linuxkit/init:97b398b5deab3fc62531fae833085c19d9f92a67 has a flattened image that it should use for amd64? Some annotation in the index.json? Here is the entry from my current cache for that init image:

      {
         "mediaType": "application/vnd.oci.image.index.v1+json",
         "size": 1785,
         "digest": "sha256:32908513b9ad552eeab6720a69723b01d5e2a46fc67bd821b277fef3d6272de0",
         "annotations": {
            "org.opencontainers.image.ref.name": "linuxkit/init:97b398b5deab3fc62531fae833085c19d9f92a67"
         }
      },

I can see the lookup process for linuxkit/init:97b398b5deab3fc62531fae833085c19d9f92a67 to index at sha256:32908513b9ad552eeab6720a69723b01d5e2a46fc67bd821b277fef3d6272de0, from which I can read the index to find the manifest for amd64, and from there config and layers.

What would the process be for going from linuxkit/init:97b398b5deab3fc62531fae833085c19d9f92a67 to the flattened tar?

One other thing to keep in mind is that the linuxkit cache is expected to last another 12-18 months or so. Docker image cache is finally moving to containerd under the covers, which means support for multi-arch indexes and images stored in their native OCI format (not just expanded layers). It is experimental, so I expect another 6-9 months for them to have all the features we need, and another equal amount for sufficient adoption that we can drop the linuxkit cache entirely (which will make me very happy).

If we can find a way to do this sanely, then by all means.

dgageot · 2022-10-11T16:12:05Z

@deitch thanks for all the feedback.

First, I should be more clear about why I'd like to do that. On Docker Desktop, we have pretty heavy images that contain a lot of packages. When we rebuild the application after a change, we have a lot of caching to avoid doing the same thing twice. However, as soon a we have changed the code of a single package, one of the images has to be rebuilt and we'd like this to be as quick as possible.

With the change I'd like to see, all the image of all the packages, expect the single one that needs to be rebuilt, are already in cache, whether it's the on-disk cache of linuxkit, or in a docker daemon. However, each of those images is present in the form of layers that we need to merge again and again, on each build.

I'd like to cache that result, by image. On my machine, this accelerates a no-op build from 16s to 9s and that's pretty important because 99% of my builds will be able to leverage that cache.

dgageot · 2022-10-11T16:12:35Z

I pushed some demo code that is flawed in many ways here

deitch · 2022-10-11T18:50:57Z

I get the purpose of it. You are saying that it is a frequent usage. Let's work with that.

The design you propose in #3851 says, when I need an image, say, linuxkit/init:f0c103b4550bd6b84fa4cdd3abb04ab63c13a0a8, I do the following:

Look for a cached flattened image of it; if I find it, use it.
If I cannot find it (like now), create a flattened tar image from the layers, but save that, so the next time I come back to step 1, I can just read the flattened image.

It becomes a branch, rather than a simple, "read image as tar stream of flattened filesystem with layers applied."

Where do we "look for a cached flattened image"? Is it in lkt cache, blobs/sha256/<sum of flattened tar stream>?
How do we save the mapping of "image name to blob of flattened stream"? The PR looks like it is just saving a key of the index to its flattened output?

dgageot · 2022-10-11T19:20:16Z

Look for a cached flattened image of it; if I find it, use it.

If I cannot find it (like now), create a flattened tar image from the layers, but save that, so the next time I come back to step 1, I can just read the flattened image.

Yes. All I added is the step that saves the expensive computation of the flatten image.

Where do we "look for a cached flattened image"? Is it in lkt cache, blobs/sha256/<sum of flattened tar stream>?

My goal was something like in ~/.linuxkit/flatten/sha256.tar.

How do we save the mapping of "image name to blob of flattened stream"? The PR looks like it is just saving a key of the index to its flattened output?

From the image name, I get an image ID or digest. This ID/digest becomes the key towards the blob.

deitch · 2022-10-12T07:46:00Z

OK, I think I see where you are going. It does make sense. I would change the storage approach.

I definitely would store them in the linuxkit cache. Let's continue our example from above, working with linuxkit/init:f0c103b4550bd6b84fa4cdd3abb04ab63c13a0a8. I actually created a flattened tar of it, comes out to be 794def3d3d731c5c5d1831ab362e2aa53556b0bcf8e1db1e5ff69d3a96ef84a6 (we probably could gzip it, which would make it different, but good enough for now).

Thinking this through, I would:

save that tar file in ~/.linuxkit/cache/blobs/sha256/794def3d3d731c5c5d1831ab362e2aa53556b0bcf8e1db1e5ff69d3a96ef84a6, which fits exactly with what all the other blobs are
Create an "index" to the flattened blobs that looks like OCI index (has reference to multiple archs) but is not.
Store a reference to the "index" in index.json using OCI Artifacts.

Since the only customer for this OCI layout is linuxkit, we can do this without worrying about compatibility with other clients.

As a first blush, I would think about having an index.json entry like this. I made up the mediaTypes to use here, so subject to discussion.

      // this is the existing one and points to the OCI image index
      {
         "mediaType": "application/vnd.oci.image.index.v1+json",
         "size": 1785,
         "digest": "sha256:32908513b9ad552eeab6720a69723b01d5e2a46fc67bd821b277fef3d6272de0",
         "annotations": {
            "org.opencontainers.image.ref.name": "linuxkit/init:97b398b5deab3fc62531fae833085c19d9f92a67"
         }
      },

      // this is the new one we add and points to the "index of flattened tars"
      {
         "mediaType": "application/vnd.oci.image.index.flattened.v1+json",
         "size": 1785,
         "digest": "sha256:e24fb1cd1bd9346ca452853611e3e1688dc7ec3197fa91b4ef2726282c803028",
         "annotations": {
            "org.opencontainers.image.ref.name": "linuxkit/init:97b398b5deab3fc62531fae833085c19d9f92a67"
         }
      },

And the "index of flattened tars" looks like:

{
  "schemaVersion": 2,
  "mediaType": "application/vnd.oci.image.index.flattened.v1+json",
  "manifests": [
    {
      "mediaType": "application/vnd.oci.flattened.filesystem.tar",
      "size": 20509696,
      "digest": "sha256:794def3d3d731c5c5d1831ab362e2aa53556b0bcf8e1db1e5ff69d3a96ef84a6",
      "platform": {
        "architecture": "amd64",
        "os": "linux"
      }
    }
  ]
}

dgageot mentioned this issue Oct 11, 2022

Store extracted layers into a cache #3851

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache extracted layers #3850

Cache extracted layers #3850

dgageot commented Oct 11, 2022

deitch commented Oct 11, 2022

dgageot commented Oct 11, 2022

dgageot commented Oct 11, 2022

deitch commented Oct 11, 2022

dgageot commented Oct 11, 2022

deitch commented Oct 12, 2022

Cache extracted layers #3850

Cache extracted layers #3850

Comments

dgageot commented Oct 11, 2022

deitch commented Oct 11, 2022

dgageot commented Oct 11, 2022

dgageot commented Oct 11, 2022

deitch commented Oct 11, 2022

dgageot commented Oct 11, 2022

deitch commented Oct 12, 2022