Skip to content

Latest commit

 

History

History
100 lines (77 loc) · 4.25 KB

2019-04-29-cache.md

File metadata and controls

100 lines (77 loc) · 4.25 KB
created last updated status title authors
2019-04-29
2019-04-29
implemented
Forcing non-cache-hits in the repository cache
aehlig

Abstract

This document describes an extension of the repository build API that can be used to artifically avoid cache hits in the repository cache.

Background

External Repositories

Bazel requires a view of the whole dependency graph. But not everybody organizes their source code in a single repository. Therefore, bazel has the concept of external repositories to specify additional parts of the source tree to be fetched from some other location.

Repository Cache

Several projects, especially related ones, often depend on the same external resources. To avoid re-downloading them for every project, bazel has a cache, shared between all workspaces, of downloaded files, indexed by their hash.

It is good practice, for reasons of security and hermeticity, to specify a hash of any external file needed. Therefore, those hashes are available anyway for most downloads, making the cache work efficiently.

Downloads with non-authoritative hash

Upon updating external dependencies, sometimes it is forgotten to update the hash as well, leading to cache hits not desired by the user (#5045, #6089). This is especially confusing, if the download is only indirectly through a different rule (e.g., maven_jar) where a canonical identifier is available, and the hash is considered only a security measure (#5144, #5250).

Proposal

To avoid such unintended cache hits, it was requested to artifically split the cache. We propose to do it in the following way.

API extension

The repository-context functions download and download_and_extract are extended by an additional optional parameter canonicial_id. If this parameter is set to a non-empty string, a file is taken only from cache, if the a file for the logical pair consisting of hash and canonical_id is already in the logical cache; such a pair enters the logical cache, if a file with that hash is downloaded via a download or download_and_extract request specifying that canonical_id.

On-disk representation of the cache

We do not want to store the same file several times on disk. Moreover, we want a file downloaded with a canonical_id also be available in cache for requests asking for a hash only. Therefore, we still store the files by hash, as we currently do, and only add the additional information by which canonical_ids a file is known by. As the on-disk format uses one directory per hashed file, we can directly add the information there. To avoid having conflicts when updating the list, we use one file per canonical_id. More precisely, we add a file where the name id- followed by the hash of the canonical_id; the content of the file will be the canonical_id itself. As the hash is assummed to be sufficiently secure, there won't be any name clashes, also not with the payload entry called file. Moreover, using (essentially) the hash rather than the id avoids problems with very long canonical_ids and with difficult characters in the canonical_id, including /.

Limitations

The approach of artificially splitting the cache has several limitations users should be aware of.

  • Unintended Fragmentation: if several rules (or variants thereof used by different projects) refer to the same file via a different canonical_id, it will be downloaded once for every such identifier. In particular, URLs don't work well as canonical identifieres for archives distributed via different organisations.

  • Bazel has no means of verifying the canonical id; it is purely declarative from bazel's point of view. So, if a an incorrect use of a rule specifies an unintended pair of canonical_id and URL, bazel will add whatever file downloaded from there to the logical cache. That will be available for future cache hits.

Backward-compatibility

The change adds a new optional argument to ctx.download and ctx.download_and_extract. If the argument is not provided, the behavior does not change. So the proposed change is backwards compatible.