Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal on caching #412

Closed
wants to merge 11 commits into from
Closed

proposal on caching #412

wants to merge 11 commits into from

Conversation

henrjk
Copy link
Member

@henrjk henrjk commented Jan 19, 2022

This proposal aims to address issue #147

Looking forward to your feedback!

@gerardcl
Copy link
Member

I like it so far! my suggestion would be to finish it by having a proposal on how one would do caching based on each main technology currently supported (aka go, java, typescript, python).
By doing such exercise we can ensure we match all by design and also document howto.

@henrjk
Copy link
Member Author

henrjk commented Jan 19, 2022

I like it so far! my suggestion would be to finish it by having a proposal on how one would do caching based on each main technology currently supported (aka go, java, typescript, python).
By doing such exercise we can ensure we match all by design and also document howto.

Makes sense to me. I will start looking at node/npm. Happy to have volunteers for all techs!

The way I am testing this is locally against one of our projects. I wrote a little helper script for this available at https://gist.github.com/henrjk/4265bc1b4f6fe3f7cd9027b0bcc284f9

@michaelsauter
Copy link
Member

volunteering for Go

@gerardcl
Copy link
Member

volunteering for python

@henrjk
Copy link
Member Author

henrjk commented Jan 20, 2022

With commit 4dc6da4 I added a first draft to enable typescript caching. Some of this should be familiar to @gerardcl.
In our project we have several repos each having two builds for backend and frontend. In this case the ability to skip a build when there were no changes in the associated subdirectory (--working-dir) will be of utmost importance.
For this I added a parameter called --working-dir-commit-sha which would have the commit hash for the subdirectory. I believe git rev-parse "HEAD:$working_dir" provides this value. In addition the build script expects the file $pipeline-cache-dir/$working-dir/.ods/git-dir-commit-sha to capture this value. This should be ideally done outside of the script to guarantee that it will always be updated. This would allow its last modified timestamp to be used for cleanup. Unless there are concerns I would add --working-dir-commit-sha to the proposal.

The script will require rsync which would need to be added to the image. I also noted that git is not available and so it uses .ods/git-ref to protect master and develop. This should not be hardcoded but I wasn't sure how best to configure this.

FYI @henninggross, @netzartist


== Con

* Tools such as for typescript https://github.com/pnpm/pnpm[pnpm] and https://github.com/yarnpkg/berry[yarn berry] support deduplication allowing to reference artifacts from a central location without packaging them again. pnpn uses hard links to make this efficient. However hard linking is not supported across PVCs and thus could not be used with the proposed design as is. See also related discussion at https://github.com/opendevstack/ods-pipeline/discussions/411[node typescript security and caching - npm or yarn? #411].
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To enable exploring the newer npm techs using hard linking, I am now in favor of changing the proposal so that pipeline-cache-dir and global-cache-parent would always be on the same PVC. However the checkout dir might still not be on the same PVC and perhaps not even be a PVC at all.

@henrjk henrjk marked this pull request as draft January 20, 2022 08:52
== Background

Currently there is a single PVC per project which is shared by all build pipelines.
As a consequence only a single build at a time can run for a project.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this isn't actually true due to #394, but with #407 and #160 solved, it will be "one build at a time for a repository".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add this clarification to the proposal so we don't loose it when resolving the conversation.


With sufficient caching space available:

* A global cache directory where the parent directory is indicated by parameter `global-cache-parent`. Cleanup will only spare directories with in it so that build tasks must keep stuff in a subdirectory. For example in `$(global-cache-parent)/go-modules/`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need the parameter global-cache-parent? Do you think some build tools will not be configurable to point to a directory of our choice?


* A pipeline cache directory as indicated by parameter `pipeline-cache-dir`.

In the initial implementation the parameters will be set to `/.cache/` and `/.cache-p/<pipeline-name>/` if space permits and be empty if there is not sufficient space available.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why /.cache-p/<pipeline-name>/? Does that mean you want caching within a branch to be on by default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pipeline caching should only be enabled when the tasks declares how much free space it requires for a new pipeline with a parameter pipeline-caching-mb greater than 0.
There should probably also be way to not using caching for a specific commit, perhaps by having [build-no-cache] as part of a commit message.
The pipeline cache should be left undisturbed from the previous build of the same pipeline, except in case where cache space is reclaimed. Not sure it answers your first question.


Files in the cache locations would persist between builds unless they are removed by the cleanup mechanism.

The build scripts are adjusted to take advantage of the cache parameters above. In addition each build task will support a new parameters `global-caching-mb` and `pipeline-caching-mb`. These defaults to 0 to indicate the task is not using caching. Otherwise it specify the minimum required cache space in MB.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reasoning behind two separate parameters? Why do we need both?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As now mentioned in the comment above one purpose of these parameters is to enable caching by explicitly stating how much cache space the task will require in the respective cache.
For the pipeline cache this will also allow for a good cleanup strategy as the space is not shared with other pipelines.
For the global cache the situation is more complex as it would be shared by all pipelines. So it would grow over time and cleanup might from time to time wipe out the cache (e.g. rm -rf /.cache/go) when the disk space on the caching PVC no longer is able to fulfill the demand. However to do a good job the cleaner would actually need to understand which tasks use which global caches which the proposal as is does not make explicit.


Before cache cleanup `ods-start` cleans up the prior PVC checkout dir, which means everything except `/.cache/<some-dir>` or `/.cache-p/<some-dir>`. In particular to prevent having files at the top levels regular files or links below `/.cache/` or `/.cache-p/` are deleted.

`ods-start` then cleans up separately the global and pipeline cache to fulfill the declared requirements of all build tasks in the odl.yaml. If disk space remains too low in a particular cache area a warning message is logged and the respective parameter `global-cache-parent` and/or `pipeline-cache-dir` will be empty.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleans up separately the global and pipeline cache

I think while reading through this I am actually confused what the difference is and why we need it. Could you clarify your proposal regarding this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pipeline cache enable to reuse installs of dependencies in subsequent builds similar to how happens on your local machine.
For example npm install will install its dependencies in node_modules. A subsequent build will verify that the installs are consistent with the latest package*.json version (which is nearly instantaneous). Before installing a new dependency npm will always download these to a local cache first. This cache is by default at ~/.npm. One can reconfigure this cache to be in the global cache and would speed up the build, however the performance gain probably does not outweigh the extra space requirement in my opinion. See also #411 (comment) for some impressions of performance gains.
The global cache in contrast is shared by multiple build pipelines and thus not meant to share installations of dependencies but only the dependencies itself.

@henrjk
Copy link
Member Author

henrjk commented Jan 24, 2022

Michael and I were talking directly about some of the details before the weekend.
While there were some other thought two noteworthy points for an evolution of this proposal emerged:

  • The pipeline cache confused him partly because he expected it to be named workspace directory (correct me if I am wrong).
  • Also he suggested we could eliminate the distinction between checkout directory and simple git pull if the workspace persists between pipeline builds.

This would indeed simplify things a bit, in particular the build script would not have copy sources from the checkout dir to the pipeline cache dir. Also there would be no longer a need to distinguish these directories and we could simply call it workspace directory :-)

If one enable caching for a workspace by setting a pipeline-caching-mb > 0 the workspace would git pull if the space wasn't cleaned up since the previous build and then call the build-script.

@michaelsauter is this a good summary? If it makes sense to you I would work on an updated proposal based on this.

@michaelsauter
Copy link
Member

@henrjk Yes, I think it does, but I would like to add the general view I shared to provide some background. I am not too excited to add a "workspace cache" at all into the main build tasks. In my view, starting from a clean slate is one of the "features" of CI and I personally would not encourage usage of a "workspace cache" in my projects. I believe we could also use the problem as a trigger to explore completely different approaches to avoid long build times, such as using a bundler like esbundle, or using YARN as you suggested. For other languages / build tools I think only a global cache is needed, if it all (as Nexus should be fast already!).

@henrjk
Copy link
Member Author

henrjk commented Jan 25, 2022

I am not too excited to add a "workspace cache" at all into the main build tasks. In my view, starting from a clean slate is one of the "features" of CI and I personally would not encourage usage of a "workspace cache" in my projects. I believe we could also use the problem as a trigger to explore completely different approaches to avoid long build times, such as using a bundler like esbundle, or using YARN as you suggested. For other languages / build tools I think only a global cache is needed, if it all (as Nexus should be fast already!).

I agree with your concerns and call to action! At the moment the workspace cache would not improve FE builds by much.

Nonetheless I would like to pursue a workspace cache proposal a bit further to learn whether we can provide a workspace cache and

  • keep the build scripts simple or make them even simpler
  • understand whether cached builds can be integrated without impacting quality

Can we support multi build repos better by skipping builds where nothing changed? Do we need a workspace cache to enable this?

For example the idea emerged after my initial proposal was to utilize the workspace cache to skip build tasks if their work-dir did not change as per git commit. When looking at the details the build-task would still be invoked to copy prior reports and bits into the docker context so that the image build could still continue. This is for the case where we have a FE and BE in the same repo and the FE packed js files are copied into the final docker build. Of course this increases complexity in the build scripts and perhaps one could perhaps factor out the uploading of reports and skip this entirely in the further pipeline. Or as mentioned above implement this without a workspace cache.

@henrjk
Copy link
Member Author

henrjk commented Jan 27, 2022

I update the proposal for increased clarity. It still contains both workspace caching and global caching. I believe things a bit simpler also in terms of build script complexity, but there are areas where I am not sure it can be implemented as proposed.

Also the Con section of caching.adoc is now somewhat better informed.

I will investigate what it would take to skip builds in multi build repos where sources did not change without using a workspace cache.
At the moment in my opinion workspace caching should only be considered once the mentioned cons are addressed. I assume this will take some time.
Perhaps a separate proposal for global caching should be made, if we believe this will have more traction.

@michaelsauter
Copy link
Member

@henrjk Like the overhaul of the proposal!

I guess I am still having trouble to understand the actual benefit of a "workspace cache" over using a global cache, within the context of the original problem statement of "caching dependencies". I do not know much about Python/NPM, so what I'm going to say may sound stupid - sorry in advance. I am basically approaching this coming from my - hopefully somewhat comparable - Ruby/Bundler experience. For Bundler, it is common in CI systems to place the bundle not into the workspace, but in a directory which is cached. I did some googling and it seems that npm --prefix achieves the same result (see https://stackoverflow.com/questions/14742553/how-to-set-custom-location-for-local-installation-of-npm-package), allowing to place node_modules anywhere I want, e.g. in a "global cache dir", making it possible to cache it for next runs. That's what brings me back to #147 (comment), in which my suggestion is to defer how much reuse a cache should get to the author of the ods.yaml file. The author can select a cache key of their liking, controlling how many different caches there are and how often they will be reused.

So - sorry if this doesn't make sense, and then I'll stop bringing it up 😄 However if it does make sense I propose to split this up into two suggestion, where at least the global cache seems way clearer and easier to achieve to me.

Also note that my current understanding is that npm ci will wipe the node_modules directory, in which case any kind of caching is pointless.

@gerardcl
Copy link
Member

I also see the point of splitting and start with the global option one, since it will also work for python by caching and using the venv on that cache (same as bundler) -> we just need to ensure we give the option to load the venv fromthe given cache and if present. A problem we need to address for sure first of all is to ensure we are not having many pipelines using the same cache (related to race condition already under discussion).

@henrjk
Copy link
Member Author

henrjk commented Jan 31, 2022

Corrected: earlier I said --index where I meant --prefix.

@michaelsauter Thanks for doubting!

Also note that my current understanding is that npm ci will wipe the node_modules directory, in which case any kind of caching is pointless.

In the workspace cache proposal the script actually adjusts to use npm i instead if it has a prior build cached. I am not sure actually that npm ci is much different from npm i in case there is a package-lock file available.

I am basically approaching this coming from my - hopefully somewhat comparable - Ruby/Bundler experience. For Bundler, it is common in CI systems to place the bundle not into the workspace, but in a directory which is cached. I did some googling and it seems that npm --prefix achieves the same result

Answering this could be its own blog series ;-)
But I am trying to be more concise here.

With node and npm if you have two apps depending on the same version of package foo each app may resolve the foo's own package dependencies differently depending on additional version constraints and thus one cannot share the node_modules as they are created by npm install or ci.

Using --prefix could potentially be used to place the install outside the workspace directory so that workspace caching could be implemented without needing to distinguish whether to git init or git pull. However I tested using --prefix but preliminary testing did not work for me unless I am doing pilot errors. Also the link you posted indicates that it would not work for require (https://stackoverflow.com/a/19975303).

To not overshare this it would need to be a branch cache and thus essentially be a workspace cache. Using a cache-key for npm node_modules caching would lack features I see as essential. In particular ensuring that mainline builds would not use such caching is important to me. Also in my testing $(context.pipeline.name) was not available.

There are however alternative to npm which are pnpm and yarn for which a true global shared cache is probably worthwhile.
Yarn in its latest version (Berry) uses an innovative approach dubbed Plug'n'Play where node_modules is replaced by an overridden require function. I believe this can even access js from within packages that remain tar.gz.
While npm has a cache, it unpacks and and copy files into the node_modules folder which makes it slow. As indicated in #411 to me having that npm cache persist is not really a worthwhile improvement.
However pnpm has a shareable cache (unless I am mistaken) where plugins already are unpacked and when creating a node_modules install it uses hard links as well as symlinks to express the right versions for node to discover.

The benchmarks from pnpm compare the various package managers. The code behind it is at commandsMap.js

@netzartist
Copy link

netzartist commented Jan 31, 2022

Deno has a interesting concept that you could skip dependency management like NPM at all via importing desired versions of dependencies via URLs into the Deno runtime, and exporting them in a custom lockfile to be used for caching.

This design circumvents a plethora of complexity spawned by package management software, centralized code repositories, and superfluous file formats.

The inventor of NPM created Deno after reflecting a lot about the architectural issues of NPM.


=== Global cache

Only build tasks for which a global cache provides a significant performance gain will support global caching. At the moment the primary candidates is the go language. Nexus does not support go and we have no similar go dependency artifacts manager in place.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gradle will also benefit. Normally because it always download the wrapper, unless the url that points to the gradlew wrapper is being updated to point to nexus down (opendevstack/ods-core#919). However to avoid this problem, I added to the gradle task an instruction to download the gradlew wrapper so that it lands in the gradle cache folder (

gradle wrapper --gradle-distribution-sha256-sum ${GRADLE_WRAPPER_DOWNLOAD_SHA256} && ./gradlew -version && \
). This has the limitation that just only one version of the gradle wrapper is available.

Copy link
Member

@stitakis stitakis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some comments to share some thoughts but I don't any blockers. I think the concept is ready and we could give it a try. It looks good to me!

@henrjk
Copy link
Member Author

henrjk commented Feb 8, 2022

Just update the proposal to focus on global caching only. I also aimed to undo my changes that showcased the workspace cache. The typescript build file was intended to be the same as in master now.

Copy link
Member

@michaelsauter michaelsauter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very happy with where this is now 👍. Thanks heaps for all the work you put into this! Marking this as "approved", even though I guess we would not merge the PR as-is but maybe just the cache proposal design doc?

docs/proposals/caching.adoc Show resolved Hide resolved

The following new parameter is introduced to build tasks supporting dependency caching:

* `cache-deps` a boolean defaulting to `false`. Only if set to `true` the task will used dependency caching.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer cache-dependencies.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change this as you propose.


In addition to the build task parameter, build tasks also receives the following:

* File `.ods/cache-deps-parent-dir` contains an absolute path without trailing '/' to an existing directory (no whitespace or newlines). If file `.ods/cache-deps-parent-dir` does not exits the task must not use dependency caching. This is used to switch off dependency caching dynamically for example with a special tag in a git commit message.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose to start without disabling via commit message. Do we need this file then at all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking we wouldn't need it. However without it the caching locations would need to be hardcoded, which I would rather not have.

Instead if disk space issues arise one should manually:

* Increase the PVC space of the associated repository or
* Recreate the PVC (TODO: should this be done automatically for example on a randomized bi-weekly schedule for example)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd propose to keep it manual for now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense to me as well. A first version should do that this way.


The following cache locations on the PVC are used:

- `+/.c/deps/<technology-name>/+` for dependency caches of a particular build technology. The technology-name would be defined by the build script and unknown to `ods-start`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer e.g. /.ods-cache/dependencies/<technology-name>/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this a bit lengthy. How about . /.ods-cache/deps/<technology-name/ ?

@henrjk
Copy link
Member Author

henrjk commented Feb 16, 2022

I am very happy with where this is now 👍. Thanks heaps for all the work you put into this! Marking this as "approved", even though I guess we would not merge the PR as-is but maybe just the cache proposal design doc?

Not sure. If we were to merge the proposals as well, there would still be a discussions etc around it missing. So perhaps it is better to have a link to the proposal PR in the changelog.

I would still aim to remove the other unrelated historic changes, so that only the latest adoc is kept (with final additions).

I could start work on this soon (unless somebody else if keen to) and work on a PR which likely should also include one consumer. I'd choose go for this.

@henrjk henrjk mentioned this pull request Feb 22, 2022
4 tasks
@michaelsauter
Copy link
Member

I would propose to close this now that #460 is merged and the changelog refers to this PR. @henrjk If you agree, please close.

@henrjk
Copy link
Member Author

henrjk commented Feb 25, 2022

This has now been implemented with PR #460

@henrjk henrjk closed this Feb 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants