Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

always_update_cookbooks should not be enabled by default #1879

Open
karmix opened this issue Mar 14, 2022 · 11 comments
Open

always_update_cookbooks should not be enabled by default #1879

karmix opened this issue Mar 14, 2022 · 11 comments

Comments

@karmix
Copy link
Contributor

karmix commented Mar 14, 2022

#1765 configures the chef_infra Test Kitchen provider to enable always_update_cookbooks by default.
The only explanation I can find is the PR description, which says:

Without this the user doesn't get their updated cookbooks transferred to the box.
The user shouldn't need to set this.

Unfortuantely, this is not true; Test Kitchen always gathers fresh copies of the cookbooks it needs and sends them to the test box before each converge to ensure that it is not testing a stale environment. I'm sure this change was made with good intent but I can't for the life of me think of a scenario in which it does more good than harm, and I have been unable to locate any issues or other discussion that might help me understand the problem it was trying to address.

If there was a scenario in which Test Kitchen was not sync-ing cookbooks before a converge, that would be a bug for which always_update_cookbooks is not the proper solution. always_update_cookbooks makes Test Kitchen create a new lock file with which to perform testing. The purpose of a lock file is to maintain a consistent and reproducable environment around which we can build stable and reliable workflows. Changing the lock file in the middle of a test suite can render the results of some of the tests irrelevant, which is a dangerous feature.

Consider how this new default can impact IaC:

  1. An engineer updates the lock file and creates a new commit. The new lock file uses a cookbook with some flaw that would impact the customer's services and cause their test suite to fail.
  2. The engineer runs the test suite to validate the new environment. kitchen.yml does not configure always_update_cookbooks, and with the feature defaulting to enabled, Test Kitchen updates the lock file finding and installing a new release of the cookbook that fixes the flaw. Test Kitchen and all tools run after it in the test suite use the new cookbook and do not detect the flaw.
  3. The engeineer creates a PR for the commit (which contained the original lock file).
  4. Detecting a new PR, the pipeline pulls the commit and runs the test suite. Again, Test Kitchen updates the lock file and the tests pass with the newer cookbooks. The pipeline signs off on the PR.
  5. An engineer merges the PR into the main branch of the repo.
  6. Detecting updates to the main branch, the pipeline checks it out and uses berks apply or chef push to publish the lock file containing the flawed cookbook to an environment or policy on the Chef Server.
  7. Chef Clients configured to use that environment or policy download the and use the flawed cookbook, impacting production services.

Here the new default for always_update_cookbooks causes Test Kitchen to test something other than what it was given, preventing it from providing the intended guard in a typical IaC workflow. The problem exists because external resources might receive updates between an initial commit and the execution of a test suite, causing Test Kitchen to generate a different lock file. In reality, the window of opportunity here is often larger than you might think, for example, updating a baseline cookbook may result in updates to dozens or hundreds of policies or environments. Most engineers would simply skip local testing and rely on their pipeline to tell them whether any environments failed and need their attention. The engineer's pipeline, being flooded with updates, may take several hours to get to some cookbooks.

Thankfully, in my organization enabling always_update_cookbooks causes Test Kitchen to crash, which is very visible and we were able to catch this early. Other customers are most likely unaware that Test Kitchen is no longer testing with the environment it was provided and are potentially integration testing their cookbooks in production. IMHO, we should not implicitly enable dangerous features and expect users to know or remember to explicitly disable them.

@damacus
Copy link
Contributor

damacus commented Mar 17, 2022

I might be mis-understanding this, so please correct me if my understanding is incorrect here.

The lock file created locks the cookbook you're testing. So if you make an update to it, it won't get sync'd to the node because it doesn't match the version required. Requiring this change in the first place.

Impacting IaC

Here we're doing what we would expect, though definitely not what you're expecting. And I'd love to hear from other users on this(!). Pulling in a new cookbook, which is subsequently tested by Test Kitchen, will catch breakages early.

I'm totally on the fence on this one. So on one hand, we want to test the updates to the current cookbook every time we make a change to it (that's what this change was designed to fix) and for that we need to update the lock file. Running chef update, before we run kitchen test isn't an ideal user experience.

That said, the only thing that really wants to be updated in that scenario is the current cookbook rather than it's deps.

@lamont-granquist or @marcparadise from the infra team might be able to weigh in with some other opinions on this one.

@damacus
Copy link
Contributor

damacus commented Mar 17, 2022

In the mean time, if you haven't done it already, you'll want to set your provisioner to always_update_cookbooks: false

@karmix
Copy link
Contributor Author

karmix commented Mar 17, 2022

@damacus: The lock file contains a lock for the cookbook you are testing, but does not enforce locks for local cookbooks for precicely the cookbook development use case you are describing. What you are describing would be a bug, however, I cannot reproduce the behavior you describe with Test Kitchen when always_update_cookbooks is set to false. Was the cookbook directive changed in the Policyfile.rb, or the name of the cookbook modified in its metadata.rb?

When using Policyfiles to develop cookbooks, the cookbook being developed uses a "path" cookbook location (e.g. cookbook 'my_cookbook', path: '.'). This is important because cookbooks located using path are considered local cookbooks (as opposed to cached cookbooks that come from some external source). The identifiers for local cookbooks are ignored (well, actually updated in #refresh!) when the lock file is validated so modifying the cookbook does not generate any exceptions. Since they are not cached, local cookbooks are always read directly from their given path, and include all updates made to them even after the lock file is generated.

Was this issue something you ran across in one of the Sous Chefs or other public repos? Perhaps if you could point me to the project where it was encountered, or provide an example, I might be able to help determine what is going on.

@damacus
Copy link
Contributor

damacus commented Mar 28, 2022

I'm fairly sure it was the result of me trying to move over from Berks to Policyfiles in HAproxy.

Seeing as no-one else has piped into correct you on this. I'm going to run that pipeline again, with it set to false, and see how we get on 👍🏼

@jakauppila
Copy link

We set always_update_cookbooks: false so that we are not constantly resolving the dependencies and updating the lockfile, but I would say that if I'm testing something locally, that I do want any local cookbook changes pushed for testing.

Did always_update_cookbooks change its intent over time?

@karmix
Copy link
Contributor Author

karmix commented Apr 15, 2022

I agree, when testing locally, I want local cookbook changes pushed for testing, but this always happens regardless what you set for always_update_cookbooks.

I do not believe the behavior of always_update_cookbooks has ever changed, though the name is a bit misleading. When it is set to true, kitchen has chef or berks update the lock file before sending cookbooks to the test environment and when set to false, it uses the lock file it was given. Though both policy locks and berks locks track information about your local cookbooks, neither cache or pin them to a specific version in any way. When told to use a cookbook from a local path, they will always copy the cookbook (including the latest modifications) exactly as it exists at that path, even if the metadata doesn't match what they expect to find there (chef policies complaining if the cookbook's name changes is an exception). This feature might have been better named something like "always_resolve_dependencies" as it has nothing to do with the cookbook being developed.

@karmix
Copy link
Contributor Author

karmix commented Jul 26, 2022

@damacus: Did you ever get a chance to re-run that pipeline you spoke about?

@Stromweld
Copy link
Contributor

I believe it was set to make it easier for pulling in dependencies. If you add to your CB metadata a new dependency and didn't run chef update you'd get an error. Same thing if you aren't locking CB versions and strive to use latest versions of cookbook dependencies then they wouldn't be pulled in if you don't update the policyfile lock. If you are pinning versions then I don't see any harm in it being enabled except it makes the run a few seconds longer. The pollicyfile ID is based on the content so if it doesn't change your policyfile would be generated the same each time.

@karmix
Copy link
Contributor Author

karmix commented Sep 19, 2022

I believe it was set to make it easier for pulling in dependencies. If you add to your CB metadata a new dependency and didn't run chef update you'd get an error.

Correct, if you add new dependencies to a cookbook and forget to update your lock file, Chef client may abort. However, no individual tool in your test suite can handle this task for you because the correct time to update your SUT is before you run the test suite, not in the middle of it (when tools like test-kitchen get to run). When tests are run in parallel, having a tool in the suite modify the SUT introduces race conditions that will result in inconsistent test results and ultimately impact the stability of the infrastructure being managed.

Even when tests are run serially, this feature is not likely to help because tests are ordered by their expected run duration to promote fast feedback. When the lockfile is missing dependencies, faster tests in the suite such as ChefSpec will fail before slower tests like test-kitchen even get to run. If test-kitchen doesn't get to run, it can't update the lock file.

Same thing if you aren't locking CB versions and strive to use latest versions of cookbook dependencies then they wouldn't be pulled in if you don't update the policyfile lock. If you are pinning versions then I don't see any harm in it being enabled except it makes the run a few seconds longer. The pollicyfile ID is based on the content so if it doesn't change your policyfile would be generated the same each time.

I'm not sure I understand what you are trying to say, here. We test against and use the latest cookbook dependencies wherever possible. We only pin cookbook versions to address compatibility issues. If someone creates a P/R with a stale lock file, our pipeline will complain that the lock file is out of date and soft-fail, completing the rest of the test suite so our repo maintainers can use the test results to decide whether to wait for an update to the P/R or to force merge then update the dependencies themselves.

However, we do not use our test suite to update our lock files. The lock files need to be updated as part of the P/R. More specifically, we do not want anything in our pipeline, test suite or otherwise, to modify lockfiles or any other part of the product before or during testing or deployment because that would introduce a disconnect between our git repo and the infrastructure our pipeline created, preventing us from using our git repository as the source of truth.

As for pinning everything, that doesn't seem like a feasible approach to ask people to adopt. You would have to pin every cookbook in the environment, all of it's dependencies, and their sub-dependencies, etc. You would be left trying to manually track and depsolve dozens or even hundreds of cookbook requirements, which would quickly become a burden.

@Stromweld
Copy link
Contributor

I think there is confusion on the policy files. Since there are multiple types of cookbooks the community has come with library/resource, wrapper, and potentially role cookbooks you have policyfiles that are for cookbook testing and then policyfiles that are uploaded to chef-server and assigned to servers. Typically lock files aren’t checked into git unless they are ones for assigning to servers in a role cookbook or policy file repo as best practice.

When developing and testing most users are modifying code and potentially adding dependencies. The default behavior now makes it so you don’t have to run ‘chef update’ every time you try to run a ‘kitchen converge’ for testing.

You can always set the value to false in your kitchen.yml file and since this setting has been defaulted to true for over a year and this is the first and only issue I’ve seen raised about it, I’m inclined to think others aren’t having issues with it. https://github.com/test-kitchen/test-kitchen/blob/main/CHANGELOG.md#300-2021-07-02

@karmix karmix changed the title always_update_cookbook should not be enabled by default always_update_cookbooks should not be enabled by default Sep 20, 2022
@karmix
Copy link
Contributor Author

karmix commented Sep 20, 2022

Typically lock files aren’t checked into git unless they are ones for assigning to servers in a role cookbook or policy file repo as best practice.

There is considerable value in committing lock files for cookbook repos where the lock file is never pushed to a chef server. Contributors come from various backgrounds and some contributions take months or even years for a contributor to create. When a potential contributor runs a test suite and finds a bunch of failed tests that aren't related to their work, a portion of them give up on their contribution because they lack either the skills or time required to track down and fix whatever is causing the tests to fail.

Committing lock files in a cookbook repo helps to isolate contributors from failures related to incompatible changes in the cookbook's dependencies, allowing them to focus on the fix or feature they want to add. This increases the likelihood that a P/R will make it in front of a cookbook maintainer that can assist with any dependency related issues. We value this and believe it improves the quality of our cookbooks.

I suspect that as the chef community continues to evolve we will see a migration toward including berks and policy lock files in cookbook repositories for the same reasons the ruby community moved toward including bundler lock files in application repositories. That is, unless our tooling makes it difficult to realize those benefits.

When developing and testing most users are modifying code and potentially adding dependencies. The default behavior now makes it so you don’t have to run ‘chef update’ every time you try to run a ‘kitchen converge’ for testing.

The primary reason I brought this up for discussion is because the new default also:

  1. causes test-kitchen to crash in environments where it cannot download new dependencies (e.g. air-gapped environments or where test-kitchen does not have credentials to access cookbook sources),
  2. can prevent pipelines from detecting problems with cookbooks (as I described in the initial post), and
  3. introduces race conditions resulting in inconsistent test results when test suites run tests in parallel

but did not provide a clear explanation what the intended benefit was for the change.

In my experience, changing dependencies actually occurs very rarely. For example, out of 1997 commits in the sous-chefs/apache2 repo, only 15 modify the dependencies in the metadata. I believe these are significant regressions to introduce in test-kitchen's base configuration in order to make a very small number of updates more convenient for some users.

Berks and policies were designed to preserve and recreate the environment in which a cookbook gets run. In the scenario you describe, they are not actually being utilized at all and their behavior is undesirable. The only reason the berks and policy files exist in that situation is because the tooling requires one to locate, depsolve, and vendor/export cookbooks into a directory before it can do its work.

Updating the lock file every time test-kitchen is run masks the behavior of berks and policies from those users that don't want to use them, but at the same time prevents the tools from performing their function for users that depend on them. If we have identified the original issue, I believe there are more compatible ways address it. (@damacus: Please chime in if you recall anything about updating the dependencies of a local cookbook when you ran into the original issue.)

Instead of enabling always_update_cookbooks by default (and breaking berks/policies), we should be able to avoid the issue if we do not introduce a lock file that does not already exist. Specifically:

  • If a lock file exists: use it (unless always_update_cookbooks is explicitly enabled).
  • If a lock file does not exist: Generate a temporary lockfile, populate the sandbox, then remove the temporary lock file. (The temporary lock file should ideally use a non-standard name or location to avoid creating race conditions during parallel testing.)

This would address the issue when just test-kitchen is being used. Other tools like ChefSpec leave lock files behind that didn't originally exist and would need similar updates. I think it is easy to argue that the purpose of these tools is not to manage dependencies and so they should leave the lock files in the same state as they were before they were run.

You can always set the value to false in your kitchen.yml file and since this setting has been defaulted to true for over a year and this is the first and only issue I’ve seen raised about it, I’m inclined to think others aren’t having issues with it. https://github.com/test-kitchen/test-kitchen/blob/main/CHANGELOG.md#300-2021-07-02

My intent in creating this issue was to highlight the issues this feature introduces, to identify the factors that led to the change so we could consider whether there are better alternatives, and discuss whether it is appropriate for test-kitchen in its default configuration to break common tools like berks and policy file.

I don't believe we can assume that nobody else is affected by this change. How many people have to have to encounter an issue before one speaks up? My institution is not the only one that uses berks or policies. I suspect it is more likely that others have either not noticed that their lock files are not working, or started explicitly setting always_update_cookbooks to false and didn't say anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants