Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread conflict in Puppet::Pops::Loaders #9252

Open
jstange opened this issue Feb 12, 2024 · 5 comments
Open

Thread conflict in Puppet::Pops::Loaders #9252

jstange opened this issue Feb 12, 2024 · 5 comments
Labels
bug Something isn't working help wanted Issue has been reviewed & PRs are welcome triaged Jira issue has been created for this

Comments

@jstange
Copy link

jstange commented Feb 12, 2024

Describe the Bug

We observe the following error across ~5% of all Puppet agent runs in our organization against our Puppet 7.x masters when we set multithreaded: true, with any value for max-active-instances (even 2):

ERROR [qtp735649759-289] [puppetserver] Puppet Server Error: Evaluation Error: Error while evaluating a Type-Name, Internal Error: Attempt to redefine loader named '<foo> private' (file: /path/to/our/modules/foo/manifests/init.pp, line: 12, column: 18) on node clientnode.example.com

The error disappears if we disable threading, or if we hand-insert thread synchronization into the offending method or its caller. It is not consistent to any particular client node, and nodes on which agent runs throw the error will work fine the other ~95% of the time.

The source of the error, add_loader_by_name in puppet/lib/ruby/vendor_ruby/puppet/pops/loaders.rb, is fairly straightforward, adding a new key to an instance variable and throwing the error we observe if the key is already present:

  def add_loader_by_name(loader)
    name = loader.loader_name
    if @loaders_by_name.include?(name)
      raise Puppet::ParseError, _("Internal Error: Attempt to redefine loader named '%{name}'") % { name: name }
    end
    @loaders_by_name[name] = loader
  end

Interestingly, adding Puppet.info or Puppet.warn output before the if statement, to debug the call stack and thread state, causes the issue to disappear. I'd theorize that whatever output synchronization the logging system uses adds sort of back-door thread safety. Adding logging inside the if statement doesn't do this, which makes sense.

The calling method in all instances of this error we've logged to date has been []. If we add naïve Ruby-native synchronization to that method as below, our problem disappears:

  def [](loader_name)
    loader = @loaders_by_name[loader_name]
    if loader.nil?
      # Unable to find the module private loader. Try resolving the module
      @my_semaphore.synchronize {
        loader = private_loader_for_module(loader_name[0..-9]) if loader_name.end_with?(' private')
        raise Puppet::ParseError, _("Unable to find loader named '%{loader_name}'") % { loader_name: loader_name } if loader.nil?
      }
    end
    loader
  end

That's presumably not a correct-for-this-codebase fix by itself, but does point in the general direction of "it's threads stepping on each other, add a sempahore somewhere."

Environment

  • Puppet Server 7.x (various, up to and including 7.15.0)
  • CentOS 7.9.2009, AlmaLinux 8.9
  • Sanitized puppet.conf and puppetserver.conf attached

Additional Context

It's not clear why our ecosystem in particular triggers this issue. While we're a medium-large organization, our Puppet masters are spread around to various regional collections of nodes. The one on which we did the above debugging manages only about a dozen nodes. The issue manifests on all of them.

Our module ecosystem is old growth forest, with a lot of interwoven dependency. It's possible that the trigger here is multiple references to the same dependent object being resolved in parallel.

It's of note that we have a number of modules with names of the form *_private (in a search path distinct from our "main" module repo). But this seems to occur with any module and does not occur with threading disabled, so is unlikely to be some kind of namespace collision.

@jstange jstange added the bug Something isn't working label Feb 12, 2024
@joshcooper
Copy link
Contributor

This was originally reported in https://puppet.atlassian.net/browse/PUP-11324. That issue tracker is now read-only, so please continue discussion here.

@joshcooper
Copy link
Contributor

joshcooper commented Feb 13, 2024

Thanks for letting us know @jstange We currently use the environment object to protect against concurrent access, such as

environment.lock.synchronize do
I think you'd want to do something similar for each of the Loaders instance methods that mutate state, like:

 def [](loader_name)
   environment.lock.synchronize do
      loader = @loaders_by_name[loader_name]
      if loader.nil?
        # Unable to find the module private loader. Try resolving the module
        loader = private_loader_for_module(loader_name[0..-9]) if loader_name.end_with?(' private')
        raise Puppet::ParseError, _("Unable to find loader named '%{loader_name}'") % { loader_name: loader_name } if loader.nil?
    end
    loader
end

Alternatively you could only lock when loader.nil?, but I think you need to relookup again inside the lock (double-checked lock)

def [](loader_name)
  loader = @loaders_by_name[loader_name]
  if loader.nil?
    environment.lock.synchronize do
     if loader.nil?
       ...

We don't have any immediate plans for working on this, but we welcome pull requests. I'll put the help wanted label on for now.

@joshcooper joshcooper added the help wanted Issue has been reviewed & PRs are welcome label Feb 13, 2024
@jstange
Copy link
Author

jstange commented Feb 15, 2024

After some poking I might have some vague idea how threading works in here. Seems like it's JRuby-specific, and in the case I care about it's probably instances of Puppet::Util::Autoload that are competing to touch things in parallel.

[] probably isn't the only method that needs a thread seatbelt put on it, but I don't want to blindly add stuff and create potential deadlock conditions.

I'll spend some time with my stack traces and see if there's anywhere else that looks thread-dangerous and deadlock-safe, wrap those in environment.lock.synchronize, and submit a PR of it. Someone who knows what they're doing will have to review- my guesswork about threading in here may be wildly off base.

Copy link

Migrated issue to PUP-12038

@jstange
Copy link
Author

jstange commented Apr 24, 2024

Finally got my corporate overlords to sign off on the CLA for the #9299. Our fix has been deployed locally for a while now and seems to solve the issue.

We've recently identified a similar, but distinct, threading issue, this one in Resource::Type. I'll open a separate issue for it once I've made sure my report isn't duplicating someone else's.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Issue has been reviewed & PRs are welcome triaged Jira issue has been created for this
Projects
None yet
Development

No branches or pull requests

3 participants