Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken runtime config sync #10013

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

yhabteab
Copy link
Member

@yhabteab yhabteab commented Mar 1, 2024

This PR kind of reverts 008fcd1 but it's still using AtomicFile to persist the configs to disk. Additionally, ConfigObjectUtility::CreateObject() utilises the ObjectNameMutex class to prevent an object of a given type from being created concurrently, even if it is hard to trigger that data race via API requests.

Tests

Start two Icinga endpoints that are connected to each other and must accept config updates from each other (accept_config = true).

Before

for j in {1..100} ; do
    curl -k -s -S -i -u root:icinga -H 'Accept: application/json' \
    -X PUT "https://localhost:5665/v1/objects/hosts/example-$j" \
    -d '{ "templates": [ "generic-host" ], "attrs": { "address": "127.0.0.1", "check_command": "hostalive", "vars.os" : "Linux" }, "pretty": true }'
done

It silently aborts the synchronisation attempts.

After

Works as it should!

fixes #10012

@cla-bot cla-bot bot added the cla/signed label Mar 1, 2024
@icinga-probot icinga-probot bot added area/distributed Distributed monitoring (master, satellites, clients) area/runtime Downtimes, comments, dependencies, events bug Something isn't working labels Mar 1, 2024
@yhabteab yhabteab added this to the 2.15.0 milestone Mar 1, 2024
@yhabteab yhabteab added the consider backporting Should be considered for inclusion in a bugfix release label Mar 1, 2024
@yhabteab yhabteab requested review from oxzi, julianbrost and Al2Klimov and removed request for oxzi March 1, 2024 10:02
@julianbrost julianbrost added the blocker Blocks a release or needs immediate attention label Mar 5, 2024
@julianbrost
Copy link
Contributor

This looks like it significantly changes what was only added in #9980. If it stays that way, it might make sense to revert #9980 as a whole and make a complete new attempt. That would make the backport of this cleaner (as you wouldn't have to backport a commit and its reversion at once).

What's the reason for moving ObjectNameMutex to types.{hpp,cpp}?

ceab5a6 seems to add a second independent ObjectNameMutex. What's the reason for that, shouldn't this use the same one?

@yhabteab
Copy link
Member Author

yhabteab commented Mar 6, 2024

What's the reason for moving ObjectNameMutex to types.{hpp,cpp}?

Well, my first thought was that we might be using it even more than we thought in #9980, and that it might introduce a circular dependency. However, if you are strongly against it, then of course I can undo it.

ceab5a6 seems to add a second independent ObjectNameMutex. What's the reason for that, shouldn't this use the same one?

Feels strange to use the same mutex for the cluster config sync and the API request. Otherwise, there's nothing against it, I guess.

@yhabteab
Copy link
Member Author

yhabteab commented Mar 6, 2024

ceab5a6 seems to add a second independent ObjectNameMutex. What's the reason for that, shouldn't this use the same one?

Feels strange to use the same mutex for the cluster config sync and the API request. Otherwise, there's nothing against it, I guess.

Plus ConfigObjectUtility:CreateObject() can also be used without an API listener.

@yhabteab
Copy link
Member Author

yhabteab commented Mar 6, 2024

ceab5a6 seems to add a second independent ObjectNameMutex. What's the reason for that, shouldn't this use the same one?

Feels strange to use the same mutex for the cluster config sync and the API request. Otherwise, there's nothing against it, I guess.

Plus ConfigObjectUtility:CreateObject() can also be used without an API listener.

Plus ApiListener::ConfigUpdateObjectAPIHandler() calls CreateObject() while holding a lock on that object and its type, so that's a good reason not to use the same mutex, I'd say.

@julianbrost
Copy link
Contributor

Well, my first thought was that we might be using it even more than we thought in #9980, and that it might introduce a circular dependency. However, if you are strongly against it, then of course I can undo it.

I mainly asked the question because if it is moved, it's more code that was added in #9980 that wouldn't stay the way it was added there. So if it's moved, that would be more reason for me to revert #9980 completely and do a clean new PR. (I haven't looked at the PR in enough detail to assess whether it should be moved.)

Feels strange to use the same mutex for the cluster config sync and the API request. Otherwise, there's nothing against it, I guess.

Why does it feel strange? Think of what this mutex is supposed to protect: the consistency of the configuration of a specific object. So anything that attempts to modify the object at runtime should use the same mutex.

Imagine someone would send a PUT /v1/objects/hosts/foo request to Master 1 and then the same HTTP request to Master 2 at about the same it receives the JSON-RPC request from Master 1 to create the object. Should both requests be processed at the same time on Master 2?

Plus ApiListener::ConfigUpdateObjectAPIHandler() calls CreateObject() while holding a lock on that object and its type, so that's a good reason not to use the same mutex, I'd say.

Does the locking in ConfigUpdateObjectAPIHandler() actually do anything useful anymore, or would the locking performed in CreateObject() already suffice for ConfigUpdateObjectAPIHandler()?

@yhabteab
Copy link
Member Author

yhabteab commented Mar 6, 2024

Does the locking in ConfigUpdateObjectAPIHandler() actually do anything useful anymore,

IMHO yes, it does! It acquires the lock before it does ctype->GetObject(objName);, which would cause two config updates to think the object doesn't exist yet and try to create it when it doesn't.

or would the locking performed in CreateObject() already suffice for ConfigUpdateObjectAPIHandler()?

That would be too late for the cluster config sync part.

@yhabteab yhabteab force-pushed the broken-runtime-config-sync branch 2 times, most recently from 2b40c6f to a259a68 Compare March 6, 2024 17:22
@julianbrost
Copy link
Contributor

Feels strange to use the same mutex for the cluster config sync and the API request. Otherwise, there's nothing against it, I guess.

Why does it feel strange? Think of what this mutex is supposed to protect: the consistency of the configuration of a specific object. So anything that attempts to modify the object at runtime should use the same mutex.

Imagine someone would send a PUT /v1/objects/hosts/foo request to Master 1 and then the same HTTP request to Master 2 at about the same it receives the JSON-RPC request from Master 1 to create the object. Should both requests be processed at the same time on Master 2?

What about that part? Currently, the PR adds different locking based on whether an object is created via HTTP or JSON-RPC. Both would create the same object though, so both should use the same locking. Wouldn't you end up with race conditions between HTTP and JSON-RPC otherwise?

@yhabteab
Copy link
Member Author

yhabteab commented Mar 7, 2024

What about that part? Currently, the PR adds different locking based on whether an object is created via HTTP or JSON-RPC. Both would create the same object though, so both should use the same locking. Wouldn't you end up with race conditions between HTTP and JSON-RPC otherwise?

I don't think so! In case of a JSON-RPC connection, the object is locked twice! There is only one case in which a race condition may occur. This's when the API request enters CreateObject(), while the cluster sync takes place in this part of the code. It's hard to trigger, I'd say!

ConfigObject::Ptr object = ctype->GetObject(objName);
String config = params->Get("config");
bool newObject = false;
if (!object && !config.IsEmpty()) {
newObject = true;
/* object does not exist, create it through the API */
Array::Ptr errors = new Array();
/*
* Create the config object through our internal API.
* IMPORTANT: Pass the origin to prevent cluster sync loops.
*/
if (!ConfigObjectUtility::CreateObject(ptype, objName, config, errors, nullptr, origin)) {

@yhabteab
Copy link
Member Author

yhabteab commented Mar 11, 2024

I've tested all the use cases of the config sync and it is working perfectly now. However, we still need to address the use of ObjectNameMutex in the modifyobjecthandler.cpp file, particularly whether it is even necessary to lock the object name there.

@yhabteab yhabteab removed the request for review from julianbrost March 11, 2024 15:06
@yhabteab yhabteab requested review from julianbrost and Al2Klimov and removed request for Al2Klimov March 11, 2024 15:06
lib/remote/configobjectslock.cpp Outdated Show resolved Hide resolved
lib/remote/configobjectutility.cpp Outdated Show resolved Hide resolved
lib/remote/configobjectslock.hpp Outdated Show resolved Hide resolved
@julianbrost
Copy link
Contributor

However, we still need to address the use of ObjectNameMutex in the modifyobjecthandler.cpp file, particularly whether it is even necessary to lock the object name there.

Some synchronization has to happen there for sure. Am I missing something or isn't even using ObjectLock so that two post requests that modify the same object could result in some race condition?

At first glance, ObjectNameMutex also doesn't sound like a bad idea there. That will prevent both starting a modification before the object was fully created and triggering a deletion while the modification is still being processed.

@yhabteab
Copy link
Member Author

Am I missing something or isn't even using ObjectLock so that two post requests that modify the same object could result in some race condition?

There is no olock on the actual configuration object, but only on the restore_attributes and these acquired in RestoreAttribute().


static std::mutex m_Mutex;
static std::condition_variable m_CV;
static std::map<Type*, std::set<String>> m_LockedObjectNames;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider unordered_{map,set}.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exact types won't matter that much here as the overall size of those containers will stay rather small anyways.

For completeness, as I already talked with @yhabteab about it: I was thinking about maybe making this a std::set<std::pair<Type*, String>> instead (or std::unordered_set for that matter, there won't be much of a difference), as that would reduce the map + set operations down to just the set operations. But this seems to bring no huge improvements for simplifying the code (which would have been my motivation here, not performance).

Copy link
Contributor

@julianbrost julianbrost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine in general to me. One commit message refers to the outdated class name ObjectNameMutex though. And if you're touching the PR anyways, you might do a slight improvement to one of the comments (see my inline comment).

lib/remote/configobjectutility.cpp Outdated Show resolved Hide resolved
@tbauriedel
Copy link
Member

ref/NC/804054

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) area/runtime Downtimes, comments, dependencies, events blocker Blocks a release or needs immediate attention bug Something isn't working cla/signed consider backporting Should be considered for inclusion in a bugfix release ref/NC
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Runtime created objects don't get synced
4 participants