Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OF-2824: Addressing data consistency in caches of routing table #2459

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

guusdk
Copy link
Member

@guusdk guusdk commented May 8, 2024

The distinct commits in this PR each intend to address a separate issue, where the first three commits are preparing the code for easier application of the fourth commit.

The over-arching goal of all commits is to make the implementation of RoutingTable less error prone (in context of client sessions/routes). it does so by:

  • adding strictness with regards to allowed null and bare vs. full JID usage
  • simplifying the code (doing away with an optimization that I suspect plays a role in the cause of observed data inconsistency)
  • replacing per-cache lock usage to usage of one lock, to guard the manipulation of three related caches. This intends to remove a potential synchronization issue.

guusdk added 4 commits May 8, 2024 11:19
Any Session's `address` property is expected to be non-null by the existing implementation. One notable exemption to this is `RemoteIncomingServerSession`, where the address _field_ is lazily initialized (and thus _can_ be null). Its getter value is guaranteed to be non-null though.

This commit annotates and asserts that various Session address fields, getters and setters to have non-null values.

The intention of this change is to simplify implementations (by dropping unneeded null-checks), more strictly document expected state (for easier maintenance) and introduce fail-fast behavior (that should help catch any current or future bugs).

It is not expected that this change will introduce runtime changes (as the values already are non-null).
Any Session's address is non-null, but _client_ sessions also are guarenteed to be full JIDs (in other words: they have a non-null resource part).

This commit explicitly checks for this (to fail fast on any current and future bugs), and removes unneeded checks on usage. This intends to reduce code maintenance costs, by having less complex code.

It is not expected that this change will introduce runtime changes (as the values already are full JIDs).
It is suspected that, under race conditions, the optimization removed by this commit introduces data inconsistency.

The optimization prevents modification of a second cache, as through the output of manipulation of the first cache it can be deduced that the second cache should already be in the expected state. By skipping manipulation of that second cache, a cluster-wide operation is prevented, which improves performance.

The presence-based override for the optimization - one that I frankly do not understand - becomes obsolete by this change, and is also removed.

The change introduced by this commit trades performance for more reliable data consistency. As an added benefit, the code becomes less complex, reducing maintenance costs.

This commit changes the public signature of the `addClientRoute` method of `RoutingTable`: it no longer returns a value. The return value is currently not used by Openfire's own code. It was used for only two days, back in 2007: it was introduced in commit a940eef where it was used to update a statistic. This statistic got removed two days later, in commit 67f9ab6. Given the nature of the code (RoutingTable being _very_ low level), it is not expected that third-party code uses this method. It contract change should therefor be reasonably safe to do.
…the same lock

RoutingTableImpl has three closely related caches, that are used to represent the state of client session routes:
- `usersSessionsCache`
- `anonymousUsersCache`
- `usersCache`

Each value in the first cache is expected to correspond to a value in one of the other two caches.

Under OF-2824, a bug is described where `usersCache` contains values that are _not_ in `usersSessionsCache`. That shouldn't be possible.

Prior to this commit, manipulation of these caches is performed under a lock obtained from each of the caches. This means that the overall operation of adding an entry to `usersSessionsCache` and one of the other two caches is _not_ guarded by one singular lock (instead, two locks are used, each guarding the operation pertaining to that particular cache). This leaves room for a race-condition.

This commit addresses the race condition by using one singular lock to guard manipulations in all of these caches.

As all caches use a JID (in either bare or full form) as their key value, the singular lock introduced by this commit is based on the bare JID of the key that's being manipulated. This lock is obtained from the `usersSessionsCache`.

This change can lead to more lock contention (as more operations are guarded by the same lock). Simultaneously, less acquiring of locks will take place (as many operations previously required two locks to be acquired, while now, only one is needed. What the effects are on performance is as of yet undetermined.

This commit also introduces some related, minor changes:
- Logged messages are made more consistent
- Some operations have been moved outside of the protection of a (potentially cluster-wide) lock, to improve performance
- Where methods expect to be called with a full JID, exceptions are thrown when a bare JID is used. This fail-fast behavior is intended to uncover any existing or future bugs.
@guusdk guusdk added the backport 4.8 on merge, GHA will generate a PR with these changes against 4.8 branch label May 8, 2024
Copy link
Member

@akrherz akrherz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't see anything obviously wrong with this change.

@Fishbowler
Copy link
Member

I've reviewed this commit by commit, and it seems sensible.
On locks, do those locks that we no longer acquire (Users, AnonymousUsers) get locked and edited anywhere else? i.e. Could this be trading race conditions?

@Fishbowler
Copy link
Member

Those smack tests though 😬

Copy link
Contributor

@GregDThomas GregDThomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One typo, one thing that worries me but I may be off base there.

@guusdk
Copy link
Member Author

guusdk commented May 9, 2024

do those locks that we no longer acquire (Users, AnonymousUsers) get locked and edited anywhere else?

No, I've checked (but cannot programatically ensure) that all interaction with those caches either use UsersSession's lock, or no lock at all (in cases were the cache interaction is not specific to one entry - I'm not sure how to improve on that).

In commit ef582e1 fail-fast behavior was introduced that causes exceptions to be thrown when certain methods are invoked using a bare JID while the expected argument is a full JID.

For certain lookups, these unexpected bare JIDs appear to be used all the time, causing Openfire to become pretty unusable.

In this commit, the invocations that do not _modify_ cache content are short-circuited when a bare JID argument is used. This has the benefit of preventing the acquisition of a cluster-wide lock.
@guusdk
Copy link
Member Author

guusdk commented May 9, 2024

Those smack tests though 😬

I had introduced fail-fast behavior. Guess what: things started to fail rather fast. I was wrong to assume that the arguments used would always be of the expected type (a full JID rather than a bare JID).

There might be room to improve things, as the invocation with bare JIDs might be a symptom of an underlying issue, but that's potentially a larger refactoring - while it wouldn't fix the same problem in plugins or third-party code.

Given that it appears to be wide-spread, I've now modified to code to, instead of failing immediately, simply prevent an operation that would have a predictable null-result when a bare JID is provided.

@Fishbowler
Copy link
Member

LGTM 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 4.8 on merge, GHA will generate a PR with these changes against 4.8 branch
Projects
None yet
4 participants