Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] grouping skipping opRepoPostCreateDelay, causing operations being applied out of order when multiple login operations are pending. (fixes issue since 5.1.10) #2087

Merged
merged 7 commits into from
May 16, 2024

Conversation

jkasten2
Copy link
Member

@jkasten2 jkasten2 commented May 14, 2024

Description

One Line Summary

Fix operations being applied out of order when multiple login operations are pending.

Details

OperationRepo.getGroupableOperations() was not respecting opRepoPostCreateDelay like OperationRepo.getNextOps(), this would lead to some being applied out of order. The only known case is when there is a login in the queue for an Anonymous User crate and also User A and another for User B. The anonymous User create happens first with a push subscription which is correct. Then we have to wait for opRepoPostCreateDelay time after that create before we can attempt to identify it has User A, this is also ok. However we skip to User B and try to create it and transfer the subscription, this isn't correct as we also should wait opRepoPostCreateDelay for the new subscription id.

We addressed getGroupableOperations() in this PR however it was not enough, the OperationRepo and it's executors still ran into edge cases where ids would get mixed up due to intermixing of operations on different users. So in this PR we consistently stall the whole OperationRepo processing to avoid the issues, until we can make a flow up PR to improve this later.

While this PR fixes the issue noted above there should be another mechanism in place to prevent such an issue. This should be done in a follow up PR.

Motivation

No matter how quickly or the offline state of the device operations should be applied in the correct order.

Scope

Corrects OperationRepo ordering and makes now consistently delays the OperationRepo after creating records.

Related

This is a follow up to PR #2059, where that logic created the issue noted above.

Testing

Unit testing

Manual testing

Tested on Android 6 with the following scenario:

  1. Turned on Airplane mode
  2. Opened the app
  3. OneSignal.login("A")
  4. OneSignal.User.addEmail("a@a.com")
  5. OneSignal.User.addTag("a", "a")
  6. OneSignal.login("B")
  7. OneSignal.User.addEmail("b@b.com")
  8. OneSignal.User.addTag("b", "b")
  9. Closed app
  10. Turned off Airplane mode
  11. Opened app again
  12. Observed correct REST API calls and checked records on OneSignal dashboard.

Affected code checklist

  • Notifications
    • Display
    • Open
    • Push Processing
    • Confirm Deliveries
  • Outcomes
  • Sessions
  • In-App Messaging
  • REST API requests
  • Public API changes

Checklist

Overview

  • I have filled out all REQUIRED sections above
  • PR does one thing
    • If it is hard to explain how any codes changes are related to each other then it most likely needs to be more than one PR
  • Any Public API changes are explained in the PR details and conform to existing APIs

Testing

  • I have included test coverage for these changes, or explained why they are not needed
  • All automated tests pass, or I explained why that is not possible
  • I have personally tested this on my device, or explained why that is not possible

Final pass

  • Code is as readable as possible.
    • Simplify with less code, followed by splitting up code into well named functions and variables, followed by adding comments to the code.
  • I have reviewed this PR myself, ensuring it meets each checklist item
    • WIP (Work In Progress) is ok, but explain what is still in progress and what you would like feedback on. Start the PR title with "WIP" to indicate this.

This change is Reviewable

Added test to prove that grouping logic is not waiting on
opRepoPostCreateDelay for new ids.This has side effect of operations
being done out of order in some cases. A real world case is login
operations can be done out of order, which also results in things like
the push subscription being on the wrong users. There is also a smaller
issue where we are simply not waiting the time limit so some 404 error
were not being avoid but would have still resulted in correct behavior
in the end.
@jkasten2 jkasten2 changed the title WIP [Fix] grouping skipping opRepoPostCreateDelay, causing some operations applying out of order WIP [Fix] grouping skipping opRepoPostCreateDelay, causing login + other operations applying out of order May 14, 2024
@jkasten2 jkasten2 changed the title WIP [Fix] grouping skipping opRepoPostCreateDelay, causing login + other operations applying out of order WIP [Fix] grouping skipping opRepoPostCreateDelay, causing operations being applied out of order when multiple login operations are pending May 14, 2024
opRepoPostCreateDelay like OperationRepo.getNextOps(), this would lead
to some being applied out of order. The only known case is when there
is a login in the queue for an Anonymous User crate and also User A and
another for User B. The anonymous User create happens first with a push
subscription which is correct. Then we have to wait for
opRepoPostCreateDelay time after that create before we can attempt to
identify it has User A, this is also ok. However we skip to User B and
try to create it and transfer the subscription, this isn't correct as
we also should wait opRepoPostCreateDelay for the new subscription id.

While this commit fixes the issue noted above there should be another
mechanism in place to prevent such an issue. This should be done in a
follow up commit.

See PR #2087 for more details.
@jkasten2 jkasten2 changed the title WIP [Fix] grouping skipping opRepoPostCreateDelay, causing operations being applied out of order when multiple login operations are pending WIP [Fix] grouping skipping opRepoPostCreateDelay, causing operations being applied out of order when multiple login operations are pending. (fixes issue since 5.1.10) May 14, 2024
Reversing the list means we insert each item at index 0 to keep the
order correct. Always inserting at index 0 means we no longer need to
keep an index state, which both simples the code but also means we
don't depend on state which may be wrong sometimes, noted in the
warning comment that was removed in this commit. This reversed()
logic for list used in a few places in this file already to solve this
problem.
@jkasten2 jkasten2 changed the title WIP [Fix] grouping skipping opRepoPostCreateDelay, causing operations being applied out of order when multiple login operations are pending. (fixes issue since 5.1.10) [Fix] grouping skipping opRepoPostCreateDelay, causing operations being applied out of order when multiple login operations are pending. (fixes issue since 5.1.10) May 14, 2024
@nan-li
Copy link
Contributor

nan-li commented May 14, 2024

I'm seeing if user A already exists, it 409s, and CreateUser will eventually be called with the pushsub and new email.

The response is parsed and the ID for the email subscription is extracted as the push subscription ID, which has odd side effects of being transferred to user B and the ID used to update the push subscription.

Perhaps because the email was the first item in the returned subscriptions array, the push sub was the second item.

UPDATE: Looks like an existing issue if a CreateUser is sent with an email subscription, not related to this PR.

@nan-li
Copy link
Contributor

nan-li commented May 14, 2024

I'm testing this flow on a new install with network off:

// Init OneSignal
// Anon user creation
OneSignal.login("<EXISTING_EUID_USER_A");
OneSignal.getUser().addTag("a", "a");

OneSignal.login("B");
OneSignal.getUser().addEmail("b@b.com");

Then turn network on.

There's some unexpected ordering and requests made, but eventually everything ends up in the mostly correct states, it seems.

  1. CreateUser "B" sends immediately after the anonymous user is created (email is in payload, no push sub in payload). I think because the transfer-push-sub operation is not grouped since it needs the ID from the anon create.
  2. IdentifyUser "A" sends 5 seconds after the anonymous user is created, returns 409
  3. CreateUser "A" sends with push sub ID in payload (push sub has no token)
  4. Call go GetUser "B" happens 5 second after its create
  5. Call to GetUser "A" happens 5 seconds after its create
  6. Push token received, enqueues a CreateSub with the token, because no push subscription is in the store?
  7. Call to TransferPushSub to "B" (by push sub ID, note that this sub has no token)
  8. Call to CreateSubscription with token in payload for "B", that was previously enqueued. A new push subscription ID is returned.
  9. Any updates to push sub do send to the new push sub ID
  10. Old push sub ID is still the one in the OneSignal-Subscription-Id header

@nan-li
Copy link
Contributor

nan-li commented May 14, 2024

Looking at this scenario again 👆, I think the issue is that for User B, there are 3 operations enqueued:

  • login-user
  • transfer-subscription
  • create-subscription

However, since transfer-subscription is having the 5 seconds block, the other 2 operations go together without waiting, and that creates side effects down the line.

Since opRepoPostCreateDelay is passed to delay() canAccess() should use
inclusive logic.
The OperationRepo can't juggle two different users correctly, so
stall the whole queue by opRepoPostCreateDelay. We plan to address this
limitation in a future PR.
@jkasten2 jkasten2 merged commit 91c9f50 into main May 16, 2024
2 of 3 checks passed
@jkasten2 jkasten2 deleted the fix/canaccess_breaking_order_with_grouping branch May 16, 2024 16:44
@jinliu9508 jinliu9508 mentioned this pull request May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants