Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tasks/WP-190: Handle concurrency with Tapis OAuth Token Refresh #932

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

chandra-tacc
Copy link
Collaborator

@chandra-tacc chandra-tacc commented Jan 25, 2024

Overview

When Tapis OAuth token expires for a user and multiple tapis api calls are requested concurrently (example: page refresh), all of them send requests to Tapis to refresh token. This duplicate request is a waste of resource and slows performance.

The solution is to only make one request per user when token expires. Any solution should work with requests distributed across multiple processes.

Possible Solutions

django select_for_update

  • operates at the row level, Tapis OAuth info is one row per user. So, select_for_update a convenient mechanism to do distributed locking.
  • allows for transactions to wait until they acquire the lock. This allows for incoming requests to not fail because they cannot acquire lock.

django-db-mutex

  • allows you to create higher-level named locks that can be used to protect larger sections of code or critical operations. This needs a third party library installation
  • If a request cannot acquire a mutex because it is already locked, the db-mutex throws an error. This will fail the request if a retry-wait like mechanism is not implemented.

This PR uses select_for_update since is readily available in django and has waits.

Related

Changes

  • Uses a feature knob to allow disabling this in case of issues. This is due to lack of easy ability to test E2E with multiple users and various scenarios.
  • Uses select_for_update for a specific user, check if the updated data has non-expired token. Otherwise, refreshes the token.
  • There are expiry checks at two places - one before acquiring a lock (on current model data) and one after acquiring row lock, but with latest db data.

Testing

Expire token

docker exec -it portal_django /bin/bash
from django.contrib.auth.models import User
u = User.objects.get(username='cyemparala')
u.tapis_oauth.expires_in = 1
u.tapis_oauth.save()

Enable Setting

In settings_local.py, add entry:
ENABLE_OPTIMIZED_OAUTH_REFRESH = True

Test cases:

Whole page refresh - triggers multiple concurrent requests with expired token:
  • Go to dashboard on cep.test
  • Expiry token
  • Refresh dashboard page.
  • See logs
    ** Only one "Refreshing tapis oauth token" message is seen, rest are all waiting for row lock
    ** 2 of waiting transactions, get the update access token info and process the request.
core_portal_django         | [DJANGO] INFO 2024-01-25 19:38:00,932 UTC models portal.apps.auth.models.optimized_client:99: Tapis OAuth token expired
core_portal_django         | [DJANGO] INFO 2024-01-25 19:38:00,933 UTC models portal.apps.auth.models.optimized_client:104: Refreshing tapis oauth token
core_portal_django         | [DJANGO] INFO 2024-01-25 19:38:00,952 UTC models portal.apps.auth.models.optimized_client:99: Tapis OAuth token expired
core_portal_django         | [DJANGO] INFO 2024-01-25 19:38:01,142 UTC models portal.apps.auth.models.optimized_client:99: Tapis OAuth token expired
core_portal_django         | [DJANGO] INFO 2024-01-25 19:38:01,450 UTC models portal.apps.auth.models.optimized_client:114: Token expired, but already refreshed. Refreshing token from DB.
core_portal_django         | [DJANGO] INFO 2024-01-25 19:38:01,762 UTC basehttp django.server.log_message:212: "GET /api/workspace/jobs/listing?offset=0&limit=50&query_string= HTTP/1.0" 200 421569
core_portal_django         | [DJANGO] INFO 2024-01-25 19:38:01,765 UTC models portal.apps.auth.models.optimized_client:114: Token expired, but already refreshed. Refreshing token from DB.
Single request
  • Expiry token
  • Click on history menu item in cep.test
  • See logs
    ** Only one "Refreshing tapis oauth token" message is seen, rest are all waiting for row lock
    ** No other request acquired row lock because they already saw the non-expired token
core_portal_nginx          | 172.22.0.1 - - [25/Jan/2024:20:30:35 +0000] "PATCH /api/notifications/ HTTP/2.0" 200 17 "https://cep.test/workbench/history/jobs" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" "-"
core_portal_django         | [DJANGO] INFO 2024-01-25 20:30:35,397 UTC models portal.apps.auth.models.optimized_client:98: Tapis OAuth token expired
core_portal_django         | [DJANGO] INFO 2024-01-25 20:30:35,398 UTC models portal.apps.auth.models.optimized_client:104: Refreshing tapis oauth token
core_portal_django         | [DJANGO] INFO 2024-01-25 20:30:36,262 UTC basehttp django.server.log_message:212: "GET /api/workspace/jobs/listing?offset=0&limit=50&query_string= HTTP/1.0" 200 421569
core_portal_nginx          | 172.22.0.1 - - [25/Jan/2024:20:30:36 +0000] "GET /api/workspace/jobs/listing?offset=0&limit=50&query_string= HTTP/2.0" 200 30057 "https://cep.test/workbench/history/jobs" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" 
Basic UI Sanity Tests

Ran through basic UI sanity tests to ensure acquiring client does not fail

UI

Notes

Copy link

codecov bot commented Jan 25, 2024

Codecov Report

Attention: Patch coverage is 23.33333% with 23 lines in your changes are missing coverage. Please review.

Project coverage is 64.71%. Comparing base (461e53c) to head (0baa7b0).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #932      +/-   ##
==========================================
- Coverage   64.81%   64.71%   -0.10%     
==========================================
  Files         434      434              
  Lines       12482    12509      +27     
  Branches     2573     2608      +35     
==========================================
+ Hits         8090     8095       +5     
- Misses       4161     4183      +22     
  Partials      231      231              
Flag Coverage Δ
javascript 69.83% <ø> (ø)
unittests 59.55% <23.33%> (-0.18%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
server/portal/settings/settings.py 0.00% <0.00%> (ø)
server/portal/apps/auth/models.py 52.00% <25.00%> (-16.00%) ⬇️

@chandra-tacc chandra-tacc marked this pull request as ready for review January 25, 2024 23:15
Copy link
Member

@rstijerina rstijerina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good I think

server/portal/apps/auth/models.py Outdated Show resolved Hide resolved
server/portal/apps/auth/models.py Outdated Show resolved Hide resolved
chandra-tacc and others added 2 commits February 9, 2024 13:21
Co-authored-by: Sal Tijerina <r.sal.tijerina@gmail.com>
@nathanfranklin
Copy link
Member

I haven’t had the chance to test it yet, but I appreciate the detailed PR description, particularly the 'Possible Solutions' section 💯

Copy link
Member

@rstijerina rstijerina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

client.refresh_tokens()
except Exception:
logger.exception('Tapis Token refresh failed')
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On failure, perhaps we should call logout

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, will test it and share info.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current logic is all in a model, can't do http redirect or control view responses from here. Have to do from view. I setup an custom exception and handled it in Base View to send 401 back to client. On testing, by forcing an error - it turned the 401 to redirect to tapis oauth, but that failed due to CORS policy issue, I have to check if this is local setup or something else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came to that realization as well today. I tried another solution with DesignSafe, which is to put the refresh logic in a middleware. In CEP we originally moved that logic from middleware to the client() method, but I think the solution you propose here might solve the original issue there?

Haven't tested yet, what are your thoughts?
https://github.com/DesignSafe-CI/portal/blob/task/DES-2702--tapis-v3-oauth/designsafe/apps/auth/middleware.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rstijerina - sorry for delay in response, I missed this note.
I looked at the code in that branch. It looks good, one comment on overall integration:

  • Should you do this also for logout?
    logout(request)
    return HttpResponseRedirect(reverse('designsafe_auth:login'))

Some testing aspects:

  • Behavior on xhr requests when tapis token expiry fails. If xhr does not handle 302 cleanly, some extra check and redirect might be needed.
  • Walking through the code, it is protected from infinite loop through this(which is good). If refresh fails, goes to login, which has to authenticate with tapis, if authentication works, this middleware immediately returns (because there is no expiry) and move away from this middleware.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use do this also for logout?
logout(request)
return HttpResponseRedirect(reverse('designsafe_auth:login'))

Yes, thanks. Added:
https://github.com/DesignSafe-CI/portal/blob/task/DES-2709--v3-apps-views/designsafe/apps/auth/middleware.py

Behavior on xhr requests when tapis token expiry fails. If xhr does not handle 302 cleanly, some extra check and redirect might be needed.

Can you expand more on this part? Where would tapis token expiry fail?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behavior on xhr requests when tapis token expiry fails. If xhr does not handle 302 cleanly, some extra check and redirect might be needed.

Can you expand more on this part? Where would tapis token expiry fail?
I meant - "Behavior on xhr requests when tapis token expires, and the refresh fails - this will hit the logout code and send a 302 back to client. If javascript side of response handling does not handle 302 cleanly (page rendering after 302, etc), may be extra logic need be needed to check for 302 status and specific error type(token expired) and then setting location href to logout".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rstijerina - regarding this PR, if middleware is the right place for auth and if it is working in designsafe, should I do the same here and start testing. What is your opinion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution I have in DesginSafe does work, here are example logs from a refresh that just occurred for me:

des_django         | [DJANGO] INFO 2024-04-12 14:28:57,764 middleware designsafe.apps.auth.middleware.process_request:49: Tapis OAuth token expired for user sal. Refreshing token
des_django         | [DJANGO] INFO 2024-04-12 14:28:57,769 middleware designsafe.apps.auth.middleware.process_request:49: Tapis OAuth token expired for user sal. Refreshing token
des_django         | [DJANGO] INFO 2024-04-12 14:28:57,775 middleware designsafe.apps.auth.middleware.process_request:61: Refreshing Tapis OAuth token
des_django         | [DJANGO] INFO 2024-04-12 14:29:02,626 middleware designsafe.apps.auth.middleware.process_request:72: Token updated by another request. Refreshing token from DB.

It might not be fool-proof though, and could definitely use more testing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could talk about best place for token refresh in the next infra scrum?

Copy link
Contributor

@edmondsgarrett edmondsgarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! And tested well

@chandra-tacc
Copy link
Collaborator Author

Copy link
Contributor

@edmondsgarrett edmondsgarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@chandra-tacc chandra-tacc marked this pull request as draft April 19, 2024 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants