Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review error scenario of full network outage (no replica can be reached) #116

Open
zecke opened this issue Apr 24, 2019 · 2 comments
Open

Comments

@zecke
Copy link
Contributor

zecke commented Apr 24, 2019

It seems several aspects need to be reviewed and understood better. This might be obsolete by implementing Mongo's SDAM specification.

  1. VOMongoRepository class >> #defaultReplicaSetLookupTimeout vs. Socket standardTimeout:
    When trying to find a new primary maybe only a single one will be tried in VOMongoReplicationUrlResolver>>#lookupReplicaSetStatusIfFound: because of a big socket timeout (Mongo>>#open uses default socket time out of 45s).
    Maybe it makes sense to access knownMongoUrls atRandom to avoid always getting stuck on the first entry!

  2. Do not (attempt to) populate session pool when there is no primary:
    If >>#basicReset fails primaryMongoUrl should not be assigned the first URL but no server.

  3. Do not execute VOMongoReplicationUrlResolver>>#reset concurrently. Each VOExecutionStrategy will call >>#reset on failure and the last one will win. They should be serialized. The effect seems that after a change in primary it takes some time for voyage to "warm" up again to get the same transaction speed. This shouldn't be the case though.

@zecke
Copy link
Contributor Author

zecke commented Apr 24, 2019

Another issue is that some of the pooled Mongo instances are not closed. Leading to populate raising an exception about the external semaphores not being able to be registered.

@zecke
Copy link
Contributor Author

zecke commented Apr 24, 2019

With a session pool of max 10 entries I end up with 101 open Mongo connections during a stepdown from primary to the new primary.

This seems to be due lack of concurrency control in the VOMongoRepositoryResolver. In case a query fails >>#resetPool or just >>#reset will be invoked. In the latter the existing pool will be flushed and a new one is created. There are three race conditions.

  1. The >>#flush and >>#populate are ran concurrently. New connections are added to a discarded pool after it has been flushed. This can be mitigated by setting the desired pool size to 0 before calling flush. It does seem to fix the issue I see.

  2. With multiple calls to >>#reset at the same time we can populate/discard pools that we have not seen yet. There should be some locking around the pool...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant