Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HMAC keys thread hits Invariant failure !_committedSnapshot || *_committedSnapshot < nameU64 during replset startup #112

Open
IanWhalen opened this issue Oct 17, 2017 · 7 comments

Comments

@IanWhalen
Copy link

https://jira.mongodb.org/browse/SERVER-31598

Also, @igorcanadi, please let me know if this is still the best way to communicate these kinds of failures to your team?

@igorcanadi
Copy link
Contributor

Hey @IanWhalen, thanks for reporting. This failure is most likely related to the change in how Mongo manages oplog snapshots (same as #106 and #102). I would expect many more failures until we adapt to the new mongo's storage engine API.

I have spent some time researching https://jira.mongodb.org/browse/SERVER-28620 and unfortunately adapting to the new behavior will most likely require changes in RocksDB itself. There are two behaviors that are currently unsupported in RocksDB:

  1. Assigning timestamps to transaction. We might be able to work around this, if we keep in-memory mapping of timestamp to RocksDB's sequence number. However, that mapping would go away after restart. Is that okay?
  2. Even if we are able to assign timestamps to RocksDB's sequence number, we have no way of "slicing a single transaction", i.e. assign different timestamps to different writes in a single transaction. In the design document, this is what "Open question" section addresses.

I originally thought that the changes are small, but it appears this project might have a bigger scope unfortunately.

@milkie
Copy link
Contributor

milkie commented Oct 18, 2017

Hi Igor,
Yes, it will be okay to lose timestamp data when restarting; WiredTiger will have the same behavior. After a restart, we are planning on changing the logging (journaling) system to only log (journal) the oplog, and not log any other tables. Checkpoints will be done on the data as of the majority level timestamp. On restart, a node would restart with the data as of the last checkpoint-at-a-timestamp, and then the replication subsystem would replay oplog entries forward from the majority point, thus restoring timestamps for all writes after the majority point.
For #2, we might be able to make replication not slice transactions if not supported (this would suffer a performance penalty but otherwise might be able to work.)

@igorcanadi
Copy link
Contributor

Yes, it will be okay to lose timestamp data when restarting; WiredTiger will have the same behavior.

Awesome! Makes things much easier on our side.

After a restart, we are planning on changing the logging (journaling) system to only log (journal) the oplog, and not log any other tables.

Hm, if I understand correctly, this would enable the oplog to be fully independent of the storage engine, right? Can we share oplog implementation across engines? ;)

For #2, we might be able to make replication not slice transactions if not supported (this would suffer a performance penalty but otherwise might be able to work.)

Would be great if we could get there easily. RocksDB is planning a feature that would enable us to slice transactions, but it won't be ready soon.

@milkie
Copy link
Contributor

milkie commented Oct 23, 2017

Hm, if I understand correctly, this would enable the oplog to be fully independent of the storage engine, right? Can we share oplog implementation across engines? ;)

That is something we have considered doing in the future, but today's code still requires oplog writes to be transactional with everything else. That is, one transaction needs to atomically commit writes to both the oplog table and other data tables.
The idea with logging only the oplog is that it gets us closer to merging the oplog and the journal, such that one entity serves both purposes. One milestone on this road is to have rollback work by recovering to a point-in-time prior to the point where the node diverged, and then play the oplog forward from that point. We are configuring WiredTiger to take checkpoints at a point-in-time (the majority level, in fact), and then at rollback time we can "forget" all non-checkpointed writes by restarting the storage engine -- thus restoring the data back to a consistent point-in-time, without having to roll back individual writes as we do today.

Slicing transactions today only occurs on secondaries as an optimization for batching together multiple insert ops; disabling this optimization is easy, but we'd need to thread in a new function call into the storage engine API in order to know if slicing is supported.

@igorcanadi
Copy link
Contributor

Slicing transactions today only occurs on secondaries as an optimization for batching together multiple insert ops;

Can I just commit the previous transaction when a timestamp changes? Is there a possibility of rollback even on the secondaries?

@milkie
Copy link
Contributor

milkie commented Oct 30, 2017

Secondaries will not call rollback_transaction() for replication. I'm not sure how you could commit a transaction in the middle, though; you'd have to begin a new transaction at that moment as well, so that the subsequent call of commit_transaction() would work.

@igorcanadi
Copy link
Contributor

I'm not sure how you could commit a transaction in the middle, though; you'd have to begin a new transaction at that moment as well, so that the subsequent call of commit_transaction() would work.

Yup, exactly -- commit all writes and transparently start a new transaction. That'll work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants