New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement aborted transactions retry mechanism #1
Conversation
Two tests failed:
Seems to me it's a right behavior: data concurrently changed, so results of the retried transaction are not equal to the results of the original transaction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions about the implementation and a note about failing tests.
row = iter_rows.pop(0) | ||
if self._results_checksum is not None: | ||
self._results_checksum.consume_result(row) | ||
_compare_checksums( | ||
self._original_results_checksum, self._results_checksum | ||
) | ||
yield row |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The java version actually consumes the results up to the same point that the user did before retrying before doing the comparison: https://github.com/googleapis/java-spanner/blob/059ef1ef1f03e80f4ff2705b45a62c553bb4e83d/google-cloud-spanner/src/main/java/com/google/cloud/spanner/connection/ChecksumResultSet.java#L145-L153. Here we actually yield the rows used in the comparison. How come this doesn't result in the user getting the same rows twice after an internal retry?
It's also a bit confusing to see this do the checksum comparison with each row. I know _compare_checksums
passes if it sees fewer retried rows than original, but that's not what I'd expect it to do without reading it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@olavloite, @skuruppu mentioned you might be able to review and compare this to the java client behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come this doesn't result in the user getting the same rows twice after an internal retry?
Seems to me a user will not use the rows read within a transaction in operations outside the transaction until the transaction is finished. It probably sounds hard, so consider this case:
User starts a transaction, reads several rows, then calls his own (not Spanner) API which is using these rows to write some data somewhere. Then he calls more functions in the same transaction and commits it. It's incorrect way of use - one should not do anything with data which the transaction is using at that moment, as transaction can fail and all the data/changes will be rollbacked, but it'll not rollback the records made in the user API, so the user will get the data desynchronization.
With this, the data that was read within transaction are considered as a temporary inexact-until-transaction-commit data, and if we're retrying a transaction, all that data should be re-read and recalculated, that's how I see it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also a bit confusing to see this do the checksum comparison with each row. I know _compare_checksums passes if it sees fewer retried rows than original, but that's not what I'd expect it to do without reading it.
Yeah, that's something I'm not 100% like too, but retry logic looks too apportioned across classes and functions to me, so I've decided to make many things deep under the hood. Thus, in the transaction class we're not thinking about results number of a result set of one of the operations of this transaction, we're just calling ==
and it takes care about such a things deeper in the result set itself. Result set object is much more closer to these details than transaction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have absolutely no Python experience, so bear with me if I misread or misunderstood (parts of) the code here, but if I understand it correctly the suggested implementation for Python here does the following (during the initial transaction):
- The
__iter__
method will normally (i.e. in the original transaction) iterate over thePartialResultSet
s that are received from Spanner, and for each of these iterate over the rows in thatPartialResultSet
. - If the
ResultSet
has a checksum, it will update the running checksum with the contents of the current row. - It then checks whether the current checksum is equal to the checksum of the original transaction. If this is the original transaction and not a retry, the value of
_original_results_checksum
will beNone
and the comparison will always succeed. - It then returns the row.
If a transaction is retried, the following happens:
- The
__iter__
method will iterate over thePartialResultSet
s that are received from Spanner for the retried query, and for each of these iterate over the rows in thatPartialResultSet
. - If the
ResultSet
has a checksum, it will update the running checksum with the contents of the current row. - It then checks whether the current checksum is equal to the checksum of the original transaction. As far as I can see, this would always fail if the result set contained more than 1 row, as it is a running checksum.
Assume that a query returns the following rows during the original transaction:
Key | Value |
---|---|
1 | One |
2 | Two |
During the original transaction the entire result set was consumed so the running checksum was calculated based on all the rows:
- After row 1 the checksum is
C1
- After row 2 the checksum is
C2
So the original checksum is nowC2
.
During a retry the query will be retried and the checksum will be calculated again (I assume that the results during the retry are equal to the original query):
- After row 1 the checksum is
C1
. That is compared to the original checksum that isC2
and the comparison will fail.
The checksum should therefore not be compared for each row, but after consuming the same number of rows during the retry as during original query.
(But I get a feeling that there is a more fundamental misunderstanding about this retry logic. I'll elaborate on that in a separate comment.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @olavloite
It then checks whether the current checksum is equal to the checksum of the original transaction. As far as I can see, this would always fail if the result set contained more than 1 row, as it is a running checksum.
Yeah, that's the thing @c24t mentioned as something that we don't expect. In fact I'm redefining ==
operator for ResultsChecksum
object, so that it compares checksums in this way:
if original is not None:
if retried != original:
if not retried < original: # retried has less results
At the first line it checks if we're in retried transaction - if not, than we have nothing to compare.
The second checks if checksums are equal, considering their length (number of consumed results).
And the third line checks if the retried transaction came to the original transaction failure point.
So, if the given checksums have different length, it'll be considered like there are no problems so far, 'cause a) retried transaction is not at the point, where the original failed, so it's too soon to compare checksums; b) retried transaction went farther than the original transaction = the checksums were compared = they were equal at the original transaction failure point).
Looking at this, I think there can be a problem when retried transaction went farther than the original. I'll double check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@c24t, hm-m, I think the logic here can be little bit simpler, if I'll use a __len__
magic method.
UPDATE: yes, it looks better, PTAL
@IlyaFaer I get the impression that there is a misunderstanding about when retry logic based on checksums are needed for aborted transactions. The Java client contains two different APIs for interacting with Cloud Spanner:
The way that the Connection API keeps track of the results per statement depends on the type of statement:
From what I see in this PR, the checksum retry logic is being added to the 'normal' client. The Retries based on checksums are only needed if client applications can interact with Cloud Spanner in a more traditional way through some kind of 'connection' object (or similar) that does not require you to define a transaction as a function. Consider the following (pseudo-)code example based on a simple example on how to work with MySQL in Python: try:
connection = mysql.connector.connect(host='spanner.googleapis.com',
database='Electronics',
user='pynative',
password='pynative@#29')
if connection.is_connected():
connection.autocommit = false
cursor = connection.cursor()
cursor.execute("SELECT * FROM Foo;")
for row in cursor:
print(row)
cursor.execute("""UPDATE Foo SET Bar = 'baz' WHERE Id = 2""")
connection.commit()
except Error as e:
connection.rollback()
print("Error ", e)
finally:
if (connection.is_connected()):
cursor.close()
connection.close()
print("Spanner connection is closed") Cloud Spanner could for example abort the transaction before/during the statement All the above should be transparent to the client application (except for a delay during the execution of Please don't hesitate to ask if you have additional questions on this (or let me know if I completely misunderstood how the current PR works). |
Yes, that's correct. The thing is that the connection API we have for now is actually delegating to the original Spanner client, and exactly where where In fact our Python connection API always works in autocommit mode: Thus, our way of use is little bit different (at least in some parts) from Java - our transactions are always functions, and commit/rollback are called under the hood. We probably can write several wrappers in the connection API, to use checksums comparisons only in connection API, not in traditional Spanner client. But it looks like an overkilling way, as we will have to inherit and then override several original Spanner methods from several classes, and these new methods will be very-very similar to the original ones, like 15-20% defference. It's probably easier to make checksums comparison enablable (and always enabled when using connection API), if it's critical not to compare checksums in the original Spanner client. For now I don't see simpler way to add checksums comparison into connection API without adding it into original Spanner, as to make comparison work it takes to make changes in several classes, that are embedding each other on different levels. |
Thanks for the clarification, that makes sense.
Uhm... If it is not the intention to change that, then you don't need retry logic for transactions, as there are no transactions.
Update statements will also not abort, even though they do use transactions, as any aborted transaction is handled by the normal retry logic:
|
@olavloite thanks for the detailed explanation. That is also my understanding. This fix relates to googleapis#34 and based on the discussion and example, it seems that the intention is to change the retry mechanism in |
Yeah, this getting a little bit confusing as there are a number of different topics getting mixed up a little bit. Issue googleapis#34 describes a feature request for
So:
As explained above, the normal retry strategy that uses a |
Ok that makes more sense to me now @olavloite, thanks so much for making sense of it. So if the intention is to support |
* feat: add PITR-lite support * fix: remove unneeded conversion for earliest_version_time * test: fix list_databases list comprehension * feat: add support for PITR-lite backups (#1) * Backup changes * Basic tests * Add system tests * Fix system tests * Add retention period to backup systests * style: fix lint errors * test: fix failing backup system tests (googleapis#2) * Remove unnecessary retention period setting * Fix systests * Review changes (googleapis#3) * Remove unnecessary retention period setting * Fix systests * Review changes * style: fix lint Co-authored-by: larkee <larkee@users.noreply.github.com> Co-authored-by: Zoe <zoc@google.com>
This is the first part of the transaction retry mechanism implementation. Here is what it consists of:
Class
ResultsChecksum
is written according to the design and covered with unit tests.Every transaction inits a checksum property on creation.
Transaction.execute_update()
andTransaction.batch_update()
methods are now adding their results into transaction checksum.StreamedResultSet
now have an optional constructor arg - checksum. If checksum is passed result set will be pushing every streamed row into the checksum before returning it to user.At the next step I'm gonna implement propagation of transaction checksum into the result sets of operations runned in this transaction.