Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thoughts for a sharding counter in a "stressed environment" #107

Open
trollkotze opened this issue May 7, 2018 · 2 comments
Open

Thoughts for a sharding counter in a "stressed environment" #107

trollkotze opened this issue May 7, 2018 · 2 comments
Labels

Comments

@trollkotze
Copy link

I had problems implementing a sharding counter in Datastore that would work under stress. When there was a lot of traffic (syncing huge amounts of data) with many entity update and insert calls being started at once, and sharding counter increments inside transactions added for each insert, the network was too congested to get done with any increment within 60 seconds (the maximum time for a transaction).

The thing is, Datastore has no problems with long running update commands, as long as they are not inside a transaction. It can take much longer here to get done with an update than 60 seconds when there are dozens started at once and the network is congested. Datastore will keep connections for every command alive for quite long and succeed even after many minutes. But for transactions this is of course not allowed.

I think a sharding counter would be a great feature for this package, but it might not be so easy to make it usable and robust even in a "handicapped" or "stressed" environment like this.

Just some food for thought to work on maybe when there is time. If I come up with experience and ideas, I'll try to get back here as well.

Cheers!

@trollkotze
Copy link
Author

Long rambling quoted from where the issue was first formulated (and removed now from there), for reference:

@sebelga Thanks for getting back!

Can I ask what do you mean by "transactions expiring too quickly"?

Maybe better to put it the other way around: The operations inside the counter increment transaction (that is not really more than something like: get(count); count+=increment; save(count) ) run too slowly. So the transaction expires (after 60 seconds, that is, if my understanding is correct).

My sharding counter at the moment does the following, parallel to saving x new entities of a kind:

  1. pick a random shard
  2. create and start a transaction, and inside that transaction do:
    a) get the current value of that shard.
    b) increment by x and save it again.
    c) commit the transaction
  3. On failure: rollback the transaction. <- (Even that would fail under heavy load, with the same "transaction expired" error, but I guess that should then be okay:? However in that case, I'd be wondering what the "rollback" actually method is for, if a failed transaction would not have had any effects anyway. :? It seems I don't really understand transactions.)

The error I am commonly getting under heavy load is:
Error: 3 INVALID_ARGUMENT: The referenced transaction has expired or is no longer valid

If I understood correctly, the lifetime of a datastore transaction is 60 seconds. So that shows me that my simple getting and saving of a counter inside that transaction is really slow. But as I said, this happens only under heavy load. When doing a few single updates, the counting works.

(Maybe my problem is just my not so ultrahigh bandwidth internet connection in my local dev environment and it would be a total non-issue when hosted on AppEngine, where we want to deploy our app. But if a mere slowdown like this for a short period breaks crucial functionality, then I think that is unacceptable, so I'd better fix it.)

I retry the above with exponential backoff. (Yes, I'm creating a new transaction in every retry, not reusing the old txn object again and failing because of that.) When still failing, cache the unprocessed increment amount in memory, and let it be picked up and added to the next increment.

This is a really bad and unsafe workaround, because I don't know when that will happen. The app might crash or be restarted before that happens. And it could keep the count wrong for a long time.

Meanwhile, the simultaneously running "normal" (i.e. non-transactional) saves of the entities which I am counting succeed without problem. Only the transactional counter updates get stuck in this exponential backoff retry-fail-again loop.

But now when thinking about it: I should perhaps better not allow even trying to do multiple increments in parallel (which would then all retry, and mostly fail, with exponential backoff, slowing each other down and leading to more accumulating failures), but instead just turn it into a queue:

When a count increment by x should be done:Emitted when a Redis server prints to stdout.

  1. Increment an in-memory "unsynced" counter for that entity kind by x.
  2. a) If a count transaction for that entity kind is already running, return.
    b) Else, start a count transaction, trying to increment by the current "unsynced" amount.

When a count transaction finishes:

  1. a) successfully: decrement the in-memory "unsynced" counter by the amount that has been counted in the transaction.
    b) failed: leave the "unsynced" counter as it is
  2. If the in memory counter "unsynced > 0", start a new count transaction, trying to increment by the current "unsynced" amount.

I had thought that, when I have multiple shards, it should not be a problem to do multiple increments on different shards at the same time. But this seems to be problematic when the number of parallel running counter increment transactions becomes larger and larger, and especially, I guess, when bandwidth is somewhat limited.

Thanks so much for asking and letting me think about it aloud here. Now I think I'm a bit clearer about what is going on. 😃

@sebelga
Copy link
Owner

sebelga commented May 7, 2018

Thanks for moving the thread here. What tool/lib/service do you use to make your stress tests? How can we benchmark when it starts to break?
cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants