Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Inernal] Ignore: Some thoughts on hedging #4471

Draft
wants to merge 114 commits into
base: master
Choose a base branch
from

Conversation

kirankumarkolli
Copy link
Member

Pull Request Template

Description

Please include a summary of the change and which issue is fixed. Include samples if adding new API, and include relevant motivation and context. List any dependencies that are required for this change.

Type of change

Please delete options that are not relevant.

  • [] Bug fix (non-breaking change which fixes an issue)
  • [] New feature (non-breaking change which adds functionality)
  • [] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [] This change requires a documentation update

Closing issues

To automatically close an issue: closes #IssueNumber

NaluTripician and others added 30 commits October 26, 2023 15:05
Co-authored-by: Matias Quaranta <ealsur@users.noreply.github.com>
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the required format: "[Internal] Category: (Adds|Fixes|Refactors|Removes) Description"

Internal should be used for PRs that have no customer impact. This flag is used to help generate the changelog to know which PRs should be included. Examples:
Diagnostics: Adds GetElapsedClientLatency to CosmosDiagnostics
PartitionKey: Fixes null reference when using default(PartitionKey)
[v4] Client Encryption: Refactors code to external project
[Internal] Query: Adds code generator for CosmosNumbers for easy additions in the future.

using (CancellationTokenSource cancellationTokenSource = CancellationTokenSource.CreateLinkedTokenSource(cancellationToken))
{
// Get effective order of regions to route to (static once populated)
IReadOnlyCollection<Uri> availableRegions = client.DocumentClient.GlobalEndpointManager.GetApplicableEndpoints(request.RequestOptions.ExcludeRegions, isReadRequest: true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is better than getting the current available regions for the scenario where an offline region becomes available again. Cavoite is that if a region is not available, a request will still be sent to it but since it will hedge on other regions too this is not much of a problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two side affects

  • In-case a future region is made available that might get excluded
  • Possible higher latency in-case of future un-avilale was included

Both issues were present even with earlier model (may be first full request might cover it?)

//Send out hedged requests
for (int requestNumber = 0; requestNumber < availableRegions.Count; requestNumber++)
{
TimeSpan awaitTime = this.Threshold + TimeSpan.FromMilliseconds(requestNumber * this.ThresholdStep.Milliseconds);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this I do not think this will give the right timespans except for the first await.
Request 0: await threshold before sending next request -- correct
Request 1: await threshold + step before sending next request -- is waiting too much time, since here the threshold amount of time should have passed then it should only be waiting the threshold step amount of time.

Does this make sense? So the time should really be: Timespan awaitTime = requestNumber == 0 ? this.Threshold : this.ThresholdStep;

This is because the WhenAny call has the await which will complete when the Task.Delay is done (or when a request completes).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just used existing logic from your PR, that's a clarification I have too, feel free to update as needed

{
clonedRequest.RequestOptions ??= new RequestOptions();

clonedRequest.RequestOptions.ExcludeRegions = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to ignore the original exclude regions list here if it is provided? Also, by using location endpoint to route rather than exclude regions we would be allowing cross regional retries on the hedged requests. Is this something we want to do; this behavior is different than Java.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exclude regions are alreay considered in initial list population already.

here we are using single region targeting LocationEndpointToRoute

requestTasks.Remove(completedTask);

(bool isNonTransient, responseMessage) = await (Task<(bool, ResponseMessage)>)completedTask;
if (isNonTransient)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I think we can do if(isNonTransient || requestTasks.Count == 1)

TimeSpan awaitTime = this.Threshold + TimeSpan.FromMilliseconds(requestNumber * this.ThresholdStep.Milliseconds);
Task thresholdDelayTask = Task.Delay(awaitTime, cancellationToken);

using (RequestMessage clonedRequest = (requestNumber == 0) ? request : request.Clone(request.Trace.Parent))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think creation of the cloned message has to be in a different helper method or else when moving outside the for loop the message will be disposed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point. This needs attention

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants