Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InstallAppleCertificateV2 and InstallAppleProvisioningProfileV1 often fails to download secure files #13468

Closed
maxim-lobanov opened this issue Aug 26, 2020 · 8 comments
Assignees
Labels

Comments

@maxim-lobanov
Copy link

Required Information

Question, Bug, or Feature? Bug

Task Name: InstallAppleCertificateV2 and InstallAppleProvisioningProfileV1

Environment

  • Azure pipelines
  • Hosted MacOS images

Issue Description

These tasks are used to download secure files from Azure DevOps to VM during build. We have noticed that these tasks are unstable and often fail randomly (see errors below) and pass on retry.
Context: We run builds that contains a lot of jobs (~about 32 jobs) to validate Hosted images quality. Pass rate of this build is ~60%. In every second build, at least one job from 32 fails (about 2% of total number of jobs fail with issue for us). It produces a lot of noise.

Task logs

InstallAppleCertificateV2:

##[error]Error: Invalid Resource

InstallAppleProvisioningProfileV1:

##[error]Error: connect ETIMEDOUT 13.107.42.18:443

Example of failed builds:

Possible solution

I propose implementation of retry logic on task level. So if downloading fails, wait a few seconds and try one more time.

@alexander-prozorov
Copy link
Contributor

Hello @maxim-lobanov,

It looks like we already have the retry logic at the azure-devops-node-api level. By default we have 5 retries already:

const maxRetries = retryCount && retryCount >= 0 ? retryCount : 5;
...
let options: IRequestOptions = {
     allowRetries: true,
     maxRetries
};
...
this.serverConnection = new WebApi(serverUrl, authHandler, options);

Where WebApi passes this options object to the HttpClient here.

As far as I can see, this issue has not occurred since August 20th. Is it still relevant?

@alexander-prozorov
Copy link
Contributor

Hi @maxim-lobanov,

Could you please confirm that we need the retry logic at the task level since we already have it in the node api?

@maxim-lobanov
Copy link
Author

maxim-lobanov commented Sep 7, 2020

Unfortunately, this issue still reproduces sometimes: Example (happened yesterday)

Any ideas why retry logic doesn't help in this case? How can we check that retry logic is really used in that failed task?

@alexander-prozorov
Copy link
Contributor

alexander-prozorov commented Sep 8, 2020

I wasn't able to reproduce this issue locally having 80 secure file downloads per job. But logging on the local agent shows that HttpClient has exactly 6 tries (Logging this value: 5 passed and + 1).
HttpClient has no logging by default so I can only guess using the time taken to complete this step. The failed step takes 3s, the succeded ones take 1-2s. All 6 attempts are about 1.3 seconds (https://github.com/microsoft/typed-rest-client/blob/master/lib/HttpClient.ts#L594). So it seems that all of them were executed.

@alexander-prozorov
Copy link
Contributor

@maxim-lobanov

I found two more tickets related to this issue:

And it seems like sometimes increasing retries helps with this problem. In any case, this only applies to the ms-hosted machines. No one can reproduce this issue using a local agent (see my message up above).

What do you think, does it make sense to implement this logic additionally at the task level? Or it is better to increase maxRetries value in the securefiles-common module?

@maxim-lobanov
Copy link
Author

@alexander-prozorov , I suggest starting with increase the number of retries / socket timeout because it is simpler solution.

@alexander-prozorov
Copy link
Contributor

Created PR: #13534

@alexander-prozorov alexander-prozorov added the awaiting deployment Related changes are waiting for deployment to be completed label Sep 14, 2020
@anatolybolshakov anatolybolshakov removed the awaiting deployment Related changes are waiting for deployment to be completed label Oct 12, 2020
@anatolybolshakov
Copy link
Contributor

Hi @maxim-lobanov 😃 Related changes has been rolled out, I'm closing it at the moment - please feel free to reopen it if this issue appears again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants