Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add additional labels #472

Open
pharindoko opened this issue Nov 22, 2023 · 24 comments
Open

feat: add additional labels #472

pharindoko opened this issue Nov 22, 2023 · 24 comments

Comments

@pharindoko
Copy link
Contributor

Hey @kichik,

we have one issue with runners on org level (I assume as well on repo level).
When multiple jobs run in parallel it seems like it can happen that the wrong job gets the runner and some jobs are left over still waiting for a runner.

I tried to add additional labels to the runner labels but have seen that this will be blocked by the webhook script.

const provider = matchLabelsToProvider(payload.workflow_job.labels);

feature-request:
I would like to have the possiblity to add additional labels to a workflow.
[self-hosted, large-runner, additional-label-for-workflow, additional-label-for-repo, additional-label]

Would mean you only check for required labels [self-hosted, large-runner] if those match and then forward it.

@kichik
Copy link
Member

kichik commented Nov 25, 2023

Can you please include the specific labels used? It will be easier to discuss with a concrete example.

@pharindoko
Copy link
Contributor Author

pharindoko commented Nov 28, 2023

As mentioned this might occur no matter if the runner are registered on org or repo level.
But it will happen more often because more jobs are executed.

Background:

We register the runners now on org level - means those are visible to all repositories the github app is configured for in an organization.

The runner has label [self-hosted, large-runner]
We have different workflows in two different repositories A,B in the same org that use these labels.

We trigger workflow A in repository A using workflow dispatch.
This workflow A gets queued and queued...

In the meantime another workflow B that listened for a runner already for 15 hours having the label [self-hosted, large-runner] takes over this runner.
It can have several reasons why this happens e.g. user missed for organizations to select the runner group setting to support Allow public repositories and has a public repository.

I can see this directly in the cloudwatch logs because in this case the workflow job failed and returned Job FrontendDeployment completed with result: Failed

Job FrontendDeployment is only availabe in repository B for workflow B.
Stepfunction succeeded, EC2 terminated...
Workflow A still waits for a runner ...

@kichik
Copy link
Member

kichik commented Nov 28, 2023

That would be one of the organization level registration cons:

https://github.com/CloudSnorkel/cdk-github-runners/blob/994f474deb38c904d94d2ad824b0493bac09d942/setup/src/App.svelte#L333C1-L334C49

You can bring up more runners to take over the stolen ones by creating a new step function execution. Open an existing one from the UI and hit the New Execution button.

I'm still not sure what you originally meant with the labels. Where would the label be added to resolve this?

@pharindoko
Copy link
Contributor Author

pharindoko commented Dec 5, 2023

I assume you can have this issue no matter if it`s org level or repo runner.
The propability is just higher on org level.
Furthermore you have an issue because as a user you might not see all repos in the org and it`s quite hard to find jobs that got stuck and still listen for a runner if you have a lot of repos in one org.

Here`s a sample for a solution proposal:

  1. Runner in cdk solution created with labels: [label-for-instance-type]
  2. User creates a workflow with [self-hosted, label-for-instance-type, label-for-repository]
  3. Workflow gets triggered and webhook.lambda.ts is triggered with labels [self-hosted, label-for-instance-type, label-for-repository]
    matchLabelsToProvider checks only if the runner labels are included: ["self-hosted, label-for-instance-type"]
  4. Stepfunction determines a runner based on the labels that match a runner: ["self-hosted, label-for-instance-type"]
  5. All labels [self-hosted, label-for-instance-type, label-for-repository] are injected into runner and attached to the config.sh command.
  6. Runner is registered in github with labels [self-hosted, label-for-instance-type, label-for-repository]

This would mean that runner labels and workflow labels can be different and you can attach additional workflow labels.

I`ll try to create a PR and test this to see if this could work.
@kichik please let me know if you have concerns ...

@kichik
Copy link
Member

kichik commented Dec 5, 2023

That won't prevent other repositories from using label-for-repository and still stealing the runner. Our label situation is already not the simplest to understand. I think this will make it even harder and still won't solve runner stealing.

@pharindoko
Copy link
Contributor Author

pharindoko commented Dec 5, 2023

well it`s not like people steal runners by intention. And the label-for-repository is just a sample for a tag.

You could use multiple tags and choose whatever you want.
But it could happen that people just copy / paste those labels from another github action from another repository by mistake.
It would still heavily reduce the propability that one runner can steal another one.
And you can trace it in the logs as well.

@pharindoko
Copy link
Contributor Author

btw if you know a better solution for this dilemma, please let me know :)

@kichik
Copy link
Member

kichik commented Dec 6, 2023

But it could happen that people just copy / paste those labels from another github action from another repository by mistake.

Exactly.

Have you considered creating multiple providers all using the same settings and image builder but with different labels? That feels like basically what you're asking for. Each runner will start with exactly the labels you want and for just the repositories you want.

btw if you know a better solution for this dilemma, please let me know :)

Repository level registration is the only thing I found. I do wish GitHub would let us aim specific runners at specific jobs, but they have no such option.

@pharindoko
Copy link
Contributor Author

pharindoko commented Dec 6, 2023

But it could happen that people just copy / paste those labels from another github action from another repository by mistake.

Exactly.

Have you considered creating multiple providers all using the same settings and image builder but with different labels? That feels like basically what you're asking for. Each runner will start with exactly the labels you want and for just the repositories you want.

No chance. Too much repos at all. This is no fun even when you have more than 10 repos.

btw if you know a better solution for this dilemma, please let me know :)

Repository level registration is the only thing I found. I do wish GitHub would let us aim specific runners at specific jobs, but they have no such option.

What I struggle with is the administration permission on repo level. A single permission to manage runners on the repository level is still missing.

Another option

Add a schedule feature to start additional runners on a regular base by

  • querying cloudwatch logs to get organizations that used the runner solution e.g. the last 2 weeks and start runners
  • querying the github instance to get all jobs that queued longer than e.g. 6 hours and start runners

The idleTimeout setting will kill those ec2 and deregister the runner by default if no job is taken after 5 minutes, right ?

@kichik please give me feedback
I could try to create a draft PR for this topic.

@kichik
Copy link
Member

kichik commented Dec 6, 2023

No chance. Too much repos at all. This is no fun even when you have more than 10 repos.

Wouldn't you have the same problem with your proposed solution? You'd have to specify all those labels somewhere. Maybe your CDK code can automatically generate it using gh?

The schedule option is something I have considered often, for many reasons. I am still not sure exactly what I want it to look like or if it should even be a schedule and not some kind of monitoring system for stolen/missing runners.

You can already define a schedule manually, if you need to. I have done something similar for one of my installations.

const sm = runners.node.findChild('Runner Orchestrator') as stepfunctions.StateMachine;
const rule = new events.Rule(this, 'Start Runner Rule', {
  schedule: events.Schedule.rate(cdk.Duration.hours(1)),
  targets: [
    new targets.SfnStateMachine(
      sm, 
      {
        input: events.RuleTargetInput.fromObject({
          "owner": "my-org",
          "repo": "my-repo",
          "jobId": 0,
          "jobUrl": "manual-runner",
          "installationId": xxx, // TODO get from github secret
          "labels": {
            "self-hosted": true,
            "something": true, // TODO update
          },
          "provider": "stack-name/provider-id" // TODO update
        }) 
      }
    )
  ],
});

@pharindoko
Copy link
Contributor Author

pharindoko commented Dec 6, 2023

No chance. Too much repos at all. This is no fun even when you have more than 10 repos.

Wouldn't you have the same problem with your proposed solution? You'd have to specify all those labels somewhere. Maybe your CDK code can automatically generate it using gh?

It doesn`t happen all the time right. And for most of the users it`s quite ok how it works even with org runners.
But for some of them it`s not acceptable when it`s used for some production deployments.
And for those it would be fine if they need to attach additional labels to make it more reliable.

The schedule option is something I have considered often, for many reasons. I am still not sure exactly what I want it to look like or if it should even be a schedule and not some kind of monitoring system for stolen/missing runners.

You can already define a schedule manually, if you need to. I have done something similar for one of my installations.

const sm = runners.node.findChild('Runner Orchestrator') as stepfunctions.StateMachine;
const rule = new events.Rule(this, 'Start Runner Rule', {
  schedule: events.Schedule.rate(cdk.Duration.hours(1)),
  targets: [
    new targets.SfnStateMachine(
      sm, 
      {
        input: events.RuleTargetInput.fromObject({
          "owner": "my-org",
          "repo": "my-repo",
          "jobId": 0,
          "jobUrl": "manual-runner",
          "installationId": xxx, // TODO get from github secret
          "labels": {
            "self-hosted": true,
            "something": true, // TODO update
          },
          "provider": "stack-name/provider-id" // TODO update
        }) 
      }
    )
  ],
});

Hmm that`s a nice idea. But ok I can`t do this statically in cdk code for all repos. I would get lost.

Create schedule to trigger a lambda.
Create a lambda to query cloudwatch insights to get the owner, repo and labels aggregated for the last 2 weeks.
Go through all rows found and trigger the statemachine.
I`ll try.
I still would like to have this as part of the cdk construct :)

@kichik
Copy link
Member

kichik commented Dec 6, 2023

It doesnt happen all the time right. And for most of the users its quite ok how it works even with org runners.
But for some of them its not acceptable when its used for some production deployments.
And for those it would be fine if they need to attach additional labels to make it more reliable.

You can create a provider just for those special cases then. Isn't that basically what you were asking for? What am I missing?

Hmm thats a nice idea. But ok I cant do this statically in cdk code for all repos. I would get lost.

Since you're registering on organization level, the repo part doesn't really matter. You can leave it as a placeholder. No need for Lambda or anything.

@pharindoko
Copy link
Contributor Author

Ok, I found a solution that is quite ok I guess to circumvent this problem:

image
Added a construct with a step function and a few lambdas with following workflow:

  1. On the left side I get all the information of the state machine logs about the jobs transmitted (repo, owner, job, labels)
  2. On the right side I query github in a lambda authenticated as github app to get all the organisation installations

  1. Based on the "owner"|"organisation" property I merge the information
  2. I check the jobs and their current state (is the job queued for longer than 10 minutes ?)
  3. I trigger new stepfunction executions to get new runners with the specific labels registered in the right organisation.

I trigger this step function each half an hour.

What would be great to get as output from the GithubRunners construct:

  • the statemachine arn (use to trigger new step function executions
  • the loggroup name (used for the cloudwatch insights query)

Currently this is what I need to execute the construct:

const stateMachine = githubRunners.node.children.find((child) => child instanceof stepfunctions.StateMachine)! as stepfunctions.StateMachine;
const logGroupName = `${this.stackName}-state-machine`;

new GithubScheduledRunnerHandler(this, 'github-scheduled-runner', {
      internalSubnetIds: props.internalSubnetIds,
      vpcId: props.vpcId,
      logGroupName,
      githubSecretId: githubRunners.secrets.github.secretName,
      githubPrivateKeySecretId: githubRunners.secrets.githubPrivateKey.secretName,
      mappingTable: providers.map((provider) => { return { "provider": provider.toString(), "labels": provider.labels } }),
      stateMachineArn: stateMachine.stateMachineArn,
      githubRunnerWaitTimeInMinutes: 10,
    });

@pharindoko
Copy link
Contributor Author

pharindoko commented Jan 4, 2024

Would you be interested to integrate this construct in your solution ? Should I add a PR or should I create a separate construct ?
I could share the code in a private repo in a simplified version (not a perfect construct) if you want to get more details.
Please let me know @kichik.

@kichik
Copy link
Member

kichik commented Jan 6, 2024

I do want something like this one day, but I don't think we have all the details figured out yet. For example, when using ECS with a limited ASG, a job might be pending for >10 minutes just because the cluster is busy. If we keep submitting more jobs, it may only make the situation worse.

I do like that it doesn't go off just by the webhook. I have seen cases where the webhook was just not called and so runners were missing.

@kichik
Copy link
Member

kichik commented Jan 6, 2024

Please do share the code, if you can. I think I may have misunderstood the matching you were doing between jobs and the existing state machine. It might already cover the problem I mentioned.

@pharindoko
Copy link
Contributor Author

@kichik hope you have seen my invitation - created a private repo for this.

@kichik
Copy link
Member

kichik commented Jan 9, 2024

@kichik hope you have seen my invitation - created a private repo for this.

I did. Thanks.

@pharindoko
Copy link
Contributor Author

After more than week of running the scheduler step function I found out there a clear correlation between the webhook lambda function error message ExecutionAlreadyExists and new runners triggered by my scheduler step function. It only seems to happen in this case.

Detailed error message in cloudwatch:

__type
com.amazonaws.swf.service.v2.model#ExecutionAlreadyExists
$fault
client
$metadata.attempts
1
$metadata.httpStatusCode
400
$metadata.requestId
f1cad492-8906-4967-9c99-f920a010bf90
$metadata.totalRetryDelay
0
@ingestionTime
1704816064769
@log
********:/aws/lambda/gh-runner-ghrunnergithubrunnerWebhookHandlerwebhoo....
@logStream
2024/01/09/[$LATEST]e18f1646a851483d93a0490663a2cde4
@message
2024-01-09T16:00:56.047Z f8bf04d7-561b-4d7a-bbd0-b66f23121b91 ERROR Invoke Error {"errorType":"ExecutionAlreadyExists","errorMessage":"Execution Already Exists: 'arn:aws:states:********:********:execution:ghrunnergithubrunnerRunnerOrchestrator********:******'-44b33b20-af08-'","name":"ExecutionAlreadyExists","$fault":"client","$metadata":{"httpStatusCode":400,"requestId":"f1cad492-8906-4967-9c99-f920a010bf90","attempts":1,"totalRetryDelay":0},"__type":"com.amazonaws.swf.service.v2.model#ExecutionAlreadyExists","stack":["ExecutionAlreadyExists: Execution Already Exists: 'arn:aws:states:********:*******:execution:ghrunnergithubrunnerRunnerOrchestrator********:*****'"," at de_ExecutionAlreadyExistsRes (/var/runtime/node_modules/@aws-sdk/client-sfn/dist-cjs/protocols/Aws_json1_0.js:1694:23)"," at de_StartExecutionCommandError (/var/runtime/node_modules/@aws-sdk/client-sfn/dist-cjs/protocols/Aws_json1_0.js:1321:25)"," at process.processTicksAndRejections (node:internal/process/task_queues:95:5)"," at async /var/runtime/node_modules/@aws-sdk/middleware-serde/dist-cjs/deserializerMiddleware.js:7:24"," at async /var/runtime/node_modules/@aws-sdk/middleware-signing/dist-cjs/awsAuthMiddleware.js:14:20"," at async /var/runtime/node_modules/@aws-sdk/middleware-retry/dist-cjs/retryMiddleware.js:27:46"," at async /var/runtime/node_modules/@aws-sdk/middleware-logger/dist-cjs/loggerMiddleware.js:7:26"," at async Runtime.handler (/var/task/index.js:13997:21)"]}
@requestId
f8bf04d7-561b-4d7a-bbd0-b66f23121b91
@timestamp
1704816056047

@kichik
Copy link
Member

kichik commented Jan 13, 2024

The name uses GitHub webhook delivery id to remain unique.

let executionName = `${payload.repository.full_name.replace('/', '-')}-${getHeader(event, 'x-github-delivery')}`.slice(0, 64);

Do you maybe have the same webhook hooked up more than once somehow?

@pharindoko
Copy link
Contributor Author

The name uses GitHub webhook delivery id to remain unique.

let executionName = `${payload.repository.full_name.replace('/', '-')}-${getHeader(event, 'x-github-delivery')}`.slice(0, 64);

Do you maybe have the same webhook hooked up more than once somehow?

Ok then I assume it might be the apigw/lambda retry mechanism...
will check it.

@pharindoko
Copy link
Contributor Author

I assume that this really only happens when the apigw (my internal customlambdaaccess implementation) does a retry...

seems like this occurs for 1 % of the runner requests and it`s always the same correlation between
"the stepfunction alreadyexists message" <> a 502 bad gateway error <> a queued job waiting for a runner in github

@kichik
Copy link
Member

kichik commented Feb 7, 2024

Can you tell what's the initial failure? It makes sense for the second webhook retry to fail with "the stepfunction alreadyexists message" and return 502. But why would the first one fail and require a retry? Is it a timeout maybe?

@pharindoko
Copy link
Contributor Author

you`re right should investigate in this :) will do ... timeout could be an option yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants