Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluentbit pods logs get STS assume role request failed - could not sign request with sigv4 #480

Open
bgarcial opened this issue Mar 27, 2024 · 5 comments

Comments

@bgarcial
Copy link

Hi dear community,

I've installed fluentbit helm chart on K8s (AWS EKS) and I am working with IAM roles for service accounts (this way) to send logs to aws opensearch service.
When telling to the fluentbit deployment to work with the serviceaccount that map my role on aws It seems it is looking for a /var/run/secrets/eks.amazonaws.com/serviceaccount/aws-iam-token file to get the token:

image

but the default and mounted path is /var/run/secrets/eks.amazonaws.com/serviceaccount/token:

image

Then it cannot fetch the credentials to assume the role ..
In somehow when creating the role and the service account, the env variable injected is AWS_WEB_IDENTITY_TOKEN_FILE : /var/run/secrets/eks.amazonaws.com/serviceaccount/token but the pod look for /var/run/secrets/eks.amazonaws.com/serviceaccount/aws-iam-token

as a result I got this error on the fluentbit pod logs:

[2024/03/27 10:46:29] [error] [aws_credentials] STS assume role request failed
[2024/03/27 10:46:29] [ warn] [aws_credentials] No cached credentials are available and a credential refresh is already in progress. The current co-routine will retry.
[2024/03/27 10:46:29] [error] [signv4] Provider returned no credentials, service=es
[2024/03/27 10:46:29] [error] [output:opensearch:opensearch.0] could not sign request with sigv4
[2024/03/27 10:46:29] [ warn] [engine] chunk '1-1711536378.324856511.flb' cannot be retried: task_id=22, input=tail.0 > output=opensearch.0
[2024/03/27 10:46:29] [ info] [input] tail.0 resume (mem buf overlimit)

I understand ths is a known issue but when checking, it is not clear how this can be solved:

Yeah, basically it's because the config map sets "" as the default for aws_sts_endpoint instead of NULL. This leads the code to incorrectly think that there is an custom STS endpoint, and then Fluent Bit tries to make a request to "".
https://github.com/fluent/fluent-bit/blob/master/plugins/out_es/es.c#L804

But that issue about fluentbit from app code perspective getting "" is supposed to be fixed now (I am using v2.2.2 )
It also says as a workaroud of setting the parameter AWS_STS_Endpoint , but did not work and for some people neither.

  • Here a person say that it has a misconfig at chart level, but not sure if that could be my case, as I am not deploying from a custom helm chart but from helm upgrade --install fluent-bit fluent/fluent-bit --namespace fluent-bit so taking by default the values and tuning in some parameters for the serviceaccount and the configmap to be able to send logs to aws opensearch ...

  • And here they ask for the necessary write permissions for the iam role, but that is not my case since I dont get permissions issues as the role is not assumed yet.

Just for the record this is my output opensearch plugin configuration:

   [OUTPUT]
        Name opensearch
        Match host.*
        Host vpc-xxxxr-eks-logs-test-qxxxxojgfi4d7fuoshm5e.eu-west-1.es.amazonaws.com
        Port 443
        AWS_Role_ARN arn:aws:iam::xxx:role/fluentbit-to-ope-xxxx-test-fluentbit-serviceaccount
        Logstash_Format On
        Logstash_Prefix node-logs
        Retry_Limit False
        AWS_Auth On
        AWS_Region eu-west-1
        tls On
        Trace_Output On
        Trace_Error On
        AWS_STS_Endpoint https://sts.eu-west-1.amazonaws.com

and that iam role (which is mapped from a k8s service account) has this policy attached

{
    "Statement": [
        {
            "Action": [
                "es:ESHttpPut",
                "es:ESHttpPost",
                "es:ESHttpGet",
                "es:ESHttpDelete"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Action": "es:*",
            "Effect": "Allow",
            "Resource": "arn:aws:es:eu-west-1:xxxxxx:domain/xxr-eks-logs-test"
        }
    ],
    "Version": "2012-10-17"
}

I think the problem is what mentioned at the beginning pod look for /var/run/secrets/eks.amazonaws.com/serviceaccount/aws-iam-token file but the env variable injected is AWS_WEB_IDENTITY_TOKEN_FILE : /var/run/secrets/eks.amazonaws.com/serviceaccount/token and that is why it does not find the token to assume the role.

How can I change the fluentbit configuration form the helm chart via parameters? It seems the injected service account token is the default managed by aws eks itself but the deployment pod from the helm chart look for a slightly diff path.

I will appreciate if someone can point me in a good direction 🙂

@iamwep
Copy link

iamwep commented Mar 27, 2024

I'm running into exactly the same issue. Did you found any workaround/solution yet ?

@bgarcial
Copy link
Author

bgarcial commented Mar 28, 2024

@iamwep not yet. I have been reading several issues here and on 'aws-for-fluent-bit' side and there is no clarity about what could be happening. What I described here is that I think is happening under the volume mount perspective of the token from the service account (when working with IRSA) but here they'd that this could also be a problem of too many requests to the 'sts' endpoint and Amazon throttling when trying the request. I am really not sure about it, as I am testing this in a K8S test environment where there is almost no traffic regarding outbound requests

@Wyifei
Copy link

Wyifei commented Apr 18, 2024

I'm facing the same issue that I want to connect to AWS Kinesis in another account, assume role doesn't work with below error message:
[2024/04/18 02:30:32] [error] [aws_credentials] STS assume role request failed
[2024/04/18 02:30:32] [ warn] [aws_credentials] No cached credentials are available and a credential refresh is already in progress. The current co-routine will retry.
[2024/04/18 02:30:32] [error] [signv4] Provider returned no credentials, service=kinesis
[2024/04/18 02:30:32] [error] [aws_client] could not sign request

helm chart:
[OUTPUT]
Name kinesis_streams
Match *
stream test
region eu-central-1
role_arn arn:aws:iam::1234567890:role/kiness

@bgarcial
Copy link
Author

@Wyifei @iamwep
I solved this issue
The I am role for service account should be only provided to the serviceAccount.annotations field on the helm chart. It means only here and not on the Open Search fluentbit output plugin (when using AWS_Role_ARN here).
I was providing it on both and that’s why the sts request was failing. The output plugin doesn’t require it as this contacts the node group that supports the EKS cluster via its IAM role node group, for collecting and as the service account already have the permissions desired.
Let me know if that could be your case too. :)

@Wyifei
Copy link

Wyifei commented Apr 18, 2024

@bgarcial
I solve the issue by upgrade helm chart to from 0.21.1 to 0.46.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants