Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Cloud SQL IAM service account authentication failed for user ..." intermittent errors when connecting to Postgres #2212

Closed
otto-nordander-yubico opened this issue May 13, 2024 · 17 comments
Assignees
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@otto-nordander-yubico
Copy link

otto-nordander-yubico commented May 13, 2024

Bug Description

Running cloud-sql-proxy sometimes it gets stuck in a state where we cant connect, returning Cloud SQL IAM service account authentication failed for user errors errors.

When this happens the proxy container must restart before it works again, i.e. retrying doesn't help.
It seems to happen if we connect multiple times in a row. E.g. start proxy, connect successfully, disconnect, connect again and it hangs but I'm not sure.

I've managed to reproduce it locally with the same error.

Example code (or command)

I've managed to reproduce with the following (note it's rather tricky to reproduce, and might require multiple tries):

cloud-sql-proxy --port 5432 --auto-iam-authn --impersonate-service-account=<SA> <postgres instance>

# In a different terminal (password prompt doesn't matter)
psql --host 127.0.0.1 --port 5432 --username=<USER>

<Ctrl+D>
# Connect again
psql --host 127.0.0.1 --port 5432 --username=<USER>

> psql: error: connection to server at "127.0.0.1", port 5432 failed: FATAL:  Cloud SQL IAM service account authentication failed for user ...

Stacktrace

psql: error: connection to server at "127.0.0.1", port 5432 failed: FATAL:  Cloud SQL IAM service account authentication failed for user ...

Steps to reproduce?

See example code.

Environment

  1. MacOS/Linux
  2. Cloud SQL Proxy version: v2.9.0 and v2.11.0
  3. Proxy invocation command: cloud-sql-proxy --port 5432 --auto-iam-authn --impersonate-service-account=<SA> <postgres instance>

Additional Details

I swear I saw another issue on this that was resolved in this repo, but I can't find it anylonger.

@otto-nordander-yubico otto-nordander-yubico added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label May 13, 2024
@jackwotherspoon
Copy link
Collaborator

Hi @otto-nordander-yubico thanks for raising an issue on the Cloud SQL Proxy! 😄

I will happily take a look into this, a couple clarifying questions:

  1. How often are you seeing this issue?
  2. When you disconnect and connect again are you doing so immediately after one another?
  3. Are you only seeing this issue with service account impersonation? (if you haven't tried without that is okay)

I'll look at re-producing this on my end.

@jackwotherspoon jackwotherspoon added the priority: p2 Moderately-important priority. Fix may not be included in next release. label May 13, 2024
@otto-nordander-yubico
Copy link
Author

otto-nordander-yubico commented May 14, 2024

Hi @otto-nordander-yubico thanks for raising an issue on the Cloud SQL Proxy! 😄

I will happily take a look into this, a couple clarifying questions:

  1. How often are you seeing this issue?
  2. When you disconnect and connect again are you doing so immediately after one another?
  3. Are you only seeing this issue with service account impersonation? (if you haven't tried without that is okay)

I'll look at re-producing this on my end.

  1. Rough estimate would be 1/30, but locally I can reproduce a lot more often than that. I think it's timing based which is why we see it "randomly".
  2. Yes, but I'm not sure if the disconnect is important, I think it's that we open 2 connections, but I can try to pin this down.
  3. No, we're seeing it with both.

Let me know if there's anything else I can help with.

@enocom
Copy link
Member

enocom commented May 14, 2024

One thing you could try is enable debug logging which would help us understand if the background refresh is behaving as expected. It's possible the token we're sending to the server is expired somehow. The other possibility is that the backend IAM check is failing for some other reason.

@otto-nordander-yubico
Copy link
Author

otto-nordander-yubico commented May 14, 2024

One thing you could try is enable debug logging which would help us understand if the background refresh is behaving as expected. It's possible the token we're sending to the server is expired somehow. The other possibility is that the backend IAM check is failing for some other reason.

Yes that's a good point, here I ran with --debug, --debug-logs.
This is when running locally:

cloud-sql-proxy version 2.11.0+darwin.arm64
2024/05/14 20:51:13 Impersonating service account with Application Default Credentials
2024/05/14 20:51:13 [xxxxx] Connection info added to cache
2024/05/14 20:51:13 [xxxxx] Listening on 127.0.0.1:5432
2024/05/14 20:51:13 The proxy has started successfully and is ready for new connections!
2024/05/14 20:51:13 [xxxxx] Connection info refresh operation started
2024/05/14 20:51:13 Enabling pprof endpoints at localhost:9091
2024/05/14 20:51:14 [xxxxx] Accepted connection from 127.0.0.1:55371 <--- me connecting is successfully
2024/05/14 20:51:14 [xxxxx] Connection info refresh operation started
2024/05/14 20:51:15 [xxxxx] Connection info refresh operation complete
2024/05/14 20:51:15 [xxxxx] Current certificate expiration = 2024-05-14T19:51:14Z
2024/05/14 20:51:15 [xxxxx] Connection info refresh operation scheduled at 2024-05-14T19:47:14Z (now + 56m0s)
2024/05/14 20:51:15 [xxxxx] Now = 2024-05-14T18:51:15Z, Current cert expiration = 2024-05-14T19:51:14Z
2024/05/14 20:51:15 [xxxxx] Cert is valid = true
2024/05/14 20:51:15 [xxxxx] Dialing x.x.x.x:3307
2024/05/14 20:51:16 [xxxxx] Connection info refresh operation complete
2024/05/14 20:51:16 [xxxxx] Current certificate expiration = 2024-05-14T19:51:15Z
2024/05/14 20:51:16 [xxxxx] Connection info refresh operation scheduled at 2024-05-14T19:47:15Z (now + 56m0s)
2024/05/14 20:51:16 [xxxxx] client closed the connection <---- me exiting the first connection
2024/05/14 20:51:17 [xxxxx] Accepted connection from 127.0.0.1:55387  <---- me opening new connection
2024/05/14 20:51:17 [xxxxx] Now = 2024-05-14T18:51:17Z, Current cert expiration = 2024-05-14T19:51:15Z
2024/05/14 20:51:17 [xxxxx] Cert is valid = true
2024/05/14 20:51:17 [xxxxx] Dialing x.x.x.x:3307
2024/05/14 20:51:18 [xxxxx] client closed the connection <---- this is because psql failed to connect

After the proxy has reached this broken state, every connection try logs the following (must restart proxy for it to work again):

2024/05/14 20:58:07 [xxxxx] Accepted connection from 127.0.0.1:57119
2024/05/14 20:58:07 [xxxxx] Now = 2024-05-14T18:58:07Z, Current cert expiration = 2024-05-14T19:54:49Z
2024/05/14 20:58:07 [xxxxx] Cert is valid = true
2024/05/14 20:58:07 [xxxxx] Dialing x.x.x.x:3307
2024/05/14 20:58:08 [xxxxx] client closed the connection

Edit: perhaps some helping information, when this happens I can see that the server asks the client for the password and its sent. When it's working as expected the password is never emitted.

@enocom
Copy link
Member

enocom commented May 14, 2024

when this happens I can see that the server asks the client for the password and its sent.

That's interesting. The way Auto IAM AuthN works is that the Proxy transmits the OAuth2 token to the server through the client certificate. So Postgres shouldn't be asking for a password. That fact that it is implies the token isn't properly being received by the server.

@otto-nordander-yubico
Copy link
Author

Perhaps another piece of the puzzle:
I ran with MySQL and tried to reproduce locally with the same connect / disconnect / connect flow.
And after a few tries it also gets stuck in a bad state with the following error:

ERROR 2059 (HY000): Authentication plugin 'mysql_clear_password' cannot be loaded: plugin not enabled

@enocom enocom assigned enocom and unassigned jackwotherspoon May 15, 2024
@otto-nordander-yubico
Copy link
Author

otto-nordander-yubico commented May 17, 2024

Okay, I think I've managed to pin it down. It's within cloudsqlconn

What I see is that there's a race condition when starting the proxy and initiating the first connection.
When we start the proxy a refresh operation is scheduled instantly here https://github.com/GoogleCloudPlatform/cloud-sql-go-connector/blob/main/internal/cloudsql/instance.go#L362
When we connect it also schedules an update of the refresh, and sometimes that runs before the initial refresh is complete.
So what happens is that the update refresh succeeds and gets a valid certificate back, then the initial refresh runs and overwrites the cert.

The cancel() here fails, and returns false, I believe since the initial refresh started immediately.

Edit: with that in mind it has nothing to do with having two connections, its simply that the first connection gets a valid cert, and then the initial refresh overwrites the cert so next connection fails.

@enocom
Copy link
Member

enocom commented May 17, 2024

Nice find. I suspect this is due to GoogleCloudPlatform/cloud-sql-go-connector#771.

I'll get this fixed. Thanks for your careful debugging.

@otto-nordander-yubico
Copy link
Author

otto-nordander-yubico commented May 20, 2024

Yes, that would explain why the certs were different in the first place.
But also, from this comment it gives the impression that it should handle multiple requests.

Anyway, let me know if you have a patch that you want help testing.

Edit: I'm currently testing --lazy-refresh as a workaround and that seems to work fine so far.

@enocom
Copy link
Member

enocom commented May 20, 2024

This also manifests as #2224.

This is definitely caused by GoogleCloudPlatform/cloud-sql-go-connector#771. I'll have a fix ready to try today or tomorrow.

@enocom
Copy link
Member

enocom commented May 20, 2024

PR to fix this in cloudsqlconn is here: GoogleCloudPlatform/cloud-sql-go-connector#806. We'll get that out in a release soon.

@enocom
Copy link
Member

enocom commented May 22, 2024

The latest release of the Go Connector has a fix: https://github.com/GoogleCloudPlatform/cloud-sql-go-connector/releases/tag/v1.10.1.

@otto-nordander-yubico
Copy link
Author

The latest release of the Go Connector has a fix: https://github.com/GoogleCloudPlatform/cloud-sql-go-connector/releases/tag/v1.10.1.

Nice, I can pull that in and run my tests to verify tomorrow!

@otto-nordander-yubico
Copy link
Author

otto-nordander-yubico commented May 23, 2024

Tested with v1.10.1 and wasn't able to reproduce the issue! Nicely done :)
And thanks for the quick fix!

@jackwotherspoon
Copy link
Collaborator

@otto-nordander-yubico Glad your testing went successfully 😄 Thanks so much for raising this issue, we greatly appreciate it!

I will close this out, if the issue re-surfaces feel free to re-open this.

👏 👏 @enocom

@jackwotherspoon
Copy link
Collaborator

Actually I will wait till we release this officially to close out

@jackwotherspoon
Copy link
Collaborator

Release of version v2.11.3 has the new Go Connector version used and should now resolve this issue 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

3 participants