Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connecting to cloud-sql using private-ip sometimes fails with a TLS handshake timeout #2208

Open
akshetpandey opened this issue May 1, 2024 · 7 comments
Assignees
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@akshetpandey
Copy link

akshetpandey commented May 1, 2024

Bug Description

I am running v2.9.0/cloud-sql-proxy.linux.amd64 on a GCE n1-highmem-4 instance that is started through dataflow. The binary runs inside a u22 base image container.

As part of the container entry point, I run the following script:

CLOUD_SQL_PROXY_INSTANCES="<db-instance-name>"

/bin/cloud_sql_proxy --private-ip -u /cloudsql ${CLOUD_SQL_PROXY_INSTANCES} &

# Wait for all instances
for instance in ${CLOUD_SQL_PROXY_INSTANCES//,/ } ; do
    while ! pg_isready -h "/cloudsql/${instance}" ; do
        echo "Waiting for instance ${instance} to be online"
        sleep 1
    done
done

Occasionally, I will get the following error:

2024/04/21 01:34:43 [$DB-STRING] could not resolve instance version: failed to get instance: Refresh error: failed to get instance metadata (connection name = "$DB-STRING"): Get "https://sqladmin.googleapis.com/sql/v1beta4/projects/$PROJECT/instances/$DB/connectSettings?alt=json&prettyPrint=false": net/http: TLS handshake timeout

2024/04/21 01:34:43 The proxy has encountered a terminal error: unable to start: [$DB-STRING] Unable to mount socket: failed to get instance: Refresh error: failed to get instance metadata (connection name = "$DB-STRING"): Get "https://sqladmin.googleapis.com/sql/v1beta4/projects/$PROJECT/instances/$DB/connectSettings?alt=json&prettyPrint=false": net/http: TLS handshake timeout

And then the script gets stuck in an infinite loop, because cloud-sql-proxy quits instead of trying to connect again on the next attempt. Some of it is my fault, I should be using a process manager, but the timeout is unexpected.

The gce instance should not be throttling, and it runs in the same region as the cloud-sql instance. I do not know how to check if the other side of the auth-proxy is having issues. A lot of other instances are also connecting to is (mostly GAE), and I also see these similar issues on them occasionally.

PS: This is a new issue as a follow up for this comment I posted in a different issue: #2081 (comment)

Steps to reproduce?

No easy reproduction steps, since this happens occasionally (~a few times a week).

Environment

  1. OS type and version: Ubuntu 22
  2. Cloud SQL Proxy version (./cloud-sql-proxy --version): v2.9.0
  3. Proxy invocation command (for example, ./cloud-sql-proxy --port 5432 INSTANCE_CONNECTION_NAME): /bin/cloud_sql_proxy --private-ip -u /cloudsql $INSTANCE &
@akshetpandey akshetpandey added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label May 1, 2024
@enocom enocom added type: question Request for information or clarification. and removed type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels May 3, 2024
@enocom
Copy link
Member

enocom commented May 3, 2024

Thanks @akshetpandey.

The root problem seems to be this:

Get "https://sqladmin.googleapis.com/sql/v1beta4/projects/$PROJECT/instances/$DB/connectSettings?alt=json&prettyPrint=false": net/http: TLS handshake timeout

The SQL Admin API call isn't responding and the Proxy dies. We recently added retry support for 50x responses here: GoogleCloudPlatform/cloud-sql-go-connector#781. I wonder if we should extend that to include more generic TLS errors.

@enocom enocom added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed type: question Request for information or clarification. labels May 3, 2024
@akshetpandey
Copy link
Author

akshetpandey commented May 3, 2024

I ended up changing my script to check if the pid is still alive and to restart it if it is not. An internal retry will definitely address the issue too.

I do want to add that something seems fishy here. Not sure if its the container, dataflow, sql admin, network routing, dns, or something else, but the error happens way too frequently.

I don't have concrete data but the failure rate I am seeing implies that sqladmin.googleapis.com has an uptime <99% in this particular situation/setup.

Do note that this isn't the first request made in the flow. My script successfully hits the metadata server first and then this fails.

@enocom
Copy link
Member

enocom commented May 6, 2024

What kind of CPU usage do you have on this instance? Wondering if this is a client error.

@akshetpandey
Copy link
Author

n1-highmem-4, cpu usage at that point is pretty low.

@akshetpandey
Copy link
Author

Screenshot 2024-05-06 at 1 48 12 PM Screenshot 2024-05-06 at 1 49 47 PM

Cpu usage is down to under 10% when we start cloud-sql-proxy. Let me know if you want more logs. I am not sure what is available, but it will be all gone in 20 days.

@enocom
Copy link
Member

enocom commented May 8, 2024

How many instances are you connecting to in your script?

@akshetpandey
Copy link
Author

Just the 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

3 participants