Connecting to cloud-sql using private-ip sometimes fails with a TLS handshake timeout #2208

akshetpandey · 2024-05-01T21:55:44Z

Bug Description

I am running v2.9.0/cloud-sql-proxy.linux.amd64 on a GCE n1-highmem-4 instance that is started through dataflow. The binary runs inside a u22 base image container.

As part of the container entry point, I run the following script:

CLOUD_SQL_PROXY_INSTANCES="<db-instance-name>"

/bin/cloud_sql_proxy --private-ip -u /cloudsql ${CLOUD_SQL_PROXY_INSTANCES} &

# Wait for all instances
for instance in ${CLOUD_SQL_PROXY_INSTANCES//,/ } ; do
    while ! pg_isready -h "/cloudsql/${instance}" ; do
        echo "Waiting for instance ${instance} to be online"
        sleep 1
    done
done

Occasionally, I will get the following error:

2024/04/21 01:34:43 [$DB-STRING] could not resolve instance version: failed to get instance: Refresh error: failed to get instance metadata (connection name = "$DB-STRING"): Get "https://sqladmin.googleapis.com/sql/v1beta4/projects/$PROJECT/instances/$DB/connectSettings?alt=json&prettyPrint=false": net/http: TLS handshake timeout

2024/04/21 01:34:43 The proxy has encountered a terminal error: unable to start: [$DB-STRING] Unable to mount socket: failed to get instance: Refresh error: failed to get instance metadata (connection name = "$DB-STRING"): Get "https://sqladmin.googleapis.com/sql/v1beta4/projects/$PROJECT/instances/$DB/connectSettings?alt=json&prettyPrint=false": net/http: TLS handshake timeout

And then the script gets stuck in an infinite loop, because cloud-sql-proxy quits instead of trying to connect again on the next attempt. Some of it is my fault, I should be using a process manager, but the timeout is unexpected.

The gce instance should not be throttling, and it runs in the same region as the cloud-sql instance. I do not know how to check if the other side of the auth-proxy is having issues. A lot of other instances are also connecting to is (mostly GAE), and I also see these similar issues on them occasionally.

PS: This is a new issue as a follow up for this comment I posted in a different issue: #2081 (comment)

Steps to reproduce?

No easy reproduction steps, since this happens occasionally (~a few times a week).

Environment

OS type and version: Ubuntu 22
Cloud SQL Proxy version (./cloud-sql-proxy --version): v2.9.0
Proxy invocation command (for example, ./cloud-sql-proxy --port 5432 INSTANCE_CONNECTION_NAME): /bin/cloud_sql_proxy --private-ip -u /cloudsql $INSTANCE &

The text was updated successfully, but these errors were encountered:

enocom · 2024-05-03T18:11:58Z

Thanks @akshetpandey.

The root problem seems to be this:

Get "https://sqladmin.googleapis.com/sql/v1beta4/projects/$PROJECT/instances/$DB/connectSettings?alt=json&prettyPrint=false": net/http: TLS handshake timeout

The SQL Admin API call isn't responding and the Proxy dies. We recently added retry support for 50x responses here: GoogleCloudPlatform/cloud-sql-go-connector#781. I wonder if we should extend that to include more generic TLS errors.

akshetpandey · 2024-05-03T22:21:32Z

I ended up changing my script to check if the pid is still alive and to restart it if it is not. An internal retry will definitely address the issue too.

I do want to add that something seems fishy here. Not sure if its the container, dataflow, sql admin, network routing, dns, or something else, but the error happens way too frequently.

I don't have concrete data but the failure rate I am seeing implies that sqladmin.googleapis.com has an uptime <99% in this particular situation/setup.

Do note that this isn't the first request made in the flow. My script successfully hits the metadata server first and then this fails.

enocom · 2024-05-06T16:49:08Z

What kind of CPU usage do you have on this instance? Wondering if this is a client error.

akshetpandey · 2024-05-06T16:56:36Z

n1-highmem-4, cpu usage at that point is pretty low.

akshetpandey · 2024-05-06T20:52:26Z

Cpu usage is down to under 10% when we start cloud-sql-proxy. Let me know if you want more logs. I am not sure what is available, but it will be all gone in 20 days.

enocom · 2024-05-08T15:52:10Z

How many instances are you connecting to in your script?

akshetpandey · 2024-05-11T19:01:09Z

Just the 1

akshetpandey added the type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. label May 1, 2024

enocom added type: question Request for information or clarification. and removed type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels May 3, 2024

enocom added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed type: question Request for information or clarification. labels May 3, 2024

enocom assigned jackwotherspoon May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Connecting to cloud-sql using private-ip sometimes fails with a TLS handshake timeout #2208

Connecting to cloud-sql using private-ip sometimes fails with a TLS handshake timeout #2208

akshetpandey commented May 1, 2024 •

edited

enocom commented May 3, 2024

akshetpandey commented May 3, 2024 •

edited

enocom commented May 6, 2024

akshetpandey commented May 6, 2024

akshetpandey commented May 6, 2024

enocom commented May 8, 2024

akshetpandey commented May 11, 2024

Connecting to cloud-sql using private-ip sometimes fails with a TLS handshake timeout #2208

Connecting to cloud-sql using private-ip sometimes fails with a TLS handshake timeout #2208

Comments

akshetpandey commented May 1, 2024 • edited

Bug Description

Steps to reproduce?

Environment

enocom commented May 3, 2024

akshetpandey commented May 3, 2024 • edited

enocom commented May 6, 2024

akshetpandey commented May 6, 2024

akshetpandey commented May 6, 2024

enocom commented May 8, 2024

akshetpandey commented May 11, 2024

akshetpandey commented May 1, 2024 •

edited

akshetpandey commented May 3, 2024 •

edited