-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Terraform/OVH API crashes randomly #588
Comments
Hi, I'm having the same problems and I'd like to add that this also happens when I delete a public cloud network. |
I see that @amstuta linked #591 to this issue, but that is unlikely to fix all the problems. I don't believe that the problem is strongly provider related. Sure, some timeout errors might be fixed by it, but the provider can't account for an MKS upgrade. For example we have MKS clusters with different numbers of nodes, different machine types etc. where clusters would reach different ready state at likely different times. Unless timeout is pushed out to some ridiculously high values. I still believe that the issue is on the API side, but of course I could be wrong. |
@apinter you're right, we're currently planning various patches on the API and on the provider to improve the overall experience with the products mentioned in your issue, that's why I linked the PR. |
I've just tried to deploy a mongo cluster with the latest provider (v0.41.0) and ran into a timeout again. Pretty sure that something is not right lately with OVH considering that yesterday a VM backup took over 2.5hrs instead of the "few second" 🙃 resource "ovh_cloud_project_database" "mongodb_medres" {
provider = ovh.ovh
description = "medres-mongo"
disk_size = 40
engine = "mongodb"
flavor = "db2-4"
opensearch_acls_enabled = false
plan = "production"
service_name = var.ovh_service_name
version = "6.0"
nodes {
region = "DE"
}
nodes {
region = "DE"
}
nodes {
region = "DE"
}
}
resource "ovh_cloud_project_database_mongodb_user" "xund_medres_admin" {
provider = ovh.ovh
service_name = var.ovh_service_name
cluster_id = ovh_cloud_project_database.mongodb_medres.id
name = "admin"
roles = [
"clusterMonitor@admin",
"readWriteAnyDatabase@admin",
"userAdminAnyDatabase@admin",
]
}
resource "ovh_cloud_project_database_mongodb_user" "xund_medres" {
provider = ovh.ovh
service_name = var.ovh_service_name
cluster_id = ovh_cloud_project_database.mongodb_medres.id
name = "xund_medres"
roles = [
"dbAdminAnyDatabase@admin",
"readWriteAnyDatabase@admin",
]
}
## Authorized ips
resource "ovh_cloud_project_database_ip_restriction" "medres_authip_mongo" {
provider = ovh.ovh
for_each = var.authorized_networks_medres
service_name = var.ovh_service_name
cluster_id = ovh_cloud_project_database.mongodb_medres.id
engine = ovh_cloud_project_database.mongodb_medres.engine
description = each.value.description
ip = each.value.ip
}
Made the output with debug mode available here EDIT: btw went over with about 10m the new default 40m timeout. |
TF refresh did nothing, while TF apply just started to create a new mongo cluster just now with the same name. Not sure if this is a new behavior, or expect to work like this or not. |
Apparently, the 2nd deployment went better, but was able to reproduce the very same timeout like before. With that said, it still missed out the resource "ovh_cloud_project_database_mongodb_user" "xund_medres_admin" {
provider = ovh.ovh
service_name = var.ovh_service_name
cluster_id = ovh_cloud_project_database.mongodb_medres.id
name = "admin"
roles = [
"clusterMonitor@admin",
"readWriteAnyDatabase@admin",
"userAdminAnyDatabase@admin",
]
} The apply also timed out. The debug output is available here. Add this point I have 2 mongo prod grade clusters deployed... |
Downgraded to v0.40.0 and the same issue.... in the output you can see the 409 responses, yet it runs around for 20m for no reason. This worked just fine in 0.40.0 so again, it is unlikely to be a provider issue...
|
Little update, had some API issues again. Happened 3-4 times in a row. |
Trying to remove my
|
Debug log as requested: f7a647e331bf4fd3c6264773e5fe0ec3 EDIT: My public ip: 103.121.110.221 |
After verifying internally, I confirm you that neither your IP nor your region are being blocked. Since you mentioned that using a VPN in Singapore solved the issue, one possibility is that there is a poor connectivity between your AS and ours. If you are OK with the situation of using a VPN, then let's stick to it. If you are not OK with using the VPN, we need more information to escalate the case to our network team, such as a |
Thanks for looking into this. I'm ok with using my VPN, but my employer might not be on the long run 🙃
This is the result if I use wireguard:
Very different results. |
Describe the bug
Hi, not sure if this indeed is a provider bug, feels a lot more like an API bug, but here it goes: Terraform plan/apply randomly crashes with timeouts/500 response code/other errors. There is a business support ticket exists for this very issue on OVH already (9391258), but in 2 weeks we got absolutely nowhere. Opening the bug here as per the suggestion on Discord.
Terraform Version
OVH Terraform Provider Version
0.39.0 and 0.40.0
Earlier versions had the same or similar issues.
Affected Resource(s)
Would also list here the VM instances resource as well, but that is a different provider. Essentially any resource we use can cause issues.
This is not an issue with Terraform core as other providers elsewhere works just fine, or restarting the plan/apply - sometimes multiple times - gets the job done eventually.
Terraform Configuration Files
Debug Output
The most common outputs we see: https://gist.github.com/apinter/cdda84c7eb975c2f52beff5d701bd488
Panic Output
Expected Behavior
Terraform can be used seamlessly with OVH to deploy resources.
Actual Behavior
Terraform/OVH API randomly crashes rendering an unreliable
Steps to Reproduce
Please list the steps required to reproduce the issue, for example:
terraform plan
terraform apply
The more resources the more likely it will crash, but can happen with a simple 4 resource deployment as well or with an MKS cluster upgrade. (This just happened yesterday when we upgraded from 1.26 to 1.27 with TF, timed out after 10 minutes)
References
Additional context
This is not a networking error on our end. Doesn't matter where are we running the deployment from it can end up with a crash.
Support suggested that we reach a ratelimit, but based on the official documentation that should give a 429 response, not a 500. Even if we reach a ratelimit, and the endpoint responds with 500, that is a bug.
The text was updated successfully, but these errors were encountered: