Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: timeout when restarting PostgreSQL and while lifting fencing #4504

Merged
merged 2 commits into from
May 20, 2024

Conversation

leonardoce
Copy link
Contributor

@leonardoce leonardoce commented May 9, 2024

The instance manager starts PostgreSQL:

  1. when it starts up
  2. when configuration changes are being applied (after stopping it)
  3. when fencing is lifted.

In the second and third examples, the operator is requested by the embedded cluster reconciler loop, and performed without any timeout.

If PostgreSQL won't start up again because of a wrong configuration or missing disk space, the reconciler loop will be stuck waiting for a dead postmaster to be up.

This patch handles this condition by using a combination of the timeout parameters that are already set in the cluster.

Fixes: #4501

@leonardoce leonardoce requested a review from a team as a code owner May 9, 2024 14:22
@github-actions github-actions bot added backport-requested ◀️ This pull request should be backported to all supported releases release-1.21 release-1.22 release-1.23 labels May 9, 2024
Copy link
Contributor

github-actions bot commented May 9, 2024

❗ By default, the pull request is configured to backport to all release branches.

  • To stop backporting this pr, remove the label: backport-requested ◀️ or add the label 'do not backport'
  • To stop backporting this pr to a certain release branch, remove the specific branch label: release-x.y

@leonardoce
Copy link
Contributor Author

Hi! This one seems harmless and needed, but I encourage you to take a deep look.

After much thinking, I still didn't make up my mind about whether it was better to fix it or leave it as it is and employ the energies on refactoring, to make the code flow clearer.

@leonardoce
Copy link
Contributor Author

/test limit=local

Copy link
Contributor

@leonardoce, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/9091027252

@mnencia
Copy link
Member

mnencia commented May 17, 2024

/test tl=4 l=local

Copy link
Contributor

@mnencia, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/9131874590

Leonardo Cecchi and others added 2 commits May 20, 2024 12:19
The instance manager starts PostgreSQL:

1. when it starts up
2. when configuration changes are being applied (after stopping it)
3. when fencing is lifted.

In the second and third example, the operator is requested by the
embedded cluster reconciler loop, and performed without any timeout.

If PostgreSQL won't start up again because of a wrong configuration or
missing disk space, the reconciler loop will be stuck waiting for a dead
postmaster to be up.

This patch handles this condition by using a combination of the timeout
parameters that are already set in the cluster.

Fixes: cloudnative-pg#4501

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
@mnencia
Copy link
Member

mnencia commented May 20, 2024

/ok-to-merge E2E green. Only expected unrelated failures.

@cnpg-bot cnpg-bot added the ok to merge 👌 This PR can be merged label May 20, 2024
@mnencia mnencia merged commit 09a4d80 into cloudnative-pg:main May 20, 2024
27 of 28 checks passed
cnpg-bot pushed a commit that referenced this pull request May 20, 2024
)

The instance manager starts PostgreSQL:

1. when it starts up
2. when configuration changes are being applied (after stopping it)
3. when fencing is lifted.

In the second and third examples, the operator is requested by the
embedded cluster reconciler loop, and performed without any timeout.

If PostgreSQL won't start up again because of a wrong configuration or
missing disk space, the reconciler loop will be stuck waiting for a dead
postmaster to be up.

This patch handles this condition by using a combination of the timeout
parameters that are already set in the cluster.

Fixes: #4501

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com>
Co-authored-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
(cherry picked from commit 09a4d80)
mnencia pushed a commit that referenced this pull request May 20, 2024
)

The instance manager starts PostgreSQL:

1. when it starts up
2. when configuration changes are being applied (after stopping it)
3. when fencing is lifted.

In the second and third examples, the operator is requested by the
embedded cluster reconciler loop, and performed without any timeout.

If PostgreSQL won't start up again because of a wrong configuration or
missing disk space, the reconciler loop will be stuck waiting for a dead
postmaster to be up.

This patch handles this condition by using a combination of the timeout
parameters that are already set in the cluster.

Fixes: #4501

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com>
Co-authored-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
(cherry picked from commit 09a4d80)
mnencia pushed a commit that referenced this pull request May 20, 2024
)

The instance manager starts PostgreSQL:

1. when it starts up
2. when configuration changes are being applied (after stopping it)
3. when fencing is lifted.

In the second and third examples, the operator is requested by the
embedded cluster reconciler loop, and performed without any timeout.

If PostgreSQL won't start up again because of a wrong configuration or
missing disk space, the reconciler loop will be stuck waiting for a dead
postmaster to be up.

This patch handles this condition by using a combination of the timeout
parameters that are already set in the cluster.

Fixes: #4501

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Co-authored-by: Leonardo Cecchi <leonardo.cecchi@enteprisedb.com>
Co-authored-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
(cherry picked from commit 09a4d80)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-requested ◀️ This pull request should be backported to all supported releases ok to merge 👌 This PR can be merged release-1.21 release-1.22 release-1.23
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Instance manager reconciler loop stuck if PostgreSQL restart fails
4 participants