Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MinPinnedVersionId not increase when there's no batch queries running #16644

Open
xxchan opened this issue May 8, 2024 · 6 comments
Open

MinPinnedVersionId not increase when there's no batch queries running #16644

xxchan opened this issue May 8, 2024 · 6 comments
Assignees
Milestone

Comments

@xxchan
Copy link
Member

xxchan commented May 8, 2024

background:
https://risingwave-labs.slack.com/archives/C034TRN6A49/p1714528348327289

During oncall, we found MinPinnedVersionIdNotIncrease alert keeps firing for a cluster. The version id increase only every 3 hours.

Grafana indicates that there were no batch queries running.
So it's weird that the version id has been pinned for such an extended period.

Not sure if this is a bug. If it's not, maybe we should remove this alert.

@github-actions github-actions bot added this to the release-1.10 milestone May 8, 2024
@xxchan
Copy link
Member Author

xxchan commented May 8, 2024

@zwang28 @Little-Wallace randomly assigned to you. Feel free to find someone else to check this issue

@hzxa21
Copy link
Collaborator

hzxa21 commented May 9, 2024

Take a glance at grafana:

  • Barrier latency and barrier number increases at 05/01 01:30 UTC+8:
    image

  • At around the same time, the delay in MinPinnedVersionId happened:
    image

I suspect the high barrier latency is the trigger of the issue.

@hzxa21
Copy link
Collaborator

hzxa21 commented May 9, 2024

Recovery triggered at the same time:
image

I suspect that one CN stops receiving version update after recovery.

@zwang28
Copy link
Contributor

zwang28 commented May 10, 2024

The version id increase only every 3 hours.

unpin every 3 hours is force by max_version_pinning_duration_sec = 10800

@fuyufjh
Copy link
Contributor

fuyufjh commented May 24, 2024

Note down some observations of today's case: Name: MinPinnedVersionIdNotIncrease Sev: [warning] Cluster: [prod-aws-euno1-eks-a] at 10:04 AM.

Grafana URL

Min-pinned version ID was stuck:

image image image

Meanwhile, pinned epoch IDs were normal:

image

Barrier was abnormal, but it's after the 1st min-epoch stuck. Thus, it's likely to be a result instead of the cause.

image

There was no heavy batch queries.

image

Now I am wondering what happens at 7:58 AM i.e. the 1st stuck

The stuck at 7:58 AM, which I consider as the root cause, was caused by a sink error:

image

Grafana Logs URL

s3 error: service error: NoSuchKey: The specified key does not exist
and
prefetch meet error when read 13368429..13451152 from sst-39302557 (63181742)

@zwang28
Copy link
Contributor

zwang28 commented May 24, 2024

The hummock version is pinned by log store. It's consistently held due to the large volume of historical logs to consume, until forcefully unpinned by max_version_pinning_duration_sec.

The log input rate of sink exceeds the consumption rate, so the situation will deteriorate.

https://risingwave-labs.slack.com/archives/C034TRN6A49/p1716526705213879?thread_ts=1716516350.612169&cid=C034TRN6A49

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants