Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: curio: alertManager #11926

Merged
merged 5 commits into from May 8, 2024
Merged

feat: curio: alertManager #11926

merged 5 commits into from May 8, 2024

Conversation

LexLuthr
Copy link
Contributor

@LexLuthr LexLuthr commented Apr 24, 2024

Related Issues

Proposed Changes

This PRs adds a new singleton task for alerting.
The alerting task can create an incident for a previously configured service.
User should get critical alerts via SMS, email and call.

  • Test in a running Curio cluster before merging

Additional Info

A new page about alerting is required in Curio docs. We also need to create a section on how to sign up and configure PagerDuty.

Checklist

Before you mark the PR ready for review, please make sure that:

  • Commits have a clear commit message.
  • PR title is in the form of of <PR type>: <area>: <change being made>
    • example: fix: mempool: Introduce a cache for valid signatures
    • PR type: fix, feat, build, chore, ci, docs, perf, refactor, revert, style, test
    • area, e.g. api, chain, state, market, mempool, multisig, networking, paych, proving, sealing, wallet, deps
  • If the PR affects users (e.g., new feature, bug fix, system requirements change), update the CHANGELOG.md and add details to the UNRELEASED section.
  • New features have usage guidelines and / or documentation updates in
  • Tests exist for new functionality or change in behavior
  • CI is green

curiosrc/alertmanager/task_alert.go Outdated Show resolved Hide resolved
curiosrc/alertmanager/task_alert.go Show resolved Hide resolved
curiosrc/alertmanager/task_alert.go Outdated Show resolved Hide resolved
curiosrc/alertmanager/task_alert.go Outdated Show resolved Hide resolved
curiosrc/alertmanager/task_alert.go Outdated Show resolved Hide resolved
curiosrc/alertmanager/task_alert.go Outdated Show resolved Hide resolved
node/config/types.go Outdated Show resolved Hide resolved
node/config/types.go Outdated Show resolved Hide resolved
@LexLuthr LexLuthr requested a review from snadrus April 29, 2024 10:25
@LexLuthr LexLuthr marked this pull request as ready for review April 29, 2024 14:58
@LexLuthr LexLuthr requested a review from magik6k April 29, 2024 14:58
curiosrc/alertmanager/alerts.go Outdated Show resolved Hide resolved
curiosrc/alertmanager/alerts.go Show resolved Hide resolved
curiosrc/alertmanager/task_alert.go Outdated Show resolved Hide resolved
documentation/en/default-curio-config.toml Outdated Show resolved Hide resolved
node/config/def.go Outdated Show resolved Hide resolved
node/config/types.go Outdated Show resolved Hide resolved
@LexLuthr LexLuthr changed the base branch from release/curio-beta to master May 7, 2024 15:28
Copy link
Contributor

@magik6k magik6k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good. Just one nit.

Not in this PR, but would be neat to reuse the config parsing logic in balanceCheck and gather all miner-ids served by the cluster, then check that winningPoSt is happenning correctly based on entries being added to the mining_tasks table (we expect ~2880 entries in last 24h per minerID based on the base_compute_time column, the check should probably be for a smaller window tho)

curiosrc/alertmanager/alerts.go Outdated Show resolved Hide resolved
@LexLuthr
Copy link
Contributor Author

LexLuthr commented May 7, 2024

It's alive!!!

Screenshot 2024-05-07 at 11 48 31 PM

Pushing the suggested changes and some minor changes in another commit.

@LexLuthr
Copy link
Contributor Author

LexLuthr commented May 7, 2024

Looks really good. Just one nit.

Not in this PR, but would be neat to reuse the config parsing logic in balanceCheck and gather all miner-ids served by the cluster, then check that winningPoSt is happenning correctly based on entries being added to the mining_tasks table (we expect ~2880 entries in last 24h per minerID based on the base_compute_time column, the check should probably be for a smaller window tho)

I think a better check would be to just check that all "win" instances resulted in a block which was accepted by the chain. This would help see if SPs are loosing blocks due to any issues.

I have extracted the address function for future use.

@snadrus snadrus enabled auto-merge (squash) May 7, 2024 20:04
@snadrus snadrus disabled auto-merge May 7, 2024 20:04
@magik6k
Copy link
Contributor

magik6k commented May 8, 2024

I think a better check would be to just check that all "win" instances resulted in a block which was accepted by the chain. This would help see if SPs are loosing blocks due to any issues.

That's another good one, but the former does tell you that you have some machines in the cluster taking care of winningPoSt at all times

@magik6k magik6k merged commit 22ccaf9 into master May 8, 2024
184 of 186 checks passed
@magik6k magik6k deleted the feat/alert-manager branch May 8, 2024 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants