Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make SirCAL alert when lag is below expected lower threshold #1918

Open
melange396 opened this issue Dec 15, 2023 · 2 comments
Open

Make SirCAL alert when lag is below expected lower threshold #1918

melange396 opened this issue Dec 15, 2023 · 2 comments

Comments

@melange396
Copy link
Contributor

Sir Complains-A-Lot (aka SirCAL) alerts when lag exceeds particular thresholds for what is "typical" for each indicator. The thresholds were set by hand based on what had been observed at some point in time, with an expectation that the reporting pattern will remain consistent. The alerts help identify problems, but for a number of potential legitimate reasons, the typical lag from data sources might change -- when there is a new longer "expected" lag, alerts will fire somewhat regularly, and then the team can investigate and increase the threshold as appropriate.

In fact, our indicators seem to have typical/expected lag "ranges", so in addition to a typical lag "max", we also see a typical lag "min". For example, nchs-mortality is a weekly signal that currently varies between 12 and 18 days of lag (12 on the day that a new update is released, and up to 18 over the course of the rest of the week). For a number of potential legitimate reasons, a data source might be able to decrease their typical lag, such as improvements in their reporting or processing pipelines. If we alert when the lag we see is below a minimum bound, we can detect and respond to this. If the example used above was able to shave a day off their cycle, the new range would be between 11 and 17 days of lag. This means we can change the max lag threshold to get a tighter bound, but only if this is brought to our attention, thus necessitating this new alert condition.

TL;DR: we should alert when lag is outside of a range (instead of just when it exceeds an upper bound) so that we can identify changes in reporting patterns and adjust thresholds appropriately.

"max_age" detection and alert generation code:

if current_lag_in_days > source_config["max_age"] + grace:
if row["signal"] not in age_complaints:
age_complaints[row["signal"]] = Complaint(
f"is {current_lag_in_days} days old",
data_source,
row["signal"],
[row["geo_type"]],
row["max_time"],
source_config["maintainers"])

Threshold specification(s):


"nchs-mortality": {
"max_age":16,

@rlunde
Copy link

rlunde commented Dec 21, 2023

Do we save the observed lag for a signal in a database? It might be interesting to be able to use data analysis on collected data to see a measure of variability (for example).

@melange396
Copy link
Contributor Author

Lag for all datapoints of our signals can be pulled from our database or api without too much trouble... But then doing an analysis on that across different dimensions might be worthy of a publication, and thus outside the scope of this issue! We have "max lag" in some of our internal dashboards and the variability is pretty boring; we mostly see sawtooth patterns where the lag increases by 1 every day when there are no updates, and then it jumps back down when a data drop happens. I can point it out for you in elastic/kibana some time, but im sure youve actually already seen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants