Skip to content

Stabilize and monitor data pipeline

Past due by over 1 year 50% complete

From #1626:

Our data pipeline has gone down for days and weeks with no attention when upstream sources make changes or have data issues. Keeping up with these issues has been my first focus. I think I’ve made significant progress here, but some things remain to do. For instance, right now, those pipeline alerts go to one otherwise busy person. Getting som…

From #1626:

Our data pipeline has gone down for days and weeks with no attention when upstream sources make changes or have data issues. Keeping up with these issues has been my first focus. I think I’ve made significant progress here, but some things remain to do. For instance, right now, those pipeline alerts go to one otherwise busy person. Getting something which would create GitHub issues on failure and/or send alerts onto our contributor Slack would be very helpful.

Some locations are no longer providing test data at all. Quite a few provide it less reliably and less frequently. Our pipeline needs to be robust against both issues. Our model needs to ensure that missing data is mitigated as possible. Our UI needs to communicate the impact of missing data clearly to users, perhaps by gathering some kind of prevalence assessment from the user directly.

A validation procedure will also be very helpful, to characterize the impact of changes made by folks like me who didn’t write the original code and model. I don’t know exactly what this looks like yet, but something pragmatic we can build on is better than nothing.