Specify archiver success & failure conditions #70

zaneselvans · 2023-02-23T16:25:14Z

Our goal is to have the archivers running on an automated schedule in the background, taking snapshots of the original data sources which can be accessed programmatically. This will minimize the overhead associated with keeping our raw inputs up to date, but we still need the system to alert us when something goes wrong so we can fix it.

Each dataset's archiver should generate a report describing the outcome of the archiving run. See Summarize archiver run and send notification via GH Action #60.
Depending on what happens during the archiving run (which could be encapsulated by the generated report), the run should be declared a success or a failure. Success or failure should be defined at the level of each individual dataset so as to effectively direct our attention to where the problem is.
What constitutes a successful archiving run will vary by dataset. We need a way to specify our expectations, and if anything outside of those expectations happens we should get a failure. If we start with stringent criteria and find we're getting too many false positives ("failures" that are actually okay), we can loosen the criteria based on the actual outcomes we see.
In general we expect the set of data partitions to either remain constant or grow over time, but we have fairly specific expectations about what new data partitions should look like. In most cases, new partitions are just additional timesteps, e.g.:
- a new year of EIA-923 data
- a new month of EIA-860m data
- a new month of EPA CEMS data
- a new quarter of FERC EQR
- In the case of FERC's XBRL data which is scraped from an RSS feed, we expect to see any number of new individual filings, primarily in the most recent timestep, but occasionally revising older filings.
If any expected data partition is not found, that should result in failure.
In the case of frequently updated datasets like the EIA-860M, it might make sense to raise a 🚩 if much more time than expected passes without an update. E.g. if there haven't been any changes to a monthly dataset for 3 months, maybe the agency has actually started putting new data somewhere else.
If a new data partition of an unexpected form is found, that should result in failure, since it means our expectations about what the data should look like are no longer correct. We should be required to explicitly update our expectations about the data.
We should at least report some measure of the scale of changes detected between versions, and beyond a certain threshold we might want to cause a failure. E.g.:
- If all the expected data partitions are found, but some of them have decreased in size by 90% something is probably wrong and needs to be investigated.
- If the file type has changed even though the file name has not changed (as happened when the EIA-176 started publishing Excel spreadsheets but calling them CSVs) that should also probably result in a failure that demands investigation.
- If an unusually high proportion of the data partitions has changed, but they still have the right names, file types, and sizes, maybe it's not a failure, but should be investigated as it could indicate big revisions to the data (as often happens without any kind of notice or documentation).

The text was updated successfully, but these errors were encountered:

zschira · 2023-02-24T21:14:49Z

I think having success/failure conditions like this is a really good idea for making automated archives useful and hopefully catch errors early.

In general we expect the set of data partitions to either remain constant or grow over time

I think we could probably fail or at least require human review any time we would delete a partition outright as that's almost always unexpected.

Another thing we should probably start considering is some procedures for handling failures. For example, we should plan some sort of human intervention mechanism if we deem an archive to actually be acceptable even if it does generate a failure.

zaneselvans added the inframundo label Feb 23, 2023

e-belfer mentioned this issue Feb 23, 2023

Summarize archiver run and send notification via GH Action #60

Closed

zaneselvans mentioned this issue Feb 27, 2023

Test individual archivers #45

Open

2 tasks

zaneselvans added the Epic label Feb 27, 2023

zaneselvans mentioned this issue Feb 27, 2023

Automate Zenodo archiving #61

Open

zaneselvans added this to the PUDL 2023Q2 Release milestone Mar 12, 2023

jdangerx mentioned this issue Dec 6, 2023

Identify and implement 1-2 additional generic validations for all datasets to help us catch remaining gaps (e.g. size of all files hasn’t changed by more than 10%). #217

Closed

e-belfer mentioned this issue Dec 18, 2023

Add file and dataset size validation checks #236

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify archiver success & failure conditions #70

Specify archiver success & failure conditions #70

zaneselvans commented Feb 23, 2023 •

edited by e-belfer

Tasks

zschira commented Feb 24, 2023

Specify archiver success & failure conditions #70

Specify archiver success & failure conditions #70

Comments

zaneselvans commented Feb 23, 2023 • edited by e-belfer

Tasks

zschira commented Feb 24, 2023

zaneselvans commented Feb 23, 2023 •

edited by e-belfer