Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify archiver success & failure conditions #70

Open
1 of 17 tasks
Tracked by #61
zaneselvans opened this issue Feb 23, 2023 · 1 comment
Open
1 of 17 tasks
Tracked by #61

Specify archiver success & failure conditions #70

zaneselvans opened this issue Feb 23, 2023 · 1 comment

Comments

@zaneselvans
Copy link
Member

zaneselvans commented Feb 23, 2023

Our goal is to have the archivers running on an automated schedule in the background, taking snapshots of the original data sources which can be accessed programmatically. This will minimize the overhead associated with keeping our raw inputs up to date, but we still need the system to alert us when something goes wrong so we can fix it.

  • Each dataset's archiver should generate a report describing the outcome of the archiving run. See Summarize archiver run and send notification via GH Action #60.
  • Depending on what happens during the archiving run (which could be encapsulated by the generated report), the run should be declared a success or a failure. Success or failure should be defined at the level of each individual dataset so as to effectively direct our attention to where the problem is.
  • What constitutes a successful archiving run will vary by dataset. We need a way to specify our expectations, and if anything outside of those expectations happens we should get a failure. If we start with stringent criteria and find we're getting too many false positives ("failures" that are actually okay), we can loosen the criteria based on the actual outcomes we see.
  • In general we expect the set of data partitions to either remain constant or grow over time, but we have fairly specific expectations about what new data partitions should look like. In most cases, new partitions are just additional timesteps, e.g.:
    • a new year of EIA-923 data
    • a new month of EIA-860m data
    • a new month of EPA CEMS data
    • a new quarter of FERC EQR
    • In the case of FERC's XBRL data which is scraped from an RSS feed, we expect to see any number of new individual filings, primarily in the most recent timestep, but occasionally revising older filings.
  • If any expected data partition is not found, that should result in failure.
  • In the case of frequently updated datasets like the EIA-860M, it might make sense to raise a 🚩 if much more time than expected passes without an update. E.g. if there haven't been any changes to a monthly dataset for 3 months, maybe the agency has actually started putting new data somewhere else.
  • If a new data partition of an unexpected form is found, that should result in failure, since it means our expectations about what the data should look like are no longer correct. We should be required to explicitly update our expectations about the data.
  • We should at least report some measure of the scale of changes detected between versions, and beyond a certain threshold we might want to cause a failure. E.g.:
    • If all the expected data partitions are found, but some of them have decreased in size by 90% something is probably wrong and needs to be investigated.
    • If the file type has changed even though the file name has not changed (as happened when the EIA-176 started publishing Excel spreadsheets but calling them CSVs) that should also probably result in a failure that demands investigation.
    • If an unusually high proportion of the data partitions has changed, but they still have the right names, file types, and sizes, maybe it's not a failure, but should be investigated as it could indicate big revisions to the data (as often happens without any kind of notice or documentation).

Tasks

  1. censusdp1tract inframundo
  2. eia176 inframundo
  3. 1 of 2
    aesharpe
@zschira
Copy link
Member

zschira commented Feb 24, 2023

I think having success/failure conditions like this is a really good idea for making automated archives useful and hopefully catch errors early.

In general we expect the set of data partitions to either remain constant or grow over time

I think we could probably fail or at least require human review any time we would delete a partition outright as that's almost always unexpected.

Another thing we should probably start considering is some procedures for handling failures. For example, we should plan some sort of human intervention mechanism if we deem an archive to actually be acceptable even if it does generate a failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Icebox
Development

No branches or pull requests

2 participants