You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our goal is to have the archivers running on an automated schedule in the background, taking snapshots of the original data sources which can be accessed programmatically. This will minimize the overhead associated with keeping our raw inputs up to date, but we still need the system to alert us when something goes wrong so we can fix it.
Depending on what happens during the archiving run (which could be encapsulated by the generated report), the run should be declared a success or a failure. Success or failure should be defined at the level of each individual dataset so as to effectively direct our attention to where the problem is.
What constitutes a successful archiving run will vary by dataset. We need a way to specify our expectations, and if anything outside of those expectations happens we should get a failure. If we start with stringent criteria and find we're getting too many false positives ("failures" that are actually okay), we can loosen the criteria based on the actual outcomes we see.
In general we expect the set of data partitions to either remain constant or grow over time, but we have fairly specific expectations about what new data partitions should look like. In most cases, new partitions are just additional timesteps, e.g.:
a new year of EIA-923 data
a new month of EIA-860m data
a new month of EPA CEMS data
a new quarter of FERC EQR
In the case of FERC's XBRL data which is scraped from an RSS feed, we expect to see any number of new individual filings, primarily in the most recent timestep, but occasionally revising older filings.
If any expected data partition is not found, that should result in failure.
In the case of frequently updated datasets like the EIA-860M, it might make sense to raise a 🚩 if much more time than expected passes without an update. E.g. if there haven't been any changes to a monthly dataset for 3 months, maybe the agency has actually started putting new data somewhere else.
If a new data partition of an unexpected form is found, that should result in failure, since it means our expectations about what the data should look like are no longer correct. We should be required to explicitly update our expectations about the data.
We should at least report some measure of the scale of changes detected between versions, and beyond a certain threshold we might want to cause a failure. E.g.:
If all the expected data partitions are found, but some of them have decreased in size by 90% something is probably wrong and needs to be investigated.
If the file type has changed even though the file name has not changed (as happened when the EIA-176 started publishing Excel spreadsheets but calling them CSVs) that should also probably result in a failure that demands investigation.
If an unusually high proportion of the data partitions has changed, but they still have the right names, file types, and sizes, maybe it's not a failure, but should be investigated as it could indicate big revisions to the data (as often happens without any kind of notice or documentation).
The content you are editing has changed. Please copy your edits and refresh the page.
I think having success/failure conditions like this is a really good idea for making automated archives useful and hopefully catch errors early.
In general we expect the set of data partitions to either remain constant or grow over time
I think we could probably fail or at least require human review any time we would delete a partition outright as that's almost always unexpected.
Another thing we should probably start considering is some procedures for handling failures. For example, we should plan some sort of human intervention mechanism if we deem an archive to actually be acceptable even if it does generate a failure.
Our goal is to have the archivers running on an automated schedule in the background, taking snapshots of the original data sources which can be accessed programmatically. This will minimize the overhead associated with keeping our raw inputs up to date, but we still need the system to alert us when something goes wrong so we can fix it.
Tasks
censusdp1tract
archiver success conditions #71eia176
archiver success conditions #72epacems
archiver success conditions #214The text was updated successfully, but these errors were encountered: