Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are there other high priority formats we haven’t created plans for yet? #4

Open
lljohnston opened this issue Sep 12, 2019 · 6 comments

Comments

@lljohnston
Copy link
Collaborator

No description provided.

@rexbradford
Copy link

See the issue I raised entitled "Paper records - which format category do they fit under?" Scanned paper records don't seem to be discussed in this list, are they? PDF is currently the lingua franca for scanned paper, but there are alternatives and PDF itself is an umbrella format with a wide variety of choices/issues involved.

@fhkjaerskov
Copy link

we have and have had ongoing preservation discussions on statistical and/or survey data formats, eg, R, SAS, SPSS, Stata and SDMX that generally fit the research data umbrella. Under this umbrella, we also discuss more cumbersome formats such as genetic data sets originated from i.e. whole gene sequencing that we in due course might ingest. In addition to this we also have early discussions on how to preserve data (decisions, calculations, profiles etc.) in the public administration if they are influenced or directly caused by machine learning algorithms (/AI). How do we contain, document and preserve such algorithms?

@dangormanjr
Copy link

Could you speak to how you're preserving forms of computer data that predate personal computing — for instance, punch cards, magnetic tape data, etc.?

@lljohnston
Copy link
Collaborator Author

lljohnston commented Aug 12, 2020

@dangormanjr

We copy all files off any media that we receive from agencies, and actively monitor our storage and migrate to new storage media on a regular basis. We are consolidating preservation systems into a cloud environment with geographic replication.

Our goal is always to copy files off of any media that we receive as soon as possible. We do sometimes keep the original agency media as well. We do annual sampling on any media that we retain, and we if we still have the media after 10 years, the files are migrated to new media.

We do have some punch cards. I have been told that there is at least one case (long before I got to the agency) where there was a project to use software to read a set of cards and copy the data off for research use.

@lljohnston
Copy link
Collaborator Author

@rexbradford It's an interesting question. This work covers both born digital and digitized textual records - we treat both equally from a preservation standpoint. The Transfer Guidance provides more details about preferred formats for digitized texts versus textual data. There are also new regulations coming that focus on the digitization of permanent records. As to PDF, it is the lingua franca for access, but not necessarily for preservation. As you say, it's highly variable as to the tools or settings that can be used to create PDFs.

@lljohnston
Copy link
Collaborator Author

@fhkjaerskov U.S. federal agencies have not commonly identified any software or algorithms as permanent records to transfer to NARA, so this hasn't been a top priority for our work. We have received software and code as part of transfers, and do simple bit-level preservation. We have a lot of datasets, some going back 50 years, and we always hope to get codebooks with them to understand how the data was created/structured. We also have databases, and we try to save not just the data but the tables, and their relationships and rules/stored procedures. We are watching this space around algorithm preservation as the thinking and tools evolve.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants