Skip to content

ucla-data-rescue/eot-sprint-toolkit

 
 

Repository files navigation

The Toolkit

The purpose of this toolkit is to help small groups of citizen-hackers join in the End of Term Web Archive 2016, a collaborative project hosted by the Internet Archive, California Digital Library, and others. As concerned citizens, we can use our expertise to fill gaps in the archive and ensure that government data stay archived, open, and usable by the public. We’ve written this document to keep track of what we’re doing and help other groups with similar aims to ours.

IF YOU JUST WANT TO GET TO WORK, CLICK HERE TO GO TO THE EVENT GOALS PAGE! If you’re looking for more context, keep reading.

Why are we doing this?

The End of Term archive webcrawl has run in previous election years, and is a non-partisan historical archiving project of great independent value. However, this year’s effort has attracted widespread attention because of the extreme changes expected with the inauguration of Donald Trump in January. Our perspective draws particularly on Canadian experiences under the previous government, which greatly restricted access to scientific data that conflicted with its policies, and in some cases shut down data collection programs rather than confront their results. The incoming US administration has already announced its intention to dismantle a number of government programs and agencies. The E.P.A. and climate change research are high on the list.

The archiving project thus takes on even greater urgency than it ordinarily would. We hope that, by identifying and capturing key resources, we can make a real difference in the fight to save science and the environment.

Contributing

This repository should house both code developed for the sprint & hackathon events, and explanations that clarify the use of those tools. For now, we’re asking that you contribute your work by forking the repo, adding your code and its documentation (this can be a patch to the current file or a separate tool manual), and submitting a pull request. If you aren’t sure what all this means please contact Matt directly and he’ll help you out.

How it works, and scope of our efforts

The End of Term archiving project is a broad, slow walk through the entire .gov top-level domain. It uses an open-source webcrawling program to visit and archive as many web pages as possible across the whole domain. However, there are a lot of government websites in the US, and many of them are extremely complex. On its own, the webcrawler will not be able to penetrate deeply into the largest of these sites.

Also, many government resources are not really “web pages”; instead, these documents and data live in complex databases that only serve up the hidden data when someone makes a search query. In these cases, the webcrawler will never capture the data in an archive. In other cases, static resources are housed in FTP servers. New developments now allow the webcrawler to capture FTP pages, so FTP sites can now be seeded directly to the webcrawler (change as of Dec. 22, 2016).

To capture these resources, we need manual intervention in the webcrawler’s instructions, and additional technologies to get crucial information the webcrawler can’t manage. Our efforts deal mostly with the E.P.A. and other environmental agencies, but we hope that the same techniques can work for NASA, NOAA, NSF, HUD, HHS, and other endangered departments and agencies.

For now, we see our objectives as the following:

  • identifying and flagging deeply buried sites that link to important troves of HTML documents that the webcrawler will not be likely to find on its own
  • developing recipes for reverse-engineering URL’s of database-stored documents, and potentially adding metadata to archived PDF files located in this way
  • locating non-document-centric datasets (that is, databases collecting large sets of quantitative data instead of texts), assessing whether they are in danger of disappearing, and developing methods to retrieve and archive the information

So there are three categories of preservation work here; let’s discuss them in turn.

Deep Web Pages: Nominating Seeds

This process is technically straightforward but requires some web research skills.

Text Databases: Reverse-Engineering URL’s and Scraping Metadata

We are taking the E.P.A. publications list as an initial test of this process. That link goes to a full listing, by title, of all EPA publications currently available on line. Crawling this space will take a long time but is technically trivial. However, we lose all metadata by crawling in this way. The search page here gives one more piece of data – the year of publication – which is not much, is actually fairly useful for people who want to make use of them. What is the best way for us to collect and preserve that metadata?

Another, slightly less transparent store of documents is the collection of Environmental Impact Statements. We haven’t found a full listing of all of these documents, and the search interface is a little more complex, so perhaps more difficult to scrape, but also permitting the retention of more of the metadata.

A third set of documents is actually embedded in a more data-driven interface (so it’s sort of a transition to the next category), at the Enforcement and Compliance History repository. What would it take to recover this data?

Tools

We’ve been looking at Scrapy and its GUI builder, Portia, as building blocks for our scraping efforts. However, we haven’t settled either on a format for any metadata we can collect nor a good way to ingetrate it directly with the Internet Archive’s broader webcrawling project.

Quantitative Datasets: Finding, filtering, Downloading, and Archiving Them

the EPA hosts many datasets large and small. A good number of them are online at Envirofacts. Infoormation about Envirofacts and other API’s is also available online. Many of these datasets can also be Downloaded from EPA, but the download links are sent via email. It should be possible to automate this process using a combination of a scraper and a mail API like Mailgun or SendGrid. The question of short term storage (for acquisition), and especially of long-term archiving, remains unsolved.

Moreover, it’s not yet clear which datasets are vulnerable. Some are stored on data.gov, and others may be unlikely to catch the attention of the incoming administration, who are probably most focussed on climate change research and environmental assessments. Colleagues at Penn are working on a survey of scientists, social scientists, and public servants, which whould help us to set priorities.

Thanks

To @patcon, @dcwalk, @mackenzien, and others at Civic Tech TO, as well as Michelle Murphy, Patrick Keilty, Alessandro Delfanti, and many others.

Releases

No releases published

Packages

No packages published