DevOps for Privacy Offices

We envision a future in which the public can easily understand how and why personally identifiable information gets collected by government agencies.

To get there, we're working with federal privacy offices and structuring data from PDFed privacy-related compliance documents. By structuring data, we're equipping privacy offices with the ability to more quickly search through these documents, reducing unnecessary manual practices and laying a foundation for them to more easily collaborate with engineering teams.

This project is funded by 10x.

Privacy Dashboard development repo here

Our phase three work is happening in partnership with the GSA's Privacy Office.

Install

The scraping code is written in Python and runs locally. We recommend creating a virtual environment using virtualenv to install and manage the required Python libraries. Run these commands in the repository directory on your machine to create a local virtual environment, start it, and then install all requirements.

virtualenv .venv
source .venv/bin/activate
pip install -r requirements.txt

Scraping Data

Running python sorn_scraper.py does the following:

Fetches the contents of the page where GSA publishes links and descriptions of System of Records Notices (SORNs)
Scrapes the unique SORN identifiers contained in each federalregister.gov url and crafts url for the XML version of the full text document
Downloads those XML files and parses them to get the text from specific sections of the document:
- System Name
- PII
- Purpose
- Retention Policy
- Routine Uses
- Document Title
Outputs text from these fields into a local .csv file called gsa_sorns.csv with one row per system.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.circleci		.circleci
noun_extracting		noun_extracting
pias		pias
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
gsa_pias.csv		gsa_pias.csv
noun_finder.py		noun_finder.py
pia_full_text_to_csv.py		pia_full_text_to_csv.py
pia_scraper.py		pia_scraper.py
requirements.txt		requirements.txt
sorn_scraper.py		sorn_scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.circleci

.circleci

noun_extracting

noun_extracting

pias

pias

tests

tests

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

gsa_pias.csv

gsa_pias.csv

noun_finder.py

noun_finder.py

pia_full_text_to_csv.py

pia_full_text_to_csv.py

pia_scraper.py

pia_scraper.py

requirements.txt

requirements.txt

sorn_scraper.py

sorn_scraper.py

Repository files navigation

DevOps for Privacy Offices

We envision a future in which the public can easily understand how and why personally identifiable information gets collected by government agencies.

Install

Scraping Data

About

Releases

Packages

Contributors 2

Languages

License

18F/privacy-tools

Folders and files

Latest commit

History

Repository files navigation

DevOps for Privacy Offices

We envision a future in which the public can easily understand how and why personally identifiable information gets collected by government agencies.

Install

Scraping Data

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages