Skip to content
This repository has been archived by the owner on Dec 25, 2020. It is now read-only.

Research: What is the simplest way to get PII from the list of Categories of Records? #7

Open
ondrae opened this issue Feb 10, 2020 · 5 comments
Assignees
Milestone

Comments

@ondrae
Copy link
Collaborator

ondrae commented Feb 10, 2020

What:
Research: What is the simplest way to get PII from the list of Categories of Records?

Depends on:
#5 and #6

Why:
If our assumptions about OMB A130 are correct, then we need a repeatable way to turn categories of records from PIAs and SORNs into an inventory of PII. Something we can turn into code would be best. Instructions on how to do it by hand work too, just less enticing for a new agency wanting to use our service.

What:
Just an example of one way: Find some official NIST or GSA list of PII. We compare that official list against the categories of records, keeping only the matching PII. If no official list exists, make your own list. Using your expertise, choose what is PII and what isn’t. The GSA privacy office would probably love to help. They could maybe even do it for you?!?

Try to avoid anything complicated, like combinations of records that become PII.

Acceptance:
We will have an understanding of the suggested approach. The partners have agreed to this approach.

@ondrae ondrae added this to To do in Phase three work Feb 10, 2020
@ondrae ondrae added this to the Inventory PII milestone Feb 10, 2020
@nikzei
Copy link

nikzei commented Feb 10, 2020

@nikzei and @peterrowland to pair on tightening up AC.

@peterrowland
Copy link
Member

peterrowland commented Feb 10, 2020

Two examples of previous projects that used natural language processing to categorize text data into consistent categories.

https://github.com/GSA/calc/pull/997
Tl;DR: This is a clever method to do broader matching of terms by filtering out words that are uncommon in their dataset and then tries different combinations of remaining terms to look for a category match. Requires more data, but the method of stripping out uncommon words and trying different combinations may be worth considering as a way to do more generalized matching.

https://github.com/18F/10x-ssp-parse-prototype
Tl;DR: This project scrapes narrative controls text contained in SSP documents and uses popular natural language processing (NLP) libraries to quantify similarity between texts. This method isn't applicable to matching terms like categories, but would be useful for comparing fields in SORNs and PIAs.

@nikzei nikzei moved this from To do to In progress in Phase three work Feb 11, 2020
@peterrowland
Copy link
Member

Marcela referred me to the Commodity Futures Trading Commission as an example of plain-language terms for PII.
https://www.cftc.gov/Privacy/cftcpia/index.htm
https://www.cftc.gov/media/2001/piaems051019/download

@peterrowland
Copy link
Member

peterrowland commented Feb 13, 2020

The question surfaced: Do SORN Categories of Records == PII?

Privacy Act defines a record as:
"any item, collection, or grouping of information about an individual that is maintained by an agency..."

and a System of Record as:

a group of any records under the control of any agency from which information is retrieved by the name of the individual or by some identifying number, symbol, or other identifying particular assigned to the individual;

https://www.law.cornell.edu/uscode/text/5/552a

If Privacy Act 'records' are personal information, should we consider personal information PII?

GAO report (08-536) uses the 'Personal Information' and 'Personally Identifiable Information' interchangeably, and uses this definition:

includ[es] (1) any information that can be used to distinguish or
trace an individual’s identity, such as name, Social Security number, date and place of
birth, mother’s maiden name, or biometric records; and (2) any other information that is
linked or linkable to an individual, such as medical, educational, financial, and employment
information.
https://www.gao.gov/assets/gao-08-536.pdf - p.1

NIST's guidance on protecting PII (800-22) references this definition and goes into detail on what information can be used to distinguish or trace and individual, and what linked or linkable means.
https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-122.pdf - p 2-1

We should ask Richard or Marcela to confirm if GSA also uses this definition.

@ondrae
Copy link
Collaborator Author

ondrae commented Feb 13, 2020

@peterrowland Thank you for this research.

I was wrong in my assumption that Categories or Record meant something different enough from PII that we should treat them different. Based on what you found above, I'm going to start talking about them both as pretty much the same thing, and will use the terms interchangeably.

Is there anything else you want to do before we close this issue?

@peterrowland peterrowland moved this from In progress to Done in Phase three work Feb 18, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
Development

No branches or pull requests

3 participants