Skip to content

sfomuseum/go-libraryofcongress

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

go-libraryofcongress

Go package providing tools for working with Library of Congress data.

Documentation

Go Reference

Tools

$> make cli
go build -mod vendor -o bin/parse-lcnaf cmd/parse-lcnaf/main.go
go build -mod vendor -o bin/parse-lcsh cmd/parse-lcsh/main.go

parse-lcnaf

parse-lcnaf is a command-line tool to parse the Library of Congress lcnaf.both.ndjson (or lcnaf.both.ndjson.zip) Name Authority file and output CSV-encoded name authority ID and (English) label data.

$> ./bin/parse-lcnaf -h
parse-lcnaf is a command-line tool to parse the Library of Congress `lcnaf.both.ndjson` (or `lcnaf.both.ndjson.zip`) file and output CSV-encoded subject heading ID and (English) label data.

Usage:
	 ./bin/parse-lcnaf lcnaf.both.ndjson.zip

For example:

$> ./bin/parse-lcnaf ~/Downloads/lcnaf.both.ndjson.zip > lcnaf.csv

Time passes...
More time passes...
Time keeps on slipping slipping in to the future...

$> wc -l lcnaf.csv
 11024368 lcnaf.csv

$> cat lcnaf.csv
id,label
n90699999,"Birkan, Kaarin"
n85299999,"Devorin, Lonyah"
no2007099999,"Graham, Sean"
n94099999,Tampa Joe
n98099999,"McGoggan, Graham"
n79099999,"Brockmann, Lester C."
no2018099999,"Neefe, Christian Gottlob, 1748-1798. Veränderungen über den Priestermarsch aus Mozarts Zauberflöte"
n2003099999,"Halstenberg, Friedrich"
no2019099999,"Colling, Anton"
n88299999,"Herring, Jackson R."
... and so on

It is also possible to parse LCSH data directly from the LoC servers. For example:

$> bin/parse-lcnaf https://id.loc.gov/download/lcnaf.both.ndjson.zip

Notes

  • Persons with empty labels are ignored.
  • This tool will work with the compressed and uncompressed version of lcnaf.both.ndjson. Keep in mind that compressed file is already 7GB and expands to an uncompressed 55GB.
  • This tool creates a temporary SQLite database (in the operating system's "temp" directory) to track duplicate records. This is necessary because tracking duplicate IDs in memory tend to cause out-of-memory errors. The temporary SQLite database is removed when the tool exits.

parse-lcsh

parse-lcsh is a command-line tool to parse the Library of Congress Subject Headings (lcsh.both.ndjson) Subject Headings file and output CSV-encoded subject heading ID and (English) label data.

$> ./bin/parse-lcsh -h
parse-lcsh is a command-line tool to parse the Library of Congress `lcsh.both.ndjson` file and out CSV-encoded subject heading ID and (English) label data. It can also be configured to include broader concepts for each heading as well as Wikidata and Worldcat concordances.

Usage:
	 ./bin/parse-lcsh [options] lcsh.both.ndjson

Valid options are:
  -include-all
    	If true will enable all the other -include-* flags
  -include-broader skos:broader
    	If present, include a comma-separated list of skos:broader pointers associated with each subject heading
  -include-concordances
    	If true will enable the -include-wikidata and -include-worldcat flags
  -include-wikidata
    	If present, include a Wikidata pointer associated with each subject heading
  -include-worldcat
    	If present, include a Worldcat pointer associated with each subject heading

For example:

$> ./bin/parse-lcsh /usr/local/data/loc/lcsh.both.ndjson | less
id,label
sh98007138,Sports tournaments
sh85133899,Tennis--Tournaments
sh85133890,Tennis
sh91004781,Federation Cup
sh99005024,History
sh2009114899,Anarchism--Italy--History--20th century
sh2002012476,20th century
sh85004812,Anarchism
sh2008122899,Kitchens--Planning
sh85072576,Kitchens
sh2002006228,Planning
sh88001899,"Humorous poetry, Russian"
sh85116005,Russian poetry
sh85116022,Russian wit and humor
sh2008123899,Integrated circuits--Amateurs' manuals
sh99001292,Amateurs' manuals
sh85067117,Integrated circuits
sh85065604,Indians of South America--Ecuador--Antiquities
sh85040894,Ecuador--Antiquities
sh2005006899,Valdivian culture
... and so on

Or, to include additional metadata (broader concepts and concordances):

$> bin/parse-lcsh -include-all /usr/local/data/loc/lcsh.both.ndjson > lcsh.csv
$> grep Q3362749 ./lcsh.csv
sh85097529,Papabuco language,"sh85149668,sh85084601",Q3362749,1052283

It is also possible to parse LCSH data directly from the LoC servers. For example:

$> bin/parse-lcsh https://id.loc.gov/download/lcsh.both.ndjson.zip

Notes

  • Subject headings with empty labels are ignored.
  • This tool will work with the compressed and uncompressed version of lcsh.both.ndjson.

See also