Data Transformation Pipeline Code

Architecture

Harvester: ActivityStreams based record harvester that stores to a Cache.
Cache: Record cache for storing local copies of JSON data. Currently postgres or filesystem.
IdMap: Identifier Map that mints, manages and retrieves external/internal identifier sets. Currently redis or in-memory.

Config: Configuration as JSON records in a Cache.
Collector: Recursively collects identifiers for a given record.
Merger: Merges two Linked Art records representing the same entity together.
Reidentifier: Recursively external rewrite URIs in a record to internal identifiers, given an IdMap.
Sources/*/Fetcher: Fetches identified record from external source to a Cache.
Sources/*/Mapper: Maps from external source into Linked Art.
Sources/*/Reconciler: Determine if the entity in the given record is described in the external source.
Sources/*/Loader: Load a dump of the data into the data cache.
Sources/*/IndexLoader: Create an inverted index to reconcile records against this dataset.
MarkLogic: Transformation from Linked Art into MarkLogic internal format.

Source	Fetch	Map	Reconcile	Load	IdxLoad
AAT	✅	✅	N/A	N/A	-
DNB	✅	✅	-	✅	-
FAST	✅	-	-	-	-
Geonames	✅	✅	-	N/A	-
LCNAF	✅	✅	-	-	-
LCSH	✅	✅	-	✅	✅
TGN	✅	✅	-	N/A	-
ULAN	✅	✅	-	N/A	-
VIAF	✅	✅	-	-	-
Who's on First	✅	✅	-	N/A	-
Wikidata	✅	✅	✅	✅	✅
Japan NL	✅	✅	-	N/A	-

✅ = Seems to work ; - = Not started ; N/A = Can't be done

AAT, TGN, ULAN: Dump files are NTriples based. More effort to reconstruct than it would be worth.
Geonames: Dump file is CSV without all the information (e.g. no language of names)
WOF: Dump file is a 33Gb sqlite db... if it was useful, we could just use it as the cache
LCNAF: Doesn't have the real world object data which we want, useful for reconciliation though
VIAF: Too much data for not enough value
FAST: Just not implemented yet (needs to process MARC/XML)

Process:

In the config file, look up dumpFilePath and remoteDumpFile
Go to the directory where dumpFilePath exists and rename it with a date (e.g. latest-2022-07)
execute wget <url> where <url> is the URL from remoteDumpFile (and probably validate it by hand online)
For wikidata, as it's SO HUGE, instead do: nohup wget --quiet <url> & to fetch it in the background so we can get on with our lives in the mean time.
Done :)

Name		Name	Last commit message	Last commit date
Latest commit History 373 Commits
docs		docs
pipeline		pipeline
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
checkDataUpdates.py		checkDataUpdates.py
debug-reconcile.py		debug-reconcile.py
export_parallel.sh		export_parallel.sh
google-sames-diffs.py		google-sames-diffs.py
harvest_parallel.sh		harvest_parallel.sh
load-csv-map.py		load-csv-map.py
load-csv-map2.py		load-csv-map2.py
make-wd-differents.py		make-wd-differents.py
make_test_dataset.py		make_test_dataset.py
manage-data.py		manage-data.py
merge-metatypes.py		merge-metatypes.py
merge_parallel.sh		merge_parallel.sh
reconcile_parallel.sh		reconcile_parallel.sh
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
run-all.sh		run-all.sh
run-export.py		run-export.py
run-harvest.py		run-harvest.py
run-load.py		run-load.py
run-merge.py		run-merge.py
run-reconcile.py		run-reconcile.py