bib-rdf-pipeline

This repository contains various scripts and configuration for converting MARC bibliographic records into RDF, for use at the National Library of Finland.

The main component is a conversion pipeline driven by a Makefile that defines rules for realizing the conversion steps using command line tools.

The steps of the conversion are:

Start with a file of MARC records in Aleph sequential format
Split the file into smaller batches
Preprocess using unix tools such as grep and sed, to remove some local peculiarities
Convert to MARCXML and enrich the MARC records, using Catmandu
Run the Library of Congress marc2bibframe2 XSLT conversion from MARC to BIBFRAME RDF
Convert the BIBFRAME RDF/XML data into N-Triples format and fix up some bad URIs
Calculate work keys (e.g. author+title combination) used later for merging data about the same creative work
Convert the BIBFRAME data into Schema.org RDF in N-Triples format
Reconcile entities in the Schema.org data against external sources (e.g. YSA/YSO, Corporate names authority, RDA vocabularies)
Merge the Schema.org data about the same works
Calculate agent keys used for merging data about the same agent (person or organization)
Merge the agents based on agent keys
Convert the raw Schema.org data to HDT format so the full data set can be queried with SPARQL from the command line
Consolidate the data by e.g. rewriting URIs and moving subjects into the original work
Convert the consolidated data to HDT
??? (TBD)
Profit!

Dependencies

Command line tools are assumed to be available in $PATH, but the paths can be overridden on the make command line, e.g. make CATMANDU=/opt/catmandu

For running the main suite

Apache Jena command line utilities sparql and rsparql
Catmandu utility catmandu
uconv utility from Ubuntu package icu-devtools
xsltproc utility from Ubuntu package xsltproc
hdt-cpp command line utilities rdf2hdt and hdtSearch
hdt-java command line utility hdtsparql.sh

For running the unit tests

In addition to above:

bats in $PATH
xmllint utility from Ubuntu package libxml2-utils in $PATH

Name		Name	Last commit message	Last commit date
Latest commit History 321 Commits
doc		doc
input		input
merged		merged
output		output
refdata		refdata
scripts		scripts
slices		slices
sparql		sparql
split-input		split-input
test		test
.travis.yml		.travis.yml
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc

doc

input

input

merged

merged

output

output

refdata

refdata

scripts

scripts

slices

slices

sparql

sparql

split-input

split-input

test

test

.travis.yml

.travis.yml

LICENSE.md

LICENSE.md

Makefile

Makefile

README.md

README.md

Repository files navigation

bib-rdf-pipeline

Dependencies

For running the main suite

For running the unit tests

About

Releases

Packages

Languages

License

NatLibFi/bib-rdf-pipeline

Folders and files

Latest commit

History

Repository files navigation

bib-rdf-pipeline

Dependencies

For running the main suite

For running the unit tests

About

Topics

Resources

License

Stars

Watchers

Forks

Languages