Skip to content
This repository has been archived by the owner on Dec 19, 2018. It is now read-only.

culturegraph/solr-metamorph-entity-processor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

solr-metamorph-entity-processor

A add-on for the Apache Solr Data Import Handler.

This project adds an entity processor that handles bibliographic records. It supports the metamorph DSL for data extraction.

Installation

Build Plugin

mvn package

Produces solr-metamorph-entity-processor-VERSION-jar-with-dependencies.jar in target .

Integrate Plugin into Solr

Assuming a fresh Solr installation.

  • Download the latest version

  • Unzip

  • Directory of your Solr installation is solr-VERSION (e.g. solr-7.4.0)

A list of Solr directories:

Name Location Example

SOLR_ROOT

Path to the unpacked solr distribution

/srv/solr-7.4.0

SOLR_SERVER_DIR

SOLR_ROOT/server

/srv/solr-7.4.0/server

SOLR_HOME

SOLR_ROOT/server/solr

/srv/solr-7.4.0/server/solr

Create the directory SOLR_ROOT/lib:

mkdir -p SOLR_ROOT/lib

Include Metafacture Dependencies

mkdir -p SOLR_ROOT/lib/metafacture

Copy all Metafacture Module JARs into SOLR_ROOT/lib/metafacture and into SOLR_ROOT/server/solr-webapp/webapp/WEB-INF/lib.

cd SOLR_ROOT/lib/metafacture

repo="http://central.maven.org/maven2/org/metafacture"
modules="metafacture-biblio metafacture-commons metafacture-flowcontrol metafacture-framework metafacture-io metafacture-mangling metamorph metamorph-api"
for module in $modules; do
  wget -q -P $(realpath ${SOLR})/lib/metafacture ${repo}/${module}/${METAFACTURE_VERSION}/${module}-${METAFACTURE_VERSION}.jar
done

Include Data Import Handler Add-On

mkdir -p SOLR_ROOT/lib/dih

Copy the latest release JAR into SOLR_ROOT/lib/dih.

Configure solrconfig.xml

Note
Assumes a existing core (you may use a default core). Edit the solrconfig.xml of your core.

Enable the Data Import Handler and the processor by adding the following lib statements to the solrconfig.xml of your config set:

  <!-- Data Import Handler -->
  <lib dir="\${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-.*\.jar" />

  <!-- Metafacture -->
  <lib dir="\${solr.install.dir:../../../..}/lib/metafacture" regex="metafacture-.*\.jar" />

  <!-- Data Import Handler Add-Ons -->
  <lib dir="\${solr.install.dir:../../../..}/lib/dih" regex="solr-metamorph-entity-processor-.*\.jar" />

Add the /dataimport request handle to the solrconfig.xml:

  <requestHandler name="/dataimport" class="solr.DataImportHandler">
    <lst name="defaults">
      <str name="config">solr-data-config.xml</str>
    </lst>
  </requestHandler>
Tip
A example solr-data-config.xml is located in example/solr-data-config.xml.

Metamorph Entity Processor

Note
Test data are located in example/testdata.mrc. The solr-data-config.xml expects them in /tmp.

This MetamorphEntityProcessor reads all content from the data source on a record by record basis. This processor may handle compressed input streams, if the consumed data source is a BinFileDataSource.

Each record is processed by a metafacture pipeline that uses metamorph to extract fields.

The Metamorph Entity Processor has the following attributes:

url

Required. A attribute that specifies the location of the input file in a way that is compatible with the configured data source.

format

Required. The format supplied by the data source.

Supported Formats
  • marc21

    • Pre-processing records by replacing newline and carriage return with a space

  • marcxml

    • Pre-processing records by converting marcxml into marc21 and using the marc21 pre-processing (see above).

    • if includeFullRecord=true, the implicit field fullRecord contains the MARC21 representation of the record.

morphDef

Required. The metamorph definition files that are used for field extraction. Each extracted field is added as a implicit field. If the input is a list of files (separated by a comma), the data get passed from one metamorph file to another. Those files are located inside the config set’s conf directory. :: Make sure that your metamorph definition xml has the following properties:

  • The encoding of the file should be UTF-8

    • Validate the file encoding with a text editor

  • Check for control characters, if you use XML 1.0

    • ASCII control characters are not legally encodeable in XML 1.0

includeFullRecord

An optional attribute that adds the received record to the implicit field fullRecord. The attribute is a boolean value (true or false), that is false by default.

onError

By default the MetamorphEntityProcessor will stop processing documents, if it finds one that generates an error. If you set onError to "skip", the MetamorphEntityProcessor will instead skip documents that fail processing. A debug message will be created that contains the record and the cause of the failure.

For example:

<entity name="morph"
        processor="org.culturegraph.solr.handler.dataimport.MetamorphEntityProcessor"
        url="path/to/file.marc21"
        inputFormat="marc21"
        morphDef="morph.xml,morph2.xml"
        includeFullRecord="true"
        onError="skip">
  <field column="identifier" name="id"/>
  <field column="fullRecord" name="fullRecord_s"/>
</entity>

The used metamorph definitions:

<?xml version="1.0" encoding="UTF-8"?>
<!-- morph.xml -->
<metamorph xmlns="http://www.culturegraph.org/metamorph" version="1">
    <rules>
        <data name="idn" source="001"/>
    </rules>
</metamorph>
<?xml version="1.0" encoding="UTF-8"?>
<!-- morph2.xml -->
<metamorph xmlns="http://www.culturegraph.org/metamorph" version="1">
    <rules>
        <data name="identifier" source="idn"/>
    </rules>
</metamorph>

Import

Run a full-import:

curl -s http://localhost:1111/solr/demo/dataimport?command=full-import

Check status:

curl -s http://localhost:1111/solr/demo/dataimport?command=status

Commit:

curl -s http://localhost:1111/solr/demo/update?commit=true
NOTE

The admin UI provides a Dataimport Screen .

Appendix

Metamorph IR To Row Conversion

A record processed by metamorph will be transformed into a intermediate representation (IR) that consists of the following elements:

  • Record

  • Entity

  • Literal

A row processed by Solr is a map that consists of key-value or key-list pairs.

IR
startRecord("001")
literal("date", "20181001")
startEntity("person")
literal("lastname", "Unknown")
endEntity()
literal("cat", "human")
literal("cat", "person")
endRecord()
Row (Represented as JSON)
{
  "cat": ["human", "person"]
  "date": "20181001"
  "personLastname": "Unknown"
}

The following rules are applied to convert a IR to a Row:

  • Record id will be ignored

  • Literals with the same name form a list

  • Literal names in entities are prefixed with the entity name in CamelCase

About

A Apache Solr entity processor that processes bibliographic records. A add-on for the data import handler.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages