Skip to content
Felix Lohmeier edited this page Jan 29, 2021 · 15 revisions

A demonstration on how to use export templating to produce MARCXML (MARC21 XML Schema).

Introduction

For this tutorial, we are going to be working with a set of data from the Library Carpentry Lesson that looks like this:

Title Authors DOI ...
The Fisher Thermodynamics of Quasi-Probabilities Flavia Pennini|Angelo Plastino 10.3390/e17127853 ...
Aflatoxin Contamination of the Milk Supply: A Pakistan Perspective Naveed Aslam|Peter C. Wynn 10.3390/agriculture5041172 ...
... ... ... ...

Note that multiple Authors are divided by | in this data set.

Getting Started

You can follow along by pointing OpenRefine at the example data:

  • Go to http://localhost:3333/ in your web browser
  • Begin a new project with https://raw.githubusercontent.com/LibraryCarpentry/lc-open-refine/gh-pages/data/doaj-article-sample.csv as the URL.
  • Click on Export > Templating...

Option 1: Rows mode

MARCXML technically requires a leader and a 245 (title) field. To represent multiple records in an XML file, the <collection> element is used (cf. MARCXML example of a collection).

Changes to make

Prefix Field

<?xml version="1.0" encoding="utf-8"?>
<collection xmlns="http://www.loc.gov/MARC21/slim">

Optionally followed by a blank line to get cleanly formatted xml.

Row Template Field

<record>
<leader>     nab a22     uu 4500</leader>
<controlfield tag="001">{{cells['URL'].value.replace('https://doaj.org/article/','').escape('xml')}}</controlfield>
<datafield tag="100" ind1="0" ind2=" ">
	<subfield code="a">{{cells['Authors'].value.split('|')[0].escape('xml')}}</subfield>
</datafield>
<datafield tag="245" ind1="0" ind2="0">
	<subfield code="a">{{cells['Title'].value.escape('xml')}}</subfield>
</datafield>{{
forEach(cells['Authors'].value.split('|').slice(1), v ,'
<datafield tag="700" ind1="0" ind2=" ">
	<subfield code="a">' + v.escape('xml') + '</subfield>
</datafield>').join('')}}
</record>

Some points here:

  • All values will be transformed with .escape('xml') to handle characters that are not allowed in XML.
  • The leader is a special field where the data elements are positionally defined.
  • Field 001 shall contain "the control number assigned by the organization". There are 23 rows without an DOI in the dataset, so we will use the seemingly unique part of the URL column by replacing the first part with blank .replace('https://doaj.org/article/','').
  • Field 100 shall contain "the Personal name used as a main entry in a bibliographic record" in subfield a. We assume here that this is the first name in the Authors column and select it with .split('|')[0].
  • Remaining authors may be provided in (repeatable) Field 700, subfield a. We need a forEach loop here combined with .slice(1) to select all remaining authors not provided in field 100 already.

If columns contain empty cells, then we must use forNonBlank(), otherwise we would get elements with "Null" in the output. Example for adding the contents of the DOI column (which contains 23 blank cells) to the template:

{{
forNonBlank(
	cells['DOI'].value,
	v,
	'<datafield tag="024" ind1="7" ind2=" ">
		<subfield code="a">' + v.escape('xml') + '</subfield>
		<subfield code="2">doi</subfield>
	</datafield>',
	''
)
}}

Row Separator Field

Change the row separator from the default (comma (,) and line break) to just a line break. Confirm that there's a line break between </record> and <record> elements in the preview window.

Suffix Field

</collection>

Optionally with preceding blank line to get cleanly formatted xml.

Option 2: Records mode

If your project is in records mode, the "Row separator" field will insert a separator between records, rather than individual rows (the field "Row separator" should eventually be renamed to Separator to clarify this behavior).

Note that there is no such thing as a "Record template" field so you still need to edit the "Row template" field and use GREL controls (if, forNonBlank, ...) e.g. to process only the first row of a record.

Data preparation

First we need to split multi-valued cells in the Authors column to produce some kind of record structure:

  • Authors > Edit cells > Split multi-valued cells... > by separator > |

Now the data should look like this:

Title Authors DOI ...
The Fisher Thermodynamics of Quasi-Probabilities Flavia Pennini 10.3390/e17127853 ...
Angelo Plastino ...
Aflatoxin Contamination of the Milk Supply: A Pakistan Perspective Naveed Aslam 10.3390/agriculture5041172 ...
Peter C. Wynn ...
... ... ... ...

Templating Export

OpenRefine provides a handy row.record.cells[columnName].value to access all values of a column within a record. Applying this expression to the Authors column (Authors > Edit cells > Transform...), we get the following preview:

row value row.record.cells[columnName].value
1. Flavia Pennini [ "Flavia Pennini", "Angelo Plastino" ]
2. Angelo Plastino [ "Flavia Pennini", "Angelo Plastino" ]
3. Naveed Aslam [ "Naveed Aslam", "Peter C. Wynn" ]
4. Peter C. Wynn [ "Naveed Aslam", "Peter C. Wynn" ]

Note that blank or null values will be skipped silently. If there are dependencies between columns (e.g. name in column A and identifier for that name in column B) then you need a different approach (see Option 3: Advanced records mode using cross below).

To process the resulting array, we can use a forEach function. And to process each record only once, we can use an if function to check if it is the first line of a record and otherwise output nothing. Example:

if(
   row.index - row.record.fromRowIndex == 0,
   forEach(row.record.cells[columnName].value, v, v).join('\n'),
   ''
)
row value if(row.index - row.record.from ...
1. Flavia Pennini Flavia Pennini
Angelo Plastino
2. Angelo Plastino
3. Naveed Aslam Naveed Aslam
Peter C. Wynn
4. Peter C. Wynn

So that we can use these functions when templating, we put the entire template in curly braces and also move the hard coded strings into the if function. A sample solution follows.

Prefix Field

<?xml version="1.0" encoding="utf-8"?>
<collection xmlns="http://www.loc.gov/MARC21/slim">

Optionally followed by a blank line to get cleanly formatted xml.

Row Template Field

{{
if(row.index - row.record.fromRowIndex == 0,
'<record>' + '\n' +
'<leader>     nab a22     uu 4500</leader>' + '\n' +
forNonBlank(cells['URL'].value, v,
'<controlfield tag="001">' + v.replace('https://doaj.org/article/','').escape('xml') + '</controlfield>' + '\n', ''
) +
forNonBlank(row.record.cells['Authors'].value[0], v,
'<datafield tag="100" ind1="0" ind2=" ">
	<subfield code="a">' + v.escape('xml') + '</subfield>
</datafield>' + '\n', ''
) +
forNonBlank(cells['Title'].value, v,
'<datafield tag="245" ind1="0" ind2="0">
	<subfield code="a">' + v.escape('xml') + '</subfield>
</datafield>' + '\n', ''
) +
forEach(row.record.cells['Authors'].value.slice(1), v,
'<datafield tag="700" ind1="0" ind2=" ">
	<subfield code="a">' + v.escape('xml') + '</subfield>
</datafield>' + '\n'
).join('') +
'</record>' + '\n'
, '')
}}

Some points here:

  • If a column contains a value only in the first row of a record, then cells[columnName].value or row.record.cells[columnName].value[0] can be used equally.
  • To prevent errors caused by empty cells, the function forNonBlank is always used here as a precaution for string values. This is not needed for the function forEach, because an empty array will not lead to an error.
  • Be careful when concatenating many strings and functions. The templating dialog currently (feature request) has no syntax checking (in case of error it shows Updating... endlessly), so it is easier to compose the expression in the Edit cells > Transform... dialog (without curly brackets).
  • Line breaks can be generated either with multi-line strings or explicitly with '\n'. We add a line break after closing element so that we can omit the "Row Separator"

Row Separator Field

Delete the default (comma (,) and line break) so that the field is empty.

Suffix Field

</collection>

Option 3: Advanced records mode using cross

After importing hierarchically structured data or after reconciliation there are often dependencies between columns (e.g. name in column A and identifier in column B). Function row.record.cells[columnName].value does not help us here if a column contains any empty cells.

record name id row.record.cells['id'].value row.record.cells['id'].value[row.index - row.record.fromRowIndex]
1 Fabrizio 1 [ "1", "3" ] 1
Hao [ "1", "3" ] 3
Marie 3 [ "1", "3" ] Error: ArrayIndexOutOfBoundsException: 2
2 Naveed 4 [ "4" ] 4

There are several solutions to this kind of problem:

  1. One solution would be to add placeholder characters to the empty cells and remove/skip them later in the template.
  2. Another solution would be to join interdependent columns with a separator and split them up again in the template.
  3. For a programmatic approach, we can use the function cross, which can determine all associated rows of a record via a unique identifier.

We will follow the third approach here.

Data preparation

Continuing our example project we should first mockup some identifiers for illustration:

  • Authors > Edit column > Add column based on this column...
    • New column name: Identifier
    • Expression: if(row.index == even(row.index), value.md5()[0,10], '')

Then we need to create a column with unique record identifiers in each row of a record. The title column does not seem sufficiently unique. Normally, we could copy an existing column of identifiers and use fill down, but in our data set not all records have a DOI and there is a duplicate in the URL column. So we create a column based on the record index:

  • Title > Edit column > Add column based on this column...
    • New column name: index
    • Expression: row.record.index.toString()

Now the data should look like this:

Title index Authors Identifier ...
The Fisher Thermodynamics of Quasi-Probabilities 0 Flavia Pennini 65696e3b3b ...
0 Angelo Plastino ...
Aflatoxin Contamination of the Milk Supply: A Pakistan Perspective 1 Naveed Aslam fbb869c439 ...
1 Peter C. Wynn ...
Metagenomic Analysis of Upwelling-Affected Brazilian Coastal Seawater Reveals Sequence Domains of Type I PKS and Modular NRPS 2 Rafael R. C. Cuadrat 9b8e7077df ...
2 Juliano C. Cury ...
2 Alberto M. R. Dávila c9c596470c ...
... ... ... ... ...

Templating Export

We will use the function cross to get an array of rows to iterate over. This way we can address rows of a record specifically. We need to tell cross the column name and the project name. Here is an example for the project name doaj article sample csv:

with(
	cross(cells['index'].value, 'doaj article sample csv' , 'index'),
	rows,
	rows
)

Applying this expression we get the following preview:

row value with(cross(cells['index'].valu ...
1. 0 [ [object Row], [object Row] ]
2. 0 [ [object Row], [object Row] ]
3. 1 [ [object Row], [object Row] ]
4. 1 [ [object Row], [object Row] ]
5. 2 [ [object Row], [object Row], [object Row] ]
6. 2 [ [object Row], [object Row], [object Row] ]
7. 2 [ [object Row], [object Row], [object Row] ]

We can then address values of columns in specific rows by iterating over the rows array with forEach. For example:

with(
	cross(cells['index'].value, 'doaj article sample csv' , 'index'),
	rows,
	forEach(
		rows,
		r,
		r.cells['Authors'].value + forNonBlank(r.cells['Identifier'].value, v, ' | ' + v, '')
		)
)

Applying this expression we get:

row value with( cross(cells['index'] ...
1. 0 [ "Flavia Pennini | 65696e3b3b", "Angelo Plastino" ]
2. 0 [ "Flavia Pennini | 65696e3b3b", "Angelo Plastino" ]
3. 1 [ "Naveed Aslam | fbb869c439", "Peter C. Wynn" ]
4. 1 [ "Naveed Aslam | fbb869c439", "Peter C. Wynn" ]
5. 2 [ "Rafael R. C. Cuadrat | 9b8e7077df", "Juliano C. Cury", "Alberto M. R. Dávila | c9c596470c" ]
6. 2 [ "Rafael R. C. Cuadrat | 9b8e7077df", "Juliano C. Cury", "Alberto M. R. Dávila | c9c596470c" ]
7. 2 [ "Rafael R. C. Cuadrat | 9b8e7077df", "Juliano C. Cury", "Alberto M. R. Dávila | c9c596470c" ]

Names and Identifiers seems to match now. We will now use this technique to generate MARCXML.

Prefix Field

<?xml version="1.0" encoding="utf-8"?>
<collection xmlns="http://www.loc.gov/MARC21/slim">

Optionally followed by a blank line to get cleanly formatted xml.

Row Template Field

{{
if(row.index - row.record.fromRowIndex == 0,
with(cross(cells['index'].value, 'doaj article sample csv' , 'index'), rows,
'<record>' + '\n' +
'<leader>     nab a22     uu 4500</leader>' + '\n' +
forEach(filter(rows, r, isNonBlank(r.cells['URL'].value)).slice(0,1), r,
'<controlfield tag="001">' + r.cells['URL'].value.replace('https://doaj.org/article/','').escape('xml') + '</controlfield>' + '\n'
).join('') +
forEach(filter(rows, r, isNonBlank(r.cells['Authors'].value)).slice(0,1), r,
'<datafield tag="100" ind1="0" ind2=" ">' + '\n' +
'	<subfield code="a">' + r.cells['Authors'].value.escape('xml') + '</subfield>' + '\n' +
	forNonBlank(r.cells['Identifier'].value, v,
'	<subfield code="0">' + v.escape('xml') + '</subfield>' + '\n', ''
	) +
'</datafield>' + '\n'
).join('') +
forEach(filter(rows, r, isNonBlank(r.cells['Title'].value)).slice(0,1), r,
'<datafield tag="245" ind1="0" ind2="0">' + '\n' +
'	<subfield code="a">' + r.cells['Title'].value.escape('xml') + '</subfield>' + '\n' +
'</datafield>' + '\n'
).join('') +
forEach(filter(rows, r, isNonBlank(r.cells['Authors'].value)).slice(1), r,
'<datafield tag="700" ind1="0" ind2=" ">' + '\n' + 
'	<subfield code="a">' + r.cells['Authors'].value.escape('xml') + '</subfield>' + '\n' +
	forNonBlank(r.cells['Identifier'].value, v,
'	<subfield code="0">' + v.escape('xml') + '</subfield>' + '\n', ''
	) +
'</datafield>' + '\n'
).join('') +
'</record>' + '\n'
), '')
}}

Some points here:

  • The variable rows is defined in line 3 and contains an array of rows generated by cross. The project name currently needs to be given to cross as a string, e.g. 'doaj article sample csv'. OpenRefine 3.5 will allow an empty string to address the current project.
  • The filter function is used to select only the rows with existing values in a column. To ensure that only one row is used for non-repeatable fields (even if two rows should erroneously contain values in the data), slice(0,1) is used to explicitly select the first row.
  • Identifiers were added to Field 100, subfield 0 and Field 700, subfield 0.

Row Separator Field

Delete the default (comma (,) and line break) so that the field is empty.

Suffix Field

</collection>
Clone this wiki locally