Export as MARCXML
A demonstration on how to use export templating to produce MARCXML (MARC21 XML Schema).
For this tutorial, we are going to be working with a set of data from the Library Carpentry Lesson that looks like this:
Title | Authors | DOI | ... |
---|---|---|---|
The Fisher Thermodynamics of Quasi-Probabilities | Flavia Pennini|Angelo Plastino | 10.3390/e17127853 | ... |
Aflatoxin Contamination of the Milk Supply: A Pakistan Perspective | Naveed Aslam|Peter C. Wynn | 10.3390/agriculture5041172 | ... |
... | ... | ... | ... |
Note that multiple Authors are divided by |
in this data set.
You can follow along by pointing OpenRefine at the example data:
- Go to http://localhost:3333/ in your web browser
- Begin a new project with
https://raw.githubusercontent.com/LibraryCarpentry/lc-open-refine/gh-pages/data/doaj-article-sample.csv
as the URL. - Click on
Export
>Templating...
MARCXML technically requires a leader and a 245 (title) field. To represent multiple records in an XML file, the <collection>
element is used (cf. MARCXML example of a collection).
<?xml version="1.0" encoding="utf-8"?>
<collection xmlns="http://www.loc.gov/MARC21/slim">
Optionally followed by a blank line to get cleanly formatted xml.
<record>
<leader> nab a22 uu 4500</leader>
<controlfield tag="001">{{cells['URL'].value.replace('https://doaj.org/article/','').escape('xml')}}</controlfield>
<datafield tag="100" ind1="0" ind2=" ">
<subfield code="a">{{cells['Authors'].value.split('|')[0].escape('xml')}}</subfield>
</datafield>
<datafield tag="245" ind1="0" ind2="0">
<subfield code="a">{{cells['Title'].value.escape('xml')}}</subfield>
</datafield>{{
forEach(cells['Authors'].value.split('|').slice(1), v ,'
<datafield tag="700" ind1="0" ind2=" ">
<subfield code="a">' + v.escape('xml') + '</subfield>
</datafield>').join('')}}
</record>
Some points here:
- All values will be transformed with
.escape('xml')
to handle characters that are not allowed in XML. - The leader is a special field where the data elements are positionally defined.
-
Field 001 shall contain "the control number assigned by the organization". There are 23 rows without an DOI in the dataset, so we will use the seemingly unique part of the URL column by replacing the first part with blank
.replace('https://doaj.org/article/','')
. -
Field 100 shall contain "the Personal name used as a main entry in a bibliographic record" in subfield a. We assume here that this is the first name in the Authors column and select it with
.split('|')[0]
. - Remaining authors may be provided in (repeatable) Field 700, subfield a. We need a
forEach
loop here combined with.slice(1)
to select all remaining authors not provided in field 100 already.
If columns contain empty cells, then we must use forNonBlank()
, otherwise we would get elements with "Null" in the output. Example for adding the contents of the DOI column (which contains 23 blank cells) to the template:
{{
forNonBlank(
cells['DOI'].value,
v,
'<datafield tag="024" ind1="7" ind2=" ">
<subfield code="a">' + v.escape('xml') + '</subfield>
<subfield code="2">doi</subfield>
</datafield>',
''
)
}}
Change the row separator from the default (comma (,) and line break) to just a line break. Confirm that there's a line break between </record>
and <record>
elements in the preview window.
</collection>
Optionally with preceding blank line to get cleanly formatted xml.
If your project is in records mode, the "Row separator" field will insert a separator between records, rather than individual rows (the field "Row separator" should eventually be renamed to Separator to clarify this behavior).
Note that there is no such thing as a "Record template" field so you still need to edit the "Row template" field and use GREL controls (if
, forNonBlank
, ...) e.g. to process only the first row of a record.
First we need to split multi-valued cells in the Authors column to produce some kind of record structure:
- Authors > Edit cells > Split multi-valued cells... > by separator >
|
Now the data should look like this:
Title | Authors | DOI | ... |
---|---|---|---|
The Fisher Thermodynamics of Quasi-Probabilities | Flavia Pennini | 10.3390/e17127853 | ... |
Angelo Plastino | ... | ||
Aflatoxin Contamination of the Milk Supply: A Pakistan Perspective | Naveed Aslam | 10.3390/agriculture5041172 | ... |
Peter C. Wynn | ... | ||
... | ... | ... | ... |
OpenRefine provides a handy row.record.cells[columnName].value
to access all values of a column within a record. Applying this expression to the Authors column (Authors
> Edit cells
> Transform...
), we get the following preview:
row | value | row.record.cells[columnName].value |
---|---|---|
1. | Flavia Pennini | [ "Flavia Pennini", "Angelo Plastino" ] |
2. | Angelo Plastino | [ "Flavia Pennini", "Angelo Plastino" ] |
3. | Naveed Aslam | [ "Naveed Aslam", "Peter C. Wynn" ] |
4. | Peter C. Wynn | [ "Naveed Aslam", "Peter C. Wynn" ] |
Note that blank or null values will be skipped silently. If there are dependencies between columns (e.g. name in column A and identifier for that name in column B) then you need a different approach (see Option 3: Advanced records mode using cross below).
To process the resulting array, we can use a forEach
function. And to process each record only once, we can use an if function to check if it is the first line of a record and otherwise output nothing. Example:
if(
row.index - row.record.fromRowIndex == 0,
forEach(row.record.cells[columnName].value, v, v).join('\n'),
''
)
row | value | if(row.index - row.record.from ... |
---|---|---|
1. | Flavia Pennini | Flavia Pennini Angelo Plastino |
2. | Angelo Plastino | |
3. | Naveed Aslam | Naveed Aslam Peter C. Wynn |
4. | Peter C. Wynn |
So that we can use these functions when templating, we put the entire template in curly braces and also move the hard coded strings into the if
function. A sample solution follows.
<?xml version="1.0" encoding="utf-8"?>
<collection xmlns="http://www.loc.gov/MARC21/slim">
Optionally followed by a blank line to get cleanly formatted xml.
{{
if(row.index - row.record.fromRowIndex == 0,
'<record>' + '\n' +
'<leader> nab a22 uu 4500</leader>' + '\n' +
forNonBlank(cells['URL'].value, v,
'<controlfield tag="001">' + v.replace('https://doaj.org/article/','').escape('xml') + '</controlfield>' + '\n', ''
) +
forNonBlank(row.record.cells['Authors'].value[0], v,
'<datafield tag="100" ind1="0" ind2=" ">
<subfield code="a">' + v.escape('xml') + '</subfield>
</datafield>' + '\n', ''
) +
forNonBlank(cells['Title'].value, v,
'<datafield tag="245" ind1="0" ind2="0">
<subfield code="a">' + v.escape('xml') + '</subfield>
</datafield>' + '\n', ''
) +
forEach(row.record.cells['Authors'].value.slice(1), v,
'<datafield tag="700" ind1="0" ind2=" ">
<subfield code="a">' + v.escape('xml') + '</subfield>
</datafield>' + '\n'
).join('') +
'</record>' + '\n'
, '')
}}
Some points here:
- If a column contains a value only in the first row of a record, then
cells[columnName].value
orrow.record.cells[columnName].value[0]
can be used equally. - To prevent errors caused by empty cells, the function
forNonBlank
is always used here as a precaution for string values. This is not needed for the functionforEach
, because an empty array will not lead to an error. - Be careful when concatenating many strings and functions. The templating dialog currently (feature request) has no syntax checking (in case of error it shows
Updating...
endlessly), so it is easier to compose the expression in theEdit cells
>Transform...
dialog (without curly brackets). - Line breaks can be generated either with multi-line strings or explicitly with
'\n'
. We add a line break after closing element so that we can omit the "Row Separator"
Delete the default (comma (,) and line break) so that the field is empty.
</collection>
After importing hierarchically structured data or after reconciliation there are often dependencies between columns (e.g. name in column A and identifier in column B). Function row.record.cells[columnName].value
does not help us here if a column contains any empty cells.
record | name | id | row.record.cells['id'].value | row.record.cells['id'].value[row.index - row.record.fromRowIndex] |
---|---|---|---|---|
1 | Fabrizio | 1 | [ "1", "3" ] | 1 |
Hao | [ "1", "3" ] | 3 | ||
Marie | 3 | [ "1", "3" ] | Error: ArrayIndexOutOfBoundsException: 2 | |
2 | Naveed | 4 | [ "4" ] | 4 |
There are several solutions to this kind of problem:
- One solution would be to add placeholder characters to the empty cells and remove/skip them later in the template.
- Another solution would be to join interdependent columns with a separator and split them up again in the template.
- For a programmatic approach, we can use the function
cross
, which can determine all associated rows of a record via a unique identifier.
We will follow the third approach here.
Continuing our example project we should first mockup some identifiers for illustration:
- Authors > Edit column > Add column based on this column...
- New column name:
Identifier
- Expression:
if(row.index == even(row.index), value.md5()[0,10], '')
- New column name:
Then we need to create a column with unique record identifiers in each row of a record. The title column does not seem sufficiently unique. Normally, we could copy an existing column of identifiers and use fill down
, but in our data set not all records have a DOI and there is a duplicate in the URL column. So we create a column based on the record index:
- Title > Edit column > Add column based on this column...
- New column name:
index
- Expression:
row.record.index.toString()
- New column name:
Now the data should look like this:
Title | index | Authors | Identifier | ... |
---|---|---|---|---|
The Fisher Thermodynamics of Quasi-Probabilities | 0 | Flavia Pennini | 65696e3b3b | ... |
0 | Angelo Plastino | ... | ||
Aflatoxin Contamination of the Milk Supply: A Pakistan Perspective | 1 | Naveed Aslam | fbb869c439 | ... |
1 | Peter C. Wynn | ... | ||
Metagenomic Analysis of Upwelling-Affected Brazilian Coastal Seawater Reveals Sequence Domains of Type I PKS and Modular NRPS | 2 | Rafael R. C. Cuadrat | 9b8e7077df | ... |
2 | Juliano C. Cury | ... | ||
2 | Alberto M. R. Dávila | c9c596470c | ... | |
... | ... | ... | ... | ... |
We will use the function cross
to get an array of rows to iterate over. This way we can address rows of a record specifically. We need to tell cross the column name and the project name. Here is an example for the project name doaj article sample csv
:
with(
cross(cells['index'].value, 'doaj article sample csv' , 'index'),
rows,
rows
)
Applying this expression we get the following preview:
row | value | with(cross(cells['index'].valu ... |
---|---|---|
1. | 0 | [ [object Row], [object Row] ] |
2. | 0 | [ [object Row], [object Row] ] |
3. | 1 | [ [object Row], [object Row] ] |
4. | 1 | [ [object Row], [object Row] ] |
5. | 2 | [ [object Row], [object Row], [object Row] ] |
6. | 2 | [ [object Row], [object Row], [object Row] ] |
7. | 2 | [ [object Row], [object Row], [object Row] ] |
We can then address values of columns in specific rows by iterating over the rows array with forEach
. For example:
with(
cross(cells['index'].value, 'doaj article sample csv' , 'index'),
rows,
forEach(
rows,
r,
r.cells['Authors'].value + forNonBlank(r.cells['Identifier'].value, v, ' | ' + v, '')
)
)
Applying this expression we get:
row | value | with( cross(cells['index'] ... |
---|---|---|
1. | 0 | [ "Flavia Pennini | 65696e3b3b", "Angelo Plastino" ] |
2. | 0 | [ "Flavia Pennini | 65696e3b3b", "Angelo Plastino" ] |
3. | 1 | [ "Naveed Aslam | fbb869c439", "Peter C. Wynn" ] |
4. | 1 | [ "Naveed Aslam | fbb869c439", "Peter C. Wynn" ] |
5. | 2 | [ "Rafael R. C. Cuadrat | 9b8e7077df", "Juliano C. Cury", "Alberto M. R. Dávila | c9c596470c" ] |
6. | 2 | [ "Rafael R. C. Cuadrat | 9b8e7077df", "Juliano C. Cury", "Alberto M. R. Dávila | c9c596470c" ] |
7. | 2 | [ "Rafael R. C. Cuadrat | 9b8e7077df", "Juliano C. Cury", "Alberto M. R. Dávila | c9c596470c" ] |
Names and Identifiers seems to match now. We will now use this technique to generate MARCXML.
<?xml version="1.0" encoding="utf-8"?>
<collection xmlns="http://www.loc.gov/MARC21/slim">
Optionally followed by a blank line to get cleanly formatted xml.
{{
if(row.index - row.record.fromRowIndex == 0,
with(cross(cells['index'].value, 'doaj article sample csv' , 'index'), rows,
'<record>' + '\n' +
'<leader> nab a22 uu 4500</leader>' + '\n' +
forEach(filter(rows, r, isNonBlank(r.cells['URL'].value)).slice(0,1), r,
'<controlfield tag="001">' + r.cells['URL'].value.replace('https://doaj.org/article/','').escape('xml') + '</controlfield>' + '\n'
).join('') +
forEach(filter(rows, r, isNonBlank(r.cells['Authors'].value)).slice(0,1), r,
'<datafield tag="100" ind1="0" ind2=" ">' + '\n' +
' <subfield code="a">' + r.cells['Authors'].value.escape('xml') + '</subfield>' + '\n' +
forNonBlank(r.cells['Identifier'].value, v,
' <subfield code="0">' + v.escape('xml') + '</subfield>' + '\n', ''
) +
'</datafield>' + '\n'
).join('') +
forEach(filter(rows, r, isNonBlank(r.cells['Title'].value)).slice(0,1), r,
'<datafield tag="245" ind1="0" ind2="0">' + '\n' +
' <subfield code="a">' + r.cells['Title'].value.escape('xml') + '</subfield>' + '\n' +
'</datafield>' + '\n'
).join('') +
forEach(filter(rows, r, isNonBlank(r.cells['Authors'].value)).slice(1), r,
'<datafield tag="700" ind1="0" ind2=" ">' + '\n' +
' <subfield code="a">' + r.cells['Authors'].value.escape('xml') + '</subfield>' + '\n' +
forNonBlank(r.cells['Identifier'].value, v,
' <subfield code="0">' + v.escape('xml') + '</subfield>' + '\n', ''
) +
'</datafield>' + '\n'
).join('') +
'</record>' + '\n'
), '')
}}
Some points here:
- The variable
rows
is defined in line 3 and contains an array of rows generated bycross
. The project name currently needs to be given tocross
as a string, e.g.'doaj article sample csv'
. OpenRefine 3.5 will allow an empty string to address the current project. - The filter function is used to select only the rows with existing values in a column. To ensure that only one row is used for non-repeatable fields (even if two rows should erroneously contain values in the data),
slice(0,1)
is used to explicitly select the first row. - Identifiers were added to Field 100, subfield 0 and Field 700, subfield 0.
Delete the default (comma (,) and line break) so that the field is empty.
</collection>