Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encode-marcxml should combine split leader fields of decode-marc21 to create valid marc xml #527

Open
TobiasNx opened this issue Apr 22, 2024 · 7 comments · May be fixed by #531
Open

encode-marcxml should combine split leader fields of decode-marc21 to create valid marc xml #527

TobiasNx opened this issue Apr 22, 2024 · 7 comments · May be fixed by #531
Assignees
Labels

Comments

@TobiasNx
Copy link
Contributor

TobiasNx commented Apr 22, 2024

encode-marcxml creates invalid marc data when processing decoded-marc21 with separated leader elements.

It creates invalid marc xml data. See the leader spec here in the schema: https://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd

encode-marcxml should behave in the same way as encode-marc21, if the data is seperated the encoder should combine them.

Status quo:
https://test.metafacture.org/playground/?flux=%22https%3A//raw.githubusercontent.com/metafacture/metafacture-core/master/metafacture-runner/src/main/dist/examples/read/marc21/10.marc21%22%0A%7C+open-http%0A%7C+as-lines%0A%7C+decode-marc21%0A%7C+encode-marcxml%0A%7C+print%0A%3B

decode-marc21(emitLeaderAsWhole="false")` -> encode-marcxml => the separated elements are kept:

		<marc:leader>p</marc:leader>
		<marc:leader>a</marc:leader>
		<marc:leader>m</marc:leader>
		<marc:leader> </marc:leader>
		<marc:leader>a</marc:leader>
		<marc:leader> </marc:leader>
		<marc:leader>c</marc:leader>
		<marc:leader> </marc:leader>

leader should be combined.

Originally posted by @TobiasNx in #526 (comment)

@TobiasNx TobiasNx added the Bug label Apr 22, 2024
@dr0i
Copy link
Member

dr0i commented Apr 22, 2024

Asking @blackwinter @fsteeg @hagbeck @maipet @TobiasNx : this would change the default behaviour - how should we procede:

a) just fix it and mention the fix as we normally do
b) note also in https://github.com/metafacture/metafacture-core/blob/master/metafacture-biblio/src/main/java/org/metafacture/biblio/marc21/MarcXmlEncoder.java the change of the behaviour. Somewhere else?
c) should we also provide a possibility for backward compatibility, i.e. allowing the creation of the invalid MARC XML?
d) do we have to increase the major release version (to 7.0.0) ?

@TobiasNx
Copy link
Contributor Author

TobiasNx commented Apr 22, 2024

In my opinion this should similar be handled as #401. It is a bug, It should not be backwards compatible.

@dr0i dr0i self-assigned this Apr 22, 2024
@dr0i
Copy link
Member

dr0i commented Apr 23, 2024

decode-marc21(emitLeaderAsWhole="false") (where this prarameter set to false is the default) outputs not all necessary elements. We could:
a) enhance decode-marc21 to output all the subfields that are needed to compose the leader in encode-marxml.
Then enhance encode-marxml to parse these subfields.
b) make decode-marc21(emitleaderaswhole="true") as default. And bail out in encode-marxml when leader is encountered as separated subfields with message like e.g. "Please provide one marc:leader$marcleader</marc:leader>"

I would really like to go with b) which saves us (and the code) much extra work in comparison to a).
It's true then that still decode-marc21(emitLeaderAsWhole="false") would not work (well - it would not generate a wrong MARC XML but no MARC XML at all) but as this would no longer be the default I think it's very unlikely that someone explicitly sets the parameter (decode-marc21(emitLeaderAsWhole="false")) - and if so, the Exception message would be clear enough.
WDYT?

@dr0i dr0i assigned TobiasNx and unassigned dr0i Apr 23, 2024
@TobiasNx
Copy link
Contributor Author

TobiasNx commented Apr 24, 2024

In my opinion decode-marc21 should not be changed at all.
I thought you could reuse something from encode-marc21 for encode-marcxml, because encode-marc21 can handle the "missing" elements. (Probably with some JAVA stuff here)

All leader info should already be available either since they are a) static, b) can be generated by the encoder (as #524) or c) provided by the decode-marc21.

LeaderPos Name Variant
00-04 Record length b) generated by encoder as in #524
05 record status c) leader.status
06 record type c) leader.type
07 bibliographic level c) leader.bibliographicLevel
08 type of control c) leader.typeOfControl
09 character coding scheme c) leader.characterCodingScheme
10 Indicator count a) always 2
11 Subfield code count a) always 2
12-16 Base address of data b) generated by encoder as in #524
17 encoding level c) leader.encodingLevel
18 descriptive cataloging form c) leader.catalogingForm
19 multipart resource record level c) leader.multipartLevel
20 Length of the length-of-field portion a) always 4
21 Length of the starting-character-position portion a) always 5
22 Length of the implementation-defined portion a) always 0
23 Undefined a) always 0

encode-marc21 can create the leader even with the missing leader positions since the missing elements are either generated (see #524) or static see
https://github.com/metafacture/metafacture-core/blob/bd719d2ee87834461fd8594a7a1b69db49658e02/metafacture-biblio/src/main/java/org/metafacture/biblio/marc21/Marc21Constants.java#L31-37

@dr0i
Copy link
Member

dr0i commented Apr 25, 2024

I know this all. My proposal is, if it's true that we always need decode-marc21 to pipe to encode-marcxml we could

go with b) which saves us (and the code) much extra work in comparison to a).

@TobiasNx
Copy link
Contributor Author

TobiasNx commented Apr 25, 2024

Talked to @dr0i and @maipet the easy solution is instead of generating the Leader Pos 00-04 and Pos 12-16 is just to set them to 0 if the leader is provided incomplete and in separated elements:
like this:
leader = 0000[leader.status][leader.type][leader.bibliographicLevel][leader.typeOfControl][leader.characterCodingScheme]2200000[leader.encodingLevel][leader.catalogingForm][leader.multipartLevel]4500

That also echos voices from the literature:

„The construction of the element in a MARCXML record has some minor nuances. Typically, a MARCXML element takes, as content, the leader string unchanged from the way it would appear in a traditional MARC record serialization for magnetic tape, including any spaces. However, implementers should recognize, especially when creating MARCXML records from scratch rather than transforming existing MARC records, that a few character positions of the element are ambiguous in a MARCXML serialization. In traditional MARC, the first five characters of the leader string (character positions 0 to 4) are meant to hold the length of the record. They can be used for this purpose in MARCXML, but XML-applications processing MARCXML records do not use this information; they have other (more reliable) options to determine the „length of an XML-serialized MARC record or to move on to the next element in a MARCXML instance. If transformed from a traditional MARC record, these first five characters often are left as is (i.e., representing the length of the record when it was serialized according to the tape-based traditional MARC record format). In other instances, these first five characters of the leader may be set to all zeros in the MARCXML record (the authors’ personal preference). Similar logic is applied to leader character positions 12 to 16, which are designed to hold the base address for a traditionally serialized MARC record on tape. Finally, leader character positions 20 to 23 make up the entry map for the MARC record directory. Since directory entries are not relevant to the MARCXML serialization of the record, these character positions are either left as they would appear in the traditional MARC serialization (always “4500”) or left blank (i.e., four spaces).“

from: XML for Catalogers and Metadata Librarians by Timothy W. Cole and Myung-Ja Han, https://dl.acm.org/doi/10.5555/2531760

and examples from DNB, ALMA and others:
https://d-nb.info/1165051362
https://lobid.org/marcxml/990045392320206443
https://wiki.deutsche-digitale-bibliothek.de/pages/viewpage.action?pageId=78513719

@dr0i
Copy link
Member

dr0i commented Apr 29, 2024

We discussed offline: we have to properly parse the leader, e.g. because the order of the leader's subfields cannot be ensured when e.g. using a fix to build the leader. The easiest way ensuring a valid MARC21 as XML is to reuse the eloborate mechanisms encode-marc21 provides. A wrapper could be implemented as abbrevation for doing this .
So, we would make this:

| decode-marc21
| encode-marc21
| decode-marc21(emitleaderaswhole="true")
| encode-marcxml

to this:

| decode-marc21
| encode-marcxml

i.e. encode-marcxml does | encode-marc21 | decode-marc21(emitleaderaswhole="true") | encode-marcxml .
Never mind performance. We can tackle that if someone complains - and maybe that's not even an issue.

dr0i added a commit that referenced this issue May 6, 2024
Marc21XmlEncoder acts as a wrapper. It makes use of Marc21Encoder, Marc21Decoder
and MarcXmlEncoder to ensure a proper MarcXml, especially regarding the leader.
Also - in contrast to MarcXmlEncoder - the record id (field 001) is mandatory.
@dr0i dr0i linked a pull request May 6, 2024 that will close this issue
dr0i added a commit that referenced this issue May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Working
Development

Successfully merging a pull request may close this issue.

2 participants