Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement WileyML XML records parser producing DocumentText datastore #1440

Open
marekhorst opened this issue Dec 12, 2023 · 0 comments
Open
Assignees

Comments

@marekhorst
Copy link
Member

Originally requested in: https://support.openaire.eu/issues/8896#note-98

This parser should be responsible for:

  • assiging proper identifier (instead of currently used file name which is not unique)
  • extract text out of the WileyML record

Currently an input is DocumentText datastore with file name set as id and WileyML record as text.

We could start with id extraction first and propagate full XML record as text in the begining. Instead of relying on a file name, which currently identifies Wiley XMLs in the DocumentText avro datastore we should identify those XML records with something better like DOI. DOIs are available in the XMLs so it should be possible to extract them. There are multiple DOIs defined for a single WileyML record (e.g. identifying Journal or issue apart from identifying article) so we should pick carefully the right DOI and pick a replacement whenever article DOI is not available.

@marekhorst marekhorst self-assigned this Dec 12, 2023
marekhorst added a commit that referenced this issue Dec 13, 2023
…Text datastore

Introducing the first version of the importer module along with the workflow.xml definition.

Current version of the importer expects DocumentText avro records at input (with text field providing WileyML records) and produces DocumentText records at output with the identifier field updated with a DOI extracted from the WileyML record.
marekhorst added a commit that referenced this issue Dec 13, 2023
…Text datastore

Introducing the first version of the importer module along with the workflow.xml definition.

Current version of the importer expects DocumentText avro records at input (with text field providing WileyML records) and produces DocumentText records at output with the identifier field updated with a DOI extracted from the WileyML record.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant