Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Docx support #515

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

[WIP] Docx support #515

wants to merge 5 commits into from

Conversation

kermitt2
Copy link
Owner

@kermitt2 kermitt2 commented Oct 21, 2019

docx document support using Apache POI (via the opensagres converter)

@coveralls
Copy link

coveralls commented Oct 21, 2019

Coverage Status

Coverage decreased (-0.04%) to 37.257% when pulling 70e48a4 on docx-support into 9ad861e on master.

@kermitt2
Copy link
Owner Author

I need to add some tests :/

@kermitt2
Copy link
Owner Author

I've added examples under grobid/grobid-core/src/test/resources/docx.
Out of 14 different docx article templates, the conversion of 3 are currently failing (IJSRP-paper-submission-format-double-column.docx, IJSRP-paper-submission-format-single-column.docx and Journal-Manuscript-Format-MS-Office-2007.docx) :(

@kermitt2
Copy link
Owner Author

Given the poor results with Apache POI, we would need to try docx4j as alternative.

@kermitt2 kermitt2 self-assigned this Dec 28, 2019
@Downchuck
Copy link

Is there a quick way -- or would it make sense to add the ALTO XML as an input format for the service, as well? In this manner, someone could run that pipeline separately (pdfalto included) and from there just submit the XML.

This is of course not as nice as getting inline converters working, but it does sort of shift the problem.

@kermitt2
Copy link
Owner Author

Adding web services taking as input ALTO XML instead of PDF is easy (we just shortcut the pdf conversion), but the problem is that there are a certain number of ALTO flavors and so far the only one well tested and supported is the pdfalto output. This is probably quite a lot of effort to test and support comprehensively ALTO variants.

The best option regarding docx would be certainly a docx to ALTO XML conversion (same ALTO as pdfalto), and just take also ALTO as input, but it looks complicated, which is why I am adding, first, this docx support via Apache POI or docx4j.

@Downchuck
Copy link

Should I add ALTO XML input as a feature request / pull request?

The embedded PDF ALTO support makes containerization a bit more of a challenge and does have an impact in the way ingestion can be managed- as you could imagine, running the PDF ALTO process on a separate server gives the GROBID server more capacity for running WAIPI and such.

@Downchuck
Copy link

noting another contributor has posted a pdfalto service PR (WIP): #552

@kermitt2 kermitt2 added this to the 0.6.1 milestone Apr 24, 2020
@kermitt2 kermitt2 modified the milestones: 0.6.1, 0.6.2 Aug 12, 2020
@kermitt2 kermitt2 mentioned this pull request Aug 31, 2020
@lfoppiano lfoppiano removed this from the 0.6.2 milestone Mar 15, 2021
@kermitt2 kermitt2 mentioned this pull request Aug 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants