Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

processHeaderDocument returns BibTeX by default instead of TEI #1093

Open
michamos opened this issue Apr 3, 2024 · 3 comments
Open

processHeaderDocument returns BibTeX by default instead of TEI #1093

michamos opened this issue Apr 3, 2024 · 3 comments
Assignees
Labels
bug From Hemiptera and especially its suborder Heteroptera need help Issues where the contributors are even more incompetent than usual

Comments

@michamos
Copy link
Contributor

michamos commented Apr 3, 2024

Hi, I noticed that, at least since v0.7.3, GROBID started returning bibtex by default for /api/processHeaderDocument. This contradicts https://grobid.readthedocs.io/en/latest/Grobid-service/#apiprocessheaderdocument which claims a special Accept: application/x-bibtex header must be used for BibTeX and that the default is TEI XML.

Note that it's possible to get an XML response by using Accept: application/xml.

Steps to reproduce

  1. Get a PDF (I used https://arxiv.org/pdf/2212.12604v1.pdf but anything will do)
  2. Make a request against the GROBID API. I used the HuggingFace demo API:
    curl https://kermitt2-grobid.hf.space/api/processHeaderDocument --form input=@Downloads/2212.12604v1.pdf
  3. See that the output contains BibTeX and not TEI XML:
@misc{-1,
  author = {},
  title = {Search for new physics in the τ lepton plus missing transverse momentum final state in proton-proton collisions at √ s = 13 TeV The CMS Collaboration},
  date = {2022-12-23},
  year = {2022},
  month = {12},
  day = {23},
  eprint = {arXiv:2212.12604v1[hep-ex]},
  abstract = {A search for physics beyond the standard model (SM) in the final state with a hadronically decaying tau lepton and a neutrino is presented. This analysis is based on data recorded by the CMS experiment from proton-proton collisions at a center-ofmass energy of 13 TeV at the LHC, corresponding to a total integrated luminosity of 138 fb-1. The transverse mass spectrum is analyzed for the presence of new physics. No significant deviation from the SM prediction is observed. Limits are set on the production cross section of a W boson decaying into a tau lepton and a neutrino. Lower limits are set on the mass of the sequential SM-like heavy charged vector boson and the mass of a quantum black hole. Upper limits are placed on the couplings of a new boson to the SM fermions. Constraints are put on a nonuniversal gauge interaction model and an effective field theory model. For the first time, upper limits on the cross section of t-channel leptoquark (LQ) exchange are presented. These limits are translated into exclusion limits on the LQ mass and on its coupling in the t-channel. The sensitivity of this analysis extends into the parameter space of LQ models that attempt to explain the anomalies observed in B meson decays. The limits presented for the various interpretations are the most stringent to date. Additionally, a model-independent limit is provided.}
}

Requested info

Linux amd64 through lfoppiano/grobid:0.7.3 Docker image & whatever huggingface is using

  • What is your Java version (java --version)?

openjdk 17.0.2 2022-01-18
OpenJDK Runtime Environment (build 17.0.2+8-86)
OpenJDK 64-Bit Server VM (build 17.0.2+8-86, mixed mode, sharing)

  • In case of build or run errors, please submit the error while running gradlew with --stacktrace and --info for better log traces (e.g. ./gradlew run --stacktrace --info) or attach the log file logs/grobid-service.log.
@lfoppiano lfoppiano added the bug From Hemiptera and especially its suborder Heteroptera label Apr 3, 2024
@lfoppiano lfoppiano self-assigned this Apr 4, 2024
@lfoppiano
Copy link
Collaborator

Hi @michamos, long time not see 😄
It's nice that you're back working with Grobid?
Thanks for opening the issue.

It seems more a problem due to how Jakarta selects the default when Accept is not specified.
In local, when I use the same request you posted, I get TEI-XML, however I think it depends how the methods are loaded. It seems that there is no clear behaviour, althought this looks strange.

One solution I saw is to add an additional filter to default the Accept to application/xml when undefined, but it seems a bit of a hack and might affect other endpoints.

I will check it out a bit more in detail

@michamos
Copy link
Contributor Author

michamos commented Apr 4, 2024

Hi @lfoppiano, indeed :) We've been using GROBID in prod for INSPIRE for quite a while now. We use it to extract author and affiliation info from PDFs and to segment references for interactive search (so users can copy/paste references from a paper and it magically works). Unfortunately, our current resources are very limited, so we can't really contribute beyond submitting bug reports.

Thanks for looking into the issue!

@lfoppiano
Copy link
Collaborator

I dug into this and did not find a clean solution. I'm quite surprised that there is no way to define a default behavior.
It seems that the behavior is random depending on the platform where it's running.

Nevertheless, I updated the documentation, though, stating that the Accept header is required.

@lfoppiano lfoppiano added the need help Issues where the contributors are even more incompetent than usual label May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug From Hemiptera and especially its suborder Heteroptera need help Issues where the contributors are even more incompetent than usual
Projects
None yet
Development

No branches or pull requests

2 participants