Skip to content
This repository has been archived by the owner on Mar 30, 2022. It is now read-only.

500 error when converting PDF #46

Closed
tamlyn opened this issue May 16, 2018 · 11 comments
Closed

500 error when converting PDF #46

tamlyn opened this issue May 16, 2018 · 11 comments

Comments

@tamlyn
Copy link

tamlyn commented May 16, 2018

Steps to reproduce

  • Delete all Docker containers docker rm $(docker ps -aq)
  • Delete all Docker images docker rmi $(docker images -q)
  • Clone repo git clone https://github.com/elifesciences/sciencebeam.git
  • Start containers docker-compose up
  • Download sample PDF curl "https://elifesciences.org/download/aHR0cHM6Ly9jZG4uZWxpZmVzY2llbmNlcy5vcmcvYXJ0aWNsZXMvMzI2NzEvZWxpZmUtMzI2NzEtdjIucGRm/elife-32671-v2.pdf?_hash=nrG1HRdFl4DZPdYxrP0OOJfOcyNJrkWHhR5HiBe0O4M%3D" > elife-32671-v2.pdf
  • Send PDF to ScienceBeam curl -XPOST --data-binary @elife-32671-v2.pdf -v -H "content-type: application/octet-stream" http://localhost:8075/api/convert\?filename=elife-32671-v2.pdf

Expected

Returns JATS XML

Actual

Sometimes it returns the JATS but sometimes it returns a HTTP 500 error. Usually the first request after starting the container succeeds and the second one fails but this is not entirely consistent.

ScienceBeam logs for failed request
sciencebeam_1  | DEBUG:sciencebeam.server.blueprints.api:processing file: elife-32671-v2.pdf (6429323 bytes, type "application/pdf")
sciencebeam_1  | DEBUG:sciencebeam.pipeline_runners.simple_pipeline_runner:skipping step (type "application/pdf" not supported): DOC to PDF
sciencebeam_1  | DEBUG:sciencebeam.pipeline_runners.simple_pipeline_runner:executing step (with type "application/pdf"): Convert to TEI
sciencebeam_1  | INFO:sciencebeam.transformers.grobid_service:processing: elife-32671-v2.pdf (6429323) - http://grobid:8070/api/processFulltextDocument
sciencebeam_1  | DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): grobid
grobid_1       | INFO  [2018-05-16 14:10:15,878] org.grobid.core.factory.GrobidPoolingFactory: Number of Engines in pool active/max: 1/10
grobid_1       | ERROR [2018-05-16 14:10:17,516] org.grobid.core.process.ProcessPdf2Xml: pdftoxml process finished with error code: 137. [/opt/grobid/grobid-home/pdf2xml/lin-64/pdftoxml_server, -blocks, -noImageInline, -fullFontName, -noImage, -annotation, /opt/grobid/grobid-home/tmp/origin2492405599912754936.pdf, /opt/grobid/grobid-home/tmp/JQ3qaXQRd7.lxml]
grobid_1       | ERROR [2018-05-16 14:10:17,516] org.grobid.core.process.ProcessPdf2Xml: pdftoxml return message:
grobid_1       |
grobid_1       | 172.18.0.3 - - [16/May/2018:14:10:17 +0000] "POST /api/processFulltextDocument HTTP/1.1" 400 208 "-" "python-requests/2.18.4" 1796
sciencebeam_1  | DEBUG:urllib3.connectionpool:http://grobid:8070 "POST /api/processFulltextDocument HTTP/1.1" 400 208
sciencebeam_1  | [2018-05-16 14:10:17,572] ERROR in app: Exception on /api/convert [POST]
sciencebeam_1  | Traceback (most recent call last):
sciencebeam_1  |   File "/srv/sciencebeam/venv/lib/python2.7/site-packages/flask/app.py", line 1982, in wsgi_app
sciencebeam_1  |     response = self.full_dispatch_request()
sciencebeam_1  |   File "/srv/sciencebeam/venv/lib/python2.7/site-packages/flask/app.py", line 1614, in full_dispatch_request
sciencebeam_1  |     rv = self.handle_user_exception(e)
sciencebeam_1  |   File "/srv/sciencebeam/venv/lib/python2.7/site-packages/flask_cors/extension.py", line 161, in wrapped_function
sciencebeam_1  |     return cors_after_request(app.make_response(f(*args, **kwargs)))
sciencebeam_1  |   File "/srv/sciencebeam/venv/lib/python2.7/site-packages/flask/app.py", line 1517, in handle_user_exception
sciencebeam_1  |     reraise(exc_type, exc_value, tb)
sciencebeam_1  |   File "/srv/sciencebeam/venv/lib/python2.7/site-packages/flask/app.py", line 1612, in full_dispatch_request
sciencebeam_1  |     rv = self.dispatch_request()
sciencebeam_1  |   File "/srv/sciencebeam/venv/lib/python2.7/site-packages/flask/app.py", line 1598, in dispatch_request
sciencebeam_1  |     return self.view_functions[rule.endpoint](**req.view_args)
sciencebeam_1  |   File "sciencebeam/server/blueprints/api.py", line 70, in _convert
sciencebeam_1  |     content=content, filename=filename, data_type=data_type
sciencebeam_1  |   File "sciencebeam/pipeline_runners/simple_pipeline_runner.py", line 42, in convert
sciencebeam_1  |     current_item = step(current_item)
sciencebeam_1  |   File "sciencebeam/pipelines/__init__.py", line 53, in __call__
sciencebeam_1  |     return self.fn(data)
sciencebeam_1  |   File "sciencebeam/pipelines/grobid_pipeline.py", line 62, in 
sciencebeam_1  |     pdf_content=data['content']
sciencebeam_1  |   File "sciencebeam/pipelines/grobid_pipeline.py", line 56, in 
sciencebeam_1  |     convert_to_tei = lambda pdf_filename, pdf_content: call_grobid((pdf_filename, pdf_content))[1]
sciencebeam_1  |   File "sciencebeam/transformers/grobid_service.py", line 48, in do_grobid_service
sciencebeam_1  |     response.raise_for_status()
sciencebeam_1  |   File "/srv/sciencebeam/venv/lib/python2.7/site-packages/requests/models.py", line 935, in raise_for_status
sciencebeam_1  |     raise HTTPError(http_error_msg, response=self)
sciencebeam_1  | HTTPError: 400 Client Error: Bad Request for url: http://grobid:8070/api/processFulltextDocument
sciencebeam_1  | INFO:werkzeug:172.19.0.1 - - [16/May/2018 14:10:17] "POST /api/convert?filename=elife-32671-v2.pdf HTTP/1.1" 500 -
@de-code
Copy link
Collaborator

de-code commented May 16, 2018

The error happens within the GROBID container. Could you please run it separately:

docker run --rm --add-host api.crossref.org:127.0.0.1 -p 8070:8070 lfoppiano/grobid:0.5.1

and then:

curl -v --form input=@elife-32671-v2.pdf --form consolidateHeader=0 \
  --form consolidateCitations=0 \
  localhost:8070/api/processFulltextDocument

Can you reproduce the problem just using GROBID?

What is your platform you are running it on?
Do you have disk space available?

@tamlyn
Copy link
Author

tamlyn commented May 16, 2018

The grobid container exits

curl logs
$ curl -v --form input=@elife-32671-v2.pdf -form consolidateHeader=0 \
    --form consolidateCitations=0 \
    localhost:8070/api/processFulltextDocument
* Rebuilt URL to: consolidateHeader=0/
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Could not resolve host: consolidateHeader=0
* Closing connection 0
curl: (6) Could not resolve host: consolidateHeader=0
*   Trying ::1...
* Connected to localhost (::1) port 8070 (#1)
> POST /api/processFulltextDocument HTTP/1.1
> Host: localhost:8070
> User-Agent: curl/7.43.0
> Accept: */*
> Content-Length: 6429644
> Expect: 100-continue
> Content-Type: multipart/form-data; boundary=------------------------5fd1cdfd8599ce92
>
< HTTP/1.1 100 Continue
* Empty reply from server
* Connection #1 to host localhost left intact
curl: (52) Empty reply from server
grobid container logs
INFO  [2018-05-16 15:43:49,190] org.grobid.core.factory.GrobidPoolingFactory: Number of Engines in pool active/max: 1/10
INFO  [2018-05-16 15:43:49,218] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/fulltext/model.wapiti (size: 20462507)
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/fulltext/model.wapiti"
Model path: /opt/grobid/grobid-home/models/fulltext/model.wapiti
INFO  [2018-05-16 15:43:52,208] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/segmentation/model.wapiti (size: 15807193)
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/segmentation/model.wapiti"
Model path: /opt/grobid/grobid-home/models/segmentation/model.wapiti
INFO  [2018-05-16 15:43:56,820] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/figure/model.wapiti (size: 680116)
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/figure/model.wapiti"
Model path: /opt/grobid/grobid-home/models/figure/model.wapiti
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/header/model.wapiti"
INFO  [2018-05-16 15:43:57,167] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/header/model.wapiti (size: 36094028)
Model path: /opt/grobid/grobid-home/models/header/model.wapiti
INFO  [2018-05-16 15:44:02,854] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/name/header/model.wapiti (size: 2225578)
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/name/header/model.wapiti"
Model path: /opt/grobid/grobid-home/models/name/header/model.wapiti
INFO  [2018-05-16 15:44:03,094] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/name/citation/model.wapiti (size: 393118)
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/name/citation/model.wapiti"
Model path: /opt/grobid/grobid-home/models/name/citation/model.wapiti
INFO  [2018-05-16 15:44:03,521] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/affiliation-address/model.wapiti (size: 2646298)
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/affiliation-address/model.wapiti"
Model path: /opt/grobid/grobid-home/models/affiliation-address/model.wapiti
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/date/model.wapiti"
INFO  [2018-05-16 15:44:06,343] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/date/model.wapiti (size: 102435)
Model path: /opt/grobid/grobid-home/models/date/model.wapiti
INFO  [2018-05-16 15:44:06,603] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/citation/model.wapiti (size: 16235248)
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/citation/model.wapiti"
Model path: /opt/grobid/grobid-home/models/citation/model.wapiti
INFO  [2018-05-16 15:44:14,332] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/reference-segmenter/model.wapiti (size: 4829569)
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/reference-segmenter/model.wapiti"
Model path: /opt/grobid/grobid-home/models/reference-segmenter/model.wapiti

I'm using Docker 18.03.1-ce-mac65 on OSX 10.11 with > 50GB free space. 😕

@de-code
Copy link
Collaborator

de-code commented May 16, 2018

Apologies, a hyphen in front of one of the --form parameters got lost. I updated my previous comment.

@tamlyn
Copy link
Author

tamlyn commented May 16, 2018

Same result: empty reply and grobid container exits with code 137.

@de-code
Copy link
Collaborator

de-code commented May 16, 2018

Could you send over the logs for that please? (in the previous logs it was not able to resolve 'consolidateHeader' due to the parameter issue)

@tamlyn
Copy link
Author

tamlyn commented May 16, 2018

curl

$ curl -v --form input=@elife-32671-v2.pdf --form consolidateHeader=0 \
    --form consolidateCitations=0 \
    localhost:8070/api/processFulltextDocument
*   Trying ::1...
* Connected to localhost (::1) port 8070 (#0)
> POST /api/processFulltextDocument HTTP/1.1
> Host: localhost:8070
> User-Agent: curl/7.43.0
> Accept: */*
> Content-Length: 6429751
> Expect: 100-continue
> Content-Type: multipart/form-data; boundary=------------------------34a7797deab18566
>
< HTTP/1.1 100 Continue
* Empty reply from server
* Connection #0 to host localhost left intact
curl: (52) Empty reply from server

grobid

INFO  [2018-05-16 15:56:23,079] org.grobid.core.factory.GrobidPoolingFactory: Number of Engines in pool active/max: 1/10
INFO  [2018-05-16 15:56:23,103] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/fulltext/model.wapiti (size: 20462507)
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/fulltext/model.wapiti"
Model path: /opt/grobid/grobid-home/models/fulltext/model.wapiti
INFO  [2018-05-16 15:56:26,182] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/segmentation/model.wapiti (size: 15807193)
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/segmentation/model.wapiti"
Model path: /opt/grobid/grobid-home/models/segmentation/model.wapiti
INFO  [2018-05-16 15:56:30,465] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/figure/model.wapiti (size: 680116)
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/figure/model.wapiti"
Model path: /opt/grobid/grobid-home/models/figure/model.wapiti
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/header/model.wapiti"
INFO  [2018-05-16 15:56:30,899] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/header/model.wapiti (size: 36094028)
Model path: /opt/grobid/grobid-home/models/header/model.wapiti
INFO  [2018-05-16 15:56:38,397] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/name/header/model.wapiti (size: 2225578)
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/name/header/model.wapiti"
Model path: /opt/grobid/grobid-home/models/name/header/model.wapiti
INFO  [2018-05-16 15:56:38,688] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/name/citation/model.wapiti (size: 393118)
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/name/citation/model.wapiti"
Model path: /opt/grobid/grobid-home/models/name/citation/model.wapiti
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/affiliation-address/model.wapiti"
INFO  [2018-05-16 15:56:39,191] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/affiliation-address/model.wapiti (size: 2646298)
Model path: /opt/grobid/grobid-home/models/affiliation-address/model.wapiti
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/date/model.wapiti"
INFO  [2018-05-16 15:56:42,062] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/date/model.wapiti (size: 102435)
Model path: /opt/grobid/grobid-home/models/date/model.wapiti
INFO  [2018-05-16 15:56:42,256] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/citation/model.wapiti (size: 16235248)
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/citation/model.wapiti"
Model path: /opt/grobid/grobid-home/models/citation/model.wapiti
INFO  [2018-05-16 15:56:46,752] org.grobid.core.jni.WapitiModel: Loading model: /opt/grobid/grobid-home/models/reference-segmenter/model.wapiti (size: 4829569)
[Wapiti] Loading model: "/opt/grobid/grobid-home/models/reference-segmenter/model.wapiti"
Model path: /opt/grobid/grobid-home/models/reference-segmenter/model.wapiti

@de-code
Copy link
Collaborator

de-code commented May 16, 2018

This issue moby/moby#21083 suggests that you might be able to solve it by increasing the memory allocation for docker.

You could also decrease the Java heap (1g is what the ScienceBeam docker compose file is using):

docker run --rm --add-host api.crossref.org:127.0.0.1 -p 8074:8070 -e JAVA_OPTS=-Xmx1g lfoppiano/grobid:0.5.1

@tamlyn
Copy link
Author

tamlyn commented May 16, 2018

Aha! Increasing the Java memory fixed it for the Grobid container when running solo. And doubling Docker's memory to 4GB appears to have fixed it when running through ScienceBeam. Thanks!

👍 🎉 🍷

@tamlyn tamlyn closed this as completed May 16, 2018
@antwal
Copy link

antwal commented Apr 18, 2019

Hi, i need reopen it; have same problem:

curl -X POST --show-error --form "file=@2016.pdf;filename=2016.pdf" http://localhost:8075/api/convert
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>500 Internal Server Error</title>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error and was unable to complete your request.  Either the server is overloaded or there is an error in the application.</p>

in my docker yaml have added:

  grobid:
    image: lfoppiano/grobid:0.5.4
    container_name: 'grobid'
    environment:
        - JAVA_OPTS=-Xmx4g
    ports:
        - "8070:8070"
        - "8071:8071"
  sciencebeam:
    image: elifesciences/sciencebeam:latest
    container_name: 'sciencebeam'
    links:
      - grobid
    command: ./server.sh \
        --host=0.0.0.0 --port=8075 \
        --grobid-url http://grobid:8070/api
    ports:
      - "8075:8075"

I have tried also run grobid and sciencebeam from command line but i have same error

Docker full logs of Grobid is here
Docker full logs of SB is here

How can resolve?

Thanks

@de-code
Copy link
Collaborator

de-code commented Apr 18, 2019

Are you on a Mac too? If so, have you increased the memory for Docker?

@antwal
Copy link

antwal commented Apr 18, 2019

Are you on a Mac too? If so, have you increased the memory for Docker?

Ok, up memory from 4Gb to 12Gb now working

NuclearRedeye pushed a commit that referenced this issue Nov 8, 2021
* added name header model

* added name-header api route

* filter name author using header model

* added SegmentedModelNestedBluePrint

* added TestLayoutDocumentLabelResult

* refactored test_should_filter_by_line_without_token

* implemented filtering by layout token

* using id for line filtering

* minor optmisation

* updated readme

* extract authors from header

* added test_should_add_authors

* replace raw authors with author

* parse author forename surnames

* minor optimisation using set

* initial extraction of author names to tei

* added note type for author output

* split authors

* moved to separate extractor

* added simple extraction test

* parse combine raw authors block

* renamed sur_name to surname

* added support for title, middle name and suffix

* renamed label to marker

* handle marker at the beginning

* handle multiple markers at the end

* using public model

* implemented name model features
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants