API Guide

This page is a guide on how to use the API.

0. Introduction
1. Send Your Document: POST /document
2. Get the queue status: GET /queue/{id}
3. Get the results
4. Server Configuration Access

0. Introduction

First of all there is a few things to know:

The API is RESTful: The API is over HTTP and follow REST standards.
The API is asynchronous: There is a simple queue system and every job is managed by the API server.

The API has an endpoint prefix /api and then, optionally, the version number /v1.0. That mean every request must be send to:

/api/v1.0: will use the API version 1.0
/api/v1: will use the latest API version 1.x
/api: will use the latest API version

1. Send Your Document: `POST /document`

First of all, you need to do a POST request to send the document to Parsr. Along that, you can send the configuration file to tell Parsr what kind of processing it must perform on the file. If you don't provide a config file, the default config file will be used.

Regarding the configuration file, please refer to the configuration file documentation. (Tip: You can also obtain the default configuration on the server via the endpoint: /api/v1/default-config. See Section 4.)

`curl` command

curl -X POST \
  http://localhost:3001/api/v1/document \
  -H 'Content-Type: multipart/form-data' \
  -F 'file=@/path/to/file.pdf;type=application/pdf' \
  -F 'config=@/path/to/config.json;type=application/json'

Windows tip: Use double quotes " instead of single quote '

curl -X POST \
  http://localhost:3001/api/v1/document \
  -H "Content-Type: multipart/form-data" \
  -F "file=@/path/to/file.pdf;type=application/pdf" \
  -F "config=@/path/to/config.json;type=application/json"

Status: 202 - Accepted

00cafe4463b9c12aac145b3ee8f00d

The document you sent has been accepted and is being processed. The body contains the unique queue ID. You need to keep it somewhere for later, to know what's the queue status and get the results.

Status: 415 - Unsupported Media Type

This error means the file format you sent is not supported by the platform.

2. Get the queue status: `GET /queue/{id}`

This request allows you to get the status of the queued document being processed. You need to give it the queue ID that was return in the previous request.

`curl` command

curl -X GET \
  http://localhost:3001/api/v1/queue/00cafe4463b9c12aac145b3ee8f00d

Status: 200 - OK

{
  "estimated-remaining-time": 30,
  "progress-percentage": 10,
  "start-date": "2018-12-31T12:34:56.789Z",
  "status": "Detecting reading order..."
}

This status means the document is still being processed.

The estimated-remaining-time is expressed in seconds.

NB: estimated-remaining-time and progress-percentage are not working yet and are placeholder for future usage.

Status: 201 - Created

{
  "id": "00cafe4463b9c12aac145b3ee8f00d",
  "json": "/api/v1/json/00cafe4463b9c12aac145b3ee8f00d",
  "csv": "/api/v1/csv/00cafe4463b9c12aac145b3ee8f00d",
  "text": "/api/v1/text/00cafe4463b9c12aac145b3ee8f00d",
  "markdown": "/api/v1/markdown/00cafe4463b9c12aac145b3ee8f00d"
}

This status is sent when the processing is done. It returns links to the generated resources and the ID of the document for convenience.

Status: 404 - Not Found

This error means the queue ID doesn't refer to any known processing queue.

Status: 500 - Internal Server Error

This error means that something went terribly wrong on the backend, probably an error coming from Parsr.

3. Get the results

You can have results in different formats:

JSON: GET /json/{id}
Markdown: GET /markdown/{id}
Raw text: GET /text/{id}
CSV: GET /csv/{id}

These requests allow you to get the results of the processed document. You need to give it the queue ID that was return in a previous request.

3.1. JSON, Markdown and Text results

The queries for JSON, Markdown and raw text are all working in the same way. CSV is a bit different and is described in the next section.

`curl` command

curl -X GET \
  http://localhost:3001/api/v1/json/00cafe4463b9c12aac145b3ee8f00d

Status: 200 - OK

{
	"metadata": [/* ... */],
	"fonts": [/* ... */],
	"pages": [/* ... */],
}

For more information on the JSON format, please refer to the specific guide.

Status: 404 - Not Found

This error means that the result file doesn't exist. Maybe it wasn't asked to be outputted in the config you sent in the first request.

3.2. CSV List of Files: `GET /csv/{id}`

Since you can have multiple tables per page, you need to query them in two steps:

First of all, get the list of every CSV files' paths:

`curl` command

curl -X GET \
  http://localhost:3001/api/v1/csv/00cafe4463b9c12aac145b3ee8f00d

Status: 200 - OK

[
  "/api/v1/csv/00cafe4463b9c12aac145b3ee8f00d/1/1",
  "/api/v1/csv/00cafe4463b9c12aac145b3ee8f00d/2/1",
  "/api/v1/csv/00cafe4463b9c12aac145b3ee8f00d/2/2",
  "/api/v1/csv/00cafe4463b9c12aac145b3ee8f00d/3/1"
]

Status: 404 - Not Found

This error means that the result file doesn't exist. Maybe it wasn't asked to be outputted in the config you sent in the first request.

3.3. CSV File: `GET /csv/{id}/{page}/{table}`

Then, we can get the CSV files one by one with the following parameters:

{id} is the ID of the document
{page} is the page number
{table} is the table number

`curl` command

curl -X GET \
  http://localhost:3001/api/v1/csv/00cafe4463b9c12aac145b3ee8f00d/1/1

Status: 200 - OK

3x4 table;Empty column;Numbers
;;
Item A;;3.14
"Item B
on two lines";;1,234.56

This CSV output example contains multiline cells and an empty column.

Status: 404 - Not Found

This error means that the result file doesn't exist. Maybe {page} and {table} parameters doesn't refer to an or it wasn't asked to be outputted in the config you sent in the first request.

3.4. Download Results

You can download any of the available output formats:

JSON: GET /json/{id}?download=1
Markdown: GET /markdown/{id}?download=1
Raw text: GET /text/{id}?download=1
CSV: GET /csv/{id}?download=1

Being {id} the same queue ID obtained in Section 3 - Get the results.

For JSON and Raw Text, a json or txt file will start downloading. For Markdown, if the document has any embedded assets like images, a zip file will start downloading, including the markdown and a folder with all required assets. If it does not contain any images, a single md file will be downloaded. For CSV option, a zip will be downloaded, containing one csv file per each table in the document.

4. Server Configuration Access

The API can also be queried to gain access to the following server assets:

4.1. Default Configuration

The server's default configuration can be queried (at /api/v1/default-config) using:

curl -X GET \
http://localhost:3001/api/v1/default-config

4.2. List of Modules

The list of all usable modules can be queried from the server (at /api/v1/modules) using:

curl -X GET \
http://localhost:3001/api/v1/modules

4.3. Module Configuration

A module's configuration file, which includes name, description and each module parameter's default value and range can be queried (at /api/v1/module-config/<module_name>) using:

curl -X GET \
http://localhost:3001/api/v1/module-config/table-detection

... which will fetch the configuration file for the table-detection module.

4.4. Checking Installed Dependencies

If you run into troubles running Parsr, a good first way to check if every required dependency is installed is by going to:

http://localhost:3001/api/v1/check-installation

A table will be displayed, showing every required (and optional) dependency, and it's path in your system. If you find that your system is missing a dependency, refer to the installation guide to fix the problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

api-guide.md

api-guide.md

API Guide

0. Introduction

1. Send Your Document: `POST /document`

`curl` command

Status: 202 - Accepted

Status: 415 - Unsupported Media Type

2. Get the queue status: `GET /queue/{id}`

`curl` command

Status: 200 - OK

Status: 201 - Created

Status: 404 - Not Found

Status: 500 - Internal Server Error

3. Get the results

3.1. JSON, Markdown and Text results

`curl` command

Status: 200 - OK

Status: 404 - Not Found

3.2. CSV List of Files: `GET /csv/{id}`

`curl` command

Status: 200 - OK

Status: 404 - Not Found

3.3. CSV File: `GET /csv/{id}/{page}/{table}`

`curl` command

Status: 200 - OK

Status: 404 - Not Found

3.4. Download Results

4. Server Configuration Access

4.1. Default Configuration

4.2. List of Modules

4.3. Module Configuration

4.4. Checking Installed Dependencies

Files

api-guide.md

Latest commit

History

api-guide.md

File metadata and controls

API Guide

0. Introduction

1. Send Your Document: POST /document

curl command

Status: 202 - Accepted

Status: 415 - Unsupported Media Type

2. Get the queue status: GET /queue/{id}

curl command

Status: 200 - OK

Status: 201 - Created

Status: 404 - Not Found

Status: 500 - Internal Server Error

3. Get the results

3.1. JSON, Markdown and Text results

curl command

Status: 200 - OK

Status: 404 - Not Found

3.2. CSV List of Files: GET /csv/{id}

curl command

Status: 200 - OK

Status: 404 - Not Found

3.3. CSV File: GET /csv/{id}/{page}/{table}

curl command

Status: 200 - OK

Status: 404 - Not Found

3.4. Download Results

4. Server Configuration Access

4.1. Default Configuration

4.2. List of Modules

4.3. Module Configuration

4.4. Checking Installed Dependencies

1. Send Your Document: `POST /document`

`curl` command

2. Get the queue status: `GET /queue/{id}`

`curl` command

`curl` command

3.2. CSV List of Files: `GET /csv/{id}`

`curl` command

3.3. CSV File: `GET /csv/{id}/{page}/{table}`

`curl` command