Skip to content

A web application for PDF content and table extraction, featuring image-based visual layout analysis, indexed document search, batch processing and extraction result annotation.

License

os-climate/crrf-det

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CRRF Data Extraction Toolkit

A web application for PDF content and table extraction, featuring image-based visual layout analysis, indexed document search, batch processing and extraction result annotation.

Features

PDF Processing (t-pdf)

  • 📄 PDF page content understanding using an image-based visualized method, segmenting tables and text boxes
  • 🧪 Unit test controlled layout analysis results for quality assurance
  • 🚀 High speed analysis: Image processing written in NumPy + scikit-image, achieving 3 page/sec per 1000 Geekbench score on a single core.
  • 🧬 Conversion from PDF files to structured JSON

Documents Management

  • 📁 Manage a repository of folders of PDF files
  • 🔎 Search using keywords and phrases (ngram) inside the PDF documents, designed for numerical value extraction, with:
    • "double quoted phrases"
    • -excluded_words
    • -"excluded phrases"
  • 🏷️ Manage a list of persisted search queries, known as "filters", for quick recalling and batch execution. Associate a search query with a list of tags.
  • Fully asynchronous task processing, with configurable number of parallel processes

Batch Processing, User and Annotation Projects

  • 💼 Create batch processing projects to run a selection of "filters" against a selection of folders and documents, generating a collection of segments in JSON format for download.
  • 🏷️ Convert the segments into an annotation project
  • 📱 A mobile-browser-friendly infinite-scrolling web app for annotating small segments collected from the documents
  • 🧑‍💼 Invitation based user registration system, with admin-accessible document managements and user-accessible annotation

Developing

Clone the repo:

$ git clone git@github.com:os-climate/crrf-det.git

Frontend

We use Vite as our frontend tooling for a React based frontend. To start the frontend, first install Node.js in your local environment, and make sure the npm command is available. Then:

$ cd crrf-det/src/fe
$ npm install
$ npm run dev

After dependency installation, this will launch the frontend server at http://localhost:5173/. Note that the default setup in the repository assumes that you run the development on localhost. For instructions to deploy the program to another host, consult the Deployment section.

Backend

We use Docker as the backend development environment. To launch the backend, first install the respective Docker edition for your local environment. Then:

$ cd crrf-det
$ docker-compose build
$ docker-compose up

This will bring up a Sanic based backend at port 8000, with a Redis database at port 6379. Additionally, it creates the dev-data folder (at the same level of docker-compose.yml) for persisted data. No information is persisted in the Redis database. It is used primarily for running and keep tracking of asynchronous tasks.

Note our setup uses an x86_64 base image.

You need a first admin user to use any of the functionalities. To create one, do it manually inside the container:

$ (sudo) docker exec -ti crrf-det-be-1 bash
# python
>>> import data.user
>>> data.user.add('admin', 'password', 0)
>>> quit()

Visit http://localhost:5173/ and login using the user. Note the argument 0 at the end of the call refers to the level of the user. You need level 0 (the highest) to access PDF documents and project functionalities. Levels > 0 can only access the annotation app.

Tests

Unit tests currently only covers the PDF page layout analysis portion of the code, which is in Python. Once you have the development containers setup, you can then go inside and start the tests:

$ (sudo) docker exec -ti crrf-det-be-1 bash
# python -m unittest

The tests only guarantees that the layout analysis code, including the portion that breaks columns, rows, and eventually guess the location of the table, is working as intended.

Deployment

We have written a small script to build size-optimized Docker images for deployment. To build for deployment, first determine the target hostname and port (must be known due to CORS in the backend, and API endpoints in the frontend). Then:

$ cd crrf-det/deploy
$ ./build.sh //hostname:port
$ (sudo) docker save det-be-dist -o det-be-dist.tar
$ (sudo) docker save det-fe-dist -o det-fe-dist.tar

Note that the //hostname:port is only used in building the frontend, by hard-coding the destination API endpoints into the code before compilation. To setup backend handling of CORS, you need to set HOST_FE_URL variable in docker-compose.yml.

Once you have to two (frontend and backend) images (.tar), copy them to your host, and use the reference docker-compose.yml file in the deploy folder to set it up.

!!! Security Consideration !!!

Some environment variables should be changed during the deployment:

  - JWT_SECRET=crrf-det-jwt-SECRET!!!501015
  - PASSWORD_SALT=crrf-det-salt-50-10-15
  - URL_SIGN_SECRET=86c935bc079ba1fef55809e2f575426c

These variables control the encryption of relevant parts. Using the example as is opens up opportunities for an attacker to generate your authentication token.

For JWT_SECRET and PASSWORD_SALT just enter some long enough random strings will be enough. To generate URL_SIGN_SECRET, a safe way would be to do it inside the container:

$ (sudo) docker exec -ti crrf-det-be-1 bash
# python
>>> import service.sign
>>> service.sign.generate_key()

About

A web application for PDF content and table extraction, featuring image-based visual layout analysis, indexed document search, batch processing and extraction result annotation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published