Skip to content

RamanHacks/pytorch-hackathon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[pytorch-hackathon] Document Extraction Tool (DET)

DET is an end-to-end tool for extracting Key-Value pairs from a variety of documents, built entirely on PyTorch and served using TorchServe.

Try it live on the web-app! 👋

DET Architecture

The Document Extraction tool is composed of two main components, namely,
1. OCR (Optical Character Recognition) and
2. Document NER (Named Entity Recognition)
The OCR component comprises of a Detection and Recognition module which work sequentially to produce results which is then consumed by the NER part for training/prediction.

For the OCR part, we have deployed a torchserve model server on GCP using Vertex AI service. Using torchserve, we deploy a complex workflow in the form of a DAG, comprising of pre_processing, detection and recognition models.

For the NER part, we created a training module using transformers library which requires the text and bounding-box results from OCR output to train/predict documents.

The architectural flow of the two modules is shown here

Components:

  • Optical Character Recognition (OCR):
    • Detection:
    • Recognition:
  • Named Entity Recognition (NER):
    • Receipt Dataset (SROIE):
    • Forms Dataset (FUNSD):

Contents

File Structure

deploy                           # INSTRUCTIONS/SCRIPTS for deploying model(s) on GPU/CPU
|---GPU                          # For Serving models on GPU   
|------jit-models                # contains jitted models created after tracing the model using torch.jit.trace   
|------------craft_ts.pt         # DOWNLOAD the torchscript files from here: https://drive.google.com/drive/folders/1NBSZIZzSzIVOUqnxu0PHgmy-_Tvvp2hY?usp=sharing
|------------crnn_ts.pt    
|------------sroie_ts.pt    
|------------funsd_ts.pt    
|------model-archive             # stores models along with configuration in .mar format as required by "torchserve"  
|------------model_store         # GENERATE standalone .mar files from torch-model-archiver command given below  
|------------------craft.mar     
|------------------crnn.mar
|------------------sroie.mar
|------------------funsd.mar
|------------wf_store            # GENERATE workflow .mar files from torch-workflow-archiver command given below
|------------------ocr.mar
|------------------ner_sroie.mar
|------------------ner_funsd.mar
|------config.properties         # config.properties for storing serving related configurations. 
|                                For details, refer this: https://github.com/pytorch/serve/blob/master/docs/configuration.md
|------detection_handler.py      # handler for text detection pipeline 
|------rec_handler.py            # handler for text recognition pipeline 
|------workfow_handler.py        # handler for configuring end2end ocr serving pipeline 
|------workflow_ocr.yaml         # define the pipeline here, also specify batch-size, num workers etc.
|------workflow_ner_sroie.yaml
|------workflow_ner_funsd.yaml
|------Dockerfile                # template for creating GPU docker image for UCR
|------.dockerignore             # files to ignore in docker image
|------cloud_deploy.ipynb        # scripts to deploy the file on GCP Vertex ai and run predictions.
|------sample_b64.json           # sample file to send request on inference api
|------index_sroie.json          # contains mapping of indexes to human-readable labels for SROIE dataset  
|------index_funsd.json          # contains mapping of indexes to human-readable labels for FUNSD dataset

|---CPU                          # Same as above except it's CPU (currently INCOMPLETE)
|------jit-models
|------------craft_ts.pt
|------------crnn_ts.pt
|------model-archive
|------------model_store         
|------------------craft.mar
|------------------crnn.mar
|------------wf_store            
|------------------ocr.mar
|------config.properties
|                                
|------detection_handler.py
|------rec_handler.py
|------workfow_handler.py
|------workflow.yaml
|------Dockerfile      

train                            # NOTEBOOKS for training models on GPU/CPU. More training scripts COMING SOON!
|---NER.ipynb                    # Jupyter Notebook to train,test Document NER models and convert it to torchscript format.

Live Demo

Try it live here! 👋

Deployment

Quick Deploy

Install Docker and NVIDIA Container Toolkit: See this link for help!

Download and start docker:
docker run -d --gpus all -p 7080:7080 -p 7081:7081 -p 7082:7082 abhigyanr/det-gpu:latest

For testing API, follow these steps! Note: This method requires NVIDIA GPU and driver to be present!

Using Docker Container

Install docker and nvidia container toolkit: See this link for help!

Clone this repository and change directory:

git clone https://github.com/RamanHacks/pytorch-hackathon.git
cd pytorch-hackathon && cd deploy
cd GPU      # OR (cd CPU)

Build Docker Image

docker build -f Dockerfile -t det .

Run Docker container

docker run -d --rm --name det-cpu -p 7080:7080 -p 7081:7081 -p 7082:7082 det                # (For CPU)
docker run -d --rm --name det-gpu --gpus all -p 7080:7080 -p 7081:7081 -p 7082:7082 det     # (For GPU, use --gpus '"device=0,1"' to specify device)

Optional: Check Status

docker logs $(docker ps -l -q)      # to check if the docker container is running fine
curl localhost:7080/ping            # to check if the network is accessible from localhost, should return Healthy

Register Models

curl -X POST "localhost:7081/workflows?url=ocr.war"
curl -X POST "localhost:7081/workflows?url=ner_sroie.war"
curl -X POST "localhost:7081/workflows?url=ner_funsd.war"

Optional: Stop and Remove Container

docker stop $(docker ps -l -q)
docker rm $(docker ps -l -q)

For testing API, follow these steps!

From Source

Install torch from official link: PyTorch Official
Install torchserve from official repo: TorchServe Official

Clone this repository and install dependencies:

git clone https://github.com/RamanHacks/pytorch-hackathon.git
cd pytorch-hackathon && cd deploy
pip install -r requirements.txt    
cd GPU      # OR (cd CPU)

Download pretrained torchscript models from Google Drive and move it inside jit-models folder.

Generate .mar files:

# create model archives
torch-model-archiver -f --model-name craft --version 1.0 --serialized-file jit-models/craft_ts.pt --handler det_handler.py --export-path model-archive/model_store/

torch-model-archiver -f --model-name crnn --version 1.0 --serialized-file jit-models/crnn_ts.pt --handler rec_handler.py --export-path model-archive/model_store/

cp index_sroie.json index.json
torch-model-archiver -f --model-name sroie --version 1.0 --serialized-file jit-models/sroie_ts.pt --handler ext_handler.py --export-path model-archive/model_store/ --extra-files index.json

cp index_funsd.json index.json
torch-model-archiver -f --model-name funsd --version 1.0 --serialized-file jit-models/funsd_ts.pt --handler ext_handler.py --export-path model-archive/model_store/ --extra-files index.json

rm index.json
# create workflow archives
torch-workflow-archiver -f --workflow-name ocr --spec-file workflow_ocr.yaml --handler workflow_handler.py --export-path model-archive/wf_store/

torch-workflow-archiver -f --workflow-name ner_sroie --spec-file workflow_ner_sroie.yaml --handler workflow_handler.py --export-path model-archive/wf_store/

torch-workflow-archiver -f --workflow-name ner_funsd --spec-file workflow_ner_funsd.yaml --handler workflow_handler.py --export-path model-archive/wf_store/

Start Model Server

torchserve --start --model-store model-archive/model_store/ --workflow-store model-archive/wf_store/ --ncs --ts-config config.properties

Register Models

curl -X POST "localhost:7081/workflows?url=ocr.war"
curl -X POST "localhost:7081/workflows?url=ner_sroie.war"
curl -X POST "localhost:7081/workflows?url=ner_funsd.war"

Optional: Stop TorchServe

torchserve --stop

For testing API, follow these steps!

Sample Request

Request format: Create a sample json file containing base64 values of image

{
    'b64': '<base64 value of an image>' 
}

Response format of OCR(only), i.e. when hitting "/wfpredict/ocr":

[
    {
        'bbox': [[<top-left>],[<top-right>],[<bottom-left>],[<bottom-right>]],
        'ocr': [<value>, <confidence>],
    }
]

Response format of OCR+NER, i.e. when hitting "/wfpredict/ner_sroie" or "/wfpredict/ner_funsd":

[
    {
        'bbox': [<top-left-x>,<top-left-y>,<bottom-right-x>,<bottom-right-y>],
        'ocr': <value>,
        'key': <value>,
    }
]

Using CURL

Sample CURL Request

curl -X POST -H "Content-Type: application/json; charset=utf-8" -d @sample_b64.json localhost:7080/wfpredict/ner_sroie

From Python file

Python function to convert an image into base64, send request and return predictions

import base64
import requests
def sample_request(image_file_path)
    def convert_b64(image_file):
        """Open image and convert it to Base64"""
        with open(image_file, "rb") as input_file:
            jpeg_bytes = base64.b64encode(input_file.read()).decode("utf-8")
        return jpeg_bytes

    req = {"data": convert_b64(image_file_path)}
    res = requests.post("http://localhost:7080/wfpredict/ner_sroie", json=req)

    return res.json()

Training

Custom NER

Jump to NER.ipynb for details on training and testing Document NER models!

Custom OCR

-----Coming-Soon-----

Model Optimization

-----Coming Soon-----

Support Our Work

-----If you like our work, do not forget to ⭐ this repository and follow us on twitter, linkedin-----
-----If you have got any specific feature request, contact us at admin@docyard.ai-----

License

Apache License 2.0

About

DET is an end-to-end tool for extracting Key-Value pairs from a variety of documents, built entirely on PyTorch and served using TorchServe.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published