Skip to content

🏆 An applicant tracking system (ATS) is a software application that enables the electronic handling of recruitment and hiring needs. Corporate recruiters or hiring managers can then search and sort through the resumes in a number of ways, depending on the needs

License

MIT, MIT licenses found

Licenses found

MIT
LICENSE
MIT
LICENSE.md

RocktimRajkumar/ATS

Repository files navigation

Template Based Information Extraction Rule (TIER) - Parser

Project Image

issue ATS is released under the MIT license. forks stars Repository Size Contributors

Power by Machine Learning, TIER-parser recognition processes is highly-variant invoices and documents in capabilities far beyond any Optical Character Recognition application Program.
        It easily extracts complex data from highly varied, multifaceted business invoices.
        Our cloud-based technology ensures best-in-class security and scalability, as well as full 24/7 access on any device.

Features

  • Build Reusable extraction templates based on text mining patterns and extract desired data from unstructured documents.
  • Automates the entires business process around the ingestion of unstructured data.

Prerequisites

Before you begin, ensure you have met the following requirements:

  1. Python 3.8 or later.
  2. AWS account.

Table of Contents


Description

For a long time, we have relied on paper invoices to process payments and maintain accounts. Reconciling invoices typically involves someone manually spending hours browsing through several invoices and jotting things down in a ledger.

But can this process be done better, more efficiently, with less wastage of paper, human labor and time?

So, we present you Template Based Information Extraction Rule(TIER)-Parser.

It's an Intelligent Template-based Data Extraction of significant fields and uses them as a piece of meaningful information from all incoming documents with similar layouts.

The system depends on the knowledge base. The knowledgebase contains facts and rules. The facts are derived from the Standard model like AWS comprehend. The rules represent the designed templates. The templates are helpful for detecting the meaning of the text.

How does TIER-Parser work?

TIER-Parser is a three-step procedure:-

  • AWS textract analyzes the document thoroughly, extracts the essential data and format it as required.
  • Grouping the line of text based on the template.
  • Classifying the text such as (invoice date, address, name, etc).

Technologies

  • AWS services (S3, Textract, Comprehend)
  • Python

How To Use

Installation

The AWS Command Line Interface(CLI) is a unified tool to manage AWS services.

Windows
Download

Linux

$ curl "https://awscli.amazonaws.com/$ awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
$ unzip awscliv2.zip
$ sudo ./aws/install

macOS

$ curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg"
$ sudo installer -pkg AWSCLIV2.pkg -target /

Requirements Files
"Requirement Files" are files containing a list of items to be installed using pip install.

    $ python -m pip install -r requirements.txt

Configuration

Configure AWS-CLI helps you to interact with AWS services. These include your security credentials, the default output format, and the default AWS Region.

     AWS requires that all incoming requests are cryptographically signed. The AWS CLI does this for you.

Quickly Configuring the AWS CLI

$ aws configure
AWS Access Key ID [None]: AKIA************MPLE
AWS Secret Access Key [None]: wJal************************EKEY
Default region name [None]: us-west-2
Default output format [None]: json

When you enter this command, the AWS CLI prompts you for four pieces of information (access key, secret access key, AWS Region, and output format)

How To create access keys

Steps

Usage

This section will instruct other people on how to use our project after they’ve installed it.

  1. Designed Template
    The templates are designed according to the morphological, syntactical and vocabulary components of text sentences.
    The designed templates are created for helping to get the meaningful extraction of text from the unstructured documents.
    The template represents, resolution of the image and list of groups. Each group has its x0,y0, and x1,y1 co-ordinates, group id, name of the group and description.
    i.e. template
    In the above image red boxes are the groups and the green dots are the co-ordinates i.e x0,y0 and x1,y1.
    Syntax of Template:-
    {
        "resolution": {
            "width": 1654,
            "height": 2339
                },
        "group": [
                {
                    "gid": "0",
                    "gname": "Group0",
                    "desc": "Group0 Description",
                    "x0": 70,
                    "y0": 60,
                    "x1": 639,
                    "y1": 404
                },
                {
                    "gid": "1",
                    "gname": "Group1",
                    "desc": "Group1 Description",
                    "x0": 1046,
                    "y0": 70,
                    "x1": 1625,
                    "y1": 375
                }
        ]
    }
    
    

Create your template and save it inside the template directory as .json file.

  1. Run Program

       $ python main.py "bucketName" "fileName" "templateName"
    

    "bucketName" :- Name of the s3 bucket
    "fileName" :- Name of the file in s3 bucket
    "templateName" :- Name of the template created inside the template directory

    e.g

    $ python main.py poc-cloud-bucket invoice.pdf template.json
    

You will see output similar to the following: output

The program creates 3 JSON files.

  • First file contain all the Line of text detected in a document, it also returns the location and geometry of items found on a document page

  • Second file group the Line of text detected based on template

  • Third file read the group of the line of text and extract information like (invoice date, invoice number, date, name, address, etc).

Contributing

To contribute to TIER-Parser, follow these steps:

  1. Fork this repository.
  2. Create a branch: git checkout -b <branch_name>.
  3. Make your changes and commit them: git commit -m '<commit_message>'
  4. Push to the original branch: git push origin <project_name>/
  5. Create the pull request.

Alternatively, see the GitHub documentation on creating a pull request.


Credits

Thanks to the following people who have contributed to this project.


License

Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT license.


Author Info

About

🏆 An applicant tracking system (ATS) is a software application that enables the electronic handling of recruitment and hiring needs. Corporate recruiters or hiring managers can then search and sort through the resumes in a number of ways, depending on the needs

Topics

Resources

License

MIT, MIT licenses found

Licenses found

MIT
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages