Extract Key-Value pairs from OCR output in Bot

This bot is a sample bot that accepts Form Image inputs from the user and extract needed information into a card and reply user back. You can use structured and semi-structured Forms to Extract details.

As a test dataset I extracted some e-mails into a form structure then used these structured e-mails to extract key-value pairs.

This bot has been created using Microsoft Bot Framework.

Defining Reference Text (Key) and Desired Value Margins

Search data notation is like below format:

{
  "id": 2,
  "text": "PRIORITY",   // Your Reference Text Value
  "marginX": -30,       // Margin to left of your value field
  "marginY": 30,        // Margin to left top your value field
  "width": 100,         // Width of your text area
  "height": 100         // Height of your text area
}

The output for above details will be like below

Extract Key-Values from Mixed Structured Content

Let's use one of the files from JFK Files like below we're targeting to extract CLASSIFIED MESSAGE , DEFERRED , PRIORITY, DTG, INCOMING NUMBER and DATE values.

JSON fields for regions will be like below.

[
  {
    "id": 0,
    "text": "CLASSIFIED MESSAGE",
    "marginX": 5,
    "marginY": 30,
    "width": 200,
    "height": 120
  },
  {
    "id": 1,
    "text": "DEFERRED",
    "marginX": -30,
    "marginY": 30,
    "width": 100,
    "height": 100
  },
  {
    "id": 2,
    "text": "PRIORITY",
    "marginX": -30,
    "marginY": 30,
    "width": 100,
    "height": 100
  },
  {
    "id": 3,
    "text": "DTG",
    "marginX": 5,
    "marginY": 30,
    "width": 200,
    "height": 200
  },
  {
    "id": 4,
    "text": "INCOMING NUMBER",
    "marginX": 0,
    "marginY": 30,
    "width": 200,
    "height": 100
  },
  {
    "id": 5,
    "text": "DATE",
    "marginX": 50,
    "marginY": -20,
    "width": 300,
    "height": 50
  }
]

After above definitions search regions will be set like below

And after that, we'll be succesfully extract like below.

Extract Key-Values from Semi-Structured Content

Let's use one of the files from JFK Files like below we're targeting to extract FROM , TITLE , AGENCY ORIGINATOR, RECORD NUMBER, RECORD SERIES and AGENCY FILE NUMBER values.

JSON fields for regions will be like below.

[
  {
    "id": 0,
    "text": "RECORD NUMBER",
    "marginX": 220,
    "marginY": -5,
    "width": 300,
    "height": 25
  },
  {
    "id": 1,
    "text": "RECORD SERIES",
    "marginX": 220,
    "marginY": -5,
    "width": 300,
    "height": 20
  },
  {
    "id": 2,
    "text": "AGENCY FILE NUMBER",
    "marginX": 300,
    "marginY": -5,
    "width": 300,
    "height": 25
  },
  {
    "id": 3,
    "text": "AGENCY ORIGINATOR",
    "marginX": 200,
    "marginY": -5,
    "width": 150,
    "height": 25
  },
    {
    "id": 4,
    "text": "FROM",
    "marginX": 50,
    "marginY": -5,
    "width": 150,
    "height": 25
  },
 {
    "id": 5,
    "text": "TITLE",
    "marginX": 50,
    "marginY": -5,
    "width": 600,
    "height": 25
  }
]

After above definitions search regions will be set like below

And after that, we'll be succesfully extract like below.

In same structure let's assume our sample is like semi-structured like below e-mail.

We'll be looking for in this image, basically to detect From, To, Sent, Subject fields location first then find values next to these fields. Generally in these type for forms value width is dynamic, for this reason we're using dynamic width/heigh/margins per Key-Value pair.

[
    {
      "id": 0,
      "text": "From",
      "marginX": 5,
      "marginY": -5,
      "width": 800,
      "height": 25
    },
    {
      "id": 1,
      "text": "Sent",
      "marginX": 5,
      "marginY": -5,
      "width": 400,
      "height": 25
    },
    {
      "id": 2,
      "text": "To",
      "marginX": 5,
      "marginY": -5,
      "width": 300,
      "height": 25
    },
    {
      "id": 3,
      "text": "Subject",
      "marginX": 5,
      "marginY": -5,
      "width": 900,
      "height": 25
    }
]

After above definitions search regions will be set like below

When we use above settings our output is like below

Extract Key-Values from Table / Semi-Structured Content

Let's assume we have a table structured key-value pairs like below. And let's do small changes to export these fields.

I changed my SampleData.json file under Resources like below.

[
  {
    "id": 0,
    "text": "Date",
    "marginX": 50,
    "marginY": -5,
    "width": 400,
    "height": 25
  },
  {
    "id": 1,
    "text": "Company",
    "marginX": 50,
    "marginY": -5,
    "width": 400,
    "height": 25
  },
  {
    "id": 2,
    "text": "Total",
    "marginX": 50,
    "marginY": -5,
    "width": 400,
    "height": 25
  },
  {
    "id": 3,
    "text": "Card",
    "marginX": 50,
    "marginY": -5,
    "width": 400,
    "height": 25
  },
  {
    "id": 4,
    "text": "Method",
    "marginX": 50,
    "marginY": -5,
    "width": 400,
    "height": 25
  }
]

After above changes output will be like below

Hope this will be helpful.

Prerequisites

Visual Studio 2017 15.7 or newer installed.
.Net Core 2.1 or higher installed.
Bot Framework Emulator 4.1 or newer installed

Running Locally

Visual Studio

Open BotOCRExtract.csproj in Visual Studio.
Run the project (press F5 key).

Testing the bot using Bot Framework Emulator

Microsoft Bot Framework Emulator is a desktop application that allows bot developers to test and debug their bots on localhost or running remotely through a tunnel.

Install the Bot Framework emulator.

Connect to bot using Bot Framework Emulator V4

Launch the Bot Framework Emulator.
File -> Open bot and open BotOCRExtract.bot.

Deploy the bot to Azure

See Deploy your C# bot to Azure for instructions.

The deployment process assumes you have an account on Microsoft Azure and are able to log into the Microsoft Azure Portal.

If you are new to Microsoft Azure, please refer to Getting started with Azure for guidance on how to get started on Azure.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
BotOCRExtract		BotOCRExtract
Images		Images
.gitattributes		.gitattributes
.gitignore		.gitignore
BotOCRExtract.sln		BotOCRExtract.sln
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BotOCRExtract

BotOCRExtract

Images

Images

.gitattributes

.gitattributes

.gitignore

.gitignore

BotOCRExtract.sln

BotOCRExtract.sln

README.md

README.md

Repository files navigation

Extract Key-Value pairs from OCR output in Bot

Defining Reference Text (Key) and Desired Value Margins

Extract Key-Values from Mixed Structured Content

Extract Key-Values from Semi-Structured Content

Extract Key-Values from Table / Semi-Structured Content

Prerequisites

Running Locally

Visual Studio

Testing the bot using Bot Framework Emulator

Connect to bot using Bot Framework Emulator V4

Deploy the bot to Azure

Further reading

About

Releases

Packages

Languages

ikivanc/Bot-Framework-and-OCR-Extract

Folders and files

Latest commit

History

Repository files navigation

Extract Key-Value pairs from OCR output in Bot

Defining Reference Text (Key) and Desired Value Margins

Extract Key-Values from Mixed Structured Content

Extract Key-Values from Semi-Structured Content

Extract Key-Values from Table / Semi-Structured Content

Prerequisites

Running Locally

Visual Studio

Testing the bot using Bot Framework Emulator

Connect to bot using Bot Framework Emulator V4

Deploy the bot to Azure

Further reading

About

Resources

Stars

Watchers

Forks

Languages