ocr-nlp-flyer

Top 6 Submission at 2020 Daisy Hackathon @ UofT

Solution

The program utilizes OpenCV to process and segment the flyer image into bounding boxes for each product. PyTesseract, a LSTM enabled image-to-text engine, then extracts characters from each box, and the text is then parsed by empirical methods such as RegEx to extract specific pricing, discount, product name, and organic-status information.

Problem Scope

Goal: Given high-resolution flyer image, the objective is to return a table of products and their accompanying promotion details, including:
- Product Name
- Promo. Price ($)
- Unit of measure
- Least unit for promo.
- Amount saved per unit ($)
- Discount (%)
- Organic product (Boolean)
Data: 212 Flyer Images w/ no labels, Product + Unit-of-Measure Dictionary

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
flyer_images		flyer_images
README.md		README.md
ROI_test.png		ROI_test.png
ocr.py		ocr.py
output.csv		output.csv
presentation.pptx		presentation.pptx
product_dictionary.csv		product_dictionary.csv
sample_output.csv		sample_output.csv
units_dictionary.csv		units_dictionary.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flyer_images

flyer_images

README.md

README.md

ROI_test.png

ROI_test.png

ocr.py

ocr.py

output.csv

output.csv

presentation.pptx

presentation.pptx

product_dictionary.csv

product_dictionary.csv

sample_output.csv

sample_output.csv

units_dictionary.csv

units_dictionary.csv

Repository files navigation

ocr-nlp-flyer

Solution

Problem Scope

About

Releases

Packages

Languages

Skeletonboi/ocr-nlp-flyer

Folders and files

Latest commit

History

Repository files navigation

ocr-nlp-flyer

Solution

Problem Scope

About

Topics

Resources

Stars

Watchers

Forks

Languages