PDF Classifier for a Mortgage Company

Often mortgage companies get hundreds of PDFs a day and many hours are spent by employees to classify and file these PDFs. This essentially costs businesses a lot of money because employees need to spend time sorting and filing these PDFs instead of completing other tasks essential to the business, so my goal was to create a classifier that could address this issue and ultimately save employees time and businesses money. This pipeline contains scripts to decrypt these PDFs to allow for text extraction, extract and analyze the text, and then predict their classification based on various NLP models.

Prerequisites

QPDF
PDFminer
PyOCR

Tips to installing PDFminer:

https://media.readthedocs.org/pdf/pdfminer-docs/latest/pdfminer-docs.pdf

Tips to installing PyOCR:

https://pythontips.com/2016/02/25/ocr-on-pdf-files-using-python/

Getting Started

Step One: Decrypt files so text extraction is possible.

Step Two: Convert PDFs to text files.

PDFminer should work for forms, but you will need PyOCR for scanned PDFs. For the best results use the covert_pdfDir function first to extract text with PDFminer, then run the strip_blank function to delete any text files that were not converted, and then use the fill_blank function to complete the rest of the documents with PyOCR.

Step Three: Train your model.

Step Four: Predict Classifications.

Example Code

See examples.md

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
scrapped_work		scrapped_work
src		src
.gitignore		.gitignore
PDF_Classifier.pdf		PDF_Classifier.pdf
README.md		README.md
examples.md		examples.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scrapped_work

scrapped_work

src

src

.gitignore

.gitignore

PDF_Classifier.pdf

PDF_Classifier.pdf

README.md

README.md

examples.md

examples.md

Repository files navigation

PDF Classifier for a Mortgage Company

Prerequisites

Getting Started

Example Code

About

Releases

Packages

Languages

Unrelenting/PDF-Classifier

Folders and files

Latest commit

History

Repository files navigation

PDF Classifier for a Mortgage Company

Prerequisites

Getting Started

Example Code

About

Topics

Resources

Stars

Watchers

Forks

Languages