fastapi_pdfextractor

A simple api using fastapi for extracting the text content of pdf using pdfminer. Different pdf parsers were tried like pypdf2, pdfminer.. but pdfminer gave better results. For added ocr support first install tesseract and ghost script as these are required dependencies for the code to work.
Try out and compare the output of pdfminer and tika through API endpoints. Access the results through API response or app/results directory.
Note: if tesseract is installed in some other location than default, then change the location accordingly in pdfapi.py file.

Clone project

git clone https://github.com/soham-1/fastapi_pdfextractor.git

Run locally

Install dependencies

pip install -r requirements.txt

Run Server

cd app
uvicorn pdfapi:app --host 0.0.0.0 --port 8000 --reload

Run on Docker

docker-compose up -d --build

Stop the container using

docker-compose stop fast_api

Restart it using

docker-compose up -d

Documentation

This api has following endpoints

/get_doc_list - for getting a list of all the available pdf's
/parse/{doc_name} - for getting the meta data and text content of pdf. available pdf's are sample_doc_1, sample_doc_2. sample_doc_3
/pdfminer_text/{doc} - returns text output of a pdf using pdfminer library
/pdfminer_text/{doc}/{page_no} - returns text output of a pdf of specified page_no
/tika_text/{doc} - returns text output of a pdf using py-tika library
/pdfminer_xml/{doc} - returns xml output
/pdfminer_xml/{doc}/{page_no} - returns xml output of a pdf of specified page_no
/pdfminer_html/{doc} - returns html output
/pdfminer_html/{doc}/{page_no}
/pdfminer_html_char/{doc} - returns character level html output
/pdfminer_html_char/{doc}/{page_no}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
app		app
screenshots		screenshots
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

soham-1/fastapi_pdfextractor

Folders and files

Latest commit

History

Repository files navigation

fastapi_pdfextractor

Clone project

Run locally

Install dependencies

Run Server

Run on Docker

Stop the container using

Restart it using

Documentation

/get_doc_list - for getting a list of all the available pdf's

/parse/{doc_name} - for getting the meta data and text content of pdf. available pdf's are sample_doc_1, sample_doc_2. sample_doc_3

/pdfminer_text/{doc} - returns text output of a pdf using pdfminer library

/pdfminer_text/{doc}/{page_no} - returns text output of a pdf of specified page_no

/tika_text/{doc} - returns text output of a pdf using py-tika library

/pdfminer_xml/{doc} - returns xml output

/pdfminer_xml/{doc}/{page_no} - returns xml output of a pdf of specified page_no

/pdfminer_html/{doc} - returns html output

/pdfminer_html/{doc}/{page_no}

/pdfminer_html_char/{doc} - returns character level html output

/pdfminer_html_char/{doc}/{page_no}

text pdf

output

pdf with scanned image

output

About

Topics

Resources

Stars

Watchers

Forks

Languages

`/get_doc_list` - for getting a list of all the available pdf's

`/parse/{doc_name}` - for getting the meta data and text content of pdf. available pdf's are sample_doc_1, sample_doc_2. sample_doc_3

`/pdfminer_text/{doc}` - returns text output of a pdf using pdfminer library

`/pdfminer_text/{doc}/{page_no}` - returns text output of a pdf of specified page_no

`/tika_text/{doc}` - returns text output of a pdf using py-tika library

`/pdfminer_xml/{doc}` - returns xml output

`/pdfminer_xml/{doc}/{page_no}` - returns xml output of a pdf of specified page_no

`/pdfminer_html/{doc}` - returns html output

`/pdfminer_html/{doc}/{page_no}`

`/pdfminer_html_char/{doc}` - returns character level html output

`/pdfminer_html_char/{doc}/{page_no}`