New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDF sample - any way to improve extraction? #393
Comments
Thanks for providing an example. It will be very hard to extract that table. Don't expect an improvement in foreseeable future. |
Hello @igvk, I managed to tweak around some parameters of Camelot and I am getting good results, Check out the attached CSV and screenshot.. code I used to achieve this:
let me know if it works for you! |
@kdshreyas, |
@igvk, |
I'm using pyMuDPF to find out the column positions before calling camelot.read_pdf. I hope the following will help! Some helper functions def fitz_extract_tables(self, pages=['all']):
""" Returns an array of TableFinder objects
Inputs:
- pageNum: the page number to process
Returns:
- an array of TableFinder objects
"""
result = []
# Get the number of pages
#num_pages = pages if pages != 'all' else self.fitz_doc.page_count
# Open the PDF file using PyMuPDF
with fitz.open(self.filename) as fitz_doc:
num_pages = fitz_doc.page_count
for p in range(0, num_pages):
# Skip page if not in the list
if pages != ['all']:
if p not in pages:
continue
page = fitz_doc[p]
# Extract text from the page
#result = page.find_tables(vertical_strategy='text',horizontal_strategy='text', min_words_vertical=3, min_words_horizontal=2)
result.append(page.find_tables())
return result
def fitz_get_table_inner_columns(self, pageArray, tableId, to_integer=False):
""" Returns an array of float coordinates of the inner column of a table
Inputs:
- pageNum: the page number to process
- tableNum: the id of the table on the given page
Returns:
- an array of floats or integers
"""
result = []
# Fetch the tables on the given page
tables = self.fitz_extract_tables(pageArray)
# fitz_extract_tables returns a list of TableFinder objects, on each pages
# our table is on the first processed page
table = tables[0][tableId-1]
df = table.to_pandas()
# finding the inner columns from the cell coordinates of the first table row
columns_left = []
columns_right = []
r = 0
for c in range(0, len(table.rows[r].cells)):
if table.rows[r].cells[c] != None:
# Convert the string to a list using ast.literal_eval()
bbox = ast.literal_eval(str(table.rows[r].cells[c]))
# bbox contains [x0, y0, x1, y1]
# we want the x coordinates
columns_left.append(bbox[0])
columns_right.append(bbox[2])
if to_integer == True:
columns_left = [int(round(x, 0)) for x in columns_left]
columns_right = [int(round(x, 0)) for x in columns_right] And then columns_array = self.fitz_get_table_inner_columns([0], 2, to_integer=True)
columns = [','.join(str(x) for x in columns_array)] * 10
edge_tol = 70
self.tables = camelot.read_pdf(self.filename, flavor=self.flavor, pages=pages, columns=columns, edge_tol=edge_tol, split_text=True) |
Here is an example of PDF that has some incorrectly extracted data (in stream mode):
V_1.pdf
Multi-line text isn't interpreted as such, and as a result it is very sparsely distributed into rows.
Number 50 in the last row and column 4 is moved to the next cell to the right and merged with it.
Is it possible to improve the extraction of this table?
The text was updated successfully, but these errors were encountered: