PDF sample - any way to improve extraction? #393

igvk · 2023-08-07T11:52:35Z

Here is an example of PDF that has some incorrectly extracted data (in stream mode):
V_1.pdf

Multi-line text isn't interpreted as such, and as a result it is very sparsely distributed into rows.
Number 50 in the last row and column 4 is moved to the next cell to the right and merged with it.

Is it possible to improve the extraction of this table?

foarsitter · 2023-09-24T14:23:39Z

Thanks for providing an example. It will be very hard to extract that table. Don't expect an improvement in foreseeable future.

kdshreyas · 2023-10-11T18:14:15Z

Hello @igvk,

I managed to tweak around some parameters of Camelot and I am getting good results, Check out the attached CSV and screenshot..

V_1.csv

code I used to achieve this:

!wget https://github.com/camelot-dev/camelot/files/12279247/V_1.pdf

tables = camelot.read_pdf('/kaggle/working/V_1.pdf', flavor='stream', row_tol=30)
print(tables)

df = tables[0].df
#replcace newline character with spcace to make it look clean
df = df.replace('\n', ' ', regex=True)

df

# df.to_csv('V_1.csv', index = False)

let me know if it works for you!

igvk · 2023-10-11T22:58:41Z

@kdshreyas,
Yes, providing row tolerance makes it work better. Though there are some problems that are hard to solve (without providing exact column positions).
In this file, this is number 50 in column 4.
Besides, table header is shifted to the same line as column header of columns 3 & 4.
The only solution that works good enough for me for this table is to provide the list of column positions.

kdshreyas · 2023-10-12T10:06:33Z

@igvk,
I see, providing column parameters mostly works, but makes it hard-coded!
Thanks for your feedback!

patlachance · 2024-04-12T11:27:43Z

I'm using pyMuDPF to find out the column positions before calling camelot.read_pdf.

I hope the following will help!

Some helper functions

    def fitz_extract_tables(self, pages=['all']):
        """ Returns an array of TableFinder objects
        Inputs:
            - pageNum: the page number to process

        Returns:
            - an array of TableFinder objects
        """
        result = []
        # Get the number of pages
        #num_pages = pages if pages != 'all' else self.fitz_doc.page_count
        # Open the PDF file using PyMuPDF
        with fitz.open(self.filename) as fitz_doc:
            num_pages = fitz_doc.page_count
            for p in range(0, num_pages):
                # Skip page if not in the list
                if pages != ['all']:
                    if p not in pages:
                        continue

                page = fitz_doc[p]
                # Extract text from the page
                #result = page.find_tables(vertical_strategy='text',horizontal_strategy='text', min_words_vertical=3, min_words_horizontal=2)
                result.append(page.find_tables())
            return result


    def fitz_get_table_inner_columns(self, pageArray, tableId, to_integer=False):
        """ Returns an array of float coordinates of the inner column of a table
        Inputs:
            - pageNum: the page number to process
            - tableNum: the id of the table on the given page 

        Returns:
            - an array of floats or integers
        """
        result = []
        # Fetch the tables on the given page
        tables = self.fitz_extract_tables(pageArray)

        # fitz_extract_tables returns a list of TableFinder objects, on each pages
        # our table is on the first processed page
        table = tables[0][tableId-1]
        df = table.to_pandas()

        # finding the inner columns from the cell coordinates of the first table row
        columns_left  = []
        columns_right = []
        r = 0
        for c in range(0, len(table.rows[r].cells)):
            if table.rows[r].cells[c] != None:
                # Convert the string to a list using ast.literal_eval()
                bbox = ast.literal_eval(str(table.rows[r].cells[c]))
                # bbox contains [x0, y0, x1, y1]
                # we want the x coordinates
                columns_left.append(bbox[0])
                columns_right.append(bbox[2])

        if to_integer == True:
            columns_left = [int(round(x, 0)) for x in columns_left]
            columns_right = [int(round(x, 0)) for x in columns_right]

And then

        columns_array = self.fitz_get_table_inner_columns([0], 2, to_integer=True)
        columns = [','.join(str(x) for x in columns_array)] * 10
        edge_tol = 70
        self.tables = camelot.read_pdf(self.filename, flavor=self.flavor, pages=pages, columns=columns, edge_tol=edge_tol, split_text=True)

igvk added the bug Something isn't working label Aug 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF sample - any way to improve extraction? #393

PDF sample - any way to improve extraction? #393

igvk commented Aug 7, 2023

foarsitter commented Sep 24, 2023

kdshreyas commented Oct 11, 2023

igvk commented Oct 11, 2023

kdshreyas commented Oct 12, 2023

patlachance commented Apr 12, 2024

PDF sample - any way to improve extraction? #393

PDF sample - any way to improve extraction? #393

Comments

igvk commented Aug 7, 2023

foarsitter commented Sep 24, 2023

kdshreyas commented Oct 11, 2023

igvk commented Oct 11, 2023

kdshreyas commented Oct 12, 2023

patlachance commented Apr 12, 2024