Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF sample - any way to improve extraction? #393

Open
igvk opened this issue Aug 7, 2023 · 5 comments
Open

PDF sample - any way to improve extraction? #393

igvk opened this issue Aug 7, 2023 · 5 comments
Labels
bug Something isn't working

Comments

@igvk
Copy link

igvk commented Aug 7, 2023

Here is an example of PDF that has some incorrectly extracted data (in stream mode):
V_1.pdf

V_1

  1. Multi-line text isn't interpreted as such, and as a result it is very sparsely distributed into rows.

  2. Number 50 in the last row and column 4 is moved to the next cell to the right and merged with it.

Is it possible to improve the extraction of this table?

@igvk igvk added the bug Something isn't working label Aug 7, 2023
@foarsitter
Copy link
Contributor

Thanks for providing an example. It will be very hard to extract that table. Don't expect an improvement in foreseeable future.

@kdshreyas
Copy link

Hello @igvk,

I managed to tweak around some parameters of Camelot and I am getting good results, Check out the attached CSV and screenshot..

V_1.csv

image

code I used to achieve this:

!wget https://github.com/camelot-dev/camelot/files/12279247/V_1.pdf

tables = camelot.read_pdf('/kaggle/working/V_1.pdf', flavor='stream', row_tol=30)
print(tables)

df = tables[0].df
#replcace newline character with spcace to make it look clean
df = df.replace('\n', ' ', regex=True)

df

# df.to_csv('V_1.csv', index = False)

let me know if it works for you!

@igvk
Copy link
Author

igvk commented Oct 11, 2023

@kdshreyas,
Yes, providing row tolerance makes it work better. Though there are some problems that are hard to solve (without providing exact column positions).
In this file, this is number 50 in column 4.
Besides, table header is shifted to the same line as column header of columns 3 & 4.
The only solution that works good enough for me for this table is to provide the list of column positions.

@kdshreyas
Copy link

@igvk,
I see, providing column parameters mostly works, but makes it hard-coded!
Thanks for your feedback!

@patlachance
Copy link

I'm using pyMuDPF to find out the column positions before calling camelot.read_pdf.

I hope the following will help!

Some helper functions

    def fitz_extract_tables(self, pages=['all']):
        """ Returns an array of TableFinder objects
        Inputs:
            - pageNum: the page number to process

        Returns:
            - an array of TableFinder objects
        """
        result = []
        # Get the number of pages
        #num_pages = pages if pages != 'all' else self.fitz_doc.page_count
        # Open the PDF file using PyMuPDF
        with fitz.open(self.filename) as fitz_doc:
            num_pages = fitz_doc.page_count
            for p in range(0, num_pages):
                # Skip page if not in the list
                if pages != ['all']:
                    if p not in pages:
                        continue

                page = fitz_doc[p]
                # Extract text from the page
                #result = page.find_tables(vertical_strategy='text',horizontal_strategy='text', min_words_vertical=3, min_words_horizontal=2)
                result.append(page.find_tables())
            return result


    def fitz_get_table_inner_columns(self, pageArray, tableId, to_integer=False):
        """ Returns an array of float coordinates of the inner column of a table
        Inputs:
            - pageNum: the page number to process
            - tableNum: the id of the table on the given page 

        Returns:
            - an array of floats or integers
        """
        result = []
        # Fetch the tables on the given page
        tables = self.fitz_extract_tables(pageArray)

        # fitz_extract_tables returns a list of TableFinder objects, on each pages
        # our table is on the first processed page
        table = tables[0][tableId-1]
        df = table.to_pandas()

        # finding the inner columns from the cell coordinates of the first table row
        columns_left  = []
        columns_right = []
        r = 0
        for c in range(0, len(table.rows[r].cells)):
            if table.rows[r].cells[c] != None:
                # Convert the string to a list using ast.literal_eval()
                bbox = ast.literal_eval(str(table.rows[r].cells[c]))
                # bbox contains [x0, y0, x1, y1]
                # we want the x coordinates
                columns_left.append(bbox[0])
                columns_right.append(bbox[2])

        if to_integer == True:
            columns_left = [int(round(x, 0)) for x in columns_left]
            columns_right = [int(round(x, 0)) for x in columns_right]

And then

        columns_array = self.fitz_get_table_inner_columns([0], 2, to_integer=True)
        columns = [','.join(str(x) for x in columns_array)] * 10
        edge_tol = 70
        self.tables = camelot.read_pdf(self.filename, flavor=self.flavor, pages=pages, columns=columns, edge_tol=edge_tol, split_text=True)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants