Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text extraction ignores different kinds of white spaces #107

Open
hhaensel opened this issue Oct 17, 2022 · 0 comments
Open

Text extraction ignores different kinds of white spaces #107

hhaensel opened this issue Oct 17, 2022 · 0 comments

Comments

@hhaensel
Copy link

Currently, all white space characters in a textbox are merged into a single space character (' ')
This makes it very difficult to extract tabular data.

In #106, I propose to introduce an extraction mode parameter that allows the user to chose between three extraction modes.

  • :spaces (default)
    all white spaces are handled as a single space character
  • :tabs
    non-space white spaces are handled as tab characters
  • :boxes
    text between non-space white spaces is split into several textboxes with respective coordinates

For this purpose get_TextBox() no longer returns a tuple text, w, h but a vector of tuples text, w, h, offset.
During evalContent!() the vector is itereated to return a TextLayout for each set of box parameters.
For the modes :spaces and :tabs get_TextBox()always returns a single-element vector, whereas in:boxes` mode more than one TextLayout might be added to the output.

The :spaces mode reproduces the current extraction behavior.
The :tab mode is suited for extraction of "well-behaved" tabular data, i.e. no empty cells or at least a space character
The :boxes mode is essential to extract tables that contain empty cells. In that case further textbox treatment is necessary, which I would provide in a separate PR.

@sambitdash Please comment if this sounds like a desired feature to you.
If so, we can still discuss whether control via a global variable is the best choice or whether we'd rather implement a keyword arg which is passed through the text extraction function chain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant