Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when cell content exceeds cell boundaries, next cell gets messed up (exmples) #538

Open
shula opened this issue Feb 20, 2024 · 1 comment

Comments

@shula
Copy link

shula commented Feb 20, 2024

When 2 of the cells in the PDF continue beyond the cell's boundary, the next cell's content goes "crazy" (i.e. is totally different than expected)

in the example sample:

I assume the PDF source is EXCEL, where it's common to see long text cut at the border of the cell. I don't know for sure.

Command line used:
java -Dfile.encoding=UTF8 -jar tabula-1.0.5-jar-with-dependencies.jar sample.pdf -f TSV > sample.tsv

The bogus lines are identified / starts with: 1068, 1103
Output lines with the problem:
43 E2U9 A10L YCPCT "ש""א אקליפטוס סיטריאדורה SCITRIADORA/" 1068
60 43 10 CEUCC "ש""א אקליפטוס רדיאטה LYPTUSRADIATA/" 1103

In the output, i see 2 phenomena:

  1. the wrong text "A10L YCPCT" should've been: "10 CC"
  2. the wrong text "E209" should've been: "29". etc.
  3. the word "EUCALIPTUS" is cut in these lines. This makes sense, since it's not visible, and therefore, not a real bug.

in the attache sample.df > converted text file in the 3rd field shoud've been the text "10 CC".

My setup:

  • windows 10
  • java version "1.8.0_401"
  • tabula 1.0.5
@jeremybmerrill
Copy link
Member

Hi @shula Unfortunately this is expected behavior for a PDF with this kind of problem. The "extra"/unexpected characters (for example AL YPT in line 1068) are present, but under the text for the next cell to the left. So Tabula is correctly extracting the characters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants