Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: Spaces make a new line #1171

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Error: Spaces make a new line #1171

wants to merge 5 commits into from

Conversation

AlanReviews
Copy link

For some documents that contain spaces and text in the first row, the CSV file contains a new line instead of space. For example, if I had the first row with a, "Hello world", and b, use Tabula to capture the first row, I have the CSV file as:

a,"Hello
world",b

I expect the CSV file to be:

a, "Hello world", b

@jonardcaguioa
Copy link

Hi. I cant raise an issue so im posting here.

I encounter several issues in parsing lattice and stream tables. is it possible to add param like row or column tolerate like in camelot ? thanks

@nharrisanalyst
Copy link

Hi I cant raise an issue so I am posting here

Tabula opens to keycloak (I don't even know that is)

image

@jeremybmerrill
Copy link
Member

jeremybmerrill commented Aug 14, 2021 via email

@dakial
Copy link

dakial commented Sep 3, 2021

Hi guys, sorry for posting this here but there is nowhere else to do it. Maybe you should enable discussions on this repo, so that tabula users can help each other.

My post here is about line breaks inside table cells, like this (this is a screen of Tabula output):
image

Tabula will insert those line breaks in the CSV but that would create chaos when importing to excel as it would create a new line for every line break, break text string indications (between quotes), etc...

image

Maybe there is a use case to keep those line breaks inside the cell but a "remove line breaks inside cells" flag would be a great feature.

Meanwhile, for my fellow users who are battling this problem, after much searching I've found a solution that would work with both Notepad++ and Sublime. A regex find and replace (taken for a very helpful post at StackOverflow):

_Use Notepad++ regex Find-and-Replace:

Find what:_

(,"[^"]*?)[\r\n]+

Replace with:

$1

(_There is a single space after $1)

Repeatedly click "Replace All" until no more matches are found._

This works.

@flywire
Copy link

flywire commented Sep 3, 2021

My post here is about line breaks inside table cells

In Excel they are entered as .

@undine-su-menulio
Copy link

Hey! I can't raise an issue, so am writing here. Could you please tell, what to do with "Java Heap Error", if my pdf is only 50 KB and has only 1 page?

I've already tried multiple times to reinstall both Java and Tabula, restarted the whole computer, cleared the "Local/Temp" folder, cleared ports and changed 8080 to 9999, but nothing happens. OS: Windows 10

The pfd is 100% valid — I have opened it this morning without any trouble but then suddenly it just stopped working. Since then it returns "Sorry, your file upload could not be processed. Please double-check that the file you uploaded is a valid PDF file and try again" — and I cannot see any of my previous files, it just shows "First time using Tabula? Welcome!"

I will be very grateful for any help 🙏

The error is the following:
2022-01-08 17:33:25.054:INFO:oejsh.ContextHandler:main: Started o.e.j.w.WebAppContext@47089e5f{/,file:/C:/Users/User/AppData/Local/Temp/jetty-0.0.0.0-8080-tabula.jar-_-any-6136871881336378357.dir/webapp/,AVAILABLE}{file:/H:/Course/tabula-win-1.2.1/tabula/tabula.jar} 2022-01-08 17:33:25.056:WARN:oejsh.RequestLogHandler:main: !RequestLog 2022-01-08 17:33:25.068:INFO:oejs.ServerConnector:main: Started ServerConnector@2bb0e277{HTTP/1.1}{0.0.0.0:8080} 2022-01-08 17:33:25.069:INFO:oejs.Server:main: Started @8398ms 2022-01-08 17:33:35.840:WARN:oejs.ServletHandler:qtp1706234378-18: Error for /documents java.lang.OutOfMemoryError: Java heap space at org.jruby.util.ByteList.ensure(ByteList.java:341) at org.jruby.util.io.EncodingUtils$1.resize(EncodingUtils.java:1197) at org.jruby.util.io.EncodingUtils.moreOutputBuffer(EncodingUtils.java:1513) at org.jruby.util.io.EncodingUtils.transcodeLoop(EncodingUtils.java:1415) at org.jruby.util.io.EncodingUtils.transcodeLoop(EncodingUtils.java:1312) at org.jruby.util.io.EncodingUtils.strTranscode0(EncodingUtils.java:936) at org.jruby.util.io.EncodingUtils.strTranscode(EncodingUtils.java:857) at org.jruby.util.io.EncodingUtils.strEncode(EncodingUtils.java:829) at org.jruby.RubyString.encode(RubyString.java:5398) at json.ext.Parser.convertEncoding(Parser.java:196) at json.ext.Parser.initialize(Parser.java:175) at json.ext.Parser$INVOKER$i$0$1$initialize.call(Parser$INVOKER$i$0$1$initialize.gen) at org.jruby.internal.runtime.methods.JavaMethod$JavaMethodN.call(JavaMethod.java:725) at org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:208) at org.jruby.internal.runtime.methods.JavaMethod$JavaMethodN.call(JavaMethod.java:741) at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:278) at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:79) at org.jruby.RubyObject.callInit(RubyObject.java:348) at json.ext.Parser.newInstance(Parser.java:151) at json.ext.Parser$INVOKER$s$0$1$newInstance.call(Parser$INVOKER$s$0$1$newInstance.gen) at org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:212) at org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:208) at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:338) at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:183) at org.jruby.ir.interpreter.InterpreterEngine.processCall(InterpreterEngine.java:324) at org.jruby.ir.interpreter.StartupInterpreterEngine.interpret(StartupInterpreterEngine.java:74) at org.jruby.ir.interpreter.InterpreterEngine.interpret(InterpreterEngine.java:84) at org.jruby.internal.runtime.methods.MixedModeIRMethod.INTERPRET_METHOD(MixedModeIRMethod.java:179) at org.jruby.internal.runtime.methods.MixedModeIRMethod.call(MixedModeIRMethod.java:165) at org.jruby.internal.runtime.methods.DynamicMethod.call(DynamicMethod.java:200) at org.jruby.runtime.callsite.CachingCallSite.cacheAndCall(CachingCallSite.java:318) at org.jruby.runtime.callsite.CachingCallSite.call(CachingCallSite.java:155)

@john-harrold
Copy link

Hey! I can't raise an issue, so am writing here. Could you please tell, what to do with "Java Heap Error", if my pdf is only 50 KB and has only 1 page?

`

Howdy. I don't have any affiliation with this project. I just follow it on github. Can you share your pdf? I can try it out on my computer and see if I can figure out what is going on.

@undine-su-menulio
Copy link

Hey! I can't raise an issue, so am writing here. Could you please tell, what to do with "Java Heap Error", if my pdf is only 50 KB and has only 1 page?

`

Howdy. I don't have any affiliation with this project. I just follow it on github. Can you share your pdf? I can try it out on my computer and see if I can figure out what is going on.

Thank you! Here

@john-harrold
Copy link

Hey! I can't raise an issue, so am writing here. Could you please tell, what to do with "Java Heap Error", if my pdf is only 50 KB and has only 1 page?

`
Howdy. I don't have any affiliation with this project. I just follow it on github. Can you share your pdf? I can try it out on my computer and see if I can figure out what is going on.

Thank you! Here

So I was able to extract the tables. Here is the result:
tabula-un1.csv

I'm running on a mac. It's weird but I cannot figure out what version of Java I have installed. Like

java --version

Says it cannot find it, but it's got to be installed somewhere because Tabula is running :). I don't want to muck around with Java and end up breaking it. Hopefully this extraction works for you.

@undine-su-menulio
Copy link

Hey! I can't raise an issue, so am writing here. Could you please tell, what to do with "Java Heap Error", if my pdf is only 50 KB and has only 1 page?

`
Howdy. I don't have any affiliation with this project. I just follow it on github. Can you share your pdf? I can try it out on my computer and see if I can figure out what is going on.

Thank you! Here

So I was able to extract the tables. Here is the result: tabula-un1.csv

I'm running on a mac. It's weird but I cannot figure out what version of Java I have installed. Like

java --version

Says it cannot find it, but it's got to be installed somewhere because Tabula is running :). I don't want to muck around with Java and end up breaking it. Hopefully this extraction works for you.

Thank you so much for your help! Wish you all the best things in the world and all the blessings! ☀️
Hope one day I will figure out what gets in the way in my case

@ginozzzz
Copy link

Great job guys! thanks so much!

now... not sure how to raise an issue because some links are broken...

IMPORTING A TABLE with '(' brackets causes the software to omit the entire line.
Screenshot 2022-01-22 at 13 26 18
Screenshot 2022-01-22 at 13 26 37

@ictalbaniaco
Copy link

Great tool and thank you for your great job!
I just wanted to post an issue that i encountered:
I'm using tabula-win-1.2.1 on win11 and i'm exporting tables from pdf files that contains only tables (same columns and rows all the way down). Each pdf file contains 90-100 pages.
The problem is that it doesn't export more than 3300 rows.
I tested it in 9 different files and the result is the same, no more than 3300 rows.
Thank you again for your job!

@OldGuy1949
Copy link

Great tool! We use it to parse PDF files that appear to be the same format to us humans, but drive Tabula nuts in either stream or matrix mode. Culprits include formatting characters used by the PDF creators (different people I assume). We find spaces, tabs and other un-printable characters in the CSV output. The files themselves always present as header, trailer and intermediate details with page headers and trailer in the details.

May I suggest that since we humans recognize the format, allowing us to specify vertical columns (groups) and horizontal rows (elements) might remove reliance on recognizing said groups and elements. Were I a seasoned open-source developer who had time on his hands (I'm neither), I'd look at the code and see where / how this might work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet