Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffering tools: warn if running against extremely large files #737

Open
onyxfish opened this issue Dec 29, 2016 · 10 comments
Open

Buffering tools: warn if running against extremely large files #737

onyxfish opened this issue Dec 29, 2016 · 10 comments

Comments

@onyxfish
Copy link
Collaborator

Such as #581.

But what's the limit? 100MB? 500MB? 1GB?

@onyxfish onyxfish changed the title Buffering tools: a warning if running against extremely large files Buffering tools: warn if running against extremely large files Dec 29, 2016
@jpmckinney jpmckinney modified the milestone: Easy Dec 29, 2016
@jpmckinney jpmckinney modified the milestone: Easy Jan 30, 2017
@jpmckinney
Copy link
Member

jpmckinney commented Jan 30, 2017

For files larger than 1GB, I don't think we can expect csvkit to ever perform well, so we can at least start there. Some tools will have lower limits. For reference, the buffering tools are listed here.

  • csvjoin
  • csvjson unless --stream and --no-inference are set but --skip-lines is not
  • csvlook
  • csvpy
  • csvsort
  • csvsql
  • csvstat
  • in2csv unless --format is ndjson and --no-inference is set, or --format is csv and --no-inference is set but --skip-lines is not

Noting that issues had been opened about the performance of csvsort (#157, #338, #457, #626), csvsql (#428, #633), csvstat (#581). #141 discusses general performance.

@jpmckinney
Copy link
Member

jpmckinney commented Jan 30, 2017

@jpmckinney
Copy link
Member

jpmckinney commented Jan 30, 2017

The warning can also mention this from tricks.rst:

If a tool is too slow to be practical for your data try setting the :code:`--snifflimit` option or using the :code:`--no-inference`.

@jpmckinney
Copy link
Member

jpmckinney commented Jan 30, 2017

Grepping for if .* in, there are checks for in column_names, in excludes, in patterns which may be faster if they were sets instead of tuples/lists - as long as order/repetition isn't relevant. Update: None of these occur in loops, so using sets would only make a difference for very wide CSVs.

I think if column_id < len(row) in csvcut always passes - need to test on a CSV with a short row. Update: Yes, it's needed; added a test. len(row) doesn't need to be calculated every time, but I expect that wouldn't be much of a speed improvement.

Anyway, for posterity, noting that the streaming tools probably can't get much faster within csvkit.

@fgregg
Copy link
Contributor

fgregg commented Dec 8, 2017

Does csvsql always need to buffer?

I see why it would need to buffer in the case that you are both creating the table and filling the table as that requires two passes on the data. But if you are only doing one of those tasks it seems like csvsql should be able to stream?

From csvjson it seems like it's an acceptable design to have arguments change the streaming/buffering nature of the command?

Similarly, many of the statistics calculated by csvstat can be calculated by a streaming algorithm.

@jpmckinney
Copy link
Member

jpmckinney commented Dec 10, 2017

csvsql has a faster alternative in #735 which should maybe be pursued.

csvjson does stream if you set --stream and --no-inference but don't set --skip-lines.

csvstat buffers because it uses agate, but we can implement some statistics directly in csvkit to avoid buffering.

@fgregg
Copy link
Contributor

fgregg commented Dec 11, 2017

csvsql has a faster alternative in #735 which should maybe be pursued.

I use this technique a lot, but it doesn't help much if the pattern of a columns shift at 6,000,001th row. (I understand that streaming won't really make things faster, but should help with memory).

Would y'all be interested in a csvsql that uses streaming when appropriate.

@jpmckinney
Copy link
Member

I would be, yes!

@pixarbuff
Copy link

When I run csvstat on a 218MB file it fails silently giving no output. When I run it in verbose mode, it gives a memory error. Would it be possible for it to display this memory error while not in verbose mode?

Traceback (most recent call last):
File "/home/ubuntu/.local/bin/csvstat", line 10, in
sys.exit(launch_new_instance())
File "/home/ubuntu/.local/lib/python3.6/site-packages/csvkit/utilities/csvstat.py", line 341, in launch_new_instance
utility.run()
File "/home/ubuntu/.local/lib/python3.6/site-packages/csvkit/cli.py", line 118, in run
self.main()
File "/home/ubuntu/.local/lib/python3.6/site-packages/csvkit/utilities/csvstat.py", line 140, in main
**self.reader_kwargs
File "/home/ubuntu/.local/lib/python3.6/site-packages/agate/table/from_csv.py", line 65, in from_csv
contents = six.StringIO(f.read())
MemoryError

@jpmckinney
Copy link
Member

I've made a commit to do that - thanks!

lcorbasson pushed a commit to lcorbasson/csvkit that referenced this issue Sep 7, 2020
import Sequence from collections.abc to suppress warning in python 3.…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants