Buffering tools: warn if running against extremely large files #737

onyxfish · 2016-12-29T15:28:08Z

Such as #581.

But what's the limit? 100MB? 500MB? 1GB?

jpmckinney · 2017-01-30T14:31:24Z

For files larger than 1GB, I don't think we can expect csvkit to ever perform well, so we can at least start there. Some tools will have lower limits. For reference, the buffering tools are listed here.

csvjoin
csvjson unless --stream and --no-inference are set but --skip-lines is not
csvlook
csvpy
csvsort
csvsql
csvstat
in2csv unless --format is ndjson and --no-inference is set, or --format is csv and --no-inference is set but --skip-lines is not

Noting that issues had been opened about the performance of csvsort (#157, #338, #457, #626), csvsql (#428, #633), csvstat (#581). #141 discusses general performance.

jpmckinney · 2017-01-30T15:32:07Z

Fast alternatives:

csvsort cf664eb
csvsql csvsql: Document creating schema and COPYing data for large jobs #735

jpmckinney · 2017-01-30T15:33:26Z

The warning can also mention this from tricks.rst:

If a tool is too slow to be practical for your data try setting the :code:`--snifflimit` option or using the :code:`--no-inference`.

jpmckinney · 2017-01-30T16:38:23Z

Grepping for if .* in, there are checks for in column_names, in excludes, in patterns which may be faster if they were sets instead of tuples/lists - as long as order/repetition isn't relevant. Update: None of these occur in loops, so using sets would only make a difference for very wide CSVs.

I think if column_id < len(row) in csvcut always passes - need to test on a CSV with a short row. Update: Yes, it's needed; added a test. len(row) doesn't need to be calculated every time, but I expect that wouldn't be much of a speed improvement.

Anyway, for posterity, noting that the streaming tools probably can't get much faster within csvkit.

fgregg · 2017-12-08T04:22:53Z

Does csvsql always need to buffer?

I see why it would need to buffer in the case that you are both creating the table and filling the table as that requires two passes on the data. But if you are only doing one of those tasks it seems like csvsql should be able to stream?

From csvjson it seems like it's an acceptable design to have arguments change the streaming/buffering nature of the command?

Similarly, many of the statistics calculated by csvstat can be calculated by a streaming algorithm.

jpmckinney · 2017-12-10T23:38:00Z

csvsql has a faster alternative in #735 which should maybe be pursued.

csvjson does stream if you set --stream and --no-inference but don't set --skip-lines.

csvstat buffers because it uses agate, but we can implement some statistics directly in csvkit to avoid buffering.

fgregg · 2017-12-11T15:17:40Z

csvsql has a faster alternative in #735 which should maybe be pursued.

I use this technique a lot, but it doesn't help much if the pattern of a columns shift at 6,000,001th row. (I understand that streaming won't really make things faster, but should help with memory).

Would y'all be interested in a csvsql that uses streaming when appropriate.

jpmckinney · 2017-12-11T15:25:19Z

I would be, yes!

pixarbuff · 2019-04-24T20:49:05Z

When I run csvstat on a 218MB file it fails silently giving no output. When I run it in verbose mode, it gives a memory error. Would it be possible for it to display this memory error while not in verbose mode?

Traceback (most recent call last):
File "/home/ubuntu/.local/bin/csvstat", line 10, in
sys.exit(launch_new_instance())
File "/home/ubuntu/.local/lib/python3.6/site-packages/csvkit/utilities/csvstat.py", line 341, in launch_new_instance
utility.run()
File "/home/ubuntu/.local/lib/python3.6/site-packages/csvkit/cli.py", line 118, in run
self.main()
File "/home/ubuntu/.local/lib/python3.6/site-packages/csvkit/utilities/csvstat.py", line 140, in main
**self.reader_kwargs
File "/home/ubuntu/.local/lib/python3.6/site-packages/agate/table/from_csv.py", line 65, in from_csv
contents = six.StringIO(f.read())
MemoryError

jpmckinney · 2019-04-29T00:30:58Z

I've made a commit to do that - thanks!

import Sequence from collections.abc to suppress warning in python 3.…

onyxfish added feature Normal Priority labels Dec 29, 2016

onyxfish changed the title ~~Buffering tools: a warning if running against extremely large files~~ Buffering tools: warn if running against extremely large files Dec 29, 2016

jpmckinney modified the milestone: Easy Dec 29, 2016

jpmckinney modified the milestone: Easy Jan 30, 2017

jpmckinney added Low Priority and removed Normal Priority labels Jan 30, 2017

jpmckinney added Normal Priority and removed Low Priority labels Jan 30, 2017

fgregg mentioned this issue Dec 15, 2017

csvsql: Streaming mode for schema generation #913

Open

jpmckinney added a commit that referenced this issue Apr 29, 2019

Output error message for memory error even if not --verbose, #737

4044209

lcorbasson pushed a commit to lcorbasson/csvkit that referenced this issue Sep 7, 2020

Merge pull request wireservice#737 from quantrocket-llc/collections-abc

6152fea

import Sequence from collections.abc to suppress warning in python 3.…

jpmckinney added framework and removed Normal Priority labels Oct 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Buffering tools: warn if running against extremely large files #737

Buffering tools: warn if running against extremely large files #737

onyxfish commented Dec 29, 2016

jpmckinney commented Jan 30, 2017 •

edited

jpmckinney commented Jan 30, 2017 •

edited

jpmckinney commented Jan 30, 2017 •

edited

jpmckinney commented Jan 30, 2017 •

edited

fgregg commented Dec 8, 2017 •

edited

jpmckinney commented Dec 10, 2017 •

edited

fgregg commented Dec 11, 2017 •

edited

jpmckinney commented Dec 11, 2017

pixarbuff commented Apr 24, 2019

jpmckinney commented Apr 29, 2019

Buffering tools: warn if running against extremely large files #737

Buffering tools: warn if running against extremely large files #737

Comments

onyxfish commented Dec 29, 2016

jpmckinney commented Jan 30, 2017 • edited

jpmckinney commented Jan 30, 2017 • edited

jpmckinney commented Jan 30, 2017 • edited

jpmckinney commented Jan 30, 2017 • edited

fgregg commented Dec 8, 2017 • edited

jpmckinney commented Dec 10, 2017 • edited

fgregg commented Dec 11, 2017 • edited

jpmckinney commented Dec 11, 2017

pixarbuff commented Apr 24, 2019

jpmckinney commented Apr 29, 2019

jpmckinney commented Jan 30, 2017 •

edited

jpmckinney commented Jan 30, 2017 •

edited

jpmckinney commented Jan 30, 2017 •

edited

jpmckinney commented Jan 30, 2017 •

edited

fgregg commented Dec 8, 2017 •

edited

jpmckinney commented Dec 10, 2017 •

edited

fgregg commented Dec 11, 2017 •

edited