Partitioned CSV writes with gzip do not add `.csv.gz` extension and fail to read back in with `read_csv_auto` #11889

teaguesterling · 2024-04-30T18:43:51Z

What happens?

Using COPY (FROM ...) TO 'table.csv.d' (FORMAT 'csv', COMPRESSION 'gzip', PARTITION_BY (col1, col2); generates individual CSV files that are compressed but don't have the .gz extension. This causes issues with downstream tools that rely on filename investigation to determine compression times.

To Reproduce

Example:

CREATE TABLE test AS VALUES ('a', 'foo', 1), ('a', 'foo', 2), ('a', 'bar', 1), ('b', 'bar', 1);
COPY (FROM test) TO 'data.csv.d' (FORMAT 'csv', COMPRESSION 'gzip', PARTITION_BY ('col0', 'col1'));
FROM read_csv_auto('data.csv.d/*/*/*.csv.gz');  -- Fails
FROM read_csv_auto('data.csv.d/*/*/*.csv');  -- Fails
FROM read_csv_auto('data.csv.d/*/*/*.csv', compression='gzip'); -- Succeeds

Output:

v0.10.2 978c20f
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D CREATE TABLE test AS VALUES ('a', 'foo', 1), ('a', 'foo', 2), ('a', 'bar', 1), ('b', 'bar', 1);
D FROM test;
┌─────────┬─────────┬───────┐
│  col0   │  col1   │ col2  │
│ varchar │ varchar │ int32 │
├─────────┼─────────┼───────┤
│ a       │ foo     │     1 │
│ a       │ foo     │     2 │
│ a       │ bar     │     1 │
│ b       │ bar     │     1 │
└─────────┴─────────┴───────┘
D COPY (FROM test) TO 'data.csv.d' (FORMAT 'csv', COMPRESSION 'gzip', PARTITION_BY ('col0', 'col1'));
D 
D -- Did not include FROM read_csv_auto('data.csv.d/*/*/*.csv.gz'); case
[1]+  Stopped                 duckdb
~/test$ find . data.csv.d/
.
./data.csv.d
./data.csv.d/col0=a
./data.csv.d/col0=a/col1=foo
./data.csv.d/col0=a/col1=foo/data_0.csv
./data.csv.d/col0=a/col1=bar
./data.csv.d/col0=a/col1=bar/data_0.csv
./data.csv.d/col0=b
./data.csv.d/col0=b/col1=bar
./data.csv.d/col0=b/col1=bar/data_0.csv
data.csv.d/
data.csv.d/col0=a
data.csv.d/col0=a/col1=foo
data.csv.d/col0=a/col1=foo/data_0.csv
data.csv.d/col0=a/col1=bar
data.csv.d/col0=a/col1=bar/data_0.csv
data.csv.d/col0=b
data.csv.d/col0=b/col1=bar
data.csv.d/col0=b/col1=bar/data_0.csv
~/test$ file ./data.csv.d/col0=a/col1=foo/data_0.csv
./data.csv.d/col0=a/col1=foo/data_0.csv: gzip compressed data
~/test$ fg
D
D FROM read_csv_auto('data.csv.d/*/*/*.csv');
Invalid Input Error: CSV Error on Line: 1
Invalid unicode (byte sequence mismatch) detected.

Possible Solution: Enable ignore errors (ignore_errors=true) to skip this row

  file=data.csv.d/col0=a/col1=bar/data_0.csv
  delimiter = , (Auto-Detected)
  quote = " (Auto-Detected)
  escape = \0 (Auto-Detected)
  new_line = Single-Line File (Auto-Detected)
  header = false (Auto-Detected)
  skip_rows = 0 (Auto-Detected)
  date_format =  (Auto-Detected)
  timestamp_format =  (Auto-Detected)
  null_padding=0
  sample_size=20480
  ignore_errors=false
  all_varchar=0


D FROM read_csv_auto('data.csv.d/*/*/*.csv', compression='gzip');
┌─────────┬─────────┬───────┐
│  col0   │  col1   │ col2  │
│ varchar │ varchar │ int64 │
├─────────┼─────────┼───────┤
│ a       │ bar     │     1 │
│ a       │ foo     │     1 │
│ a       │ foo     │     2 │
│ b       │ bar     │     1 │
└─────────┴─────────┴───────┘
D

OS:

x64

DuckDB Version:

0.10.2

DuckDB Client:

CLI

Full Name:

Teague Sterling

Affiliation:

23andMe

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a source build

Did you include all relevant data sets for reproducing the issue?

Not applicable - the reproduction does not require a data set

Did you include all code required to reproduce the issue?

Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Yes, I have

The text was updated successfully, but these errors were encountered:

teaguesterling added the needs triage label Apr 30, 2024

teaguesterling changed the title ~~Partitioned CSV Writes with gzip Does Not Add .csv.gz Extension and Fails to Read Back in with read_csv_auto~~ Partitioned CSV writes with gzip do not add .csv.gz extension and fail to read back in with read_csv_auto Apr 30, 2024

szarnyasg added the reproduced label May 1, 2024

duckdblabs-bot removed the needs triage label May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partitioned CSV writes with gzip do not add `.csv.gz` extension and fail to read back in with `read_csv_auto` #11889

Partitioned CSV writes with gzip do not add `.csv.gz` extension and fail to read back in with `read_csv_auto` #11889

teaguesterling commented Apr 30, 2024 •

edited

Partitioned CSV writes with gzip do not add .csv.gz extension and fail to read back in with read_csv_auto #11889

Partitioned CSV writes with gzip do not add .csv.gz extension and fail to read back in with read_csv_auto #11889

Comments

teaguesterling commented Apr 30, 2024 • edited

What happens?

To Reproduce

OS:

DuckDB Version:

DuckDB Client:

Full Name:

Affiliation:

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

Did you include all relevant data sets for reproducing the issue?

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Partitioned CSV writes with gzip do not add `.csv.gz` extension and fail to read back in with `read_csv_auto` #11889

Partitioned CSV writes with gzip do not add `.csv.gz` extension and fail to read back in with `read_csv_auto` #11889

teaguesterling commented Apr 30, 2024 •

edited