Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partitioned CSV writes with gzip do not add .csv.gz extension and fail to read back in with read_csv_auto #11889

Open
2 tasks done
teaguesterling opened this issue Apr 30, 2024 · 0 comments

Comments

@teaguesterling
Copy link

teaguesterling commented Apr 30, 2024

What happens?

Using COPY (FROM ...) TO 'table.csv.d' (FORMAT 'csv', COMPRESSION 'gzip', PARTITION_BY (col1, col2); generates individual CSV files that are compressed but don't have the .gz extension. This causes issues with downstream tools that rely on filename investigation to determine compression times.

To Reproduce

Example:

CREATE TABLE test AS VALUES ('a', 'foo', 1), ('a', 'foo', 2), ('a', 'bar', 1), ('b', 'bar', 1);
COPY (FROM test) TO 'data.csv.d' (FORMAT 'csv', COMPRESSION 'gzip', PARTITION_BY ('col0', 'col1'));
FROM read_csv_auto('data.csv.d/*/*/*.csv.gz');  -- Fails
FROM read_csv_auto('data.csv.d/*/*/*.csv');  -- Fails
FROM read_csv_auto('data.csv.d/*/*/*.csv', compression='gzip'); -- Succeeds

Output:

v0.10.2 978c20f
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D CREATE TABLE test AS VALUES ('a', 'foo', 1), ('a', 'foo', 2), ('a', 'bar', 1), ('b', 'bar', 1);
D FROM test;
┌─────────┬─────────┬───────┐
│  col0   │  col1   │ col2  │
│ varchar │ varchar │ int32 │
├─────────┼─────────┼───────┤
│ a       │ foo     │     1 │
│ a       │ foo     │     2 │
│ a       │ bar     │     1 │
│ b       │ bar     │     1 │
└─────────┴─────────┴───────┘
D COPY (FROM test) TO 'data.csv.d' (FORMAT 'csv', COMPRESSION 'gzip', PARTITION_BY ('col0', 'col1'));
D 
D -- Did not include FROM read_csv_auto('data.csv.d/*/*/*.csv.gz'); case
[1]+  Stopped                 duckdb
~/test$ find . data.csv.d/
.
./data.csv.d
./data.csv.d/col0=a
./data.csv.d/col0=a/col1=foo
./data.csv.d/col0=a/col1=foo/data_0.csv
./data.csv.d/col0=a/col1=bar
./data.csv.d/col0=a/col1=bar/data_0.csv
./data.csv.d/col0=b
./data.csv.d/col0=b/col1=bar
./data.csv.d/col0=b/col1=bar/data_0.csv
data.csv.d/
data.csv.d/col0=a
data.csv.d/col0=a/col1=foo
data.csv.d/col0=a/col1=foo/data_0.csv
data.csv.d/col0=a/col1=bar
data.csv.d/col0=a/col1=bar/data_0.csv
data.csv.d/col0=b
data.csv.d/col0=b/col1=bar
data.csv.d/col0=b/col1=bar/data_0.csv
~/test$ file ./data.csv.d/col0=a/col1=foo/data_0.csv
./data.csv.d/col0=a/col1=foo/data_0.csv: gzip compressed data
~/test$ fg
D
D FROM read_csv_auto('data.csv.d/*/*/*.csv');
Invalid Input Error: CSV Error on Line: 1
Invalid unicode (byte sequence mismatch) detected.

Possible Solution: Enable ignore errors (ignore_errors=true) to skip this row

  file=data.csv.d/col0=a/col1=bar/data_0.csv
  delimiter = , (Auto-Detected)
  quote = " (Auto-Detected)
  escape = \0 (Auto-Detected)
  new_line = Single-Line File (Auto-Detected)
  header = false (Auto-Detected)
  skip_rows = 0 (Auto-Detected)
  date_format =  (Auto-Detected)
  timestamp_format =  (Auto-Detected)
  null_padding=0
  sample_size=20480
  ignore_errors=false
  all_varchar=0


D FROM read_csv_auto('data.csv.d/*/*/*.csv', compression='gzip');
┌─────────┬─────────┬───────┐
│  col0   │  col1   │ col2  │
│ varchar │ varchar │ int64 │
├─────────┼─────────┼───────┤
│ a       │ bar     │     1 │
│ a       │ foo     │     1 │
│ a       │ foo     │     2 │
│ b       │ bar     │     1 │
└─────────┴─────────┴───────┘
D 

OS:

x64

DuckDB Version:

0.10.2

DuckDB Client:

CLI

Full Name:

Teague Sterling

Affiliation:

23andMe

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a source build

Did you include all relevant data sets for reproducing the issue?

Not applicable - the reproduction does not require a data set

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have
@teaguesterling teaguesterling changed the title Partitioned CSV Writes with gzip Does Not Add .csv.gz Extension and Fails to Read Back in with read_csv_auto Partitioned CSV writes with gzip do not add .csv.gz extension and fail to read back in with read_csv_auto Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants