CSV Imports with Regex for the read_csv `columns` Parameter #11909

earlev4 · 2024-05-02T19:41:15Z

earlev4
May 2, 2024

Hi! I know there has been excellent progress with implementing regex with SELECT COLUMNS. I'm exploring the possibility of using regex with the columns parameter in read_csv to handle data types dynamically.

Current approach:

SELECT * FROM read_csv('flights.csv',
    delim = '|',
    header = true,
    columns = {
        'FlightDate': 'DATE',
        'UniqueCarrier': 'VARCHAR',
        'OriginCityName': 'VARCHAR',
        'DestCityName': 'VARCHAR'
    });

Proposed approach:

SELECT * FROM read_csv('flights.csv',
    delim = '|',
    header = true,
    columns = {
        'FlightDate': 'DATE',
        'UniqueCarrier': 'VARCHAR',
        '.*Name.*': 'VARCHAR'
    });

Issue:
To my understanding, the read_csv function requires manually specifying data types for each column when inconsistencies are expected across multiple CSV files. Manually defining each column's data type in the CLI can be cumbersome (repeatedly specifying each column and data type), especially with large datasets like the Backblaze dataset (~90 CSVs, one for each day in a quarter), which features ~148 columns prefixed with smart_ (e.g., smart_1_normalized, smart_1_raw, ... smart_255_normalized, smart_255_raw). Although the desired data type for the smart_ columns is BIGINT, it sometimes varies between BIGINT and VARCHAR depending on whether values or nulls are encountered in the samples.

Challenge:
The read_csv function determines the data type from the first CSV file read in a glob operation. For instance, even with sample_size=-1, the Q1 2022 dataset infers smart_175_normalized as VARCHAR, whereas the Q4 2023 dataset infers it as BIGINT. This inconsistency presents challenges in achieving data type uniformity across different datasets.

Possible Solutions:

Regex columns Parameter: Implement regex to dynamically match patterns in column names when assigning data types (e.g., read_csv('/path/*.csv', columns = {'.*Name.*': 'VARCHAR'}).
Sampling Multiple Files: Enhance the CSV sniffer to sample across multiple files and specify file count (e.g., read_csv('/path/*.csv', sample_files=3, sample_size=-1). A limitation is that this approach does not guarantee the CSV sniffer will encounter the desired values. This might be helpful in different scenarios.
Specifying a Sniff File: Allow users to specify a particular file for the sniffer to evaluate (e.g., read_csv('/path/*.csv', sniff_file='/path/myfile.csv', sample_size=-1). A limitation is that this approach requires knowing which file most represents the data types. Again, this might be helpful in different scenarios.

Enabling regex for the columns parameter during read_csv would provide a more robust solution by ensuring consistent data types without manually specifying each column.

To summarize, some desired options:

Use regex with the columns parameter
Scan multiple CSV files with the CSV sniffer
Specify a file for the CSV sniffer to scan

Thank you very much for your consideration! I am always appreciative and grateful for the excellent work by the DuckDB community.

pdet · 2024-05-02T21:30:51Z

pdet
May 2, 2024
Collaborator

Hi @earlev4, Is there any other system that implements the regex solution you proposed? At first glance, it sounds pretty specific to the problem you are having.

I think the most generic solution would be adding an extra option, sample_files=1, which would sample the bind types on multiple files, as you described.

We already do multiple runs of the sniffer outside the binding phase. A more complete solution would be to do this in parallel on all files while binding, but that's also a bigger change.

0 replies

earlev4 · 2024-05-03T20:46:07Z

earlev4
May 3, 2024
Author

Hi @pdet! Great to hear from you, Pedro! I sincerely appreciate your response and insight!

Currently, I am not aware of any other systems that implement the proposed regex solution. However, I implemented a workaround using PyArrow. By utilizing PyArrow's dataset, I lazily scan the CSV files in a directory to extract the schema. Columns that match the smart pattern are identified using regex and mapped to int64 within a dictionary. This dictionary is then used to set up ConvertOptions, which configures CsvFileFormat. The configured CsvFileFormat is then used to perform another lazy scan of the CSV files, this time applying the defined column types. Finally, the dataset is written to Parquet files. However, I am exploring ways to achieve this in DuckDB (CLI) without Python or dependencies like PyArrow.

I agree; I think the most generic solution is adding an extra option: sample_files=1. I see benefits to this approach for other situations, not just the one I am describing. Given the larger sample size of files, I think that approach should capture the correct schema. The multiple runs in parallel on all files while binding sounds very interesting and exciting!

Thank you very much!

BTW, I thoroughly enjoyed your talk on CSV parsing!

2 replies

pdet May 6, 2024
Collaborator

Hi, @earlev4. Thanks for the kind words, and I'm happy to hear you enjoyed the talk :-) !

I will most likely add this option after the next minor release. After some further consideration, even parallel sniffing wouldn't help much here since it would be focused on the union_by_name options.

earlev4 May 7, 2024
Author

Hi @pdet! The kind words are well deserved! 😄

Thank you very much for considering this option and moving forward with implementation. I am very grateful! I can't wait for it to be released.

Thanks again! I appreciate you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV Imports with Regex for the read_csv `columns` Parameter #11909

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

CSV Imports with Regex for the read_csv columns Parameter #11909

earlev4 May 2, 2024

Replies: 2 comments · 2 replies

pdet May 2, 2024 Collaborator

earlev4 May 3, 2024 Author

pdet May 6, 2024 Collaborator

earlev4 May 7, 2024 Author

CSV Imports with Regex for the read_csv `columns` Parameter #11909

earlev4
May 2, 2024

Replies: 2 comments 2 replies

pdet
May 2, 2024
Collaborator

earlev4
May 3, 2024
Author

pdet May 6, 2024
Collaborator

earlev4 May 7, 2024
Author