Replies: 2 comments 2 replies
-
Hi @earlev4, Is there any other system that implements the regex solution you proposed? At first glance, it sounds pretty specific to the problem you are having. I think the most generic solution would be adding an extra option, We already do multiple runs of the sniffer outside the binding phase. A more complete solution would be to do this in parallel on all files while binding, but that's also a bigger change. |
Beta Was this translation helpful? Give feedback.
-
Hi @pdet! Great to hear from you, Pedro! I sincerely appreciate your response and insight! Currently, I am not aware of any other systems that implement the proposed regex solution. However, I implemented a workaround using PyArrow. By utilizing PyArrow's I agree; I think the most generic solution is adding an extra option: Thank you very much! BTW, I thoroughly enjoyed your talk on CSV parsing! |
Beta Was this translation helpful? Give feedback.
-
Hi! I know there has been excellent progress with implementing regex with
SELECT COLUMNS
. I'm exploring the possibility of using regex with the columns parameter in read_csv to handle data types dynamically.Current approach:
Proposed approach:
Issue:
To my understanding, the
read_csv
function requires manually specifying data types for each column when inconsistencies are expected across multiple CSV files. Manually defining each column's data type in the CLI can be cumbersome (repeatedly specifying each column and data type), especially with large datasets like the Backblaze dataset (~90 CSVs, one for each day in a quarter), which features ~148 columns prefixed withsmart_
(e.g.,smart_1_normalized, smart_1_raw, ... smart_255_normalized, smart_255_raw
). Although the desired data type for the smart_ columns is BIGINT, it sometimes varies between BIGINT and VARCHAR depending on whether values or nulls are encountered in the samples.Challenge:
The
read_csv
function determines the data type from the first CSV file read in a glob operation. For instance, even withsample_size=-1
, the Q1 2022 dataset inferssmart_175_normalized
asVARCHAR
, whereas the Q4 2023 dataset infers it asBIGINT
. This inconsistency presents challenges in achieving data type uniformity across different datasets.Possible Solutions:
columns
Parameter: Implement regex to dynamically match patterns in column names when assigning data types (e.g.,read_csv('/path/*.csv', columns = {'.*Name.*': 'VARCHAR'})
.read_csv('/path/*.csv', sample_files=3, sample_size=-1)
. A limitation is that this approach does not guarantee the CSV sniffer will encounter the desired values. This might be helpful in different scenarios.read_csv('/path/*.csv', sniff_file='/path/myfile.csv', sample_size=-1)
. A limitation is that this approach requires knowing which file most represents the data types. Again, this might be helpful in different scenarios.Enabling regex for the
columns
parameter duringread_csv
would provide a more robust solution by ensuring consistent data types without manually specifying each column.To summarize, some desired options:
columns
parameterThank you very much for your consideration! I am always appreciative and grateful for the excellent work by the DuckDB community.
Beta Was this translation helpful? Give feedback.
All reactions