Add field delimiter detection #44

PoshAlpaca · 2022-03-04T13:38:43Z

Description

Adds the ability to automatically infer the CSV file's field delimiter. To use field delimiter detection, specify nil for the field delimiter when configuring CSVReader:

let configuration = CSVReader.Configuration()
configuration.delimiters = (field: nil, row: "\n")

Checklist

The following list must only be fulfilled by code-changing PRs. If you are making changes on the documentation, ignore these.

Include in-code documentation at the top of the property/function/structure/class (if necessary).
Merge to develop.
Add to existing tests or create new tests (if necessary).

Progress

Change API of Delimiter.Field to more explicitly support inference and allow for a custom list of possible delimiters.
One option could look like this:

// infer field delimiter from a list of common delimiters, e.g. ",", ";", "\t"
configuration.delimiters = (field: .infer, row: "\n")
// infer field delimiter from a list of custom delimiters
configuration.delimiters = (field: .infer(options: [",", ":"]), row: "\n")

Read CSV data incrementally until delimiters can be inferred with a high degree of confidence.

PoshAlpaca · 2022-03-04T14:36:11Z

Hi @dehesa,

I finally got around to opening the PR. This is just a draft at this point but I wanted to hear your thoughts before getting too carried away ;)

The detection method comes from CleverCSV and is also described in this paper. It consists of two parts:

Analysis of row patterns (implemented): the correct delimiter will produce the same pattern for every row. In these row patterns C represents a cell (field) and D stands for a field delimiter.
```
                     ","         ";"
1,foo,bar            CDCDC       C
2,funny ;),baz       CDCDC       CDC
```
Analysis of field types (not implemented yet): the correct delimiter will produce more fields with sensible, recognizable types, e.g. dates, integers, UUIDs etc., resulting in a higher ratio of recognized types to total fields. This is particularly useful when two different delimiters both produce reasonable row patterns, like in the example below.
```
                     ","      ";"
1,funny ;),bar       1 0 0    0 0
2,funny ;),baz       1 0 0    0 0
                     = 2/6    = 0/4
```

I think the analysis of field types could also be useful for header detection (the header row, if present, should contain only string fields while other rows could contain other types of data).

A few questions that came up:

Should we support detecting a limited set of field and row delimiters? This would obviously make the task much easier.
How much of the CSV file should we look at to make the guess (I currently hard-coded 50 Unicode scalars)? We could also dynamically read more and more until a good guess can be made. Also, when the user specifies a row delimiter we could say we read a certain number of rows.

dehesa · 2022-03-04T23:55:47Z

Hey @PoshAlpaca,

Great stuff! I am enjoying reading the paper and skimming through the CleverCSV library. Thank you!

For what I can gather, you will focus on inferring the field delimiters, but not the row delimiters (or both), right? Given those constraints, I quite like your dialect detection and pattern score calculations (although I should check it with more detail). Here go some thoughts:

Row pattern analysis seems great and already worth to have it by itself.
Field type analysis would be awesome to have, but it can be an iterative addition to further refine/validate the results of the row pattern analysis. It seems hard to get it right.
We can start supporting a limited set of fields delimiters and later on add full inferral. Maybe we can rework the Delimiter.Field configuration type to, instead of holding an array of scalars, contain an enum specifying whether the delimiter is given or it is supposed to be inferred. If it is supposed to be inferred, we can give a further option to choose on a user selected possible delimiters, a library given common delimiters, or full-inferral.
50 unicode scalars feels low. It would be better to start reading till we have some level of confidence that the inferral was successfull or we reach a maximum never of characters in which case we throw an error (maybe 1000 characters).
All the characters that are read during the inferral should be stored in a buffer. Some input sources are read-once sequences (such as the stdin).
Also keep in mind that some CSV contain multiple header files or white lines to separate headers from content.

Small code nitpicks:

The library uses 2 spaces indentation.
Please call self explicitly.

I know this are a lot of cases/situations to handle. So you can start small and grow from there. The code already looks great!

PoshAlpaca · 2022-03-06T11:33:55Z

Hi @dehesa,

Thanks for the feedback and the thoughts!
I wanted to start with detecting field delimiters because that's what the CleverCSV library also does. However I'd definitely also want to add row delimiter detection and header detection as it all goes more or less hand-in-hand.

I've started working on field type analysis already because I was curious to try it out. I was thinking it would actually make sense to do row patterns and field types all in one pass where the CSV data is converted into an abstract representation. And then one can do different calculations and inference on that abstraction, e.g. delimiter detection, header detection etc.
The CleverCSV library does two passes, one for row patterns and one for field types, which seems inefficient.

I like the idea of reworking the API as you've suggested, having the user provide possible delimiters and otherwise working with sensible defaults.

The incremental reading definitely makes sense! I'll have a look at how this could work in a bit.
I'm storing all the read data in the CSVReader's _scalarBuffer property which is passed into CSVReader.inferDelimiters. That's how it's meant to be used, right?

For the multiple header lines and empty lines I'll think about how to best handle these. Currently, when there are multiple header lines, the user has to choose one of them using the new lineNumber header strategy, right?

PoshAlpaca · 2022-03-06T14:26:04Z

I've added a suggestion for the API to the PR description. What do you think of it?
I'm not quite sure about the parameter name in .infer(options:). Maybe .infer(choices:) or something else could be better 🤔
Also, what should we do with the nil-literal API? We could interpret it as .infer. But maybe it should be deprecated?

Add field delimiter detection using row patterns

fc82811

PoshAlpaca changed the title ~~Add field delimiter detection using row patterns~~ Add field delimiter detection Mar 4, 2022

Nikolai Guyot added 2 commits March 6, 2022 11:41

Call explicitly

9923aa8

Fix indentation: spaces instead of tabs

053149b

PoshAlpaca mentioned this pull request Mar 7, 2022

Give CSVWriter.Configuration its own delimiter type which has no inference API #45

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add field delimiter detection #44

Add field delimiter detection #44

PoshAlpaca commented Mar 4, 2022 •

edited

PoshAlpaca commented Mar 4, 2022

dehesa commented Mar 4, 2022

PoshAlpaca commented Mar 6, 2022

PoshAlpaca commented Mar 6, 2022

Add field delimiter detection #44

Are you sure you want to change the base?

Add field delimiter detection #44

Conversation

PoshAlpaca commented Mar 4, 2022 • edited

Description

Checklist

Progress

PoshAlpaca commented Mar 4, 2022

dehesa commented Mar 4, 2022

PoshAlpaca commented Mar 6, 2022

PoshAlpaca commented Mar 6, 2022

PoshAlpaca commented Mar 4, 2022 •

edited