Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

send spec_*() to vroom #1436

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

send spec_*() to vroom #1436

wants to merge 2 commits into from

Conversation

sbearrows
Copy link
Contributor

Fixes #1387

Previously, spec_*() was being routed to readr ed1 even when using ed2. This meant that duplicate name repair was not consistent between spec_*() and the equivalent read_*() function.

@sbearrows
Copy link
Contributor Author

I've only implemented this for spec_csv() so we can iron things out first.

guess_max = 1000,
name_repair = "unique",
num_threads = readr_threads(),
progress = show_progress(),
show_col_types = should_show_types(),
skip_empty_rows = TRUE,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since n_max will always = guess_max I've removed it from the function arguments. I also removed show_col_types() because otherwise the column specification would print twice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I first looked at spec() because of this issue, I was sort of surprised at how many arguments there are.

So, yes, I would definitely take a hard look at whether all of them are actually being used / make sense. OTOH, especially, since readr 2e is going to basically parse the file to get the spec, it does makes sense that almost all of the usual arguments are here.

@sbearrows sbearrows changed the title send spec_csv() to vroom send spec_*() to vroom Sep 13, 2022
Copy link
Member

@jennybc jennybc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a faithful implementation of the proposal in #1387 (comment):

spec_csv_vroom <- function(..., guess_max = 1000) {
  tmp <- readr::read_csv(..., n_max = guess_max, guess_max = guess_max)
  spec(tmp)
}

It looks like the right first move.

@sbearrows
Copy link
Contributor Author

Carrying a conversation about guessing behaviors in read_*() vs spec_*() over from #1431

One consequence of doing the proposal from #1387 is the following:

Since guess_max doesn't guess column types using rows it will never read, if we complete this PR as intended then n_max = guess_max and guessing for spec_*() will never be done via selecting rows "interspersed throughout the file". For example, if some data has 2000 rows and guess_max = 1000 then n_max = 1000 and it will guess using the first 1000 rows in the data. It'll always use all of the data it is given to guess. Here is an example using the challenge.csv where the column y is filled with NA until row 1001, after which it becomes filled with dates.

# after we complete #1436
spec_csv(readr_example("challenge.csv"), guess_max = 1000) # n_max = 1000
#> cols(
#>   x = col_double(),
#>   y = col_logical()
#> )
# increasing guess_max does change the spec
spec_csv(readr_example("challenge.csv"), guess_max = 1001) # n_max = 1001
#> cols(
#>   x = col_double(),
#>   y = col_date(format = "")
#> )

# compared to how guessing works in read_csv
# since it's dispersed throughout the file, it guesses correctly
spec(read_csv(readr_example("challenge.csv"), guess_max = 1000, show_col_types = FALSE)) # n_max = all the data
#> cols(
#>   x = col_double(),
#>   y = col_date(format = "")
#> )

Not necessarily a non-starter, but food for thought.

@jennybc
Copy link
Member

jennybc commented Sep 21, 2022

More observations:

  • spec_*() probably just has to be documented as being considerably less useful in readr 2e vs 1e. The 1e design made it possible to visit "cells" for guessing without parsing them into a receptacle, whereas 2e does not. So we're going to be making some compromises. I still suspect we should set n_max to guess_max and that guess_max should default to 1000. And document the problem with that. And the user can always set guess_max to something higher or to Inf.
  • I think (hope?) that the big win here is having the result of spec_*() and the result of read_*() be consistent with respect to column names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Name repair for duplicated columns inconsistent between read_csv and spec_csv
2 participants