Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare header and schema #127

Open
6 tasks
peterdesmet opened this issue Mar 21, 2023 · 2 comments · May be fixed by #146
Open
6 tasks

Compare header and schema #127

peterdesmet opened this issue Mar 21, 2023 · 2 comments · May be fixed by #146
Assignees
Labels
enhancement New feature or request function:read_resource Function read_resource()
Milestone

Comments

@peterdesmet
Copy link
Member

peterdesmet commented Mar 21, 2023

It is possible for an (invalid) Data Package to have discrepancies between the schema and the actual data. E.g. defining more/less columns or in a different order. read_resource() will silently let those through when the data types of the switched columns are compatible, which can lead to issues for the user (e.g. lat/lon are silently switched). Only when the data types are incompatible, will readr return a parsing issue.

To avoid passing these issues silently, read_resource() should compare the headers of the file with the schema and raise an error if those are not exactly the same. This implements the following spec:

The field descriptor MUST contain a name property. This property SHOULD correspond to the name of field/column in the data file (if it has a name). As such it SHOULD be unique (though it is possible, but very bad practice, for the data file to have multiple columns with the same name). name SHOULD NOT be considered case sensitive in determining uniqueness. However, since it should correspond to the name of the field in the data file it may be important to preserve case.

Implementation considerations:

  • Only compare when replace_null(dialect$header, TRUE) (i.e. it is not false). It might be useful to define dialect_header and reuse it here:

    skip = ifelse(replace_null(dialect$header, TRUE), 1, 0),

  • The specs say that case should NOT be considered, so both the field names and col_names should be lowercased before comparing

  • To allow comparison, the header line of the file should be read separately from the main read_delim(). read_lines() could be used, but delim and encoding/locale might have to be passed too.

  • A resource can contain multiple files (e.g. observations_1, observations_2). Either all files are read and compared or only the last once, cf. add_resource():

    The last file will be read with readr::read_delim() to create or compare with schema and to set format, mediatype and encoding. The other files are ignored, but are expected to have the same structure and properties.

  • On a mismatch (fieldnames, different order, more or less), an error should be returned, similar to check_schema():

    msg = glue::glue(
    "Field names in `schema` must match column names in data:",
    "\u2139 Field names: `{field_names_collapse}`",
    "\u2139 Column names: `{col_names_collapse}`",
    .sep = "\n"

  • Add a section validation to explain what we validate:

    #' @section Validation:
    #' Full validation is not supported.
    #' Something about validation issues
    #' Something about header compare
    
@peterdesmet peterdesmet added the enhancement New feature or request label Mar 21, 2023
@peterdesmet peterdesmet added this to the 1.1.0 milestone Mar 21, 2023
@PietrH
Copy link
Member

PietrH commented Mar 22, 2023

Some questions:

Multipart resources

  • What about multipart resources, should all parts of the resource be checked? Or just the first/last one?
  • For multipart resources, will they always either all have a header, or none of them? Or is it possible for example only the first resource has a header?

Naming

What would be a good argument name to toggle this comparison/check?

  • check_header = TRUE
  • compare_header = TRUE
  • check_fields = TRUE

Default behavior

I assume that read_resource() should be default not compare the header and the schema?

@peterdesmet
Copy link
Member Author

  • Multipart resources: to increase performance (especially when reading over URL) I'd be fine with the last file being read.
  • A header or not is defined at resource level, meaning all files should comply.
  • I would not add a parameter in read_resource, but always include this check. It is a recommended part of the specs: This property SHOULD correspond ...

PietrH added a commit that referenced this issue Aug 3, 2023
@PietrH PietrH linked a pull request Aug 3, 2023 that will close this issue
@peterdesmet peterdesmet added the function:read_resource Function read_resource() label Aug 12, 2023
@peterdesmet peterdesmet modified the milestones: 1.1.0, 1.2.0 Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request function:read_resource Function read_resource()
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants