You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It is possible for an (invalid) Data Package to have discrepancies between the schema and the actual data. E.g. defining more/less columns or in a different order. read_resource() will silently let those through when the data types of the switched columns are compatible, which can lead to issues for the user (e.g. lat/lon are silently switched). Only when the data types are incompatible, will readr return a parsing issue.
To avoid passing these issues silently, read_resource() should compare the headers of the file with the schema and raise an error if those are not exactly the same. This implements the following spec:
The field descriptor MUST contain a name property. This property SHOULD correspond to the name of field/column in the data file (if it has a name). As such it SHOULD be unique (though it is possible, but very bad practice, for the data file to have multiple columns with the same name). name SHOULD NOT be considered case sensitive in determining uniqueness. However, since it should correspond to the name of the field in the data file it may be important to preserve case.
Implementation considerations:
Only compare when replace_null(dialect$header, TRUE) (i.e. it is not false). It might be useful to define dialect_header and reuse it here:
The specs say that case should NOT be considered, so both the field names and col_names should be lowercased before comparing
To allow comparison, the header line of the file should be read separately from the main read_delim(). read_lines() could be used, but delim and encoding/locale might have to be passed too.
A resource can contain multiple files (e.g. observations_1, observations_2). Either all files are read and compared or only the last once, cf. add_resource():
The last file will be read with readr::read_delim() to create or compare with schema and to set format, mediatype and encoding. The other files are ignored, but are expected to have the same structure and properties.
On a mismatch (fieldnames, different order, more or less), an error should be returned, similar to check_schema():
What about multipart resources, should all parts of the resource be checked? Or just the first/last one?
For multipart resources, will they always either all have a header, or none of them? Or is it possible for example only the first resource has a header?
Naming
What would be a good argument name to toggle this comparison/check?
check_header = TRUE
compare_header = TRUE
check_fields = TRUE
Default behavior
I assume that read_resource() should be default not compare the header and the schema?
Multipart resources: to increase performance (especially when reading over URL) I'd be fine with the last file being read.
A header or not is defined at resource level, meaning all files should comply.
I would not add a parameter in read_resource, but always include this check. It is a recommended part of the specs: This property SHOULD correspond ...
It is possible for an (invalid) Data Package to have discrepancies between the schema and the actual data. E.g. defining more/less columns or in a different order.
read_resource()
will silently let those through when the data types of the switched columns are compatible, which can lead to issues for the user (e.g. lat/lon are silently switched). Only when the data types are incompatible, willreadr
return a parsing issue.To avoid passing these issues silently,
read_resource()
should compare the headers of the file with the schema and raise an error if those are not exactly the same. This implements the following spec:Implementation considerations:
Only compare when
replace_null(dialect$header, TRUE)
(i.e. it is not false). It might be useful to definedialect_header
and reuse it here:frictionless-r/R/read_resource.R
Line 356 in 421c22f
The specs say that case should NOT be considered, so both the field names and col_names should be lowercased before comparing
To allow comparison, the header line of the file should be read separately from the main
read_delim()
.read_lines()
could be used, butdelim
andencoding/locale
might have to be passed too.A resource can contain multiple files (e.g.
observations_1
,observations_2
). Either all files are read and compared or only the last once, cf.add_resource()
:On a mismatch (fieldnames, different order, more or less), an error should be returned, similar to
check_schema()
:frictionless-r/R/check_schema.R
Lines 65 to 69 in 421c22f
Add a section validation to explain what we validate:
The text was updated successfully, but these errors were encountered: