new "Direct Data Download and Ingestion" section #501

wibeasley · 2023-02-18T21:33:13Z

consolidate all the scattered advice about reading a csv/json/xml from a url
update all the advice involving read*() functions and TLS/SSL urls. A lot of still says that base R functions (eg, read.csv()) cannot handle an https url
there are a lot of common packages that accept https that aren't listed. Of the top of my head, data.table, readr, arrow.
File types for direct ingestion:
- maybe refer html to the web scraping section in the task view.
- maybe yaml

The text was updated successfully, but these errors were encountered:

ref #501

pachadotdev · 2023-02-19T18:52:56Z

@wibeasley hi! do you prefer yaml?

ref #501

wibeasley · 2023-02-19T20:49:51Z

Yes, I typically prefer yaml if the (a) data has a nested or non-rectangular structure and (b) the file is a human entered/edited. I tend to use json for machine-generated datasets.

But there are some tabular/rectangular files that I have started expressing as yaml because they're easier to read & adjust. A small downside is that it requires a little more work (for the ingesting code) to verify the yaml politely transforms to a data.frame.

Here's an example of a tabular structure that I felt was a better fit for yaml than csv: https://github.com/OuhscBbmc/REDCapR/blob/main/inst/misc/validation-transformation.yml

I don't do it much, but the yaml package can load a file from a https url:

yaml::yaml.load_file(
  "https://raw.githubusercontent.com/OuhscBbmc/REDCapR/main/inst/misc/validation-transformation.yml"
)

Since we already have bullets for csv, xml, html, & json ...I thought yaml could be included for completeness. But as always, I'm happy following your lead. Tell me if you think tangents like this are more distracting than helpful.

Are there scenarios where you do/don't format a data file as yaml?

ref #501

wibeasley self-assigned this Feb 18, 2023

wibeasley added a commit that referenced this issue Feb 18, 2023

start direct ingest & download section

a1d163d

ref #501

wibeasley added a commit that referenced this issue Feb 18, 2023

arrow

cabc13b

ref #501

wibeasley added a commit that referenced this issue Feb 18, 2023

consolidate web scraping

34345ea

ref #501

wibeasley mentioned this issue Feb 19, 2023

high-level organization #502

Closed

wibeasley added a commit that referenced this issue Feb 19, 2023

distinguish from web services

8a95f79

ref #501

wibeasley added a commit that referenced this issue Feb 19, 2023

link to utils, instead of treat as a package

13bced5

ref #501

wibeasley added a commit that referenced this issue Feb 19, 2023

include yaml

00ccf87

ref #501

wibeasley added a commit that referenced this issue Feb 19, 2023

small consolidation of nested structures

10c1288

ref #501

wibeasley added a commit that referenced this issue Feb 19, 2023

steer towards databases task view

c13bf53

ref #501

wibeasley mentioned this issue Feb 21, 2023

high level restructuring #504

Merged

pachadotdev pushed a commit that referenced this issue Feb 22, 2023

start direct ingest & download section

77fad6c

ref #501

pachadotdev pushed a commit that referenced this issue Feb 22, 2023

arrow

13b7d6b

ref #501

pachadotdev pushed a commit that referenced this issue Feb 22, 2023

consolidate web scraping

de54b1a

ref #501

pachadotdev pushed a commit that referenced this issue Feb 22, 2023

distinguish from web services

3a3aee7

ref #501

pachadotdev pushed a commit that referenced this issue Feb 22, 2023

link to utils, instead of treat as a package

63d8d98

ref #501

pachadotdev pushed a commit that referenced this issue Feb 22, 2023

include yaml

ffe637c

ref #501

pachadotdev pushed a commit that referenced this issue Feb 22, 2023

small consolidation of nested structures

34d8140

ref #501

pachadotdev pushed a commit that referenced this issue Feb 22, 2023

steer towards databases task view

535e9ed

ref #501

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new "Direct Data Download and Ingestion" section #501

new "Direct Data Download and Ingestion" section #501

wibeasley commented Feb 18, 2023 •

edited

pachadotdev commented Feb 19, 2023

wibeasley commented Feb 19, 2023

new "Direct Data Download and Ingestion" section #501

new "Direct Data Download and Ingestion" section #501

Comments

wibeasley commented Feb 18, 2023 • edited

pachadotdev commented Feb 19, 2023

wibeasley commented Feb 19, 2023

wibeasley commented Feb 18, 2023 •

edited