Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new "Direct Data Download and Ingestion" section #501

Open
6 tasks done
wibeasley opened this issue Feb 18, 2023 · 2 comments
Open
6 tasks done

new "Direct Data Download and Ingestion" section #501

wibeasley opened this issue Feb 18, 2023 · 2 comments
Assignees

Comments

@wibeasley
Copy link
Collaborator

wibeasley commented Feb 18, 2023

  • consolidate all the scattered advice about reading a csv/json/xml from a url
  • update all the advice involving read*() functions and TLS/SSL urls. A lot of still says that base R functions (eg, read.csv()) cannot handle an https url
  • there are a lot of common packages that accept https that aren't listed. Of the top of my head, data.table, readr, arrow.
  • File types for direct ingestion:
    • maybe refer html to the web scraping section in the task view.
    • maybe yaml
@wibeasley wibeasley self-assigned this Feb 18, 2023
wibeasley added a commit that referenced this issue Feb 18, 2023
wibeasley added a commit that referenced this issue Feb 18, 2023
wibeasley added a commit that referenced this issue Feb 18, 2023
wibeasley added a commit that referenced this issue Feb 19, 2023
wibeasley added a commit that referenced this issue Feb 19, 2023
@pachadotdev
Copy link
Collaborator

@wibeasley hi! do you prefer yaml?

wibeasley added a commit that referenced this issue Feb 19, 2023
wibeasley added a commit that referenced this issue Feb 19, 2023
wibeasley added a commit that referenced this issue Feb 19, 2023
@wibeasley
Copy link
Collaborator Author

Yes, I typically prefer yaml if the (a) data has a nested or non-rectangular structure and (b) the file is a human entered/edited. I tend to use json for machine-generated datasets.

But there are some tabular/rectangular files that I have started expressing as yaml because they're easier to read & adjust. A small downside is that it requires a little more work (for the ingesting code) to verify the yaml politely transforms to a data.frame.

Here's an example of a tabular structure that I felt was a better fit for yaml than csv: https://github.com/OuhscBbmc/REDCapR/blob/main/inst/misc/validation-transformation.yml

I don't do it much, but the yaml package can load a file from a https url:

yaml::yaml.load_file(
  "https://raw.githubusercontent.com/OuhscBbmc/REDCapR/main/inst/misc/validation-transformation.yml"
)

Since we already have bullets for csv, xml, html, & json ...I thought yaml could be included for completeness. But as always, I'm happy following your lead. Tell me if you think tangents like this are more distracting than helpful.

Are there scenarios where you do/don't format a data file as yaml?

pachadotdev pushed a commit that referenced this issue Feb 22, 2023
pachadotdev pushed a commit that referenced this issue Feb 22, 2023
pachadotdev pushed a commit that referenced this issue Feb 22, 2023
pachadotdev pushed a commit that referenced this issue Feb 22, 2023
pachadotdev pushed a commit that referenced this issue Feb 22, 2023
pachadotdev pushed a commit that referenced this issue Feb 22, 2023
pachadotdev pushed a commit that referenced this issue Feb 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants