Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract R code from R Markdown HTML file #1811

Open
stevecondylios opened this issue Feb 11, 2020 · 3 comments · May be fixed by #1812
Open

Extract R code from R Markdown HTML file #1811

stevecondylios opened this issue Feb 11, 2020 · 3 comments · May be fixed by #1812
Labels
feature Feature requests

Comments

@stevecondylios
Copy link

There appears to be no fast and easy way to extract the R code from HTML files generated via R Markdown.

Example

Max and Davis's applied-ml workshop is a good example.

We can easily get the R code for 'Part_1.html', since we have access to the original .Rmd file, and can hence call

knitr::purl("Part_1.Rmd")
readLines("Part_1.R") %>% paste0(collapse="\n\n") %>% cat
# Displays R code...

But we cannot so easily get the R code for parts 2 through 5, as the originating .Rmd is not available.

Possible solution

html_to_r() extracts the R code from R Markdown generated HTML files.

I provide an implementation in a PR.

Using in the applied-ml example

We can now easily retrieve the R code from the .html files, like so

# from inside applied-ml
dir() %>% grep("Part_{1}.*html", ., value = T) %>% sapply(., html_to_r) -> a
dir() %>% grep("Part_{1}.*html", ., value = T) %>% mapply(html_to_r, inc_out=F, .) -> b

# Randomly inspect the second file with / without output to ensure it worked as expected
a[[2]] %>% cat # with output
b[[2]] %>% cat # without output

This can be merged if relevant or disregarded if not relevant.

@stevecondylios stevecondylios linked a pull request Feb 11, 2020 that will close this issue
@atusy
Copy link
Collaborator

atusy commented Feb 11, 2020

IMO, using pandoc makes the code simple and applicable to more formats (e.g., gfm).
What do you think?

# purloc = purl + pandoc
purloc = function(x, output = file.path(".", xfun::with_ext(x, "R")), ...) {
  input = tempfile(fileext = xfun::file_ext(x))
  file.copy(x, input)

  knitr::pandoc(input, 'commonmark', ext = 'md')
  
  intermediate_md = xfun::with_ext(input, 'md')
  intermediate_md %>%
    readr::read_lines() %>%
    stringr::str_replace_all("^``` r", '```{r}') %>%
    readr::write_lines(intermediate_md)
  knitr::purl(intermediate_md, output = output, ...)
}

@stevecondylios
Copy link
Author

@atusy that is a great simplification and improvement on the DIY solution in the original.

The purloc naming is also intuitive and makes sense.

Some questions

Do you think inc_out option is useful? (a quick example of the difference below)

# from inside applied-ml root directory
dir() %>% grep("Part_{1}.*html", ., value = T) %>% sapply(., html_to_r) -> a
dir() %>% grep("Part_{1}.*html", ., value = T) %>% mapply(html_to_r, inc_out=F, .) -> b

a[[2]] %>% cat # with output
b[[2]] %>% cat # without output

For me, it's useful, but maybe not for everyone?

Also, do you agree replacing character entities is useful? I think it is essential (otherwise pipes and some conditionals will appear meaningful in HTML but not in R code)

  replace_character_entities <- function(char_entity){
    xml2::xml_text(xml2::read_html(paste0("<x>", char_entity, "</x>")))
  }

# E.g. 
replace_character_entities("&gt;")
# [1] ">"

Which makes a pipe appear as %>% rather than %&gt;%

I applied this conversion to some test examples but I cannot be certain it will work under all circumstances (one exception that comes to mind is if R code contained some literal &gt;, perhaps in a comment). I think this could be sufficiently rare to not cause too much concern though

@atusy
Copy link
Collaborator

atusy commented Feb 14, 2020

About inc_out, I think it is relatively less important.
Because results are expected to be reproducible.
Also, a problem arise when the source Rmd contains code blocks.
They are not output, but are considered as output by html_2_r.
If inc_out is really needed, I think they should be commented out in R.

About special characters, we do not have to care as pandoc takes care of them

echo "<pre>%&gt;</pre>" | pandoc --from html --to gfm
# ```
# %>
# ```

@cderv cderv linked a pull request Jan 29, 2021 that will close this issue
@cderv cderv added the feature Feature requests label Jan 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Feature requests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants