Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

html_text2 does not replace the line break element within inline elements #351

Open
sigve-berge-hofland opened this issue Mar 23, 2022 · 2 comments
Labels
bug an unexpected problem or unintended behavior

Comments

@sigve-berge-hofland
Copy link

The html_text2 documentation says that: “Roughly speaking, it converts <br /> to "\n"”. But it seems that it only replaces the line break element with line breaks within block-level elements. Line break elements that are children of inline elements (text-level semantics) such as span, em etc. are not replaced by line breaks in the output.

The document in the following example is valid html markup according to the W3C validation service, but the html_text2 output does not successfully simulate how the text looks in a browser.

html <- '
<!DOCTYPE html>
<html lang = "en">
<head>
<meta charset="utf-8">
<title>test</title>
</head>
<body>
<span>line 1<br>line 2</span>
</body>
</html>
'

testthat::test_that("br to newline within inline elements", {
  
  testthat::expect_equal(rvest::html_text2(rvest::read_html(html)),
               "line 1\nline 2")
  
})
@hadley
Copy link
Member

hadley commented Nov 15, 2022

Minimal reprex:

library(rvest)

doc <- minimal_html("<p><span>line 1<br>line 2</span></p>")
html_text2(doc)
#> [1] "line 1line 2"

Created on 2022-11-15 with reprex v2.0.2

@hadley hadley added the bug an unexpected problem or unintended behavior label Nov 15, 2022
@hadley
Copy link
Member

hadley commented Nov 22, 2022

Looks like I forgot to consider that a <br> might occur in an inline tag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

2 participants