Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

html_text2 deletes some spaces between words #372

Open
mayeulk opened this issue May 22, 2023 · 5 comments
Open

html_text2 deletes some spaces between words #372

mayeulk opened this issue May 22, 2023 · 5 comments
Labels
bug an unexpected problem or unintended behavior

Comments

@mayeulk
Copy link

mayeulk commented May 22, 2023

In some cases, html_text2 deletes some standard spaces between words.

The reproducible example follows:

some_html <- '<p dir="ltr" style="text-align:left;"></p><span style="font-size:0.9375rem;">The sentence starts this way,</span><span style="font-size:0.9375rem;"> </span><span style="font-size:0.9375rem;">then</span><span style="font-size:0.9375rem;"> </span><span style="font-size:0.9375rem;">spaces</span><span style="font-size:0.9375rem;"> </span><span style="font-size:0.9375rem;">disappear</span>'
html_text(read_html(some_html)) # is correct
html_text2(read_html(some_html))  # not correct

The incorrect result is:
"The sentence starts this way,thenspacesdisappear"

html_text() works correctly, but on most cases I do need the power of html_text2 (new lines...).

I'm using: rvest_1.0.3 , xml2_1.3.3 in R 4.2.2 (Kubuntu 23.04).

(Note: The original html string comes from a rich text area of a Moodle Database activity, see https://docs.moodle.org/402/en/Database_activity; exported from Moodle as a LibreOffice .ods file)

@mayeulk
Copy link
Author

mayeulk commented May 22, 2023

Interestingly, removing the first empty paragraph allows a correct conversion:

some_html <- '<p style="text-align:left;"></p><span style="font-size:0.9375rem;">The sentence starts this way,</span><span style="font-size:0.9375rem;"> </span><span style="font-size:0.9375rem;">then</span><span style="font-size:0.9375rem;"> </span><span style="font-size:0.9375rem;">spaces</span><span> </span><span style="font-size:0.9375rem;">disappear</span>'
html_text2(read_html(some_html)) # is not correct

some_html2 <- '<span style="font-size:0.9375rem;">The sentence starts this way,</span><span style="font-size:0.9375rem;"> </span><span style="font-size:0.9375rem;">then</span><span style="font-size:0.9375rem;"> </span><span style="font-size:0.9375rem;">spaces</span><span> </span><span style="font-size:0.9375rem;">disappear</span>'
html_text2(read_html(some_html2)) # is correct

@hadley

This comment was marked as outdated.

@hadley hadley added the reprex needs a minimal reproducible example label Aug 8, 2023
@mayeulk

This comment was marked as outdated.

@hadley
Copy link
Member

hadley commented Aug 9, 2023

The attributes don't seem to be necessary to illustrate the problem, leading to this similar reprex:

library(rvest)
some_html <- "<p></p><span>The sentence starts this way,</span><span> </span><span>then</span><span> </span><span>spaces</span><span> </span><span>disappear</span>"
html_text2(read_html(some_html))
#> [1] "The sentence starts this way,thenspacesdisappear"

Created on 2023-08-09 with reprex v2.0.2

And we can make it much easier to see what's going on by adding some newlines:

library(rvest)
some_html <- "
  <p></p>
  <span>The sentence starts this way,</span>
  <span> </span>
  <span>then</span>
  <span> </span>
  <span>spaces</span>
  <span> </span>
  <span>disappear</span>
"
html_text2(read_html(some_html))
#> [1] "The sentence starts this way,thenspacesdisappear"

Created on 2023-08-09 with reprex v2.0.2

The key problem appears to be the early closing of the <p> tag. When I fix that the problem goes away:

library(rvest)
some_html <- "
<p>
  <span>The sentence starts this way,</span>
  <span> </span>
  <span>then</span>
  <span> </span>
  <span>spaces</span>
  <span> </span>
  <span>disappear</span>
</p>
"
html_text2(read_html(some_html))
#> [1] "The sentence starts this way, then spaces disappear"

Created on 2023-08-09 with reprex v2.0.2

Looking at the code, it seems like the problem probably arises if you have inline elements following a block element. Fixing that will require some careful thought.

@hadley hadley added bug an unexpected problem or unintended behavior and removed reprex needs a minimal reproducible example labels Aug 9, 2023
@tinygreen
Copy link

Looking at the code, it seems like the problem probably arises if you have inline elements following a block element. Fixing that will require some careful thought.

The problem arises if there are inline and block elements mixed on the same level, regardless of which comes first. Then is_inline() returns false and the elements are parsed as if they were all block elements. This means that all elements are passed to collapse_whitespace() individually. Without the block element, is_inline() returns true and the contents are passed to html_text_inline(), which correctly collapses whitespace after pasting the inline elements together. However, as html_text_inline() ignores <br> tags inside inline elements (issue #351), that function should be changed too.

An element could also contain several block elements with text nodes and inline elements in between. In that case, all non-block nodes between two block nodes should be passed together through collapse_whitespace() before being added to the text buffer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants