Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table markdown syntax incorrect in some cases #599

Closed
naktinis opened this issue May 17, 2024 · 2 comments · Fixed by #601
Closed

Table markdown syntax incorrect in some cases #599

naktinis opened this issue May 17, 2024 · 2 comments · Fixed by #601
Labels
bug Something isn't working

Comments

@naktinis
Copy link
Contributor

This is actually multiple issues with table markdown rendering. Let me know if you'd like me to split it into multiple issues.

Table structure below is taken from wikipedia pages.

Version tested: 1.9.0

I'll summarize the issues here, but see below for full examples and explanations:

  • <br> gets deleted but probably should be replaced with a space instead
  • New line sometimes gets added before |
  • | completely missing after some rows
  • Breaks when colspan is used, but could handle it by adding an extra empty cell (append | to all lines as many times as colspan in max colspan row requires)
  • <p> in a cell inserts a new line
  • List items in a cell completely disappeared, but could be turned into plain text
  • Inserts ---| in the middle of the table (but ---| should only appear once after the first row)
  • <wbr> not handled (should probably be replaced with an empty string)

Extract call used:

extract(html_plants, no_fallback=True, output_format='txt', include_tables=True)

Example 1

HTML:

<html>
  <table>
    <tbody>
    <tr><th colspan="2">Plants<br>Temporal range: Mesoproterozoic–present Pha. Proterozoic Archean Had.</th></tr>
    <tr><td colspan="2">Angiosperm Desmid Moss Glaucophyta Charophyta Rhodophyta Fern Spirotaenia</td></tr>
    <tr><th colspan="2">Scientific classification</th></tr>
    <tr><td>Domain:</td><td>Eukaryota</td></tr>
    <tr><td>Clade:</td><td>Diaphoretickes</td></tr>
    <tr><td>(unranked):</td> <td>Archaeplastida</td></tr>
    <tr><td>Kingdom:</td><td>Plantae<br>H.F.Copel., 1956</td></tr>
    <tr><th colspan="2">Superdivisions</th></tr>
    <tr><td colspan="2"> <p><i>see text</i> </p> </td></tr>
    <tr><th colspan="2">Synonyms </th></tr>
    <tr><td colspan="2"> <ul><li>Viridiplantae <small>Cavalier-Smith 1981</small><sup>[1]</sup></li> <li>Chlorobionta <small>Jeffrey 1982, emend. Bremer 1985, emend. Lewis and McCourt 2004</small><sup>[2]</sup></li></ul> </td></tr>
    </tbody>
 </table>
</html>

Output:

PlantsTemporal range: Mesoproterozoic–present Pha. Proterozoic Archean Had.
|
---|
Angiosperm Desmid Moss Glaucophyta Charophyta Rhodophyta Fern Spirotaenia |
Scientific classification |
---|
Domain: | Eukaryota |
Clade: | Diaphoretickes |
(unranked): | Archaeplastida |
Kingdom: | PlantaeH.F.Copel., 1956
|
Superdivisions |
---|
see text
|
Synonyms |
---|
|

Expected output:

Plants Temporal range: Mesoproterozoic–present Pha. Proterozoic Archean Had. ||
---|---|
Angiosperm Desmid Moss Glaucophyta Charophyta Rhodophyta Fern Spirotaenia ||
Scientific classification ||
Domain: | Eukaryota |
Clade: | Diaphoretickes |
(unranked): | Archaeplastida |
Kingdom: | PlantaeH.F.Copel., 1956 |
Superdivisions ||
see text |
Synonyms ||
Viridiplantae Cavalier-Smith[1] 1981 Chlorobionta Jeffrey 1982, emend. Bremer 1985, emend. Lewis and McCourt 2004 [2] |

Issues:

  • <br> gets deleted but probably should be replaced with a space instead
  • New line gets added before | in the first line
  • Breaks when colspan is used, but could handle it by adding an extra empty cell (append | to all lines as many times as colspan in max colspan row requires)
  • <p> in a cell inserts a new line
  • List items in a cell completely disappeared, but could be turned into plain text
  • Inserts ---| in the middle of the table (but ---| should only appear once after the first row)

Example 2

HTML:

<html>
  <table>
  <tbody>
  <tr><th colspan="2"><div>Khruangbin</div></th></tr>
  <tr><td colspan="2"><span></span><div>Khruangbin performing at the 2019 <a>Haldern Pop Festival</a></div></td></tr>
  <tr><th colspan="2">Background information</th></tr>
  <tr><th>Origin</th><td><a>Houston</a>, <a>Texas</a>, United States</td></tr>
  <tr><th>Genres</th><td><div>
  <ul><li><a>Psychedelic rock</a><sup><a>[1]</a></sup></li>
  <li><a>surf rock</a></li>
  <li><a>funk</a></li>
  <li><a>instrumental rock</a></li>
  <li><a>dub</a></li>
  <li><a>rock</a></li></ul>
  </div></td></tr>
  <tr><th scope="row"><span>Years active</span></th><td>2010<span>&nbsp;(<span>2010</span>)</span>–present</td></tr>
  <tr><th scope="row">Labels</th><td><a>Dead Oceans</a><br><a>Night Time Stories</a></td></tr>
  <tr><th colspan="2"></th></tr>
  <tr><th>Members</th><td>
  <ul><li><a>Laura Lee</a></li>
  <li><a>Mark Speer</a></li>
  <li><a>DJ Johnson</a></li></ul>
  </td></tr>
  <tr><th colspan="2"></th></tr>
  <tr><th>Website</th><td><span><a>khruangbin<wbr>.com</a></span></td></tr>
  </tbody>
  </table>
</html>

Output:

Khruangbin
---|
Khruangbin performing at the 2019 Haldern Pop Festival
Background information |
---|
Origin | Houston, Texas, United States |
---|---|
Genres |
---|
Years active | 2010 (2010)–present |
---|---|
Labels | Dead OceansNight Time Stories
|
---|---|
Members |
---|
Website | khruangbin |
---|---|

Expected output:

Khruangbin ||
---|---|
Khruangbin performing at the 2019 Haldern Pop Festival ||
Background information ||
Origin | Houston, Texas, United States |
Years active | 2010 (2010)–present |
Labels | Dead OceansNight Time Stories |
Members | Laura Lee Mark Speer DJ Johnson |
Website | khruangbin.com |

Issues:

  • | completely missing after the first and second rows
  • lists not parsed within cells (so genres and members missing)
  • <wbr> not handled in the website field (should be replaced with an empty string)
  • Inserts ---| in the middle of the table (but ---| should only appear once after the first row)
@adbar adbar added the bug Something isn't working label May 17, 2024
@adbar
Copy link
Owner

adbar commented May 17, 2024

Thanks for the detailed example, this seems to be related to <br> in tables. Nested tables structures are difficult to process. I'll leave the issue open for now and see if someone can address it.

@naktinis
Copy link
Contributor Author

Submitted a PR for this: #601

Some issues still remain:

  • list rendering within cells
  • <wbr> handling
  • <p><i>see text</i> </p> disappears when used with include_formatting=True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants