We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trying to convert this page to XML : https://pve.proxmox.com/pve-docs/
Especially this part :
<table class="tableblock frame-all grid-all" style=" width:100%; "> <col style="width:50%;"> <col style="width:50%;"> <thead> <tr> <th class="tableblock halign-left valign-top">Format </th> <th class="tableblock halign-left valign-top">Link</th> </tr> </thead> <tbody> <tr> <td class="tableblock halign-left valign-top"> <p class="tableblock">Printable version</p> </td> <td class="tableblock halign-left valign-top"> <p class="tableblock"> <a href="pve-admin-guide.pdf">pve-admin-guide.pdf</a> </p> </td> </tr> <tr> <td class="tableblock halign-left valign-top"> <p class="tableblock">Online HTML version</p> </td> <td class="tableblock halign-left valign-top"> <p class="tableblock"> <a href="pve-admin-guide.html">pve-admin-guide.html</a> </p> </td> </tr> <tr> <td class="tableblock halign-left valign-top"> <p class="tableblock">E-Book version</p> </td> <td class="tableblock halign-left valign-top"> <p class="tableblock"> <a href="pve-admin-guide.epub">pve-admin-guide.epub</a> </p> </td> </tr> </tbody> </table>
Tables are returned without links :
<table> <row> <cell role="head">Format</cell> <cell role="head">Link</cell> </row> <row> <cell> <p>Printable version</p> </cell> </row> <row> <cell> <p>Online HTML version</p> </cell> </row> <row> <cell> <p>E-Book version</p> </cell> </row> </table>
Here are my extract parameters (using trafilatura 1.7.0) :
trafilatura.extract(downloaded, output_format='xml', include_formatting=True, include_links=True, include_tables=True)
The text was updated successfully, but these errors were encountered:
Hi @obeone, indeed. The links were not my original focus and there are a few problems with link extraction.
Sorry, something went wrong.
No branches or pull requests
Trying to convert this page to XML : https://pve.proxmox.com/pve-docs/
Especially this part :
Tables are returned without links :
Here are my extract parameters (using trafilatura 1.7.0) :
The text was updated successfully, but these errors were encountered: