Match links to bookmarks #87

nslpls · 2024-04-08T22:51:37Z

First, a great package and thank you for making it available!

I am looking for a way to match a link to its bookmark - is that possible in principle? The below is an example of what I want to achieve.

I defined an annotation for the attribute id - that returns the location and the text of the bookmark, but not the actual id. So, it's not possible to know which bookmark has been identified.
Also, when the div element doesn't have any text or sub-elements, the annotation doesn't return anything.

Would I have to re-define the div tag handler (even though I don't know which tags may contain bookmarks)?
Also, it seems that I cannot define any custom attribute handlers, other than the three already defined?

Many thanks!

from lxml.html import fromstring
from inscriptis.html_engine import Inscriptis
from inscriptis import ParserConfig
from inscriptis.css_profiles import CSS_PROFILES
from inscriptis import get_annotated_text

doc = r"""
<html><body>

<div><a href="#idd1">Part 1</a></div>
<div><a href="#idd2">Part 2</a></div>

<div id="idd1"></div>
<div id="idd2">target with text</div>

</body></html>
"""

annotation_rules = {"a": ["link"], "#id": ["target"]}
css = CSS_PROFILES['relaxed'].copy()
inscriptis_parser_config = ParserConfig(display_links=True, annotation_rules=annotation_rules, css=css)

html_tree = fromstring(doc)
parser = Inscriptis(html_tree, config=inscriptis_parser_config)
txt = parser.get_text()
ant = parser.get_annotations()
labels = [(a.start, a.end, a.metadata) for a in ant]

for ii, ant in enumerate(labels):
    print(f"{ii} {ant[2]} {ant[0]} {txt[ant[0]:ant[1]]}")

The output is:

   0 link              3     Part 1](#idd1)
   1 link             21     Part 2](#idd2)
   2 target           36   target with text

In this example, I am looking for the id of the last div element, as well as the id, location and text of the third div element.
(Note also that the text of the link doesn't include the opening [.)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Match links to bookmarks #87

Match links to bookmarks #87

nslpls commented Apr 8, 2024

Match links to bookmarks #87

Match links to bookmarks #87

Comments

nslpls commented Apr 8, 2024