Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Match links to bookmarks #87

Open
nslpls opened this issue Apr 8, 2024 · 0 comments
Open

Match links to bookmarks #87

nslpls opened this issue Apr 8, 2024 · 0 comments

Comments

@nslpls
Copy link

nslpls commented Apr 8, 2024

First, a great package and thank you for making it available!

I am looking for a way to match a link to its bookmark - is that possible in principle? The below is an example of what I want to achieve.

I defined an annotation for the attribute id - that returns the location and the text of the bookmark, but not the actual id. So, it's not possible to know which bookmark has been identified.
Also, when the div element doesn't have any text or sub-elements, the annotation doesn't return anything.

Would I have to re-define the div tag handler (even though I don't know which tags may contain bookmarks)?
Also, it seems that I cannot define any custom attribute handlers, other than the three already defined?

Many thanks!

from lxml.html import fromstring
from inscriptis.html_engine import Inscriptis
from inscriptis import ParserConfig
from inscriptis.css_profiles import CSS_PROFILES
from inscriptis import get_annotated_text

doc = r"""
<html><body>

<div><a href="#idd1">Part 1</a></div>
<div><a href="#idd2">Part 2</a></div>

<div id="idd1"></div>
<div id="idd2">target with text</div>

</body></html>
"""

annotation_rules = {"a": ["link"], "#id": ["target"]}
css = CSS_PROFILES['relaxed'].copy()
inscriptis_parser_config = ParserConfig(display_links=True, annotation_rules=annotation_rules, css=css)

html_tree = fromstring(doc)
parser = Inscriptis(html_tree, config=inscriptis_parser_config)
txt = parser.get_text()
ant = parser.get_annotations()
labels = [(a.start, a.end, a.metadata) for a in ant]

for ii, ant in enumerate(labels):
    print(f"{ii} {ant[2]} {ant[0]} {txt[ant[0]:ant[1]]}")

The output is:

   0 link              3     Part 1](#idd1)
   1 link             21     Part 2](#idd2)
   2 target           36   target with text

In this example, I am looking for the id of the last div element, as well as the id, location and text of the third div element.
(Note also that the text of the link doesn't include the opening [.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant