Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong links position in text from telegram post #585

Open
RedHotUnicorn opened this issue May 4, 2024 · 2 comments
Open

Wrong links position in text from telegram post #585

RedHotUnicorn opened this issue May 4, 2024 · 2 comments
Labels
question Further information is requested

Comments

@RedHotUnicorn
Copy link

Hello!

Found an peculiarity in extracting text from telegram posts.
So the link for example is here .

The text of the post is :

Interestingly, most people voted for Ukrainian to be the subject of my new post. So here you go — below is everything you wanted to know about the relationship of
...

In html file the link just in div:

<div>
Interestingly, most people <a href="https://t.me/durov/271" target="_blank" rel="noopener" onclick="return confirm('Open this link?\n\n'+this.href);">voted</a> for Ukrainian to be the subject of my new post. So here you go — below is everything you wanted to know about the relationship of
...
</div>

But after extracting markdown text contains two \n before each link:

Interestingly, most people

voted for Ukrainian to be the subject of my new post. So here you go — below is everything you wanted to know about the relationship of

The code of extraction:

import trafilatura
url = "https://t.me/durov/272?embed=1&mode=tme"
html = trafilatura.fetch_url(url)

sent=trafilatura.extract(
    (
        html
    )
    , output_format='markdown'
    ,include_images=True
    # ,favor_precision=True
    ,include_formatting=True
    , include_links=True
)

print(sent)

I tried to use favor_precision option but its removing formatting and links.

@RedHotUnicorn
Copy link
Author

probably dup of #21

@adbar adbar added the question Further information is requested label May 6, 2024
@adbar
Copy link
Owner

adbar commented May 6, 2024

Hi, I've never tried using Trafilatura on Telegram posts, I need to check what's going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants