Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting content from an URl is getting none #586

Open
Fabiha15 opened this issue May 5, 2024 · 1 comment
Open

Extracting content from an URl is getting none #586

Fabiha15 opened this issue May 5, 2024 · 1 comment
Labels
question Further information is requested

Comments

@Fabiha15
Copy link

Fabiha15 commented May 5, 2024

import requests
from main_content_extractor import MainContentExtractor

url = "https://testing.nbnhchurch.org/"
response = requests.get(url)
response.encoding = 'utf-8'
content = response.text
extracted_html = MainContentExtractor.extract(content)
extracted_markdown = MainContentExtractor.extract(content, output_format="markdown")
print("Extracted content:",extracted_markdown)

This is my code to extract content from web page through URL. But I am getting following error for some of the URLs.
WARNING:trafilatura.core:discarding data: None
WARNING:trafilatura.core:discarding data: None
Extracted content: None

@adbar adbar added the question Further information is requested label May 6, 2024
@adbar
Copy link
Owner

adbar commented May 6, 2024

I assume this is related to a relatively rare combination, a homepage with no main text and also no paragraphs (text in div elements). It could be an occasion to make the baseline extraction better but without precise text markers and boundaries finding the right text elements is difficult.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants