Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Some sites block scraping content without javascript. #6447

Closed
sherlcok314159 opened this issue May 10, 2024 · 5 comments
Closed

Comments

@sherlcok314159
Copy link

Some sites can not be scraped without javascript. And I tried different useragents such as curl/8.21. All the useragents failed.

Site: https://rsshub.app/zhubai/posts/havefun

@Alkarex
Copy link
Member

Alkarex commented May 11, 2024

You can try with https://github.com/lwthiker/curl-impersonate/ , which sometimes help.
Otherwise you will need a more sophisticated system.

@sherlcok314159
Copy link
Author

Thanks. But how can I combine this with freshrss?

@Alkarex
Copy link
Member

Alkarex commented May 12, 2024

A typical way is to use a system such as RSS Bridge, which outputs an RSS feed, which can be consumed by FreshRSS.
But first step is to find an approach that works manually.

@squromiv
Copy link

squromiv commented May 22, 2024

Some sites can not be scraped without javascript

Try feedless tool. It can help in some cases.

@sherlcok314159
Copy link
Author

Thanks for the above replies. My solution is to use a local headless browser to handle this by python. It is quite light.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants