-
-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Absolutify base href #240
Comments
We can consider this def base_url
current_base_href = ['/',nil,''].any?('base_href.to_s.strip) ? nil : base_href
current_base_href || url
end Please share your thoughts |
No, I don't think a base href of "/" should be treated as an empty one. It means different things: if empty, it need to be ignored, but if it says "/", the document author is trying to say that relative links should be built from the root directory. For example: Let's say there's a page http://example.com/some/dir/first.html and it has a link: <a href="second.html">Second page</a> When there is no base href (or it is empty and we ignore it), this relative link will be absolutified as http://example.com/some/dir/second.html Instead, if the base href is If the base href was |
Yeah. It makes sense. def base_url
current_base_href = base_href.to_s.strip.empty? ? nil : URL.absolutify(base_href, URL.new(url).root_url)
current_base_href || url
end |
@jaimeiniesta Please check this PR. Time being I have overridden this method in my project to fix the failure. |
#240 Absolutified relative base href Looks good to me!
Some pages like https://www.delta.com/us/en have a relative base href tag:
This makes the scraping fail because we expect it to be an absolute URL.
To fix this, we should also absolutify this base href with the url of the scraped page. If the base href was already an absolute one, it won't get changed.
The text was updated successfully, but these errors were encountered: