Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Absolutify base href #240

Open
jaimeiniesta opened this issue Oct 23, 2018 · 4 comments
Open

Absolutify base href #240

jaimeiniesta opened this issue Oct 23, 2018 · 4 comments

Comments

@jaimeiniesta
Copy link
Owner

Some pages like https://www.delta.com/us/en have a relative base href tag:

<base href="/">

This makes the scraping fail because we expect it to be an absolute URL.

To fix this, we should also absolutify this base href with the url of the scraped page. If the base href was already an absolute one, it won't get changed.

@navarasu
Copy link
Contributor

navarasu commented Nov 6, 2020

We can consider this / like empty base href. We handle it in same way as we did it here
Some thing like this

def base_url
   current_base_href =  ['/',nil,''].any?('base_href.to_s.strip) ? nil : base_href
   current_base_href || url
end

Please share your thoughts

@jaimeiniesta
Copy link
Owner Author

No, I don't think a base href of "/" should be treated as an empty one. It means different things: if empty, it need to be ignored, but if it says "/", the document author is trying to say that relative links should be built from the root directory. For example:

Let's say there's a page http://example.com/some/dir/first.html and it has a link:

<a href="second.html">Second page</a>

When there is no base href (or it is empty and we ignore it), this relative link will be absolutified as http://example.com/some/dir/second.html

Instead, if the base href is / it should be treated as if it was http://example.com/ so the absolutified link would be http://example.com/second.html

If the base href was /other then the absolutified link would be http://example.com/other/second.html

@navarasu
Copy link
Contributor

navarasu commented Nov 6, 2020

Yeah. It makes sense.
Then I think that the below changes will solve this.

def base_url
   current_base_href = base_href.to_s.strip.empty? ? nil : URL.absolutify(base_href, URL.new(url).root_url)
   current_base_href || url
end

navarasu added a commit to navarasu/metainspector that referenced this issue Nov 7, 2020
@navarasu
Copy link
Contributor

navarasu commented Nov 9, 2020

@jaimeiniesta Please check this PR. Time being I have overridden this method in my project to fix the failure.

jschwindt added a commit that referenced this issue Jan 16, 2021
#240 Absolutified relative base href
Looks good to me!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants