Absolutify base href #240

jaimeiniesta · 2018-10-23T12:44:07Z

Some pages like https://www.delta.com/us/en have a relative base href tag:

<base href="/">

This makes the scraping fail because we expect it to be an absolute URL.

To fix this, we should also absolutify this base href with the url of the scraped page. If the base href was already an absolute one, it won't get changed.

navarasu · 2020-11-06T10:32:56Z

We can consider this / like empty base href. We handle it in same way as we did it here
Some thing like this

def base_url
   current_base_href =  ['/',nil,''].any?('base_href.to_s.strip) ? nil : base_href
   current_base_href || url
end

Please share your thoughts

jaimeiniesta · 2020-11-06T11:49:03Z

No, I don't think a base href of "/" should be treated as an empty one. It means different things: if empty, it need to be ignored, but if it says "/", the document author is trying to say that relative links should be built from the root directory. For example:

Let's say there's a page http://example.com/some/dir/first.html and it has a link:

<a href="second.html">Second page</a>

When there is no base href (or it is empty and we ignore it), this relative link will be absolutified as http://example.com/some/dir/second.html

Instead, if the base href is / it should be treated as if it was http://example.com/ so the absolutified link would be http://example.com/second.html

If the base href was /other then the absolutified link would be http://example.com/other/second.html

navarasu · 2020-11-06T13:16:25Z

Yeah. It makes sense.
Then I think that the below changes will solve this.

def base_url
   current_base_href = base_href.to_s.strip.empty? ? nil : URL.absolutify(base_href, URL.new(url).root_url)
   current_base_href || url
end

navarasu · 2020-11-09T19:22:28Z

@jaimeiniesta Please check this PR. Time being I have overridden this method in my project to fix the failure.

#240 Absolutified relative base href Looks good to me!

navarasu added a commit to navarasu/metainspector that referenced this issue Nov 7, 2020

jaimeiniesta#240 Absolutified relative base href

dab3cd4

navarasu mentioned this issue Nov 7, 2020

#240 Absolutified relative base href #275

Merged

jschwindt added a commit that referenced this issue Jan 16, 2021

Merge pull request #275 from navarasu/relative_base_href

966aa6d

#240 Absolutified relative base href Looks good to me!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Absolutify base href #240

Absolutify base href #240

jaimeiniesta commented Oct 23, 2018

navarasu commented Nov 6, 2020 •

edited

jaimeiniesta commented Nov 6, 2020

navarasu commented Nov 6, 2020 •

edited

navarasu commented Nov 9, 2020

Absolutify base href #240

Absolutify base href #240

Comments

jaimeiniesta commented Oct 23, 2018

navarasu commented Nov 6, 2020 • edited

jaimeiniesta commented Nov 6, 2020

navarasu commented Nov 6, 2020 • edited

navarasu commented Nov 9, 2020

navarasu commented Nov 6, 2020 •

edited

navarasu commented Nov 6, 2020 •

edited