Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MetaInspector is unable to scrape a particular blog post #298

Open
inmar-mohan opened this issue May 17, 2022 · 1 comment
Open

MetaInspector is unable to scrape a particular blog post #298

inmar-mohan opened this issue May 17, 2022 · 1 comment

Comments

@inmar-mohan
Copy link

Hi Team,

Thanks for this awesome gem, we have been using this gem for years to scrape the blog posts. MetaInspector is able to scrape almost all of the blog posts, however, I am facing an issue with one particular blog post.

When scraping this one particular blog post, I am getting the expected results such as title, best_description, and best_image in my local machine. However, the same piece of code is not working in the production environment(deployed in the AWS EC2 machine).

ISSUE DETAILS:
MetaInspector gem version: 5.4.0
RAILS VERSION: 5.2.6

MetaInspector returning expected results in my local machine:

def scrape(url)
  @page = MetaInspector.new(url,
    :connection_timeout => 5, :read_timeout => 5,
    :headers => { 'User-Agent' => user_agent, 'Accept-Encoding' => 'identity' },
    :faraday_options => { :ssl => { :verify => false } },
  :html_content_only => true)
end

url = "https://www.simplyleb.com/recipe/easy-french-fry-nachos/"
page = scrape(url)
page.title
=> "Easy French Fry Nachos - Simply Lebanese"
page.images.best
=> "https://www.simplyleb.com/wp-content/uploads/Mccain-Fries-9.jpg"
page.images.count
=> 25
page.best_description
=> "Melted cheese, sour cream, chopped tomatoes and all your favorite toppings on frozen French fries for an easy and quick kid-friendly lunch or snack after school."

MetaInspector is not working from AWS EC2 instance:

irb(main):012:0>url = "https://www.simplyleb.com/recipe/easy-french-fry-nachos/"
irb(main):013:0>page = scrape(url)
irb(main):014:0> inspector.images.best
=> nil
irb(main):015:0> inspector.images.count
=> 0
irb(main):016:0> inspector.title
=> "StackPath"
irb(main):017:0> inspector.best_description
=> "www.simplyleb.com is using a security service for protection against online attacks. The service requires full cookie support in order to view this website."

Please notice best_description returned in the above response:
www.simplyleb.com is using a security service for protection against online attacks. The service requires full cookie support in order to view this website.

Seems like an issue with cookies, do I have to send any extra parameters related to cookies?. Could someone please provide any suggestions on what might be the issue? I am unable to figure out why it's working in my local machine and not from the AWS. Any help would be much appreciated. Thanks.

@jaimeiniesta
Copy link
Owner

Hi, cookies are supported:

https://github.com/metainspector/metainspector/blob/master/lib/meta_inspector/request.rb#L68

It looks like the remote server is blocking your request for the IP in AWS, that's why it works fine from your local machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants