Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[metascraper-amazon] Image selector matches incorrect image #50

Open
agchou opened this issue Jan 18, 2018 · 20 comments
Open

[metascraper-amazon] Image selector matches incorrect image #50

agchou opened this issue Jan 18, 2018 · 20 comments

Comments

@agchou
Copy link

agchou commented Jan 18, 2018

I'm running into issues with the image value not being the main image for metascraper-amazon. There are actually multiple .a-dyanmic-image classes on the screen as seen in the attached photo. Can we create some rules with priority over this like wrapUrl($ => $('#landingImage').attr('src')) or wrapUrl($ => $('.a-dynamic-image').first().attr('src'))?

screen shot 2018-01-17 at 8 53 27 pm

@agchou agchou changed the title Image selector matches incorrect image [metascraper-amazon] Image selector matches incorrect image Jan 18, 2018
@Kikobeats
Copy link
Member

yeah, of course, just add the right rule here:
https://github.com/microlinkhq/metascraper/blob/master/packages/metascraper-amazon/index.js#L51

Can you specific the URl for creating a unit test?

@Kikobeats
Copy link
Member

Hey @agchou, I think you create your own package for support this new custom rule.

Can you share with us? I want to improve this in the metascraper-amazon package 😄

@Kikobeats
Copy link
Member

Happy to accept improvements over metascraper-amazon; I'm going to close the issue since it's old; If the package looks outdated for you, please ping me!

@swolidity
Copy link

Yea @Kikobeats I get just a tiny 1 pixel image every time. How do we go about fixing this? Can the rule be overridden?

@Kikobeats
Copy link
Member

@andyk2177 need to add the specific rule for contemplating that case.

Please, share the URL that is causing this behavior.

We can add a code ward to don't consider images with less than N pixels.

@swolidity
Copy link

Well, seems to be any Amazon link for me that is doing it but here is an example https://www.amazon.com/JNH-Lifestyles-Canadian-Hemlock-Infrared/dp/B00F2Y5B6W?tag=profiledotim-20

@swolidity
Copy link

I think we might just need a more specific class name to grab maybe?

@Kikobeats Kikobeats reopened this Aug 10, 2019
@Kikobeats
Copy link
Member

@andyk2177 yes, you're right, the problem is Amazon has a lot of different product views; need to setup the rules in a way we can maximize get the proper image.

Can you make a PR? Just you need is to add the specific image selector here.

@swolidity
Copy link

Ok sure, why are there two selectors though? Which one is prioritized? So for example with my url above I get this image back but the page does have a data-old-hires attribute so not sure why that one wasn't prioritized?

@Kikobeats
Copy link
Member

the best way to determinate that is adding a test per every link and be sure the output is the thing you expect

@bobber205
Copy link

bobber205 commented Oct 3, 2019

Getting "robot check" every link I've tried for an amazon product -- anyone else seeing this?

Example URL: https://www.amazon.com/dp/B07SY4C5QF/ref=cm_sw_r_tw_apa_i_2qJLDbGGS3H0Q

@swolidity
Copy link

@bobber205 it's probably because your User-Agent header looks like it is automated and coming from a script ( it is ) but you should be able to set it to anything you want. I'm setting it to a browser like this:

try {
    const { body: html } = await got(url, {
      headers: {
        "User-Agent": req.headers["user-agent"]
      }
    });
    data = await metascraper({ url, html });
    statusCode = 200;
  } catch (err) {
    statusCode = 401;
    data = {
      message: `Scraping the open graph data from "${url}" failed.`,
      suggestion:
        "Make sure your URL is correct and the webpage has open graph data, meta tags or twitter card data."
    };

@Kikobeats
Copy link
Member

@bobber205

What kind of data are you interested in?

Looks almost all the data is there using Microlink API

https://api.microlink.io/?url=https%3A%2F%2Fwww.amazon.com%2Fdp%2FB07SY4C5QF

@bobber205
Copy link

Good advice on setting the user agent!

I've set it to

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36

That's what google says is the latest User Agent for Chrome. I don't see "Robot Check" anymore but I do get https://fls-na.amazon.com/1/batch/1/OP/ATVPDKIKX0DER:144-1080801-5689911:ADD1XNCC3BW7K9PR531T$uedata=s:%2Fdp%2FB07SY4C5QF%2Fref%3Dcm_sw_r_tw_apa_i_2qJLDbGGS3H0Q%2Fuedata%2Fnvp%2Funsticky%2F144-1080801-5689911%2FNoPageType%2Fntpoffrw%3Fstaticb%26id%3DADD1XNCC3BW7K9PR531T%26pty%3DDetail%26spty%3DGlance%26pti%3DB0798MSV1F:1000 (a large black image) for the image. :(

@bobber205
Copy link

@Kikobeats I'm looking for the image mostly. The rest is coming through great once I've set the user agent

@pdesmarais
Copy link

pdesmarais commented Oct 28, 2019

Hey! So, I'm getting everything required except the product's image from this amazon URL.

Looked at the selectors that are used by metascraper; those parts exist in the html but seem empty. The actual image that should be extracted doesn't have a class or an id. It can be found within a div that has the "digitalMusicProductImage_feature_div" id.

Example URL: https://www.amazon.de/Vienna-Bolling-Project-»Classic-Jazz«/dp/B003604LHE

Is there anything to do with this @Kikobeats ?

Thanks!

@swolidity
Copy link

@pdesmarais perhaps https://microlink.io/docs/mql/getting-started/overview can help us out here?

@bobber205
Copy link

@pdesmarais Have you tried setting the useProxy init variable true?

@swolidity
Copy link

@bobber205 where do you set that? I don't see it in the docs?

@bobber205
Copy link

ah I was confusing this with the opengraph paid product. Sorry :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants