Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not able to correctly load page for homedepot and walmart #705

Open
sentinel3 opened this issue Apr 16, 2022 · 3 comments
Open

not able to correctly load page for homedepot and walmart #705

sentinel3 opened this issue Apr 16, 2022 · 3 comments

Comments

@sentinel3
Copy link

Hi great tool! I just set urlwatch up and it works on one test page. but when I further explore and put into one product on HomeDepot and Walmart, they both failed.

$cat urls.yaml
name: "HomeDepot"
url: "https://www.homedepot.ca/product/hampton-bay-1-person-braided-woven-egg-patio-swing/1001582001"
filter:
    - xpath: //span[@class="hdca-product__description-pricing-price-value"]
    - html2text
---
name: "Walmart"
url: "https://www.walmart.ca/en/ip/hometrends-egg-swing-with-stand-black/6000203713927"
filter:
  - xpath: //span/span[@class="css-2vqe5n esdkp3p0" and @data-automation="buybox-price"] 
  - html2text

then I tried to test the filters for Homedepot list:
urlwatch --test-filter and both give me empty result.
I further commented the xpath in filter in the above settings. and run with verbose:

$urlwatch --test-filter 1 --verbose #xpath filter commented
...
...connectionpool DEBUG: Starting new HTTPS connection (1): www.homedepot.ca:443

for this Homedepot list, it hangs here! I further followed the #575 to add: ##headers: User-Agent: <redacted>, then it runs but still no result.

then I tried to test the filters for Walmart list:

$urlwatch --test-filter 2 --verbose #xpath filter commented
Skip to main ...
JavaScript is Disabled
      Sorry, this webpage requires JavaScript to function correctly.
      Please enable JavaScript in your browser and reload the page.

I tried suggestion of #465, and tried to download the walmart page directly: curl directly replies blocked, while wget will download a page with text:

Are you human?
Seems like a silly question, we know. But, we want to keep robots off of Walmart.ca! 

seems Walmart blocks such automatic watch?

any help is appreciated.
thx

@thp
Copy link
Owner

thp commented Apr 18, 2022

Yes, if the pages applies a Captcha to avoid automated tools to grab the page contents, there's not much we can do. Have you checked if Walmart or Home Depot provides an API for grabbing pricing information?

Maybe it's possible to use an API for that purpose?

https://developer.walmart.com

@sentinel3
Copy link
Author

Thank you for the quick response!
I believe the human detection is only for the walmart.com case. the HomeDepot.ca has connection error 443, which I am not sure whether is caused by the same reason.
I previously build some similar small project use requests-html and BS4 + headless browser, and I did not encounter the human/robot detection on most commercial sites, maybe I will go back and give walmart.ca a try.
Any way thank you for the suggestion on the Walmart API.

@thp
Copy link
Owner

thp commented Apr 21, 2022

Have you tried using https://urlwatch.readthedocs.io/en/latest/jobs.html#browser (just change url to navigate) which uses a headless variant of the Chrome browser to load the page? Maybe this is "good enough" to make it not trigger the Captcha.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants