Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Immoscout: Bot detection/No captcha necessary #302

Open
phi1eas opened this issue Jan 25, 2023 · 40 comments
Open

Immoscout: Bot detection/No captcha necessary #302

phi1eas opened this issue Jan 25, 2023 · 40 comments

Comments

@phi1eas
Copy link

phi1eas commented Jan 25, 2023

Hi,

I am trying to run flathunter on immscout24 using imagetyperz. I run into the following issue:

$ pipenv run python3 flathunt.py
[2023/01/25 21:04:20|config.py               |INFO    ]: Using config path /home/max/flathunter/config.yaml
[2023/01/25 21:04:20|chrome_wrapper.py       |INFO    ]: Initializing Chrome WebDriver for crawler...
[2023/01/25 21:04:21|patcher.py              |INFO    ]: patching driver executable /home/max/.local/share/undetected_chromedriver/9418e1b60bf980e1_chromedriver
[2023/01/25 21:04:33|abstract_crawler.py     |INFO    ]: Timeout waiting for iframe element - no captcha verification necessary?
[2023/01/25 21:04:33|crawl_immobilienscout.py|WARNING ]: Unable to find IS24 variable in window
[2023/01/25 21:04:33|crawl_immobilienscout.py|ERROR   ]: IS24 bot detection has identified our script as a bot - we've been blocked

What I think is weird is this: If I do not pass "--headless" as a driver_argument, a Chromium window opens. This window has the immoscout bot detection page loaded. If I copy the URL from that window, and open this URL in a new tab in Chromium, I get the same page, but this time with the Captcha.

Is this because immoscout24 classified me as a bot, or is there something else going on?

This is my config.yaml:

loop:
    active: yes
    sleeping_time: 600

urls:
  - https://www.immobilienscout24.de/Suche/de/berlin/berlin/wohnung-mieten?enteredFrom=one_step_search

filters:

blacklist:
  - Innenstadt

durations:
    - name: John
      destination: Hauptbahnhof, München
      modes: 
          - gm_id: transit
            title: "Öff."
          - gm_id: bicycling
            title: "Rad"
    - name: Jane
      destination: Karlsplatz, München
      modes: 
          - gm_id: transit
            title: "Öff."
          - gm_id: driving
            title: "Auto"

message: |
    {title}
    Zimmer: {rooms}
    Größe: {size}
    Preis: {price}
    Ort: {address}

    {url}

google_maps_api:
    key: YOUR_API_KEY
    url: https://maps.googleapis.com/maps/api/distancematrix/json?origins={origin}&destinations={dest}&mode={mode}&sensor=true&key={key}&arrival_time={arrival}
    enable: False

captcha:
     imagetyperz:
           token: 4B59D2B4CC6B4DE0AFC09D310F77D8CE
#       2captcha:
#             api_key: alskdjaskldjfklj
     driver_arguments:
       - "--no-sandbox"
       - "--disable-gpu"
       - "--remote-debugging-port=9222"
       - "--disable-dev-shm-usage"
       - "window-size=1024,768"

notifiers:
    - telegram
#     - mattermost
#     - apprise

telegram:
  bot_token: (censored)
  notify_with_images: true
  receiver_ids:
      - (censored)

Thank you so much!

@codders
Copy link

codders commented Jan 25, 2023

Hi @phi1eas ,

I've definitely made the same experience as you before - that the URL in the chrome-driver frame gets detected but the same URL in the normal browser window works fine. We used to have that regularly before we switched to the undetected-chromedriver library, but it's a cat-and-mouse game, and of course IS24 is always trying to improve their detection. You can see in #296 and #272 that you're not the only one hitting this. Unfortunately, it seems a bit random which users / setups get detected and which not.

@ozeidan made a comment in #272 that they have been working on a solution based on an undetected-chromedriver-provided docker image. That might be something to look at if you want to look deeper into how to develop a long-term fix for this. But your setup looks fine, and pretty similar to mine - I don't think there's a problem there.

What I can recommend, if you are just doing a simple search in Berlin, is to use the hosted version at https://flathunter.codders.io . You can log in there with your Telegram account and setup a basic filter, and you will get messages about new flats in Berlin - no setup required from your side, and the Immoscout crawling is working at time of writing.

The blacklist, google_maps_api and durations sections from your config can safely be deleted if you're not using those features - I don't know how that made it to the sample config.

Hope that helps!

@phi1eas
Copy link
Author

phi1eas commented Jan 25, 2023

Thank you so much for your quick and helpful response! I will look into your references and try to contribute where I can.

All the best!

@phi1eas
Copy link
Author

phi1eas commented Jan 25, 2023

Just to make sure I'm not missing something: Running flathunter without --headless driver argument, I get this site:

Screenshot from 2023-01-25 23-28-17

Now if I copy the link and open a new tab in the same window, I get this site with a captcha:

Screenshot from 2023-01-25 23-28-34

Doesn't this mean that there must be some different information passed by the browser if I manually open the link, as opposed to opening it within flathunter? Maybe we could use that?

Thanks again!

@codders
Copy link

codders commented Jan 26, 2023

Yeah, I mean, obviously somewhere there there must be a difference. The tricky part is working out where. You could try and spy on the traffic between the browsers and immoscout to see what the difference in requests is, but it might also be that some Javascript is running in the page after it loads to decide whether or not to show the captcha. It could be about the position of your mouse, or the size of the window, or pretty much any property of the application (browsers running javascript give away a lot of clues).

But the fact that you can reproduce it, and that you have a good case and a bad case on the same machine, is already a solid start for investigating.

@codders
Copy link

codders commented Jan 26, 2023

Ah. I should also say. The code we use to launch the window also blocks the GeeTest API call (I think the captcha is powered by GeeTest). We do this so that we can request the Captcha from Python without re-using the same captcha token twice. So that is obviously one difference between the automated browser and the manual browser. You can try disabling that (https://github.com/flathunters/flathunter/blob/main/flathunter/chrome_wrapper.py#L46) and see if that makes a difference. Flathunter won't be able to solve the captcha, but you'll be able to see if that's what's tripping the bot detection.

@codders
Copy link

codders commented Feb 7, 2023

@phi1eas @ozeidan I just bumped the version of undetected-chromedriver to the latest (3.4.x). Maybe you can check if the issue is resolved for you in the latest.

@codders
Copy link

codders commented Feb 14, 2023

Just merged in #313, which bumps undetected-chromedriver up a version again. Maybe try again and see if that's better?

@23722
Copy link

23722 commented Feb 22, 2023

I tried the updated version but had no luck. The output remains the same that @phi1eas described.

@conorheins
Copy link

conorheins commented Feb 28, 2023

I set everything up today (Feb 28, running on Mac OSX 10.14 & sending notifications via Telegram, captchas solved w/ Imagetyperz), and I had the same problem (first without any driver_arguments and then even after adding "--headless").

However, I got it working (no longer detecting me as a bot) after I added the additional driver_arguments suggested by @codders in #296 (see here):

driver_arguments:
            - "--no-sandbox"
            - "--headless"
            - "--disable-gpu"
            - "--remote-debugging-port=9222"
            - "--disable-dev-shm-usage"
            - "window-size=1024,768"

UPDATE: Nevermind, I guess it really is somehow stochastic / traffic-dependent? Because now I'm running it and being detected as a bot again (without any change to the config.yaml file) and getting the same output as in @phi1eas's original post.

@codders
Copy link

codders commented Feb 28, 2023

@conorheins Damn - nice try! Thanks for the updates, and sorry to hear that you're struggling with the bot detection. I don't know if it would help you to turn down the looping frequency. It's really hard to see from here what makes a difference. As far as I can tell, it works okay most of the time for most users, but it's for sure not working for everyone all the time.

@conorheins
Copy link

Thanks for the quick reply @codders -- good to know, I'll try messing with the looping frequency. To be clear, by that you mean decreasing the count in sleeping_time in the loop field of the config file?

@codders
Copy link

codders commented Feb 28, 2023

Increasing the sleeping_time, yeah. If it sleeps for longer you're less likely to trigger spam protections.

@infctr
Copy link

infctr commented Feb 28, 2023

Is there anything else I could try changing / playing with to make IS24 crawler work in Google Cloud Deployment? It doesn't work for me at all (gets blocked all the time)

@codders
Copy link

codders commented Feb 28, 2023

@infctr If you've tried everything here, I'm not sure what else. What deployment region are you using in Google Cloud? For me, it's working reliably out of europe-west1 as a scheduled job.

@evgeniipetrov
Copy link

Ah. I should also say. The code we use to launch the window also blocks the GeeTest API call (I think the captcha is powered by GeeTest). We do this so that we can request the Captcha from Python without re-using the same captcha token twice. So that is obviously one difference between the automated browser and the manual browser. You can try disabling that (https://github.com/flathunters/flathunter/blob/main/flathunter/chrome_wrapper.py#L46) and see if that makes a difference. Flathunter won't be able to solve the captcha, but you'll be able to see if that's what's tripping the bot detection.

I also face the same issue (local run on windows 10 laptop), so I tried commenting this line. Flathunter still reports "Unable to find IS24 variable in window" and "IS24 bot detection has identified our script as a bot - we've been blocked". In the browser it looks like "Gleich geht's weiter" page which quickly redirects to the "Ich bin kein Roboter" page without captcha, and then captcha appears, after like a second or so. With this line uncommented captcha does not appear. So there is indeed some relation, but script can't pass it anyway unfortunately.

@trendschau
Copy link

trendschau commented Mar 6, 2023

Hi, I just read here about this problem: I wrote my own script with headless chrome and a php-wrapper for immoscout. I do not crawl the html-version, but the json-url they use for the map. It looks like this: https://www.immobilienscout24.de/Suche/controller/mapResults.go?searchUrl=/Suche/radius/wohnung-mieten?

My crawler gets blocked initially and then periodically after about 20 minutes. The blocker page from above without the captcha shows up then, the captcha is only displayed on the web-version.

I think they do some kind of browser-fingerprinting with the script they load from https://www.immobilienscout24.de/assets/immo-1-17 (I think an antibot-script from distil network?)

However, you can simply open the json-page in a new incognito window and reload it without solving any captcha and you will get through. So my workaround right now is very silly: I copy the value from the cookie "reese84" from incognito-window to my script, then it runs again for about 20 minutes. I think immoscout just does some kind of whitelisting for your browser with the distil-script and sets a fresh cookie reese84 when the script does not detect you as a bot. And: Sovling a captcha on the web-version does not help in this case, you still get blocked on the json-version vice versa. (test-case: if you open the web-version with headless chrome (in non-headless mode) and pass the captcha, the data for the map from the json-url does not load).

Anyway, your script works differently I suppose but maybe this info is helpful (or old for you, then sorry for the interruption) ...

@trendschau
Copy link

So maybe it is a very big misconception on my side, but the idea is that you prove on your side if a fresh cookie in your script solves your problem, and if so (probably not, because it might have some additional ip-range-blocking), we could search for a service to automate this (=> send-url-and-return-fresh-cookie-api)? I did not find such service on 2captcha or imagetyperz....

Screenshot-cookietoken

@codders
Copy link

codders commented Mar 7, 2023

Hi @trendschau ,

Thanks for the detailed investigation and information. Is your code up on Github anywhere?

I'm not sure if what you describe relates to the problem that our users encounter or not. Right now, for many users, the captcha solving works "just fine" - I have an instance running on Google Cloud that has been scraping ImmoScout for years without problems using the Flathunter code. I have also noticed the reese84 cookie and I do think it is significant - here is a match on another project that seems to have dug a bit deeper into the problem: Jackiebibili/ticket_tracker_api@272c539 . Maybe @Jackiebibili has some clues for us. I also mentioned in #210 that I think this is related to Imperva bot protection, but I don't have good evidence for that.

It seems like ticket_tracker_api solved this with JS injection - that might be something we could try or investigate.

@trendschau
Copy link

@codders totally agree, I don't know if it is related to the problem described in this ticket but you can easily proof it by adding a valid reese84-cookie to headless chrome. Since flathunters works fine for all other users, the reason for the blocking page might be totally different, but maybe the solution is similar.

My code is probably not of interest (very basic), but I cleaned it from all captcha-solving parts (not needed anymore) and pushed it to github. I never planned to publish it, so I am sorry for the spaghetti ... I think a super simplistic workaround might be a browser extension in another window, that stores cookies periodically on the file system in combination with a page refresh extension (something like https://github.com/ktty1220/export-cookie-for-puppeteer but without manual action). But I have to stop coding and start searching for a flat now ...

@codders
Copy link

codders commented Mar 7, 2023

@trendschau Thanks for the tip and for the code! Yes - any hints are welcome to resolve this, and I'll be happy to try this (or even happier if someone else on the thread wants to make a PR). If it fixes the issue for the users that are struggling, it would be an amazing find.

Best of luck with your search!

@trendschau
Copy link

@codders just to finish this: I found a way to automate the process with two browser extensions. Very dirty but it seems to work for now, so immobilienscout has some open data there :D Btw the archive-part of their website is completely unprotected as well, ahtough not very helpful for flat searchers. Pushed the code in case it is of interest. Good luck to you all, too!

@BarisYazici
Copy link

BarisYazici commented Mar 13, 2023

I solved this problem by injecting my cookie to the header to the GET request in abstract_crawler.py. It seems like if you have a valid cookie from one of your logged in sessions in the IS24 you can surpass the robot check. Btw I have a premium account so that might be a thing for the paid users.

I see that @trendschau already pointed out a similar solution

@yanone
Copy link

yanone commented Apr 15, 2023

So I’ve been playing with this as well and I noticed that when I got detected as a bot (with no capture showing, as above) I can log into IS24 in the running Chrome session with my user account (plus) and then the subsequent reloads work fine. Don't know yet for how long. Will report back.

@BarisYazici
Copy link

@yanone did you try to use the set cookie feature?

@yanone
Copy link

yanone commented Apr 15, 2023

Yes, I did, and it wouldn’t work, still blocked.
I figure what I did is probably technically identical to setting the cookie in the header, but apparently also not. Let's see for how long this runs, but I don't see a problem with logging into a Selenium session as long as they let me. I remember from another project that Safari won't let you touch it or else it breaks instantly, but with Chrome you can freely interact with the browser, which is nice.

@yanone
Copy link

yanone commented Apr 15, 2023

Yes, I did, and it wouldn’t work, still blocked.

And that's probably because the Selenium app is a separate process. The Chrome that I opened manually and got the cookie from and the Chrome that the bot opens are two different instances.
So one probably needs to make sure that the cookie is truly coming from the same instance. And arguably I find logging in easier than extracting the cookie and then – now we’re getting closer – restarting the bot, which restarts the browser instance, too. This all needs to happen in the same process, is my guess.

@BarisYazici
Copy link

Are you sure you are copying the correct cookie in correct format? If it seems harder than logging in manually probably there is sth wrong :D

@yanone
Copy link

yanone commented Apr 15, 2023

Update: It ran for about an hour, and now they've logged me out and are showing me the captcha page without a captcha again.

@BarisYazici
Copy link

an hour is not so bad :D

@matteomartinelli88
Copy link

Hello Hello wonderful people! I get almost the same error, my coding skills are intermediate/low so I tryed to play a little with the settings.
Is there anything new to fix this? My solution is to restart manually the bot, but then at this point it's the same as refreshing the web page manually.

[2023/07/14 09:49:21|config.py |INFO ]: Using config path C:\Users\asus\flathunter/config.yaml
[2023/07/14 09:49:21|chrome_wrapper.py |INFO ]: Initializing Chrome WebDriver for crawler...
[2023/07/14 09:49:21|init.py |WARNING ]: could not detect version_main.therefore, we are assuming it is chrome 108 or higher
[2023/07/14 09:49:21|init.py |INFO ]: setting properties for headless
[2023/07/14 09:49:33|abstract_crawler.py |INFO ]: Timeout waiting for iframe element - no captcha verification necessary?

@silasburger
Copy link

I'm seeing that the headed chromium browser isn't setting the same reese84 cookie that I have in my config file. Anyone else able to see that it is being set correctly?

@billDrett
Copy link

I am also observing the same issue as reported in the ticket when I start the script. I have tried also the reese84 cookie approach but still it gets detected from the beginning.

@lisardo-iniesta
Copy link

Does anyone know how to resolve this issue? I've tried everything discussed here, different reese84 cookies values, etc...

@HerzogVonWiesel
Copy link

Same for me. Would love an update! Getting blocked right out of the gate, even with my normal browsers reese84 cookie...

@ewamal
Copy link

ewamal commented Oct 25, 2023

Same here, at first it lasted at least a day, now I am getting blocked basically right away

@codders
Copy link

codders commented Oct 25, 2023

To the commenters who are struggling, would be great if you can leave some info about your setup - what OS, docker or direct, chomedriver arguments etc.

@jsektkuehler
Copy link

jsektkuehler commented Oct 31, 2023

+1 my config running on windows docker desktop with reese84 cookie variable set:
captcha:
2captcha:
api_key: xxxxxxxxxxxxxxxx
driver_arguments:

  • --no-sandbox
  • --headless
  • --disable-gpu
  • --remote-debugging-port=9222
  • --disable-dev-shm-usage
  • window-size=1024,768

2023-10-31 09:00:21 [2023/10/31 08:00:21|abstract_crawler.py |INFO ]: Timeout waiting for iframe element - no captcha verification necessary?
2023-10-31 09:00:21 [2023/10/31 08:00:21|immobilienscout.py |WARNING ]: Unable to find IS24 variable in window
2023-10-31 09:00:21 [2023/10/31 08:00:21|immobilienscout.py |ERROR ]: IS24 bot detection has identified our script as a bot - we've been blocked
2023-10-31 09:10:33 [2023/10/31 08:10:33|abstract_crawler.py |INFO ]: Timeout waiting for iframe element - no captcha verification necessary?
2023-10-31 09:10:33 [2023/10/31 08:10:33|immobilienscout.py |WARNING ]: Unable to find IS24 variable in window
2023-10-31 09:10:33 [2023/10/31 08:10:33|immobilienscout.py |ERROR ]: IS24 bot detection has identified our script as a bot - we've been blocked

same in WSL (Windows Subsystem for Linux) environment:
nuc12_ubuntu_sub@JSNUC12WSHi3:/opt/flathunter$ sudo -u flathunter /home/flathunter/.local/bin/pipenv run python flathunt.py [2023/10/31 11:25:15|config.py |INFO ]: Using config path /opt/flathunter/config.yaml [2023/10/31 11:25:15|chrome_wrapper.py |INFO ]: Initializing Chrome WebDriver for crawler... [2023/10/31 11:25:16|patcher.py |INFO ]: patching driver executable /home/flathunter/.local/share/undetected_chromedriver/undetected_chromedriver [2023/10/31 11:25:17|init.py |INFO ]: setting properties for headless [2023/10/31 11:25:27|abstract_crawler.py |INFO ]: Timeout waiting for iframe element - no captcha verification necessary? [2023/10/31 11:25:27|immobilienscout.py |WARNING ]: Unable to find IS24 variable in window [2023/10/31 11:25:27|immobilienscout.py |ERROR ]: IS24 bot detection has identified our script as a bot - we've been blocked

@saschagehlich
Copy link

Not a flathunter issue, but I'm working on a project that has the same issues. I noticed that with headless chrome via puppeteer, the browser gets locked out without showing a captcha.

With headed chrome, I was able to bypass bot detection using the paid capsolver.com API (https://www.capsolver.com/blog/The-other-captcha/bypass-imperva-nodejs) and 2captcha for the geetest captcha. Guess I'll just keep running in headed mode for now, although it's probably a bit more resource hungry.

I decyphered the /assets/immo-1-17 script but couldn't figure out what exactly is going on, yet. Since it's the only script that's being loaded in the headless lock-out case, this has to have the solution in it.

@fmmix
Copy link

fmmix commented Dec 16, 2023

For me it never worked once with any kind of driver arguments or reese values

@codders
Copy link

codders commented Jan 17, 2024

Maybe #514 will help some of you...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests