Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with crawling ImmoScout24: window.IS24 property resultList is missing #458

Open
acidassassin opened this issue Aug 30, 2023 · 5 comments

Comments

@acidassassin
Copy link

Hey there,

i debugged the code for the immoscout24 crawler and it seems like the follwing script returns "null"
result_json = self.get_driver_force().execute_script('return window.IS24.resultList;')

I've checked in the browser and it looks like the window.IS24.resultList is not there anymore.

Anyone has a working solution?
Thanks

@acidassassin
Copy link
Author

Okay it seems like for "gewerbe-flaechen" there is no resultList...

@codders
Copy link

codders commented Aug 30, 2023

Yeah - there's no resultList, but there is a window.__INITIAL_STATE__ containing all the data you need. You should be able to parse is similar to the way this StackOverflow answer handles it:

https://stackoverflow.com/questions/67203717/beautifulsoup-how-to-get-data-from-window-initial-state

It would probably be possible to extend the Immoscout crawler to check if __INITIAL_STATE__ is present in the result fetched from the server.

Are you a python developer? You want to give that a go?

@acidassassin
Copy link
Author

Hey @codders

thank you for your reply!
I would call myself more a scripting language developer, but Python is fun. :-)

I‘ve managed to get the INITIAL_STATE as a String, but somehow i‘m Not able to convert to a functional dict/json. Do you have any advice?

@codders
Copy link

codders commented Sep 2, 2023

What kind of error do you get? How are you parsing it?

@adobryn
Copy link

adobryn commented Oct 27, 2023

I've tried that:

logger.info("Trying to get __INITIAL_STATE__")
data = re.search(r"window\.__INITIAL_STATE__=(.*?);", search_url)
if data is not None:
    data = data.group(1)
    data = json.loads(data)
    print(json.dumps(data, indent=4))

but I'm still dealing with "IS24 bot detection has identified our script as a bot - we've been blocked" so I can not check if it's really working :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants