Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Maps filter #473

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

colinemonds
Copy link

feature #472: filter based on GMaps distance

This PR involves some refactoring. The problem is that previously,
filters ran before making any calls to external APIs. Therefore, just
adding another filter for the distance doesn't actually work: the
distance information is not yet available when we apply the filters. We
can't just run the filters later, because then we would run the Google
Maps API calls before we filtered out any properties, meaning that we
would incur Google Maps API calls for all properties that we find in our
search, including those that we later filter out anyway based on price,
size, etc. - and we actually have to pay for those requests! My solution
is to group the filters in two chains, and then run one chain before and
one after external API calls have been made. This way, we can run the
distance filter after the API calls are made, but keep everything else
the same.

Make GMapsDurationProcessor export unformatted data as
well, to enable later steps in the pipeline (such as filters) to
access and process this data.
You can't re-open a NamedTemporaryFile on Windows while it's still open.
This PR involves some refactoring. The problem is that previously,
filters ran before making any calls to external APIs. Therefore, just
adding another filter for the distance doesn't actually work: the
distance information is not yet available when we apply the filters. We
can't just run the filters later, because then we would run the Google
Maps API calls before we filtered out any properties, meaning that we
would incur Google Maps API calls for all properties that we find in our
search, including those that we later filter out anyway based on price,
size, etc. - and we actually have to pay for those requests! My solution
is to group the filters in two chains, and then run one chain before and
one after external API calls have been made. This way, we can run the
distance filter after the API calls are made, but keep everything else
the same.
@colinemonds
Copy link
Author

I swear it was green just now. :( brb

@codders
Copy link

codders commented Sep 13, 2023

Amazing! I love the refactor and the introduction of pre- and post- filters. And it's great that someone is using the Google Maps Distance feature - I was tempted to remove that a while back because I don't know if it was active.

The linter and type-checker seem pretty unhappy, but assuming you can resolve that I'm happy to merge this.

Copy link

@codders codders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to merge this, pending passing tests, but if you want to look at the stylistic things, that would be most welcome.

@@ -354,6 +349,23 @@ def max_price_per_square(self):
"""Return the configured maximum price per square meter"""
return self._get_filter_config("max_price_per_square")

def max_distance(self) -> List[DistanceConfig] | None:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be more natural here just to return the empty list of there's nothing configured. Makes the typing judgement simpler.

self._append_filter_if_not_empty(MaxRoomsFilter, config.max_rooms())
self._append_filter_if_not_empty(
PPSFilter, config.max_price_per_square())
if filter_chain == FilterChainName.preprocess:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit "meh" on if-else chains. Do you think we could do this with Enum and Match?

https://realpython.com/python-enum/#using-enumerations-in-if-and-match-statements

@colinemonds
Copy link
Author

colinemonds commented Sep 13, 2023

I wonder if it would be more natural here just to return the empty list of there's nothing configured. Makes the typing judgement simpler.

I agree and am happy to do it that way, but be advised that this is inconsistent with some things already in the code: in particular, max_rooms() returns None if the filter is not configured, and so do most of the other "trivial" filters -- whereas excluded_titles(), for example, returns an empty list if it's unconfigured. I chose to go with your recommendation, but somebody should probably clean this up.

I'm a bit "meh" on if-else chains. Do you think we could do this with Enum and Match?

We could, but the tests still run for Python 3.8 and Python 3.9, neither of which support the match statement. (This is actually the same reason why I just finished changing | in type signatures to Union in the code, where necessary -- I had added them without realizing that Python 3.8 was still supported.)

I have the tests almost green now. Most of them were broken on Windows anyway, so I had quite some cleaning up to do around them.

By the way, the test case for ebay Kleinanzeigen doesn't use a mock for requests - it just has a hardcoded URL and expects it to be forever online, it seems... maybe this should be cleaned up at some point. Oh, also, several of the hunter tests suffer from the results being rolled using randint, meaning that some of the tests are actually flaky - it just rolls the dice 30 times and expects at least one expose to pass the filters, but that's not guarenteed. I'm not fixing either of those just now, but somebody should look at that at some point.

What I did fix is a lot of code in the tests that looked like this:

config = Config(foo)
self.assertIsNotNone(config)

where the assertion obviously does nothing.

@colinemonds
Copy link
Author

colinemonds commented Sep 13, 2023

I also just plain broke some of the tests, because the Hunter implementations (including various subclasses) chose to save the exposes before running the processors on them. Therefore, they got saved without the additional information from the Google Maps API. Since I now actually filter based on that information, when the exposes got loaded again in the web hunter, where they get filtered again for display, my filter crashed.

Edit: I also just realize that this has certain implications for multi-user instances. It does mean that if at the time an expose is crawled it passes no filters, it will be marked as processed, but will not be saved. This is probably fine for a single-user instance, but in a multi-user instance, if an expose gets crawled, and then later a user registers with filter settings that would include that expose, they won't see it (because it hasn't been saved, and it will not crawled again). I am uncertain what the correct solution is.

Edit 2: The correct solution is somewhat more complicated - we would have to differentiate between exposes that have been saved with extra data (like GMaps data), and exposes that have been saved without that extra data. Then later, if a filter happens to include an expose that has been saved without extra data, we need to run the processor (like the GMaps distance calculation), and then filter again in the postprocessing chain. This might be possible to fit in the current "pipe" model somewhere, but it requires some thought. (Just braindumping here.)

@colinemonds
Copy link
Author

colinemonds commented Sep 13, 2023

Also, I introduced another real bug: because I used @dataclass, the exposes can't be JSON serialized, as json.dumps only dumps dictionaries and no other objects. (This again only comes up in the web hunter, which I didn't use, so I didn't realize that.)

I chose to go for NamedTuple instead, which is still not a dictionary, but json.dumps and json.loads can be taught to read/write those with minimal effort, which I did implement.

The correct solution for when one wants to have type-safe data being passed around and serialized is obviously to use a proper serialization library, like jsonpickle or what have you. However, the risk is that the json string might not look exactly the same, which means that if one were to switch, flathunter instances could break on data they already have in their database. I will definitely not make that choice now, but it might considered in the future.

I am not sure if you are alright with me introducing type-safe objects at all, given those constraints. If you would rather have me give up on type safety and just use stringly-typed dicts like the other code already does, just tell me so.

@codders
Copy link

codders commented Sep 13, 2023

Thanks for the feedback and the comments - looks like you're pretty deep into the weeds on this one, and I appreciate you looking into the feedback so deeply.

I think I will need to take some more time to also look at this in more detail, and I won't get that done today (or probably this week).

For Python version, I would be happy to move to supporting 3.11 and 3.10 - 3.8 and 3.9 have been out of active support for a while now. I don't think it's a big deal for new users of the code to use the current version of python: Debian stable has 3.11 and Ubuntu 22.04 LTS has 3.10.

From my side, it's up to you how much effort you put into type safety on dicts. I'm happy if most of the functions have a meaningful type restriction, and more happy the better the constraint is, but already a generic dict is a lot better than the situation before we had any annotations or checking. I don't want you to give up on your PR because of the type-checker :)

@codders
Copy link

codders commented Sep 13, 2023

I've updated main now to test only against python 3.10 and 3.11. Feel free to add code incompatible with older versions. Thanks!

@codders
Copy link

codders commented Sep 20, 2023

Hi again @colinemonds ,

I've prepared another PR based on your code - #477 - that passes the linter and type-checker. I also made most of the tests pass, but the web interface hunting tests are still failing, probably because of the issue you described. If you have time to check it out, or you want to copy those changes over to your branch, please do. Otherwise I will look again some time in the coming weeks at how to make the remaining tests pass.

@colinemonds
Copy link
Author

colinemonds commented Sep 20, 2023

Oh, thanks @codders ! Sorry for not getting back to you earlier, it seems we have done some duplicate work now; I already created a commit that likewise passes the type checks, the linter, and allmost all the tests, but I didn't push it yet. The reason is that, like you, I can't make the web hunter tests pass.

I think the web hunter needs to be thoroughly refactored for this to work. It's quite obvious from the way that a lot of the storage mechanism is set up that the database was never really designed to support a multi-user instance. (The class name "IDMaintainer" is a tell on its own - it was clearly designed to remember which expose IDs had already been scraped, and to do nothing else. It does a lot more now.)

A possible hack might be to give up on scraping the exposes only once, and instead try and scrape them only once per user. The advantage would be that later steps in the pipeline (like the Google Maps distance calculation) would run per user, and hence also run with the configuration for that user (which is obviously important in the Google Maps case, as each user might have different locations set in their configuration). I have a half-baked implementation of this.

The downside is that on large instances, there is a lot more scraping to be done. In the case of immobilienscout24 and other sites with strong anti-botting protection, that's a real pain.

A proper solution would entail splitting not only the filters in two steps, but also the processors: we would need one dedicated scraping chain that has to run only once per expose (e.g. to do the fetching, anti-anti-botting, and the actual magic on the HTML), and a dedicated processor chain that has to run once per user (currently, for the Google Maps stuff, but potentially also for other things in the future). Then we could store the interim results after the scraping chain in the database (same as we currently do), and for each user who has configured pre-processor filters such that the stored expose is interesting to them purely based on that, run the processor chain on that expose for them and store the result of that in a new table indexed by a composite key (expose_id, user_id), and then run the post-processor filters on that. This will ensure that minimal work is done.

Unfortunately, the hunter implementations aren't very DRY, so this is going to take a little work just making changes all over the place. It's still probably the best thing to do.

@codders
Copy link

codders commented Sep 20, 2023

Hi @colinemonds ,

Thanks for the update. I'm supportive of larger refactors, but I don't have the capacity myself to do that right now. So for me the question whether to hack it or not is a question of whether you have the enthusiasm to do a larger refactor of the functionality. I would prefer to have the functionality integrated with a bit of a hack than to lose it on a feature branch that only you have. What would be your preference?

@colinemonds
Copy link
Author

The problem is that I can't say what will break if I actually finish this hack. I can happily push the commit later (tomorrow, probably) for you to look at, then at least it won't get lost on a feature branch that only I have.

But still. I think the refactor is needed before my work can get properly pulled. The code base has enough technical debt as it is, and this might also introduce bugs on top of that. And I think that might matter a lot for the (your?) public web-hosted instance of flathunter, where you'll just end up with more support tickets. Might as well put in the work now and not have to worry later.

I can't promise when I'll actually have the time, though, but I might be able to fit this in on the weekend.

@codders
Copy link

codders commented Sep 21, 2023

Yeah. Scraping once per user isn't an option - Kleinanzeigen has IP blocking for bots that scrape too fast and Immoscout as you say will probably get annoyed. I don't know the exact user numbers for my instance because I don't keep any analytics, but I think it's in the hundreds of active users. So at least my use case for WebHunter requires that we only scrape once.

I might try and take a look at this in the coming weeks. Will have to see how my workload is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants