Modernizing the code base #477

uyar · 2023-08-14T17:37:28Z

uyar
Aug 14, 2023
Maintainer

Hi,

It would be great if we could find some time to review and modernize the code base. Things to consider:

Dropping support for old Python versions (Python 2 and everything below Python 3.8).
Switching to a pyproject.toml based setup.
Adding type hints.

I've made several attempts at this (except for type hints). It's not very difficult, but once started, it has to be completed and tested quickly because otherwise it becomes difficult to adapt changes to the original branch.

Today I also noticed that some data in the HTML files are in fact also provided in JSON format. Like the bottom-100 chart for instance. This makes it much much easier to handle; it becomes just a mapping from JSON to our custom dicts. But we don't have JSON-handling parsers right now.

Another idea that might be worth exploring is to define dataclasses for movies and people, in addition to the dict-based API.

Any ideas about how (or whether) to proceed?

--
Turgut

uyar · 2023-08-15T19:22:44Z

uyar
Aug 15, 2023
Maintainer Author

By the way, I noticed that some pages contain their data as a JSON string within a script tag. You can search for __NEXT_DATA__ in the HTML documents. For these, we don't even need to scrape anything, the data is just there. It only needs some transformation. Probably even no need for an HTML parser. I did a quick check on the pages and quite a few of them look like they can be handled this way.

1 reply

uyar Aug 15, 2023
Maintainer Author

Oh, that was stupid of me. I didn't realize that I had written this in the original post. I thought I had written it to an issue discussion.

uyar · 2023-08-15T19:56:33Z

uyar
Aug 15, 2023
Maintainer Author

Here's a small example (without using Cinemagoer at all) for fetching person bio info:

import json
from urllib.request import Request, urlopen

SCRIPT_TAG = '<script id="__NEXT_DATA__" type="application/json">'

keanu_bio = "https://www.imdb.com/name/nm0000206/bio/"
request = Request(keanu_bio)
request.add_header("User-Agent", "Mozilla/5.0 (X11)")
with urlopen(request) as page:
    content = page.read().decode("utf-8")

script_start = content.index(SCRIPT_TAG)
script_end = content.index("</script>", script_start + len(SCRIPT_TAG))
json_content = content[script_start + len(SCRIPT_TAG):script_end]
data = json.loads(json_content)
page_data = data["props"]["pageProps"]["contentData"]["entityMetadata"]
print(json.dumps(page_data, indent=2))

0 replies

alberanid · 2023-08-20T13:25:16Z

alberanid
Aug 20, 2023
Maintainer

I'm fine with dropping support for old Python versions; this can be done intentionally (removing / refactoring code) or simply developing the new features without maintaining the compatibility. The first option is probably better, resulting in cleaner code and making clear that the changes were introduced intentionally.

I'm not familiar with pyproject.toml, so I do not have any opinion about it: if it provides some benefit, it's okay to switch.

Regarding the JSON... that's interesting; it's hard to tell how stable it is and how much information will be moved there, but it's surely easier to parse.

As always: right now it's difficult for me to lead the development, and I do not want to slow it down, so any fix and improvement is welcome. I can also help is there is need to change a lot of parsers and we want to split the workload.

0 replies

uyar · 2023-08-20T14:00:08Z

uyar
Aug 20, 2023
Maintainer Author

I'm fine with dropping support for old Python versions; this can be done intentionally (removing / refactoring code) or simply developing the new features without maintaining the compatibility. The first option is probably better, resulting in cleaner code and making clear that the changes were introduced intentionally.

The Cinemagoer code base is very mature, so unless we add some new features, I don't foresee any technical reason to use some Python feature forcing us to drop Py2 or early Py3. That being said, I'm in favor of option 1 (intentional drop) to get cleaner code. It would make it easier for developers to find their way around the code and not be confused by Py2-related statements (especially with respect to str-unicode issues).

I'm not familiar with pyproject.toml, so I do not have any opinion about it: if it provides some benefit, it's okay to switch.

Its main advantage is that it's only one file compared to many (setup.py, setup.cfg, tox.ini, MANIFEST.in, etc).

Regarding the JSON... that's interesting; it's hard to tell how stable it is and how much information will be moved there, but it's surely easier to parse.

I agree with your concern here. It's strange that IMDb is giving its data away like this. I'm reluctant to build parsers depending on this since it might vanish. On the other hand, those parsers wouldn't be that difficult to write; so even if this vanishes, the amount of futile work will not be that much.

As always: right now it's difficult for me to lead the development, and I do not want to slow it down, so any fix and improvement is welcome. I can also help is there is need to change a lot of parsers and we want to split the workload.

I can do the transitioning work. The only thing that concerns me is that a lot of our tests are failing at the moment. Fixing those before we move on to refactoring would be safer to not cause regressions later. And that leaves us with some problems: When fixing the parsers, do we use the JSON data or not? Do we support multiple rules or not?

0 replies

uyar · 2023-10-30T19:11:58Z

uyar
Oct 30, 2023
Maintainer Author

I have started working on a complete rewrite in the cinemagoerng repository. It will use the NEXT_DATA content (the JSON payload) whenever available. It's also using a lot of relatively recent Python features like type hints, dataclasses and pattern matching. So it only supports 3.10 and up. At the moment, it's in very early stages; it's only meant to give an idea.

In the future, I hope to be able to update the parsing spec files separately from the package itself.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modernizing the code base #477

{{title}}

Replies: 5 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Modernizing the code base #477

uyar Aug 14, 2023 Maintainer

Replies: 5 comments · 1 reply

uyar Aug 15, 2023 Maintainer Author

uyar Aug 15, 2023 Maintainer Author

uyar Aug 15, 2023 Maintainer Author

alberanid Aug 20, 2023 Maintainer

uyar Aug 20, 2023 Maintainer Author

uyar Oct 30, 2023 Maintainer Author

uyar
Aug 14, 2023
Maintainer

Replies: 5 comments 1 reply

uyar
Aug 15, 2023
Maintainer Author

uyar Aug 15, 2023
Maintainer Author

uyar
Aug 15, 2023
Maintainer Author

alberanid
Aug 20, 2023
Maintainer

uyar
Aug 20, 2023
Maintainer Author

uyar
Oct 30, 2023
Maintainer Author