Replies: 5 comments 1 reply
-
By the way, I noticed that some pages contain their data as a JSON string within a script tag. You can search for |
Beta Was this translation helpful? Give feedback.
-
Here's a small example (without using Cinemagoer at all) for fetching person bio info: import json
from urllib.request import Request, urlopen
SCRIPT_TAG = '<script id="__NEXT_DATA__" type="application/json">'
keanu_bio = "https://www.imdb.com/name/nm0000206/bio/"
request = Request(keanu_bio)
request.add_header("User-Agent", "Mozilla/5.0 (X11)")
with urlopen(request) as page:
content = page.read().decode("utf-8")
script_start = content.index(SCRIPT_TAG)
script_end = content.index("</script>", script_start + len(SCRIPT_TAG))
json_content = content[script_start + len(SCRIPT_TAG):script_end]
data = json.loads(json_content)
page_data = data["props"]["pageProps"]["contentData"]["entityMetadata"]
print(json.dumps(page_data, indent=2)) |
Beta Was this translation helpful? Give feedback.
-
I'm fine with dropping support for old Python versions; this can be done intentionally (removing / refactoring code) or simply developing the new features without maintaining the compatibility. The first option is probably better, resulting in cleaner code and making clear that the changes were introduced intentionally. I'm not familiar with Regarding the JSON... that's interesting; it's hard to tell how stable it is and how much information will be moved there, but it's surely easier to parse. As always: right now it's difficult for me to lead the development, and I do not want to slow it down, so any fix and improvement is welcome. I can also help is there is need to change a lot of parsers and we want to split the workload. |
Beta Was this translation helpful? Give feedback.
-
The Cinemagoer code base is very mature, so unless we add some new features, I don't foresee any technical reason to use some Python feature forcing us to drop Py2 or early Py3. That being said, I'm in favor of option 1 (intentional drop) to get cleaner code. It would make it easier for developers to find their way around the code and not be confused by Py2-related statements (especially with respect to str-unicode issues).
Its main advantage is that it's only one file compared to many (setup.py, setup.cfg, tox.ini, MANIFEST.in, etc).
I agree with your concern here. It's strange that IMDb is giving its data away like this. I'm reluctant to build parsers depending on this since it might vanish. On the other hand, those parsers wouldn't be that difficult to write; so even if this vanishes, the amount of futile work will not be that much.
I can do the transitioning work. The only thing that concerns me is that a lot of our tests are failing at the moment. Fixing those before we move on to refactoring would be safer to not cause regressions later. And that leaves us with some problems: When fixing the parsers, do we use the JSON data or not? Do we support multiple rules or not? |
Beta Was this translation helpful? Give feedback.
-
I have started working on a complete rewrite in the cinemagoerng repository. It will use the NEXT_DATA content (the JSON payload) whenever available. It's also using a lot of relatively recent Python features like type hints, dataclasses and pattern matching. So it only supports 3.10 and up. At the moment, it's in very early stages; it's only meant to give an idea. In the future, I hope to be able to update the parsing spec files separately from the package itself. |
Beta Was this translation helpful? Give feedback.
-
Hi,
It would be great if we could find some time to review and modernize the code base. Things to consider:
I've made several attempts at this (except for type hints). It's not very difficult, but once started, it has to be completed and tested quickly because otherwise it becomes difficult to adapt changes to the original branch.
Today I also noticed that some data in the HTML files are in fact also provided in JSON format. Like the bottom-100 chart for instance. This makes it much much easier to handle; it becomes just a mapping from JSON to our custom dicts. But we don't have JSON-handling parsers right now.
Another idea that might be worth exploring is to define dataclasses for movies and people, in addition to the dict-based API.
Any ideas about how (or whether) to proceed?
--
Turgut
Beta Was this translation helpful? Give feedback.
All reactions