Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resuming dump on a page that has unicode characters in the title results in the script finishing the page dump. #399

Open
zerote000 opened this issue Dec 30, 2020 · 1 comment

Comments

@zerote000
Copy link
Contributor

Script fails to resume on pages with unicode characters in the title. Instead, it will finish up the page dump, and move on to grabbing images.

Here is an example (Mineland Wiki) where it fails due to too fast requests. It fails after completing "Christiey14".

    Carraighenské železárny, 1 edit
    Celestira, 2 edits
    Christiey14, 2 edits
Downloaded 10 pages
HTTP Error 429.
Server error, max retries exceeded.
Please resume the dump later.

Then, when resuming the dump, the following happens. It tries to resume starting with page "Chrám Nerys I.", but fails, exits page dumping, and starts image dumping.

Resuming previous dump process...
Title list was completed in the previous session
Resuming XML dump from "Chrám Nerys I."
https://wiki.mineland.eu/w/api.php
Retrieving the XML for every page from "Chrám Nerys I."
Removing the last chunk of past XML dump: it is probably incomplete.
dumpgenerator.py:1117: UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interp
reting them as being unequal
  elif seeking and title != start:
XML dump saved at... wikiminelandeu_w-20201230-history.xml
Image list is incomplete. Reloading...
Retrieving image filenames
...........
@zerote000
Copy link
Contributor Author

zerote000 commented Jan 6, 2021

Couldn't this be fixed by changing line 1114 from

title = line.decode("utf-8").strip()

to

title = line.strip()

?

If the starting title is not decoded, then it doesn't make sense to decode the titles from the title list. If the title needs to be decoded, it can be done after the comparison.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant