Resuming dump on a page that has unicode characters in the title results in the script finishing the page dump. #399

zerote000 · 2020-12-30T01:50:43Z

Script fails to resume on pages with unicode characters in the title. Instead, it will finish up the page dump, and move on to grabbing images.

Here is an example (Mineland Wiki) where it fails due to too fast requests. It fails after completing "Christiey14".

    Carraighenské železárny, 1 edit
    Celestira, 2 edits
    Christiey14, 2 edits
Downloaded 10 pages
HTTP Error 429.
Server error, max retries exceeded.
Please resume the dump later.

Then, when resuming the dump, the following happens. It tries to resume starting with page "Chrám Nerys I.", but fails, exits page dumping, and starts image dumping.

Resuming previous dump process...
Title list was completed in the previous session
Resuming XML dump from "Chrám Nerys I."
https://wiki.mineland.eu/w/api.php
Retrieving the XML for every page from "Chrám Nerys I."
Removing the last chunk of past XML dump: it is probably incomplete.
dumpgenerator.py:1117: UnicodeWarning: Unicode unequal comparison failed to convert both arguments to Unicode - interp
reting them as being unequal
  elif seeking and title != start:
XML dump saved at... wikiminelandeu_w-20201230-history.xml
Image list is incomplete. Reloading...
Retrieving image filenames
...........

The text was updated successfully, but these errors were encountered:

zerote000 · 2021-01-06T22:24:26Z

Couldn't this be fixed by changing line 1114 from

title = line.decode("utf-8").strip()

to

title = line.strip()

?

If the starting title is not decoded, then it doesn't make sense to decode the titles from the title list. If the title needs to be decoded, it can be done after the comparison.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resuming dump on a page that has unicode characters in the title results in the script finishing the page dump. #399

Resuming dump on a page that has unicode characters in the title results in the script finishing the page dump. #399

zerote000 commented Dec 30, 2020

zerote000 commented Jan 6, 2021 •

edited

Resuming dump on a page that has unicode characters in the title results in the script finishing the page dump. #399

Resuming dump on a page that has unicode characters in the title results in the script finishing the page dump. #399

Comments

zerote000 commented Dec 30, 2020

zerote000 commented Jan 6, 2021 • edited

zerote000 commented Jan 6, 2021 •

edited