Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use urllib.quote on str, not unicode, for image titles #469

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Pokechu22
Copy link
Contributor

Before, files with non-ascii names would crash when using --xmlrevisions (a regression from #451 by @yzqzss). For example, French Wikiversity has Fichier:Ensemble_access_croisé.jpg (I'm using that as an example here because it's a small site that I know will continue to exist).

$ python2 -b -u dumpgenerator.py --xmlrevisions --images --force --api https://fr.wikiversity.org/w/api.php
Checking API... https://fr.wikiversity.org/w/api.php
API is OK: https://fr.wikiversity.org/w/api.php
Checking index.php... https://fr.wikiversity.org/w/index.php
index.php is OK
PLEASE, DO NOT USE THIS SCRIPT TO DOWNLOAD WIKIMEDIA PROJECTS!
Download the dumps from http://dumps.wikimedia.org
#########################################################################
# Welcome to DumpGenerator 0.4.0-alpha by WikiTeam (GPL v3)                   #
# More info at: https://github.com/WikiTeam/wikiteam                    #
#########################################################################

#########################################################################
# Copyright (C) 2011-2023 WikiTeam developers                           #

# This program is free software: you can redistribute it and/or modify  #
# it under the terms of the GNU General Public License as published by  #
# the Free Software Foundation, either version 3 of the License, or     #
# (at your option) any later version.                                   #
#                                                                       #
# This program is distributed in the hope that it will be useful,       #
# but WITHOUT ANY WARRANTY; without even the implied warranty of        #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         #
# GNU General Public License for more details.                          #
#                                                                       #
# You should have received a copy of the GNU General Public License     #
# along with this program.  If not, see <http://www.gnu.org/licenses/>. #
#########################################################################

Analysing https://fr.wikiversity.org/w/api.php
Trying generating a new dump into a new directory...
Retrieving image filenames
..    Found 84 images
84 image names loaded
Image filenames and URLs saved at... frwikiversityorg_w-20230624-images.txt
Retrieving images from "start"
Creating "./frwikiversityorg_w-20230624-wikidump/images" directory
    Downloaded 10 images
    Downloaded 20 images
    Downloaded 30 images
    Downloaded 40 images
/usr/lib/python2.7/urllib.py:1306: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  return ''.join(map(quoter, s))
Traceback (most recent call last):
  File "dumpgenerator.py", line 2572, in <module>
    main()
  File "dumpgenerator.py", line 2564, in main
    createNewDump(config=config, other=other)
  File "dumpgenerator.py", line 2147, in createNewDump
    session=other['session'])
  File "dumpgenerator.py", line 1524, in generateImageDump
    r = session.get(config['api'] + u"?action=query&export&exportnowrap&titles=%s" % urllib.quote(title))
  File "/usr/lib/python2.7/urllib.py", line 1306, in quote
    return ''.join(map(quoter, s))
KeyError: u'\xe9'

Python 2's str vs unicode and the implicit conversions are a mess.

@nemobis
Copy link
Member

nemobis commented Jun 25, 2023

Ah yes, that error has been annoying. Thanks!

Do we really need to pass the URL as unicode? We have the titles as unicode because we're trying to convert at the I/O boundary (so that e.g. we compare titles in unicode only), but I'm not sure requests needs it.

@Pokechu22
Copy link
Contributor Author

config[api] is str, so probably the URL should also be str, good catch.

@Pokechu22
Copy link
Contributor Author

Anything holding up this being merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants