Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assorted dumpgenerator.py failures with some Miraheze (MediaWiki 1.39.3) wikis #467

Open
nemobis opened this issue Jun 17, 2023 · 7 comments
Labels
Milestone

Comments

@nemobis
Copy link
Member

nemobis commented Jun 17, 2023

Titles saved at... bigforestmirahezeorg_w-20230617-titles.txt
18795 page titles loaded                                                                                                                                                                                                                    https://bigforest.miraheze.org/w/api.php
Getting the XML header from the API                                                                                   
Retrieving the XML for every page from the beginning
42 namespaces found                                                                                                   
Trying to export all revisions from namespace 0
Trying to get wikitext from the allrevisions API and to build the XML                                                                                                                                                                       Traceback (most recent call last):                                                                                    
  File "dumpgenerator.py", line 2572, in <module>
    main()
  File "dumpgenerator.py", line 2564, in main
    createNewDump(config=config, other=other)
  File "dumpgenerator.py", line 2135, in createNewDump
    generateXMLDump(config=config, titles=titles, session=other['session'])
  File "dumpgenerator.py", line 742, in generateXMLDump
    for xml in getXMLRevisions(config=config, session=session, start=start):
  File "dumpgenerator.py", line 843, in getXMLRevisions
    for page in arvrequest['query']['allrevisions']:
UnboundLocalError: local variable 'arvrequest' referenced before assignment
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.

Not sure what's special about this wiki https://bigforest.miraheze.org/wiki/%ED%8A%B9%EC%88%98:%EB%B2%84%EC%A0%84

Maybe it was just an occasional error.

@nemobis nemobis added the bug label Jun 17, 2023
@nemobis nemobis added this to the 0.4 milestone Jun 17, 2023
@GT-610
Copy link
Contributor

GT-610 commented Jun 17, 2023

I tried another wiki (distrowiki.mirahrze.org) and nothing wrong happened. Maybe it's an occational error or an issue related to Python version, or something else.

@nemobis
Copy link
Member Author

nemobis commented Jun 18, 2023

I think I got an error HTTP 429, but we catch it and just proceed like nothing happened:

                while True:
                    try:
                        arvrequest = site.api(http_method=config['http_method'], **arvparams)
                    except requests.exceptions.HTTPError as e:
                        if e.response.status_code == 405 and config['http_method'] == "POST":
                            print("POST request to the API failed, retrying with GET")
                            config['http_method'] = "GET"
                            continue

We should ideally implement a retry mechanism as we have in getXMLPage(), to avoid endless loops.

@nemobis
Copy link
Member Author

nemobis commented Jun 18, 2023

Image filenames and URLs saved at... denbagovmirahezeorg_w-20230618-images.txt                                        
Retrieving images from "start"                                                                                        
Creating "./denbagovmirahezeorg_w-20230618-wikidump-2/images" directory                                               
Traceback (most recent call last):                                                                                    
  File "dumpgenerator.py", line 2572, in <module>                                                                                                                                                                                           
    main()                                                                                                            
  File "dumpgenerator.py", line 2564, in main                                                                         
    createNewDump(config=config, other=other)                                                                         
  File "dumpgenerator.py", line 2147, in createNewDump                                                                
    session=other['session'])                                                                                         
  File "dumpgenerator.py", line 1524, in generateImageDump
    r = session.get(config['api'] + u"?action=query&export&exportnowrap&titles=%s" % urllib.quote(title))             
  File "/usr/lib/python2.7/urllib.py", line 1306, in quote                                                            
    return ''.join(map(quoter, s))                                                                                    
KeyError: u'\u0420'                                                                                                   
tail: cannot open 'denbagovmirahezeorg_w-20230618-wikidump/denbagovmirahezeorg_w-20230618-history.xml' for reading: No such file or directory

@nemobis
Copy link
Member Author

nemobis commented Jun 18, 2023

I don't understand the HTTP 502 errors

Analysing https://ubrwiki.miraheze.org/w/api.php                                                                      
Trying generating a new dump into a new directory...                                                                  
Retrieving image filenames                                                                                            
......................................    Found 1851 images                                                           
1851 image names loaded                                                                                               
Image filenames and URLs saved at... ubrwikimirahezeorg_w-20230618-images.txt                                         
Retrieving images from "start"                           
Creating "./ubrwikimirahezeorg_w-20230618-wikidump/images" directory     
    Downloaded 10 images                                                                                              
    Read timeout: HTTPSConnectionPool(host='ubrwiki.miraheze.org', port=443): Read timed out. (read timeout=10)       
    In attempt 1, XML for "Image:1,00_M$.png" is wrong. Waiting 20 seconds and reloading...
    Downloaded 20 images                                                                                              
    Read timeout: HTTPSConnectionPool(host='ubrwiki.miraheze.org', port=443): Read timed out. (read timeout=10)       
    In attempt 1, XML for "Image:1900.png" is wrong. Waiting 20 seconds and reloading...
    Downloaded 30 images                                                                                              
    Read timeout: HTTPSConnectionPool(host='ubrwiki.miraheze.org', port=443): Read timed out. (read timeout=10)       
    In attempt 1, XML for "Image:2_turno.png" is wrong. Waiting 20 seconds and reloading...                           
    Read timeout: HTTPSConnectionPool(host='ubrwiki.miraheze.org', port=443): Read timed out. (read timeout=10)       
    In attempt 2, XML for "Image:2_turno.png" is wrong. Waiting 40 seconds and reloading...                           
    Read timeout: HTTPSConnectionPool(host='ubrwiki.miraheze.org', port=443): Read timed out. (read timeout=10)
    In attempt 3, XML for "Image:2_turno.png" is wrong. Waiting 60 seconds and reloading...
    Read timeout: HTTPSConnectionPool(host='ubrwiki.miraheze.org', port=443): Read timed out. (read timeout=10)
    In attempt 4, XML for "Image:2_turno.png" is wrong. Waiting 80 seconds and reloading...
HTTP Error 502.                                          
Server error, max retries exceeded.                                                                                   
Please resume the dump later.
https://ubrwiki.miraheze.org/w/index.php?action=submit&curonly=1&limit=1&pages=Image%3A20M%24.png&title=Special%3AExport

@nemobis nemobis changed the title dumpgenerator --xmlrevisions failures with some Miraheze (MediaWiki 1.39.3) wikis Assorted dumpgenerator.py failures with some Miraheze (MediaWiki 1.39.3) wikis Jun 18, 2023
@nemobis
Copy link
Member Author

nemobis commented Jun 20, 2023

ouch

Trying to export all revisions from namespace 2303
Trying to get wikitext from the allrevisions API and to build the XML
XML dump saved at... avidwiki_w-20230620-history.xml
Retrieving image filenames
........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................HTTP Error 429.
Server error, max retries exceeded.
Please resume the dump later.
https://www.avid.wiki/w/api.php?aiprop=url%7Cuser&format=json&aifrom=WBRZ_2013.png&list=allimages&ailimit=50&action=query
Changed directory to /mnt/at/wikiteam/avidwiki_w-20230620-wikidump
606332
606332
606332

@yzqzss
Copy link
Contributor

yzqzss commented Jul 2, 2023

https://bigforest.miraheze.org/wiki/%ED%8A%B9%EC%88%98:%EB%B2%84%EC%A0%84

Not reproduced in the latest MW-Scraper.

Trying to export all revisions from namespace -1 (magic number refers to "all")
Trying to get wikitext from the allrevisions API and to build the XML
틀:동음이의, 30 edits (--xmlrevisions)
틀:반대, 1 edits (--xmlrevisions)
틀:찬성, 1 edits (--xmlrevisions)
틀:의견, 4 edits (--xmlrevisions)
틀:삭제, 4 edits (--xmlrevisions)
틀:유지, 2 edits (--xmlrevisions)
틀:이동, 1 edits (--xmlrevisions)
틀:넘겨주기, 1 edits (--xmlrevisions)
틀:중립, 1 edits (--xmlrevisions)
틀:병합, 1 edits (--xmlrevisions)
틀:질문, 1 edits (--xmlrevisions)
틀:분할, 1 edits (--xmlrevisions)
......

@yzqzss
Copy link
Contributor

yzqzss commented Jul 2, 2023

Not sure what's special about this wiki https://bigforest.miraheze.org/wiki/%ED%8A%B9%EC%88%98:%EB%B2%84%EC%A0%84

Maybe it was just an occasional error.

If e.response.status_code == 405 and config['http_method'] == "POST" is False, arvrequest will become unbound. (Escaped continue)

https://github.com/wikiteam/wikiteam/blob/b9f861d8c206bd39f1293f1c16c008a5c141b47b/dumpgenerator.py#L829

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants