Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dumpgenerator.py: false-positive missing pages #423

Open
thiagocferr opened this issue Mar 1, 2022 · 1 comment
Open

dumpgenerator.py: false-positive missing pages #423

thiagocferr opened this issue Mar 1, 2022 · 1 comment

Comments

@thiagocferr
Copy link

While trying to generate an XML dump for the Touhou Wiki with the dumpgenerator.py (master#d7b6924), I noticed that no XML besides the Main Page was being generated, with every other entry being marked as missing in the wiki (probably deleted) in the errors.log file.

For example, executing:

$ python2 dumpgenerator.py --api=https://en.touhouwiki.net/api.php --path ./test --xml

would successfully find and load all page titles from all namespaces, and then starting "downloading pages":

[...]
Analysing https://en.touhouwiki.net/api.php
Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None
24 namespaces found
    Retrieving titles in the namespace 0
    28061 titles retrieved in the namespace 0
[...]

71698 page titles loaded
https://en.touhouwiki.net/api.php
Retrieving the XML for every page from "start"
Downloaded 10 pages
[...]

But pausing the script and checking the errors.log file would result in:

2022-02-28 21:26:22: The page "!?" was missing in the wiki (probably deleted)
2022-02-28 21:26:22: The page ""Activity"Case:04 -Cosmic Horoscope-" was missing in the wiki (probably deleted)
2022-02-28 21:26:22: The page ""Activity" Case:01 -Graveyard Memory-" was missing in the wiki (probably deleted)
2022-02-28 21:26:22: The page ""Activity" Case:02 -Nightmare Counselor-" was missing in the wiki (probably deleted)
2022-02-28 21:26:23: The page ""Activity" Case:03 -Historical Vacation-" was missing in the wiki (probably deleted)
2022-02-28 21:26:23: The page ""Activity" Case:05 -Forgotten Paradise-" was missing in the wiki (probably deleted)
2022-02-28 21:26:23: The page ""Activity" Case:06 -Shining Future-" was missing in the wiki (probably deleted)
2022-02-28 21:26:23: The page ""Activity" Case:07 -Dominated Realism-" was missing in the wiki (probably deleted)
2022-02-28 21:26:24: The page ""Activity" Case:08 -Midnight Syndrome-" was missing in the wiki (probably deleted)
2022-02-28 21:26:24: The page ""Everflowering" Masterpieces of Hatsunetsumiko's 2011 - 2013" was missing in the wiki (probably deleted)
2022-02-28 21:26:24: The page ""Everything but the Girl" Hatsunetsumiko's Dance Vocal Collection Vol.2" was missing in the wiki (probably deleted)

even though these pages actually exist.


Looking more into it, I was able to generate a XML dump (albeit with just one revision, as the wiki API seems to not support it) by changing the scripts' code to make a GET request, instead of POST request, during the XML extraction process. More precisely:

--- a/dumpgenerator.py
+++ b/dumpgenerator.py
@@ -579,7 +579,7 @@ def getXMLPageCore(headers={}, params={}, config={}, session=None):
                 return ''  # empty xml
         # FIXME HANDLE HTTP Errors HERE
         try:
-            r = session.post(url=config['index'], params=params, headers=headers, timeout=10)
+            r = session.get(url=config['index'], params=params, headers=headers, timeout=10)
             handleStatusCode(r)
             xml = fixBOM(r)
         except requests.exceptions.ConnectionError as e:

This seems to work because doing a POST returned an XML without a </page> tag for the page, with would result in a PageMissingError during this code section:

def getXMLPage(config={}, title='', verbose=True, session=None):
    [...]
    xml = getXMLPageCore(params=params, config=config, session=session)
    if xml == "":
        raise ExportAbortedError(config['index'])
    if not "</page>" in xml:
        raise PageMissingError(params['title'], xml)

while doing a GET would result in a page XML with the closing tag, thus saving it to the main XML file


I'm not really sure why this is the case, or even if making this kind of change would break the dump generation for other wikis (I tested with the InstallGentoo Wiki as well and the XML dump seemed to work just fine).

@nemobis
Copy link
Member

nemobis commented Mar 1, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants