Troubleshooting minecraftwiki_zh_all recipe #1995

TripleCamera · 2024-02-14T12:33:47Z

Note: This is only tested on MWoffliner v1.13.0 (since all openZIM scrapers are using this version). Both the code and the config between v1.13.0 and git main differs a lot. So this needs to be tested on git main.

The following description is mostly taken from my comment when troubleshooting the scrape for Minecraft Wiki (zh) (openzim/zim-requests#755).

The scraper reports Unable to find appropriate API end-point to retrieve article HTML when scraping Minecraft Wiki (zh). Here is a code analysis of MWoffliner v1.13.0.

Before the scrape starts, MWoffliner checks mobile REST API, desktop REST API, and VE REST API capabilities for a specific page (parameter testArticleId) in Downloader.checkCapabilities:

mwoffliner/src/Downloader.ts

Lines 243 to 263 in e9d4113

    
           public async checkCapabilities(testArticleId = 'MediaWiki:Sidebar'): Promise<void> { 
        
             // By default check all API's responses and set the capabilities 
        
             // accordingly. We need to set a default page (always there because 
        
             // installed per default) to request the REST API, otherwise it would 
        
             // fail the check. 
        
             this.mwCapabilities.mobileRestApiAvailable = await this.checkApiAvailabilty(this.mw.getMobileRestApiArticleUrl(testArticleId)) 
        
             this.mwCapabilities.desktopRestApiAvailable = await this.checkApiAvailabilty(this.mw.getDesktopRestApiArticleUrl(testArticleId)) 
        
             this.mwCapabilities.veApiAvailable = await this.checkApiAvailabilty(this.mw.getVeApiArticleUrl(testArticleId)) 
        
             this.mwCapabilities.apiAvailable = await this.checkApiAvailabilty(this.mw.apiUrl.href) 
        
             // Coordinate fetching 
        
             const reqOpts = objToQueryString({ 
        
               ...this.getArticleQueryOpts(), 
        
             }) 
        
             const resp = await this.getJSON<MwApiResponse>(`${this.mw.apiUrl.href}${reqOpts}`) 
        
             const isCoordinateWarning = resp.warnings && resp.warnings.query && (resp.warnings.query['*'] || '').includes('coordinates') 
        
             if (isCoordinateWarning) { 
        
               logger.info('Coordinates not available on this wiki') 
        
               this.mwCapabilities.coordinatesAvailable = false 
        
             } 
        
           }

The default value MediaWiki:Sidebar is never used because the value of mwMetaData.mainPage is passed:

mwoffliner/src/mwoffliner.lib.ts

Line 206 in e9d4113

await downloader.checkCapabilities(mwMetaData.mainPage)

The value of mwMetaData.mainPage comes from API. The base URL is stripped and its last part is taken. (This is a bad idea because different wikis have different URL rewrites.)

mwoffliner/src/MediaWiki.ts

Lines 290 to 325 in e9d4113

    
           public async getMwMetaData(downloader: Downloader): Promise<MWMetaData> { 
        
             if (this.metaData) { 
        
               return this.metaData 
        
             } 
        
             const creator = this.getCreatorName() || 'Kiwix' 
        
             const [textDir, { langIso2, langIso3, mainPage, siteName }, subTitle] = await Promise.all([ 
        
               this.getTextDirection(downloader), 
        
               this.getSiteInfo(downloader), 
        
               this.getSubTitle(downloader), 
        
             ]) 
        
             const mwMetaData: MWMetaData = { 
        
               webUrl: this.webUrl.href, 
        
               apiUrl: this.apiUrl.href, 
        
               modulePath: this.modulePath, 
        
               webUrlPath: this.webUrl.pathname, 
        
               wikiPath: this.wikiPath, 
        
               baseUrl: this.baseUrl.href, 
        
               apiPath: this.apiPath, 
        
               domain: this.domain, 
        
               textDir: textDir as TextDirection, 
        
               langIso2, 
        
               langIso3, 
        
               title: siteName, 
        
               subTitle, 
        
               creator, 
        
               mainPage, 
        
             } 
        
             this.metaData = mwMetaData 
        
             return mwMetaData 
        
           }

mwoffliner/src/MediaWiki.ts

Lines 235 to 279 in e9d4113

    
           public async getSiteInfo(downloader: Downloader) { 
        
             logger.log('Getting site info...') 
        
             const query = 'action=query&meta=siteinfo&format=json&siprop=general|namespaces|statistics|variables|category|wikidesc' 
        
             const body = await downloader.query(query) 
        
             const entries = body.query.general 
        
             // Checking mediawiki version 
        
             const mwVersion = semver.coerce(entries.generator).raw 
        
             const mwMinimalVersion = 1.27 
        
             if (!entries.generator || !semver.satisfies(mwVersion, `>=${mwMinimalVersion}`)) { 
        
               throw new Error(`Mediawiki version ${mwVersion} not supported should be >=${mwMinimalVersion}`) 
        
             } 
        
             // Base will contain the default encoded article id for the wiki. 
        
             const mainPage = decodeURIComponent(entries.base.split('/').pop()) 
        
             const siteName = entries.sitename 
        
             const langs: string[] = [entries.lang].concat(entries.fallback.map((e: any) => e.code)) 
        
             const [langIso2, langIso3] = await Promise.all( 
        
               langs.map(async (lang: string) => { 
        
                 let langIso3 
        
                 try { 
        
                   langIso3 = await util.getIso3(lang) 
        
                 } catch (err) { 
        
                   langIso3 = lang 
        
                 } 
        
                 try { 
        
                   return [lang, langIso3] 
        
                 } catch (err) { 
        
                   return false 
        
                 } 
        
               }), 
        
             ).then((possibleLangPairs) => { 
        
               possibleLangPairs = possibleLangPairs.filter((a) => a) 
        
               return possibleLangPairs[0] || ['en', 'eng'] 
        
             }) 
        
             return { 
        
               mainPage, 
        
               siteName, 
        
               langIso2, 
        
               langIso3, 
        
             } 
        
           }

This works for many wikis like English Wikipedia, but not for Chinese Minecraft Wiki. The reason is that MCW-zh has URL rewrite:

// Wikipedia-en
"base": "https://en.wikipedia.org/wiki/Main_Page",
// MCW-zh
"base": "https://zh.minecraft.wiki/",

There are two ways to fix this:

⭐ Set mwMetaData.mainPage to entries.mainpage, which is already included in the API result. (MediaWiki documentation)
```
-const mainPage = decodeURIComponent(entries.base.split('/').pop())
+const mainPage = entries.mainpage
```

Use the default parameter for Downloader.checkCapabilities:

-await downloader.checkCapabilities(mwMetaData.mainPage)
+await downloader.checkCapabilities()

I have tested both, and both worked.

The text was updated successfully, but these errors were encountered:

TripleCamera · 2024-02-20T03:30:27Z

The following description is mostly taken from my comment.

In v1.13.0 (I will test git main later), MWoffliner accepts three different APIs:

Mobile REST API: Only available in Wikimedia REST API.
Desktop REST API: Available in both Wikimedia REST API and MediaWiki REST API. However, MediaWiki REST API cannot be used without modifying the code.
- In Wikimedia REST API, the URL is /page/html/{title}.
- In MediaWiki REST API, the URL is /page/{title}/html.
In MWoffliner, it is hardcoded so that the page title can only come last. I try to modify the code, and it seems to succeed (it fails later ☹️, but it seems promising).

Besides, @xtexChooser inspired me to try Parsoid API, whose URL is /rest.php/{domain}/v3/page/html/{title}. However, this would be redirected to /rest.php/{domain}/v3/page/html/{title}/{latest_revision}. Since the response code is 302, not 200, it is regarded as inaccessible.
VisualEditor API: Available in both Wikimedia REST API and MediaWiki REST API. Minecraft Wiki (zh) is supposed to be scraped in this way. However, it cannot work now because of the bug mentioned above.

TripleCamera · 2024-02-20T03:45:42Z

I am currently testing git main.

@kelson42 switched to another scraper running git main. However, it failed because the arguments between v1.13.0 and git main differ. To fix this:

Unset --mwApiPath
Set --mwActionApiPath="api.php" (NO LEADING SLASH)

The next issue I encountered after fixing this was:

[error] [2024-02-20T03:24:45.973Z] Failed to run mwoffliner after [65s]: {
	"stack": "TypeError: articleListLines is not iterable\n    at createMainPage (file:///home/co-eda/mwoffliner-git/mwoffliner_main/lib/mwoffliner.lib.js:429:37)\n    at getMainPage (file:///home/co-eda/mwoffliner-git/mwoffliner_main/lib/mwoffliner.lib.js:466:54)\n    at doDump (file:///home/co-eda/mwoffliner-git/mwoffliner_main/lib/mwoffliner.lib.js:308:15)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async Module.execute (file:///home/co-eda/mwoffliner-git/mwoffliner_main/lib/mwoffliner.lib.js:261:17)",
	"message": "articleListLines is not iterable"
}

I modified mwoffliner.lib.js to print out articleListLines:

[log] [2024-02-20T03:42:06.004Z] articleListLines = undefined

TripleCamera · 2024-02-26T14:02:51Z

Finally, I find out the cause of the issue: same as before.

mwoffliner/src/MediaWiki.ts

Lines 413 to 428 in ad5dc1d

    
           public async getSiteInfo(downloader: Downloader) { 
        
             logger.log('Getting site info...') 
        
             const body = await downloader.query() 
        
             const entries = body.query.general 
        
             // Checking mediawiki version 
        
             const mwVersion = semver.coerce(entries.generator).raw 
        
             const mwMinimalVersion = 1.27 
        
             if (!entries.generator || !semver.satisfies(mwVersion, `>=${mwMinimalVersion}`)) { 
        
               throw new Error(`Mediawiki version ${mwVersion} not supported should be >=${mwMinimalVersion}`) 
        
             } 
        
             // Base will contain the default encoded article id for the wiki. 
        
             const mainPage = decodeURIComponent(entries.base.split('/').pop()) 
        
             const siteName = entries.sitename

Since the logic of retrieving main page remains unchanged, we still have to modify the code to make it work.

mwoffliner/src/mwoffliner.lib.ts

Lines 203 to 204 in ad5dc1d

    
           // Sanitizing main page 
        
           let mainPage = articleList ? '' : mwMetaData.mainPage

mwoffliner/src/mwoffliner.lib.ts

Line 609 in ad5dc1d

return mainPage ? createMainPageRedirect() : createMainPage()

In regular cases:

When --articleList is set, mainPage is set to empty, then createMainPage() is called, which reads the value of articleList.
When --articleList is not set, mainPage is set to mwMetaData.mainPage, then createMainPageRedirect() is called.

However, in this situation, mwMetaData.mainPage is empty, so that createMainPage() is called, which leads to the error mentioned above.

@kelson42 Could you please create a pull request? (The solution is at the end of my first comment.)

Update: Checking API capabilities is no longer a problem in git main, since MediaWiki:Sidebar is always used:

mwoffliner/src/MediaWiki.ts

Line 162 in ad5dc1d

this.apiCheckArticleId = 'MediaWiki:Sidebar'

kelson42 · 2024-02-26T14:09:27Z

@TripleCamera Thank you! I will have a look in rhe next days to your analysis.

TripleCamera · 2024-03-03T10:12:23Z

@ TripleCamera Thank you! I will have a look in rhe next days to your analysis.

@kelson42 How is everything going?

TripleCamera · 2024-03-07T06:44:59Z

I fixed the main page issue and started a scrape on my machine. Two problems arose:

Failed to retrieve "资源包/Folders", the longest page on this wiki. However, later tests showed that the second longest page ("Minecraft Dungeons:API") can be retrieved. See Special:LongPages.

So, we need to exclude "资源包/Folders".

Error log:

[info] [2024-03-07T03:17:02.060Z] Getting article [资源包/Folders] from https://zh.minecraft.wiki/api.php?action=visualeditor&mobileformat=html&format=json&paction=parse&formatversion=2&page=%E8%B5%84%E6%BA%90%E5%8C%85%2FFolders
[info] [2024-03-07T03:17:02.061Z] Getting JSON from [https://zh.minecraft.wiki/api.php?action=visualeditor&mobileformat=html&format=json&paction=parse&formatversion=2&page=%E8%B5%84%E6%BA%90%E5%8C%85%2FFolders]
[error] [2024-03-07T03:17:04.205Z] Error downloading article 资源包/Folders

API result:

{
    "error": {
        "code": "visualeditor-docserver-http",
        "info": "Error contacting the Parsoid/RESTBase server (HTTP 500): (no message)",
        "docref": "See https://zh.minecraft.wiki/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/&gt; for notice of API deprecations and breaking changes."
    },
    "servedby": "mediawiki-6ff94dc64-5tmqz"
}

After my scrape failed, someone told me that both the API and the site became slow for a while. I suspected that the scraper was too fast. So I checked the history of the minecraftwiki_zh_all recipe. Then I found that at first, argument --speed was set to "0.1", but later it was removed. I will add --speed argument and try again.

kelson42 · 2024-03-07T06:48:48Z

@TripleCamera Sorry for not coming back to you earlier, not lack of interest, but lack of time. Plan to look to your ticket in detail this WE.

TripleCamera · 2024-03-09T12:38:14Z

@ TripleCamera Sorry for not coming back to you earlier, not lack of interest, but lack of time. Plan to look to your ticket in detail this WE.

Thank you! After fixing the issues mentioned above, the scraper was running smoothly. However, I had to stop it because I don't have a lot of time either. It is estimated to finish in 5 hours (using the config below).

Here is a list of things I have done so far:

Fix the main page issue in the code (See my first comment)
Unset --mwApiPath
Set --mwActionApiPath="api.php" (NO LEADING SLASH)
Set --articleListToIgnore="资源包/Folders"
Set --speed to an appropriate value (I was using 0.5 and I couldn't sense significant changes on page load time)

Could you please apply these changes and relaunch the scraper? Next I have to rely on openZIM's scraper.

winstonsung · 2024-03-28T07:11:49Z

Any progress so far?

@kelson42

TripleCamera · 2024-04-09T14:00:36Z

Great, Kelson is back. It seems that this task can move forward a little bit more. 😊

Update: @kelson42 Hello?

TripleCamera · 2024-04-22T04:21:52Z

@kelson42 Hi. Have you been busy recently? Maybe you can assign this task to your colleagues (if they are free).

TripleCamera · 2024-05-11T01:43:33Z

Hi. I just created a pull request which contains the patch. Can someone review & merge it? @kelson42

TripleCamera mentioned this issue Feb 14, 2024

New request: Minecraft Wiki (zh) openzim/zim-requests#755

Open

TripleCamera changed the title ~~[Needs Testing] Fail to locate main page for wikis with URL rewrites~~ [Needs Testing] Unable to find appropriate API end-point for wikis with main page URL rewritten Feb 14, 2024

TripleCamera changed the title ~~[Needs Testing] Unable to find appropriate API end-point for wikis with main page URL rewritten~~ Troubleshooting minecraftwiki_zh_all recipe Feb 20, 2024

TripleCamera mentioned this issue May 11, 2024

Fix main page name for wikis with base URL rewritten #2025

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Troubleshooting minecraftwiki_zh_all recipe #1995

Troubleshooting minecraftwiki_zh_all recipe #1995

TripleCamera commented Feb 14, 2024 •

edited

TripleCamera commented Feb 20, 2024

TripleCamera commented Feb 20, 2024

TripleCamera commented Feb 26, 2024 •

edited

kelson42 commented Feb 26, 2024

TripleCamera commented Mar 3, 2024 •

edited

TripleCamera commented Mar 7, 2024

kelson42 commented Mar 7, 2024

TripleCamera commented Mar 9, 2024

winstonsung commented Mar 28, 2024 •

edited

TripleCamera commented Apr 9, 2024 •

edited

TripleCamera commented Apr 22, 2024 •

edited

TripleCamera commented May 11, 2024 •

edited

Troubleshooting minecraftwiki_zh_all recipe #1995

Troubleshooting minecraftwiki_zh_all recipe #1995

Comments

TripleCamera commented Feb 14, 2024 • edited

TripleCamera commented Feb 20, 2024

TripleCamera commented Feb 20, 2024

TripleCamera commented Feb 26, 2024 • edited

kelson42 commented Feb 26, 2024

TripleCamera commented Mar 3, 2024 • edited

TripleCamera commented Mar 7, 2024

kelson42 commented Mar 7, 2024

TripleCamera commented Mar 9, 2024

winstonsung commented Mar 28, 2024 • edited

TripleCamera commented Apr 9, 2024 • edited

TripleCamera commented Apr 22, 2024 • edited

TripleCamera commented May 11, 2024 • edited

TripleCamera commented Feb 14, 2024 •

edited

TripleCamera commented Feb 26, 2024 •

edited

TripleCamera commented Mar 3, 2024 •

edited

winstonsung commented Mar 28, 2024 •

edited

TripleCamera commented Apr 9, 2024 •

edited

TripleCamera commented Apr 22, 2024 •

edited

TripleCamera commented May 11, 2024 •

edited