Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Troubleshooting minecraftwiki_zh_all recipe #1995

Open
TripleCamera opened this issue Feb 14, 2024 · 12 comments
Open

Troubleshooting minecraftwiki_zh_all recipe #1995

TripleCamera opened this issue Feb 14, 2024 · 12 comments

Comments

@TripleCamera
Copy link

TripleCamera commented Feb 14, 2024

Note: This is only tested on MWoffliner v1.13.0 (since all openZIM scrapers are using this version). Both the code and the config between v1.13.0 and git main differs a lot. So this needs to be tested on git main.

The following description is mostly taken from my comment when troubleshooting the scrape for Minecraft Wiki (zh) (openzim/zim-requests#755).


The scraper reports Unable to find appropriate API end-point to retrieve article HTML when scraping Minecraft Wiki (zh). Here is a code analysis of MWoffliner v1.13.0.

Before the scrape starts, MWoffliner checks mobile REST API, desktop REST API, and VE REST API capabilities for a specific page (parameter testArticleId) in Downloader.checkCapabilities:

public async checkCapabilities(testArticleId = 'MediaWiki:Sidebar'): Promise<void> {
// By default check all API's responses and set the capabilities
// accordingly. We need to set a default page (always there because
// installed per default) to request the REST API, otherwise it would
// fail the check.
this.mwCapabilities.mobileRestApiAvailable = await this.checkApiAvailabilty(this.mw.getMobileRestApiArticleUrl(testArticleId))
this.mwCapabilities.desktopRestApiAvailable = await this.checkApiAvailabilty(this.mw.getDesktopRestApiArticleUrl(testArticleId))
this.mwCapabilities.veApiAvailable = await this.checkApiAvailabilty(this.mw.getVeApiArticleUrl(testArticleId))
this.mwCapabilities.apiAvailable = await this.checkApiAvailabilty(this.mw.apiUrl.href)
// Coordinate fetching
const reqOpts = objToQueryString({
...this.getArticleQueryOpts(),
})
const resp = await this.getJSON<MwApiResponse>(`${this.mw.apiUrl.href}${reqOpts}`)
const isCoordinateWarning = resp.warnings && resp.warnings.query && (resp.warnings.query['*'] || '').includes('coordinates')
if (isCoordinateWarning) {
logger.info('Coordinates not available on this wiki')
this.mwCapabilities.coordinatesAvailable = false
}
}

The default value MediaWiki:Sidebar is never used because the value of mwMetaData.mainPage is passed:

await downloader.checkCapabilities(mwMetaData.mainPage)

The value of mwMetaData.mainPage comes from API. The base URL is stripped and its last part is taken. (This is a bad idea because different wikis have different URL rewrites.)

mwoffliner/src/MediaWiki.ts

Lines 290 to 325 in e9d4113

public async getMwMetaData(downloader: Downloader): Promise<MWMetaData> {
if (this.metaData) {
return this.metaData
}
const creator = this.getCreatorName() || 'Kiwix'
const [textDir, { langIso2, langIso3, mainPage, siteName }, subTitle] = await Promise.all([
this.getTextDirection(downloader),
this.getSiteInfo(downloader),
this.getSubTitle(downloader),
])
const mwMetaData: MWMetaData = {
webUrl: this.webUrl.href,
apiUrl: this.apiUrl.href,
modulePath: this.modulePath,
webUrlPath: this.webUrl.pathname,
wikiPath: this.wikiPath,
baseUrl: this.baseUrl.href,
apiPath: this.apiPath,
domain: this.domain,
textDir: textDir as TextDirection,
langIso2,
langIso3,
title: siteName,
subTitle,
creator,
mainPage,
}
this.metaData = mwMetaData
return mwMetaData
}

mwoffliner/src/MediaWiki.ts

Lines 235 to 279 in e9d4113

public async getSiteInfo(downloader: Downloader) {
logger.log('Getting site info...')
const query = 'action=query&meta=siteinfo&format=json&siprop=general|namespaces|statistics|variables|category|wikidesc'
const body = await downloader.query(query)
const entries = body.query.general
// Checking mediawiki version
const mwVersion = semver.coerce(entries.generator).raw
const mwMinimalVersion = 1.27
if (!entries.generator || !semver.satisfies(mwVersion, `>=${mwMinimalVersion}`)) {
throw new Error(`Mediawiki version ${mwVersion} not supported should be >=${mwMinimalVersion}`)
}
// Base will contain the default encoded article id for the wiki.
const mainPage = decodeURIComponent(entries.base.split('/').pop())
const siteName = entries.sitename
const langs: string[] = [entries.lang].concat(entries.fallback.map((e: any) => e.code))
const [langIso2, langIso3] = await Promise.all(
langs.map(async (lang: string) => {
let langIso3
try {
langIso3 = await util.getIso3(lang)
} catch (err) {
langIso3 = lang
}
try {
return [lang, langIso3]
} catch (err) {
return false
}
}),
).then((possibleLangPairs) => {
possibleLangPairs = possibleLangPairs.filter((a) => a)
return possibleLangPairs[0] || ['en', 'eng']
})
return {
mainPage,
siteName,
langIso2,
langIso3,
}
}

This works for many wikis like English Wikipedia, but not for Chinese Minecraft Wiki. The reason is that MCW-zh has URL rewrite:

// Wikipedia-en
"base": "https://en.wikipedia.org/wiki/Main_Page",
// MCW-zh
"base": "https://zh.minecraft.wiki/",

There are two ways to fix this:

  1. ⭐ Set mwMetaData.mainPage to entries.mainpage, which is already included in the API result. (MediaWiki documentation)
    -const mainPage = decodeURIComponent(entries.base.split('/').pop())
    +const mainPage = entries.mainpage
  2. Use the default parameter for Downloader.checkCapabilities:
    -await downloader.checkCapabilities(mwMetaData.mainPage)
    +await downloader.checkCapabilities()

I have tested both, and both worked.

@TripleCamera TripleCamera changed the title [Needs Testing] Fail to locate main page for wikis with URL rewrites [Needs Testing] Unable to find appropriate API end-point for wikis with main page URL rewritten Feb 14, 2024
@TripleCamera
Copy link
Author

The following description is mostly taken from my comment.


In v1.13.0 (I will test git main later), MWoffliner accepts three different APIs:

  • Mobile REST API: Only available in Wikimedia REST API.

  • Desktop REST API: Available in both Wikimedia REST API and MediaWiki REST API. However, MediaWiki REST API cannot be used without modifying the code.

    In MWoffliner, it is hardcoded so that the page title can only come last. I try to modify the code, and it seems to succeed (it fails later ☹️, but it seems promising).
    屏幕截图 2024-02-14 215918

    Besides, @xtexChooser inspired me to try Parsoid API, whose URL is /rest.php/{domain}/v3/page/html/{title}. However, this would be redirected to /rest.php/{domain}/v3/page/html/{title}/{latest_revision}. Since the response code is 302, not 200, it is regarded as inaccessible.

  • VisualEditor API: Available in both Wikimedia REST API and MediaWiki REST API. Minecraft Wiki (zh) is supposed to be scraped in this way. However, it cannot work now because of the bug mentioned above.

@TripleCamera
Copy link
Author

I am currently testing git main.

@kelson42 switched to another scraper running git main. However, it failed because the arguments between v1.13.0 and git main differ. To fix this:

  1. Unset --mwApiPath
  2. Set --mwActionApiPath="api.php" (NO LEADING SLASH)

The next issue I encountered after fixing this was:

[error] [2024-02-20T03:24:45.973Z] Failed to run mwoffliner after [65s]: {
	"stack": "TypeError: articleListLines is not iterable\n    at createMainPage (file:///home/co-eda/mwoffliner-git/mwoffliner_main/lib/mwoffliner.lib.js:429:37)\n    at getMainPage (file:///home/co-eda/mwoffliner-git/mwoffliner_main/lib/mwoffliner.lib.js:466:54)\n    at doDump (file:///home/co-eda/mwoffliner-git/mwoffliner_main/lib/mwoffliner.lib.js:308:15)\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async Module.execute (file:///home/co-eda/mwoffliner-git/mwoffliner_main/lib/mwoffliner.lib.js:261:17)",
	"message": "articleListLines is not iterable"
}

I modified mwoffliner.lib.js to print out articleListLines:

[log] [2024-02-20T03:42:06.004Z] articleListLines = undefined

@TripleCamera TripleCamera changed the title [Needs Testing] Unable to find appropriate API end-point for wikis with main page URL rewritten Troubleshooting minecraftwiki_zh_all recipe Feb 20, 2024
@TripleCamera
Copy link
Author

TripleCamera commented Feb 26, 2024

Finally, I find out the cause of the issue: same as before.

mwoffliner/src/MediaWiki.ts

Lines 413 to 428 in ad5dc1d

public async getSiteInfo(downloader: Downloader) {
logger.log('Getting site info...')
const body = await downloader.query()
const entries = body.query.general
// Checking mediawiki version
const mwVersion = semver.coerce(entries.generator).raw
const mwMinimalVersion = 1.27
if (!entries.generator || !semver.satisfies(mwVersion, `>=${mwMinimalVersion}`)) {
throw new Error(`Mediawiki version ${mwVersion} not supported should be >=${mwMinimalVersion}`)
}
// Base will contain the default encoded article id for the wiki.
const mainPage = decodeURIComponent(entries.base.split('/').pop())
const siteName = entries.sitename

Since the logic of retrieving main page remains unchanged, we still have to modify the code to make it work.

// Sanitizing main page
let mainPage = articleList ? '' : mwMetaData.mainPage

return mainPage ? createMainPageRedirect() : createMainPage()

In regular cases:

  • When --articleList is set, mainPage is set to empty, then createMainPage() is called, which reads the value of articleList.
  • When --articleList is not set, mainPage is set to mwMetaData.mainPage, then createMainPageRedirect() is called.

However, in this situation, mwMetaData.mainPage is empty, so that createMainPage() is called, which leads to the error mentioned above.

@kelson42 Could you please create a pull request? (The solution is at the end of my first comment.)


Update: Checking API capabilities is no longer a problem in git main, since MediaWiki:Sidebar is always used:

this.apiCheckArticleId = 'MediaWiki:Sidebar'

@kelson42
Copy link
Collaborator

@TripleCamera Thank you! I will have a look in rhe next days to your analysis.

@TripleCamera
Copy link
Author

TripleCamera commented Mar 3, 2024

@ TripleCamera Thank you! I will have a look in rhe next days to your analysis.

@kelson42 How is everything going?

@TripleCamera
Copy link
Author

I fixed the main page issue and started a scrape on my machine. Two problems arose:

  1. Failed to retrieve "资源包/Folders", the longest page on this wiki. However, later tests showed that the second longest page ("Minecraft Dungeons:API") can be retrieved. See Special:LongPages.

    So, we need to exclude "资源包/Folders".

    Error log:

    [info] [2024-03-07T03:17:02.060Z] Getting article [资源包/Folders] from https://zh.minecraft.wiki/api.php?action=visualeditor&mobileformat=html&format=json&paction=parse&formatversion=2&page=%E8%B5%84%E6%BA%90%E5%8C%85%2FFolders
    [info] [2024-03-07T03:17:02.061Z] Getting JSON from [https://zh.minecraft.wiki/api.php?action=visualeditor&mobileformat=html&format=json&paction=parse&formatversion=2&page=%E8%B5%84%E6%BA%90%E5%8C%85%2FFolders]
    [error] [2024-03-07T03:17:04.205Z] Error downloading article 资源包/Folders
    

    API result:

    {
        "error": {
            "code": "visualeditor-docserver-http",
            "info": "Error contacting the Parsoid/RESTBase server (HTTP 500): (no message)",
            "docref": "See https://zh.minecraft.wiki/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at &lt;https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/&gt; for notice of API deprecations and breaking changes."
        },
        "servedby": "mediawiki-6ff94dc64-5tmqz"
    }
    
  2. After my scrape failed, someone told me that both the API and the site became slow for a while. I suspected that the scraper was too fast. So I checked the history of the minecraftwiki_zh_all recipe. Then I found that at first, argument --speed was set to "0.1", but later it was removed. I will add --speed argument and try again.

@kelson42
Copy link
Collaborator

kelson42 commented Mar 7, 2024

@TripleCamera Sorry for not coming back to you earlier, not lack of interest, but lack of time. Plan to look to your ticket in detail this WE.

@TripleCamera
Copy link
Author

@ TripleCamera Sorry for not coming back to you earlier, not lack of interest, but lack of time. Plan to look to your ticket in detail this WE.

Thank you! After fixing the issues mentioned above, the scraper was running smoothly. However, I had to stop it because I don't have a lot of time either. It is estimated to finish in 5 hours (using the config below).

Here is a list of things I have done so far:

  • Fix the main page issue in the code (See my first comment)
  • Unset --mwApiPath
  • Set --mwActionApiPath="api.php" (NO LEADING SLASH)
  • Set --articleListToIgnore="资源包/Folders"
  • Set --speed to an appropriate value (I was using 0.5 and I couldn't sense significant changes on page load time)

Could you please apply these changes and relaunch the scraper? Next I have to rely on openZIM's scraper.

@winstonsung
Copy link

winstonsung commented Mar 28, 2024

Any progress so far?

@kelson42

@TripleCamera
Copy link
Author

TripleCamera commented Apr 9, 2024

Great, Kelson is back. It seems that this task can move forward a little bit more. 😊


Update: @kelson42 Hello?

@TripleCamera
Copy link
Author

TripleCamera commented Apr 22, 2024

@kelson42 Hi. Have you been busy recently? Maybe you can assign this task to your colleagues (if they are free).

@TripleCamera
Copy link
Author

TripleCamera commented May 11, 2024

Hi. I just created a pull request which contains the patch. Can someone review & merge it? @kelson42

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants