Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getIncoming() crashes for some pages #517

Open
chris-gassner opened this issue Feb 15, 2023 · 1 comment
Open

getIncoming() crashes for some pages #517

chris-gassner opened this issue Feb 15, 2023 · 1 comment
Labels

Comments

@chris-gassner
Copy link

I'm trying to fetch incoming links for pages and some docs cause a crash when calling getIncoming().

Trying to fetch incoming links for the article 'Europe' fails with:

=-=- http response error =-=-=-
https://en.wikipedia.org/w/api.php?action=query&lhnamespace=0&prop=linkshere&lhshow=!redirect&lhlimit=500&format=json&origin=*&redirects=true&titles=Europe&lhcontinue=566556
FetchError: invalid json response body at https://en.wikipedia.org/w/api.php?action=query&lhnamespace=0&prop=linkshere&lhshow=!redirect&lhlimit=500&format=json&origin=*&redirects=true&titles=Europe&lhcontinue=566556 reason: Unexpected token < in JSON at position 0
 at X:\node-projects\wiki\node_modules\node-fetch\lib\index.js:273:32
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async getIncoming (X:\node-projects\wiki\node_modules\wtf-plugin-api\builds\wtf-plugin-api.cjs:110:31)
    at async X:\node-projects\wiki\index.js:385:22 {
  type: 'invalid-json'
}

while getIncoming() works for 'Javascript' or 'Briefcase' for example.
I'm guessing this is probably related to the number incoming links. The europe article has 86,136 direct links according to https://linkcount.toolforge.org/?project=en.wikipedia.org&page=Europe&namespaces=
The article Python (programming language) has 9,467 links according to https://linkcount.toolforge.org/?project=en.wikipedia.org&page=Python%20(programming%20language)&namespaces= but I get back 3718 pageids when calling getIncoming.

Not a big deal, just thought I'd let you know though.

@spencermountain
Copy link
Owner

hey Christoph, thanks for the good issue.
Yeah - i think you're right about an timeout for some pages. The api plugin loops around and fetches things 500 at a time.

I looked into the python example - the getIncoming method is only returning pages that are wikipedia articles (namespace 0) and not other wikipedia internal stuff. I think the python discrepency is from User talk pages - haha, people are using this template on their profile pages.

Please let me know if you can track down other cases with missing articles. The Europe case needs some thinking. Maybe we could try lowering the limit down from 500. The code is here if anyone is interested.
cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants