Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty response (JSONDecodeError) when sending many requests in a row #43

Open
marccarre opened this issue Jul 17, 2022 · 3 comments · May be fixed by #44
Open

Empty response (JSONDecodeError) when sending many requests in a row #43

marccarre opened this issue Jul 17, 2022 · 3 comments · May be fixed by #44

Comments

@marccarre
Copy link

Version

0.7.0

Problem

When sending several (10~100) requests in a row, some requests fail, without determinism, with the following error:

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Upon closer investigation, the actual response is a 429,

  • with "reason": Too many requests. Please comply with the User-Agent policy to get a higher rate limit: https://meta.wikimedia.org/wiki/User-Agent_policy, and
  • with the following "body":
<!DOCTYPE html>
<html lang="en">
<meta charset="utf-8">
<title>Wikimedia Error</title>
<style>
	* { margin: 0; padding: 0; }
body { background: #fff; font: 15px/1.6 sans-serif; color: #333; }
.content { margin: 7% auto 0; padding: 2em 1em 1em; max-width: 640px; }
.footer { clear: both; margin-top: 14%; border-top: 1px solid #e5e5e5; background: #f9f9f9; padding: 2em 0; font-size: 0.8em; text-align: center; }
img { float: left; margin: 0 2em 2em 0; }
a img { border: 0; }
h1 { margin-top: 1em; font-size: 1.2em; }
.content-text { overflow: hidden; overflow-wrap: break-word; word-wrap: break-word; -webkit-hyphens: auto; -moz-hyphens: auto; -ms-hyphens: auto; hyphens: auto; }
p { margin: 0.7em 0 1em 0; }
a { color: #0645ad; text-decoration: none; }
a:hover { text-decoration: underline; }
code { font-family: sans-serif; }
.text-muted { color: #777; }
</style>
<div class="content" role="main">
	<a href="https://www.wikimedia.org"><img src="https://www.wikimedia.org/static/images/wmf-logo.png" srcset="https://www.wikimedia.org/static/images/wmf-logo-2x.png 2x" alt="Wikimedia" width="135" height="101">
</a>
<h1>Error</h1>
<div class="content-text">
	<p>Our servers are currently under maintenance or experiencing a technical problem.

	Please <a href="" title="Reload this page" onclick="window.location.reload(false); return false">try again</a> in a few&nbsp;minutes.</p>

<p>See the error message at the bottom of this page for more&nbsp;information.</p>
</div>
</div>
<div class="footer"><p>If you report this error to the Wikimedia System Administrators, please include the details below.</p><p class="text-muted"><code>Request from 122.216.10.145 via cp5012 cp5012, Varnish XID 477962109<br>Upstream caches: cp5012 int<br>Error: 429, Too many requests. Please comply with the User-Agent policy to get a higher rate limit: https://meta.wikimedia.org/wiki/User-Agent_policy at Sun, 17 Jul 2022 22:28:20 GMT</code></p>
</div>
</html>

Root cause

This library doesn't follow Wikimedia's user-agent policy, specifically:

<client name>/<version> (<contact information>) <library/framework name>/<version> [<library name>/<version> ...]. Parts that are not applicable can be omitted.

which leads in a temporary rate limiting/blacklisting of the agent:

Requests from disallowed user agents may instead encounter a less helpful error message like this:
Our servers are currently experiencing a technical problem. Please try again in a few minutes.

See also: https://meta.wikimedia.org/wiki/User-Agent_policy

Solution

Set an User-Agent header compliant with the above policy, e.g.:

>>> import urllib
>>> od = urllib.request.OpenerDirector()
>>> od.addheaders 
[('User-agent', 'Python-urllib/3.9')]
>>> 
>>> import wikidata
>>> wikidata.__version__
'0.7.0'
>>> 
>>> import sys
>>> od.addheaders = { 
...     "Accept": "application/sparql-results+json",
...     "User-Agent": "wikidata-based-bot/%s (https://github.com/dahlia/wikidata ; hong.minhee@gmail.com) python/%s.%s.%s Wikidata/%s" % (wikidata.__version__, sys.version_info.major, sys.version_info.minor, sys.version_info.micro, wikidata.__version__),
... }
>>> 
>>> od.addheaders 
{'Accept': 'application/sparql-results+json', 'User-Agent': 'wikidata-based-bot/0.7.0 (https://github.com/dahlia/wikidata ; hong.minhee@gmail.com) python/3.9.13 Wikidata/0.7.0'}
@dahlia
Copy link
Owner

dahlia commented Jul 18, 2022

By the way, doesn't WikiData provide rate limit header fields? If it has them we could intelligently control the request rate from client side.

@marccarre
Copy link
Author

I looked for this too but, unless I missed something, couldn't find such header in the response:

{'accept-ch': 'Sec-CH-UA-Arch,Sec-CH-UA-Bitness,Sec-CH-UA-Full-Version-List,Sec-CH-UA-Model,Sec-CH-UA-Platform-Version',
 'accept-ranges': 'bytes',
 'access-control-allow-origin': '*',
 'age': '1',
 'cache-control': 'public, max-age=300',
 'content-encoding': 'gzip',
 'content-type': 'application/sparql-results+json;charset=utf-8',
 'date': 'Fri, 22 Jul 2022 09:33:06 GMT',
 'nel': '{ "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, '
        '"success_fraction": 0.0}',
 'permissions-policy': 'interest-cohort=(),ch-ua-arch=(self '
                       '"intake-analytics.wikimedia.org"),ch-ua-bitness=(self '
                       '"intake-analytics.wikimedia.org"),ch-ua-full-version-list=(self '
                       '"intake-analytics.wikimedia.org"),ch-ua-model=(self '
                       '"intake-analytics.wikimedia.org"),ch-ua-platform-version=(self '
                       '"intake-analytics.wikimedia.org")',
 'report-to': '{ "group": "wm_nel", "max_age": 86400, "endpoints": [{ "url": '
              '"https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" '
              '}] }',
 'server': 'nginx/1.14.2',
 'server-timing': 'cache;desc="pass", host;desc="cp5008"',
 'set-cookie': 'WMF-Last-Access=22-Jul-2022;Path=/;HttpOnly;secure;Expires=Tue, '
               '23 Aug 2022 00:00:00 GMT, '
               'WMF-Last-Access-Global=22-Jul-2022;Path=/;Domain=.wikidata.org;HttpOnly;secure;Expires=Tue, '
               '23 Aug 2022 00:00:00 GMT',
 'strict-transport-security': 'max-age=106384710; includeSubDomains; preload',
 'transfer-encoding': 'chunked',
 'vary': 'Accept, Accept-Encoding',
 'x-cache': 'cp5009 miss, cp5008 pass',
 'x-cache-status': 'pass',
 'x-client-ip': '***.***.***.***',
 'x-first-solution-millis': '48',
 'x-served-by': 'wdqs2001'}

There seems to be only this recommendation on the request rate:
https://wikitech.wikimedia.org/wiki/Robot_policy#Request_rate

@dahlia
Copy link
Owner

dahlia commented Jul 22, 2022

Thank you for letting me know that! 🙏🏼

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants