Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bandersnatch mirror completeness #1622

Open
ktry opened this issue Dec 5, 2023 · 12 comments
Open

bandersnatch mirror completeness #1622

ktry opened this issue Dec 5, 2023 · 12 comments
Labels
bug Something isn't working help wanted Extra attention is needed question Further information is requested

Comments

@ktry
Copy link

ktry commented Dec 5, 2023

Thanks so much for providing a means to mirror the PyPI repository!

After our latest run of bandersnatch mirror followed by bandersnatch verify --delete --json-update, our mirror is 13.3 TB is size. It was 17.7 TB before we ran the verify --delete operation. We found that some packages were not being updated after many runs of bandersnatch mirror. One such package was poetry. We got it to update with bandersnatch sync poetry before we ran the verify --delete operation.

We are running bandersnatch 6.3.0 and python 3.5.8 and the latest verify operation took 17 days to complete and had a zero exit code. Our mirror appears incomplete compared to the stats reported on pypi.org. How can we assess the completeness of our mirror?

On our local mirror, web/simple/index.html has 371694 . web/simple has 372444 directories and web/json has 357233 directories. The bandersnatch log reports that 1,049,164 files were fetched. https://pypi.org reports 498,484 projects and https://pypi.org/stats reports the total mirror size of 18.2 TB.

/etc/bandersnatch.conf:

[mirror]
directory = /mirror/sites/PyPI
json = true
release-files = true
cleanup = true
master = https://pypi.org
timeout = 20
global-timeout = 1800
workers = 3
hash-index = false
simple-format = ALL
stop-on-error = false
storage-backend = filesystem
verifiers = 3
compare-method = hash
@cooperlees cooperlees changed the title bandersnatch bandersnatch mirror completeness Dec 5, 2023
@cooperlees cooperlees added bug Something isn't working help wanted Extra attention is needed question Further information is requested labels Dec 5, 2023
@cooperlees
Copy link
Contributor

Howdy.

Sorry to hear you troubles. You've taken the brute force attempt to fix your errors! But this is dedication (17 days verify ...). I haven't ran a verify since PyPI was around 1TB and have wondered if it's even sane to do anymore.

I think step one is to see what error(s) you're hitting and work through them. Let's change the stop on error config option and do runs reporting what actual errors you're hitting.

stop-on-error = true

DId your verify get any errors too? I can't remember but I think it respects stop-on-error too.

To get a report on completness we could add a report sub command that goes through all JSON meta data and looks for what is missing. It could also sync newer metadata from pypi.org as we walked te filesystem ... Would accept that PR.

@ktry
Copy link
Author

ktry commented Dec 5, 2023

Here are the bandersnatch operations that we have run lately:
`
_# Bandersnatch Fri Nov 10 12:27:16 MST 2023
_# 2023-11-10_12:27:16 bandersnatch mirror
_ # Bandersnatch Sun Nov 12 09:13:05 MST 2023
_# 2023-11-12_09:13:05 bandersnatch mirror --force-check
_# Bandersnatch Sat Nov 18 08:49:20 MST 2023
_# 2023-11-18_08:49:20 bandersnatch verify --delete --json-update

`

The verify --delete --jason-update log has 2296109 lines and 7313 ERROR: lines. 7283 are for the form:

2023-12-01 07:54:37,648 ERROR: /mirror/sites/PyPI/web/json/normcl.new does not exist - Did not get new JSON metadata (verify.py:68)

The remaining 30 are of the form:

2023-11-26 18:13:02,713 ERROR: Continuing to next file after error downloading: https://files.pythonhosted.org/packages/21/c8/2b875df3750668fd334c7d6904955d8f0bbfce23603ab6bc6ee88d9e084/fsleyes-1.4.3-py2.py3-none-any.whl (verify.py:175)

or of the form

2023-11-26 20:34:46,553 ERROR: Error syncing package: pytango (verify.py:38)

Here are snippets of all of the reported errors that resulted in tracebacks during the verify op. Is this helpful?

_# 2023-11-18_08:49:20 bandersnatch verify --delete --json-update

2023-11-18 08:49:21,344 INFO: Starting verify for /mirror/sites/PyPI with 3 workers (verify.py:252)
2023-11-18 08:52:07,361 INFO: Parsing shuanpdf (verify.py:125)
2023-11-18 08:52:07,363 INFO: Fetching https://pypi.org/pypi/shuanpdf/json (master.py:149)

/SNIP/

2023-11-26 18:09:56,588 INFO: Fetching https://files.pythonhosted.org/packages/09/06/896687cc1c5098dc5bc6beaaf679a5f7564cb2afc2523f8c06d61e9b874f/fsleyes-1.4.1-py2.py3-none-any.whl (master.py:149)
2023-11-26 18:10:27,905 ERROR: Continuing to next file after error downloading: https://files.pythonhosted.org/packages/09/06/896687cc1c5098dc5bc6beaaf679a5f7564cb2afc2523f8c06d61e9b874f/fsleyes-1.4.1-py2.py3-none-any.whl (verify.py:175)
Traceback (most recent call last):
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 173, in verify
await master.url_fetch(jpkg["url"], pkg_file, executor)
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 158, in url_fetch
chunk = await response.content.read(chunk_size)
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 380, in read
await self._wait("read")
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 306, in _wait
await waiter
aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket
2023-11-26 18:10:27,996 INFO: Fetching https://files.pythonhosted.org/packages/2e/7e/7cd5ab387eb7f532eff87dd71f6abd71b87ecfeac582b809496eb495bcf3/fsleyes-1.4.1.tar.gz (master.py:149)
2023-11-26 18:11:11,702 INFO: Fetching https://files.pythonhosted.org/packages/58/fc/828b23c7361f4c935391f58d5f77635c70559637023e573e680fd8599b23/fsleyes-1.4.2-py2.py3-none-any.whl (master.py:149)
2023-11-26 18:11:48,651 INFO: Fetching https://files.pythonhosted.org/packages/71/4a/fe3856ee78f61924044bdc9058bb5b6652ea82af90c46aa32c482227e0ae/fsleyes-1.4.2.tar.gz (master.py:149)
2023-11-26 18:12:23,426 INFO: Fetching https://files.pythonhosted.org/packages/21/c8/2b875df3750668fbd334c7d6904955d8f0bbfce23603ab6bc6ee88d9e084/fsleyes-1.4.3-py2.py3-none-any.whl (master.py:149)
2023-11-26 18:13:02,713 ERROR: Continuing to next file after error downloading: https://files.pythonhosted.org/packages/21/c8/2b875df3750668fbd334c7d6904955d8f0bbfce23603ab6bc6ee88d9e084/fsleyes-1.4.3-py2.py3-none-any.whl (verify.py:175)
Traceback (most recent call last):
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 173, in verify
await master.url_fetch(jpkg["url"], pkg_file, executor)
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 158, in url_fetch
chunk = await response.content.read(chunk_size)
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 380, in read
await self._wait("read")
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 306, in _wait
await waiter
aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket
2023-11-26 18:13:02,767 INFO: Fetching https://files.pythonhosted.org/packages/a9/c0/d3a78eb0dd781d64d9af706f466c42f507a6bec4069bca3e3e32f6bb2ae6/fsleyes-1.4.3.tar.gz (master.py:149)
2023-11-26 18:13:19,152 INFO: Fetching https://files.pythonhosted.org/packages/41/8b/f419746e60721f37d263247c06e8417a72c1650bb35a41d7c1d1beb5c819/fsleyes-1.4.4-py2.py3-none-any.whl (master.py:149)

/SNIP/

2023-11-26 18:43:32,980 INFO: Fetching https://files.pythonhosted.org/packages/e5/e1/254288af765910269ec6f9ea39e222c3d67de84617f79b1e63c4ba6a75c1/MeUtils-2023.11.20.13.42.41-py3-none-any.whl (master.py:149)
2023-11-26 18:44:30,557 ERROR: Continuing to next file after error downloading: https://files.pythonhosted.org/packages/e5/e1/254288af765910269ec6f9ea39e222c3d67de84617f79b1e63c4ba6a75c1/MeUtils-2023.11.20.13.42.41-py3-none-any.whl (verify.py:175)
Traceback (most recent call last):
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 173, in verify
await master.url_fetch(jpkg["url"], pkg_file, executor)
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 158, in url_fetch
chunk = await response.content.read(chunk_size)
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 380, in read
await self._wait("read")
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 306, in _wait
await waiter
aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket
2023-11-26 18:44:30,566 INFO: Fetching https://files.pythonhosted.org/packages/6a/cc/9895b13fe2203934567a3c010a12cbb96181be4421f77a2162f2ea2529ba/MeUtils-2023.11.20.13.42.41.tar.gz (master.py:149)
2023-11-26 18:45:04,176 ERROR: Continuing to next file after error downloading: https://files.pythonhosted.org/packages/6a/cc/9895b13fe2203934567a3c010a12cbb96181be4421f77a2162f2ea2529ba/MeUtils-2023.11.20.13.42.41.tar.gz (verify.py:175)
Traceback (most recent call last):
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 173, in verify
await master.url_fetch(jpkg["url"], pkg_file, executor)
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 158, in url_fetch
chunk = await response.content.read(chunk_size)
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 380, in read
await self._wait("read")
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 306, in _wait
await waiter
aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket
2023-11-26 18:45:04,230 INFO: Fetching https://files.pythonhosted.org/packages/f8/d6/6b68ca80f9c9b51b474063acbe86f2fa9146e606d620a3c76e392eb6f7eb/MeUtils-2023.11.20.13.43.23-py3-none-any.whl (master.py:149)
2023-11-26 18:45:21,930 INFO: Fetching https://files.pythonhosted.org/packages/c7/f0/433c3bb165d2e0a39bfa2b5c446de67fd696e32299f3a96b1b5352b5fcba/MeUtils-2023.11.20.13.43.23.tar.gz (master.py:149)
2023-11-26 18:45:48,489 ERROR: Continuing to next file after error downloading: https://files.pythonhosted.org/packages/c7/f0/433c3bb165d2e0a39bfa2b5c446de67fd696e32299f3a96b1b5352b5fcba/MeUtils-2023.11.20.13.43.23.tar.gz (verify.py:175)
Traceback (most recent call last):
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 173, in verify
await master.url_fetch(jpkg["url"], pkg_file, executor)
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 158, in url_fetch
chunk = await response.content.read(chunk_size)
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 380, in read
await self._wait("read")
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 306, in _wait
await waiter
aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket
2023-11-26 18:45:48,550 INFO: Fetching https://files.pythonhosted.org/packages/d2/40/0d3b2636e4057a599b548aa0ec510e0c78650389a348036b7833490a8611/MeUtils-2023.11.20.13.50.9-py3-none-any.whl (master.py:149)
2023-11-26 18:45:49,792 INFO: Fetching https://files.pythonhosted.org/packages/32/af/579db493ffa5c4df0a9333f76d4a71f153bebead7bdef47ec28e935f2e13/MeUtils-2023.11.20.13.50.9.tar.gz (master.py:149)

/SNIP/

2023-11-26 19:29:47,436 INFO: Fetching https://files.pythonhosted.org/packages/e8/e0/6b7668c4a41e2d129514321ad1343e99347771a6278085fd2e4ee4b5ff81/deepforest-1.2.2-py3-none-any.whl (master.py:149)
2023-11-26 19:30:07,580 ERROR: Continuing to next file after error downloading: https://files.pythonhosted.org/packages/e8/e0/6b7668c4a41e2d129514321ad1343e99347771a6278085fd2e4ee4b5ff81/deepforest-1.2.2-py3-none-any.whl (verify.py:175)
Traceback (most recent call last):
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 173, in verify
await master.url_fetch(jpkg["url"], pkg_file, executor)
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 155, in url_fetch
async with self.session.get(url) as response:
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 1117, in aenter
self._resp = await self._coro
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 544, in _request
await resp.start(conn)
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 890, in start
message, payload = await self._protocol.read() # type: ignore
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 604, in read
await self._waiter
aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket
2023-11-26 19:30:07,640 INFO: Fetching https://files.pythonhosted.org/packages/17/18/c8969eab432faa19508877fdfbf2ab2852d02bcc5b5d7c4203b81586ab26/deepforest-1.2.2.tar.gz (master.py:149)
2023-11-26 19:30:28,679 INFO: Fetching https://files.pythonhosted.org/packages/ed/9e/e007b234e72a83f3f15233c77d5c9311d3181c567ecf5e3ef7dba95d85e4/deepforest-1.2.3-py3-none-any.whl (master.py:149)
2023-11-26 19:30:44,727 INFO: Fetching https://files.pythonhosted.org/packages/c9/b7/15138ed10b1480e20e85e1947ce6d7b217e250c67a64449419bd4039e8b7/deepforest-1.2.3.tar.gz (master.py:149)
2023-11-26 19:31:20,503 ERROR: Continuing to next file after error downloading: https://files.pythonhosted.org/packages/c9/b7/15138ed10b1480e20e85e1947ce6d7b217e250c67a64449419bd4039e8b7/deepforest-1.2.3.tar.gz (verify.py:175)
Traceback (most recent call last):
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 173, in verify
await master.url_fetch(jpkg["url"], pkg_file, executor)
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 158, in url_fetch
chunk = await response.content.read(chunk_size)
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 380, in read
await self._wait("read")
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 306, in _wait
await waiter
aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket
2023-11-26 19:31:20,569 INFO: Fetching https://files.pythonhosted.org/packages/7f/3f/12427d5153e4f9b7321f175713fcc6268f7493d3cae92f2febb26f45a4c3/deepforest-1.2.4-py3-none-any.whl (master.py:149)
2023-11-26 19:31:50,354 INFO: Fetching https://files.pythonhosted.org/packages/ed/3d/0092384e54dd868c48f56d3eed1bbab1675df5598ca1a66f183156dca7c5/deepforest-1.2.4.tar.gz (master.py:149)

/SNIP/

2023-11-26 19:57:14,817 INFO: Fetching https://files.pythonhosted.org/packages/2f/f4/97bd5e9d29f404b1ebbf33877b90a20f42a33554e2aa277922432395b397/unitem-1.2.6-py2.py3-none-any.whl (master.py:149)
2023-11-26 19:57:35,810 ERROR: Continuing to next file after error downloading: https://files.pythonhosted.org/packages/2f/f4/97bd5e9d29f404b1ebbf33877b90a20f42a33554e2aa277922432395b397/unitem-1.2.6-py2.py3-none-any.whl (verify.py:175)
Traceback (most recent call last):
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 173, in verify
await master.url_fetch(jpkg["url"], pkg_file, executor)
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 158, in url_fetch
chunk = await response.content.read(chunk_size)
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 380, in read
await self._wait("read")
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 306, in _wait
await waiter
aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket
2023-11-26 19:57:35,884 INFO: Fetching https://files.pythonhosted.org/packages/3b/47/047f6ce12947e57cded1cd3579ea3fa8b2b15d06e753fc02a6522598db88/unitem-1.2.6-py3.8.egg (master.py:149)
2023-11-26 19:57:50,047 INFO: Fetching https://files.pythonhosted.org/packages/c7/a2/f4881a76703671bace3524f84d64d65fa0766fc16d207fb778ad99e5b3ed/unitem-1.2.6.tar.gz (master.py:149)
2023-11-26 19:57:53,309 ERROR: Error syncing package: unitem (verify.py:38)
NoneType: None
2023-11-26 19:57:53,309 INFO: Finished validating unitem (verify.py:198)

/SNIP/

File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 155, in url_fetch
async with self.session.get(url) as response:
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 1117, in aenter
self._resp = await self._coro
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 544, in _request
await resp.start(conn)
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 890, in start
message, payload = await self._protocol.read() # type: ignore
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 604, in read
await self._waiter
aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket
2023-11-26 20:24:16,515 INFO: Fetching https://files.pythonhosted.org/packages/72/8a/2c078705d8da1c91724345912d77a6615318cb44eb387e0ff59dfe13f7f0/pytango-9.4.1rc1-cp36-cp36m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (master.py:149)
2023-11-26 20:24:39,432 ERROR: Continuing to next file after error downloading: https://files.pythonhosted.org/packages/72/8a/2c078705d8da1c91724345912d77a6615318cb44eb387e0ff59dfe13f7f0/pytango-9.4.1rc1-cp36-cp36m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (verify.py:175)
Traceback (most recent call last):
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 173, in verify
await master.url_fetch(jpkg["url"], pkg_file, executor)
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 158, in url_fetch
chunk = await response.content.read(chunk_size)
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 380, in read
await self._wait("read")
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 306, in _wait
await waiter
aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket
2023-11-26 20:24:39,492 INFO: Fetching https://files.pythonhosted.org/packages/ef/ff/ddfd7213c79601f41a8635ae3af75336c7299ca94ba4553b187149b312f6/pytango-9.4.1rc1-cp36-cp36m-manylinux_2_17_i686.manylinux2014_i686.whl (master.py:149)
2023-11-26 20:25:37,691 INFO: Fetching https://files.pythonhosted.org/packages/56/58/79abb1870d26bd78ae017fe81e46a659bcd63aeb3e190603553a0d25f77e/pytango-9.4.1rc1-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (master.py:149)
2023-11-26 20:26:03,133 ERROR: Continuing to next file after error downloading: https://files.pythonhosted.org/packages/56/58/79abb1870d26bd78ae017fe81e46a659bcd63aeb3e190603553a0d25f77e/pytango-9.4.1rc1-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (verify.py:175)
Traceback (most recent call last):
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 173, in verify
await master.url_fetch(jpkg["url"], pkg_file, executor)
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 158, in url_fetch
chunk = await response.content.read(chunk_size)
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 380, in read
await self._wait("read")
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 306, in _wait
await waiter
aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket
2023-11-26 20:26:03,211 INFO: Fetching https://files.pythonhosted.org/packages/a5/3e/d98fb4b02f0c05d777b0ed4c664934757bb06fb5a7f25034c12843e4ce6b/pytango-9.4.1rc1-cp36-cp36m-win32.whl (master.py:149)
2023-11-26 20:26:05,801 INFO: Fetching https://files.pythonhosted.org/packages/30/fc/b830a9d2e4b6a03889180a81df133c52e34dd289e78805b1be9b7f5fe483/pytango-9.4.1rc1-cp36-cp36m-win_amd64.whl (master.py:149)
2023-11-26 20:26:07,940 INFO: Fetching https://files.pythonhosted.org/packages/51/f5/8b56ac422444dd2a27ade4799fd3aeb9c2fef2307c8f7dafadc87b54fc2f/pytango-9.4.1rc1-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (master.py:149)
2023-11-26 20:26:30,764 ERROR: Continuing to next file after error downloading: https://files.pythonhosted.org/packages/51/f5/8b56ac422444dd2a27ade4799fd3aeb9c2fef2307c8f7dafadc87b54fc2f/pytango-9.4.1rc1-cp37-cp37m-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (verify.py:175)
Traceback (most recent call last):
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 173, in verify
await master.url_fetch(jpkg["url"], pkg_file, executor)
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 158, in url_fetch
chunk = await response.content.read(chunk_size)
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 380, in read
await self._wait("read")
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 306, in _wait
await waiter
aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket
2023-11-26 20:26:30,855 INFO: Fetching https://files.pythonhosted.org/packages/6f/68/a7166d9406c90d1a707e3bf15671faba0683e807d3910194f9d57a9e688c/pytango-9.4.1rc1-cp37-cp37m-manylinux_2_17_i686.manylinux2014_i686.whl (master.py:149)
2023-11-26 20:26:58,134 ERROR: Continuing to next file after error downloading: https://files.pythonhosted.org/packages/6f/68/a7166d9406c90d1a707e3bf15671faba0683e807d3910194f9d57a9e688c/pytango-9.4.1rc1-cp37-cp37m-manylinux_2_17_i686.manylinux2014_i686.whl (verify.py:175)
Traceback (most recent call last):
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 173, in verify
await master.url_fetch(jpkg["url"], pkg_file, executor)
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 158, in url_fetch
chunk = await response.content.read(chunk_size)
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 380, in read
await self._wait("read")
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 306, in _wait
await waiter
aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket
2023-11-26 20:26:58,207 INFO: Fetching https://files.pythonhosted.org/packages/e0/f5/fdc1a5fa1c9ea204316d39dd6e7051a7553ea6be4d4d9d2d1029d0c0880f/pytango-9.4.1rc1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (master.py:149)
2023-11-26 20:27:37,781 INFO: Fetching https://files.pythonhosted.org/packages/71/9a/26b822f72747aedb03216181626e8eb66ff358b91d6235c0a6159496cf65/pytango-9.4.1rc1-cp37-cp37m-win32.whl (master.py:149)

/SNIP/

self._resp = await self._coro

File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 544, in _request
await resp.start(conn)
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 890, in start
message, payload = await self._protocol.read() # type: ignore
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 604, in read
await self._waiter
aiohttp.client_exceptions.ServerTimeoutError: Timeout on reading data from socket
2023-11-30 13:37:22,869 INFO: Finished validating micro-py (verify.py:198)
2023-11-30 13:37:22,870 INFO: Parsing aiohttp-dynamic (verify.py:125)
2023-11-30 13:37:22,870 INFO: Fetching https://pypi.org/pypi/aiohttp-dynamic/json (master.py:149)
2023-11-30 13:37:26,384 ERROR: Error syncing package: aiohttp-dynamic (verify.py:38)
Traceback (most recent call last):
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 969, in _wrap_create_connection
return await self._loop.create_connection(*args, **kwargs) # type: ignore # noqa
File "/root/.pyenv/versions/3.8.5/lib/python3.8/asyncio/base_events.py", line 1050, in create_connection
transport, protocol = await self._create_connection_transport(
File "/root/.pyenv/versions/3.8.5/lib/python3.8/asyncio/base_events.py", line 1080, in _create_connection_transport
await waiter
File "/root/.pyenv/versions/3.8.5/lib/python3.8/asyncio/selector_events.py", line 846, in _read_ready__data_received
data = self._sock.recv(self.max_size)
ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 131, in verify
await get_latest_json(master, json_full_path, executor, args.delete)
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 55, in get_latest_json
await master.url_fetch(url, new_json_path, executor)
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 155, in url_fetch
async with self.session.get(url) as response:
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 1117, in aenter
self._resp = await self._coro
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 520, in _request
conn = await self._connector.connect(
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 535, in connect
proto = await self._create_connection(req, traces, timeout)
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 890, in _create_connection
_, proto = await self._create_proxy_connection(req, traces, timeout)
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 1139, in _create_proxy_connection
transport, proto = await self._wrap_create_connection(
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 975, in _wrap_create_connection
raise client_error(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host pypi.org:443 ssl:default [Connection reset by peer]
2023-11-30 13:37:27,073 INFO: Finished validating aiohttp-dynamic (verify.py:198)
2023-11-30 13:37:27,073 INFO: Parsing threadactive (verify.py:125)
2023-11-30 13:37:27,073 INFO: Fetching https://pypi.org/pypi/threadactive/json (master.py:149)
2023-11-30 13:37:30,587 ERROR: Error syncing package: threadactive (verify.py:38)
Traceback (most recent call last):
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 969, in _wrap_create_connection
return await self._loop.create_connection(*args, **kwargs) # type: ignore # noqa
File "/root/.pyenv/versions/3.8.5/lib/python3.8/asyncio/base_events.py", line 1050, in create_connection
transport, protocol = await self._create_connection_transport(
File "/root/.pyenv/versions/3.8.5/lib/python3.8/asyncio/base_events.py", line 1080, in _create_connection_transport
await waiter
File "/root/.pyenv/versions/3.8.5/lib/python3.8/asyncio/selector_events.py", line 846, in _read_ready__data_received
data = self._sock.recv(self.max_size)
ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 131, in verify
await get_latest_json(master, json_full_path, executor, args.delete)
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 55, in get_latest_json
await master.url_fetch(url, new_json_path, executor)
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 155, in url_fetch
async with self.session.get(url) as response:
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 1117, in aenter
self._resp = await self._coro
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 520, in _request
conn = await self._connector.connect(
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 535, in connect
proto = await self._create_connection(req, traces, timeout)
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 890, in _create_connection
_, proto = await self._create_proxy_connection(req, traces, timeout)
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 1139, in _create_proxy_connection
transport, proto = await self._wrap_create_connection(
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 975, in _wrap_create_connection
raise client_error(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host pypi.org:443 ssl:default [Connection reset by peer]
2023-11-30 13:37:31,075 INFO: Finished validating threadactive (verify.py:198)
await waiter
File "/root/.pyenv/versions/3.8.5/lib/python3.8/asyncio/selector_events.py", line 846, in _read_ready__data_received
data = self._sock.recv(self.max_size)
ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 131, in verify
await get_latest_json(master, json_full_path, executor, args.delete)
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 55, in get_latest_json
await master.url_fetch(url, new_json_path, executor)
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 155, in url_fetch
async with self.session.get(url) as response:
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 1117, in aenter
self._resp = await self._coro
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 520, in _request
conn = await self._connector.connect(
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 535, in connect
proto = await self._create_connection(req, traces, timeout)
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 890, in _create_connection
_, proto = await self._create_proxy_connection(req, traces, timeout)
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 1139, in _create_proxy_connection
transport, proto = await self._wrap_create_connection(
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 975, in _wrap_create_connection
raise client_error(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host pypi.org:443 ssl:default [Connection reset by peer]
2023-11-30 13:37:50,617 INFO: Finished validating templateapp (verify.py:198)
2023-11-30 13:37:50,617 INFO: Parsing pyemailtracker (verify.py:125)
2023-11-30 13:37:50,617 INFO: Fetching https://pypi.org/pypi/pyemailtracker/json (master.py:149)
2023-11-30 13:37:52,875 INFO: Finished validating pyemailtracker (verify.py:198)
2023-11-30 13:37:52,876 INFO: Parsing hitomi (verify.py:125)
2023-11-30 13:37:52,876 INFO: Fetching https://pypi.org/pypi/hitomi/json (master.py:149)
2023-11-30 13:37:54,113 INFO: Finished validating hitomi (verify.py:198)
2023-11-30 13:37:54,114 INFO: Parsing power-profiler (verify.py:125)
2023-11-30 13:37:54,114 INFO: Fetching https://pypi.org/pypi/power-profiler/json (master.py:149)
2023-11-30 13:37:54,737 INFO: Finished validating power-profiler (verify.py:198)
2023-11-30 13:37:54,738 INFO: Parsing requests-lb (verify.py:125)
2023-11-30 13:37:54,738 INFO: Fetching https://pypi.org/pypi/requests-lb/json (master.py:149)
2023-11-30 13:37:55,317 INFO: Finished validating requests-lb (verify.py:198)
2023-11-30 13:37:55,318 INFO: Parsing overlap (verify.py:125)
2023-11-30 13:37:55,318 INFO: Fetching https://pypi.org/pypi/overlap/json (master.py:149)
2023-11-30 13:37:55,403 ERROR: Error syncing package: overlap (verify.py:38)
Traceback (most recent call last):
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 131, in verify
await get_latest_json(master, json_full_path, executor, args.delete)
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 55, in get_latest_json
await master.url_fetch(url, new_json_path, executor)
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 155, in url_fetch
async with self.session.get(url) as response:
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 1117, in aenter
self._resp = await self._coro
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 544, in _request
await resp.start(conn)
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 890, in start
message, payload = await self._protocol.read() # type: ignore
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 604, in read
await self._waiter
aiohttp.client_exceptions.ClientOSError: [Errno 104] Connection reset by peer
2023-11-30 13:37:59,042 INFO: Finished validating overlap (verify.py:198)

/SNIP/

2023-11-30 13:38:14,897 INFO: Fetching https://pypi.org/pypi/datafilter/json (master.py:149)
2023-11-30 13:38:15,031 ERROR: Error syncing package: datafilter (verify.py:38)
Traceback (most recent call last):
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 131, in verify
await get_latest_json(master, json_full_path, executor, args.delete)
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 55, in get_latest_json
await master.url_fetch(url, new_json_path, executor)
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 155, in url_fetch
async with self.session.get(url) as response:
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 1117, in aenter
self._resp = await self._coro
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 544, in _request
await resp.start(conn)
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 890, in start
message, payload = await self._protocol.read() # type: ignore
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/streams.py", line 604, in read
await self._waiter
aiohttp.client_exceptions.ClientOSError: [Errno 104] Connection reset by peer
2023-11-30 13:38:15,583 INFO: Finished validating datafilter (verify.py:198)
2023-11-30 13:38:15,583 INFO: Parsing monthly-returns-heatmap (verify.py:125)

/SNIP/

2023-11-30 13:38:28,729 ERROR: Error syncing package: setuptools-cython (verify.py:38)
Traceback (most recent call last):
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 969, in _wrap_create_connection
return await self._loop.create_connection(*args, **kwargs) # type: ignore # noqa
File "/root/.pyenv/versions/3.8.5/lib/python3.8/asyncio/base_events.py", line 1050, in create_connection
transport, protocol = await self._create_connection_transport(
File "/root/.pyenv/versions/3.8.5/lib/python3.8/asyncio/base_events.py", line 1080, in _create_connection_transport
await waiter
File "/root/.pyenv/versions/3.8.5/lib/python3.8/asyncio/selector_events.py", line 846, in _read_ready__data_received
data = self._sock.recv(self.max_size)
ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 131, in verify
await get_latest_json(master, json_full_path, executor, args.delete)
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 55, in get_latest_json
await master.url_fetch(url, new_json_path, executor)
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 155, in url_fetch
async with self.session.get(url) as response:
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 1117, in aenter
self._resp = await self._coro
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 520, in _request
conn = await self._connector.connect(
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 535, in connect
proto = await self._create_connection(req, traces, timeout)
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 890, in _create_connection
_, proto = await self._create_proxy_connection(req, traces, timeout)
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 1139, in _create_proxy_connection
transport, proto = await self._wrap_create_connection(
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 975, in _wrap_create_connection
raise client_error(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host pypi.org:443 ssl:default [Connection reset by peer]
2023-11-30 13:38:28,921 INFO: Finished validating setuptools-cython (verify.py:198)
2023-11-30 13:38:28,921 INFO: Parsing oog (verify.py:125)
2023-11-30 13:38:28,922 INFO: Fetching https://pypi.org/pypi/oog/json (master.py:149)
2023-11-30 13:38:32,436 ERROR: Error syncing package: oog (verify.py:38)
Traceback (most recent call last):
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 969, in _wrap_create_connection
return await self._loop.create_connection(*args, **kwargs) # type: ignore # noqa
File "/root/.pyenv/versions/3.8.5/lib/python3.8/asyncio/base_events.py", line 1050, in create_connection
transport, protocol = await self._create_connection_transport(
File "/root/.pyenv/versions/3.8.5/lib/python3.8/asyncio/base_events.py", line 1080, in _create_connection_transport
await waiter
File "/root/.pyenv/versions/3.8.5/lib/python3.8/asyncio/selector_events.py", line 846, in _read_ready__data_received
data = self._sock.recv(self.max_size)
ConnectionResetError: [Errno 104] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 131, in verify
await get_latest_json(master, json_full_path, executor, args.delete)
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/verify.py", line 55, in get_latest_json
await master.url_fetch(url, new_json_path, executor)
File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/master.py", line 155, in url_fetch
async with self.session.get(url) as response:
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 1117, in aenter
self._resp = await self._coro
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/client.py", line 520, in _request
conn = await self._connector.connect(
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 535, in connect
proto = await self._create_connection(req, traces, timeout)
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 890, in _create_connection
_, proto = await self._create_proxy_connection(req, traces, timeout)
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 1139, in _create_proxy_connection
transport, proto = await self._wrap_create_connection(
File "/opt/bandersnatch/lib/python3.8/site-packages/aiohttp/connector.py", line 975, in _wrap_create_connection
raise client_error(req.connection_key, exc) from exc
aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host pypi.org:443 ssl:default [Connection reset by peer]
2023-11-30 13:38:32,674 INFO: Finished validating oog (verify.py:198)

@cooperlees
Copy link
Contributor

So the verify seems to be getting a lot of connection errors and timeouts - What kind of internet connection are you running bandersnatch on?

It would be nice to maybe go slower and reduce these timeouts and errors I think before we can worry about your consistency ...

Have you tried 2 or 1 workers and see if you get less timeouts?

workers = 2

Maybe the default timeout of 10 seconds isn't enough either? This all dependes on the connection you're on, but it shoudl be

timeout = 10
  • Could maybe try 15 or 20 here ...

If you could try and sync with that + enable stop on error (as suggested above) and do a run I'd be interested to see what you hit.
Please also run with --debug if you can. That might help show us something to work off. That would look something like:

bandersnatch --debug mirror

@ktry
Copy link
Author

ktry commented Dec 5, 2023

I started a bandersnatch mirror job last night at 8 pm. It finished at 1 pm today with a zero exit status. The mirror grew by 59.9G. There were 2314 Fetching metadata lines in the 15788 line logfile and no ERROR lines. There are 11142 Downloading lines in the log.

Our package listing increased by 974 to a total of 372688. The last-modified date is 20231205T03:43:56.

Our internet connection throttles down to 48 Mbps after initial bursts of 200+Mbps.

Since there were no errors or timeouts, why did it complete with only 372688 total packages present on our mirror?

Will --debug mirror be helpful when there are no timeouts or errors?

First Fetching:
2023-12-04 20:43:57,002 INFO: Fetching metadata for package: apimakesens-python (serial 13609424) (package.py:58)

First Downloading:
2023-12-04 20:44:08,869 INFO: Downloading: https://files.pythonhosted.org/packages/c8/85/959e0ff82501b637e6e1541d5c7600d0eb2b79986184955582a149fcfb5c/prettyPlot-0.0.10-py3-none-any.whl (mirror.py:875)

Last Fetching:
2023-12-05 12:55:28,507 INFO: Fetching metadata for package: zytlib (serial 13609589) (package.py:58)

Last Downloading (and last lines in logfile):
https://files.pythonhosted.org/packages/8a/e0/f3ef24673dc17b52112bb9bc7384839b2ddc35e82ff18bd765ea53c54eff/zxkane.cdk-construct-simple-nat-0.2.628.tar.gz
2023-12-05 12:56:35,836 INFO: Storing index page(s): zxkane-cdk-construct-simple-nat - in /mirror/sites/PyPI/web/simple/zxkane-cdk-construct-simple-nat (mirror.py:698)
2023-12-05 12:57:18,028 INFO: Storing index page(s): zuul - in /mirror/sites/PyPI/web/simple/zuul (mirror.py:698)
2023-12-05 12:57:18,156 INFO: Generating global index page. (simple.py:260)
2023-12-05 13:01:03,482 INFO: New mirror serial: 13646864 (mirror.py:472)
2023-12-05 13:01:03,640 INFO: 1919 packages had changes (mirror.py:990)
2023-12-05 13:01:03,859 INFO: Writing diff file to mirrored-files (mirror.py:1000)

@ktry
Copy link
Author

ktry commented Dec 13, 2023

I've run bandersnatch mirror several times without error, but it only seems to fetch a few hundred projects for each run. I how have 374680 out of the 500,508 projects listed on pypi.org. I just started up a new run and the todo file only had 7276 entries. Since I'm not getting errors or timeouts at this point, what can I do to address the consistency? Thanks!

@cooperlees
Copy link
Contributor

cooperlees commented Dec 13, 2023

Sadly, the only options now are very expensive. They are:

  • bandernatch mirror --force-check
    • This will do a full sync and check every pacakge from PyPI
    • If you already have the versions downloaded it will checksum or stat on the filesystem depending on your config
  • bandersnatch verify [--delete] --json-update
    • This will only go through what you have downloaded and pull the latest JSON and ensure you have all the packages downlaoded
    • As an added bonus it will delete any delete versions upstream
    • As a negative, it won't download any of your missing packages like the above command

@ktry
Copy link
Author

ktry commented Dec 14, 2023

Thanks for that clarification! I'll do the force-check and if I start getting errors or timeouts, I'll start the debug process you outlined above.

@ktry
Copy link
Author

ktry commented Dec 21, 2023

One thing that I noticed is that web/simple/index.html is not updated as packages are synced with bandersnatch mirror --force-check. If bandersnatch doesn't finish gracefully, then web/simple/index.html could be out of sync.

Here are some statistics with bandersnatch mirror --force-check running for six days:

# grep -c href web/simple/index.html
375642
# find web/simple -maxdepth 1 -type d -newer web/simple/index.html | wc -l
271275
# awk '/ERROR/ {e++} /Fetching/ {f++} /Downloading/ {d++} /Storing/ {s++} END { printf("ERROR=%d, Fetching=%d, Storing=%d, Download=%d\n", e, f, s, d) }' bandersnatch.out
ERROR=2, Fetching=296984, Storing=287573, Download=1250239

I have high hopes that if bandersnatch finishes gracefully, that web/simple/index.html will have a lot more hrefs. And if not, I can write a tool to regenerate it.

@cooperlees
Copy link
Contributor

cooperlees commented Dec 22, 2023

Yeah, sadly, index.html is generated at the end of the run. Since the mirror is getting so big these days, I'd happily take a PR to periodically write out the global index.html during a run ... But it would have to be enabled by a config var with the default off I feel.

@ktry
Copy link
Author

ktry commented Dec 26, 2023

The bandersnatch mirror --force-check just finished and things are looking pretty good. The todo file has 14812 entries after bandersnatch finished and web/simple/index.html has 501461 hrefs.

Here are the stats from the todo and logfile. I'll try doing a normal bandersnatch mirror to see if it picks up any more packages.

TODO=14812 ERROR=4, Fetching=501089, Storing=486277, Download=2059037

The final log entries are:

2023-12-25 18:18:05,162 INFO: Downloading: https://files.pythonhosted.org/packages/94/22/c2ad4e731c3795db8acca6ea4c03d969477a97f05d2dd12ef50de59571aa/zzq_string_sum-0.4.0.tar.gz (mirror.py:875)
2023-12-25 18:18:05,227 INFO: Storing index page(s): zzq-string-sum - in /mirror/sites/PyPI/web/simple/zzq-string-sum (mirror.py:698)
2023-12-25 18:18:05,317 INFO: Generating global index page. (simple.py:260)
2023-12-25 18:28:15,083 INFO: 486277 packages had changes (mirror.py:990)
2023-12-25 18:29:00,593 INFO: Writing diff file to mirrored-files (mirror.py:1000)

The two additional errors are filename too long errors:

2023-12-25 11:04:54,546 INFO: Downloading: https://files.pythonhosted.org/packages/74/b6/d3fe5583d610652a0ce8613b05922b62a1fab89a4804eb8977f8ff2b2814/uselesscapitalquiz-3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593-py3-none-any.whl (mirror.py:875)
2023-12-25 11:04:54,615 ERROR: Continuing to next file after error downloading: https://files.pythonhosted.org/packages/74/b6/d3fe5583d610652a0ce8613b05922b62a1fab89a4804eb8977f8ff2b2814/uselesscapitalquiz-3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593-py3-none-any.whl (mirror.py:686)
Traceback (most recent call last):
  File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/mirror.py", line 662, in sync_release_files
    downloaded_file = await self.download_file(
  File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/mirror.py", line 892, in download_file
    with self.storage_backend.rewrite(path, "wb") as f:
  File "/root/.pyenv/versions/3.8.5/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch_storage_plugins/filesystem.py", line 82, in rewrite
    with tempfile.NamedTemporaryFile(
  File "/root/.pyenv/versions/3.8.5/lib/python3.8/tempfile.py", line 541, in NamedTemporaryFile
    (fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
  File "/root/.pyenv/versions/3.8.5/lib/python3.8/tempfile.py", line 250, in _mkstemp_inner
    fd = _os.open(file, flags, 0o600)
OSError: [Errno 36] File name too long: '/mirror/sites/PyPI/web/packages/74/b6/d3fe5583d610652a0ce8613b05922b62a1fab89a4804eb8977f8ff2b2814/.uselesscapitalquiz-3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593-py3-none-any.whl.203asete'
2023-12-25 11:04:54,691 INFO: Downloading: https://files.pythonhosted.org/packages/57/79/21b676698665e561d5320dad7e6d94685b429ee0179671284a9cf3cd42c4/usearch-0.22.0-cp39-cp39-manylinux_2_28_x86_64.whl (mirror.py:875)
2023-12-25 11:04:54,701 INFO: Downloading: https://files.pythonhosted.org/packages/cc/50/82753aa766ef30414fce227894e0495ac93ee4f1f3f44a2c7e9c88c79c55/uselesscapitalquiz-3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593.tar.gz (mirror.py:875)
2023-12-25 11:04:54,778 ERROR: Error syncing package: uselesscapitalquiz@14521754 (mirror.py:377)
Traceback (most recent call last):
  File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/mirror.py", line 130, in package_syncer
    await self.process_package(package)
  File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/mirror.py", line 337, in process_package
    await self.sync_release_files(package)
  File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/mirror.py", line 693, in sync_release_files
    raise deferred_exception  # raise the exception after trying all files
  File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/mirror.py", line 662, in sync_release_files
    downloaded_file = await self.download_file(
  File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch/mirror.py", line 892, in download_file
    with self.storage_backend.rewrite(path, "wb") as f:
  File "/root/.pyenv/versions/3.8.5/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/opt/bandersnatch/lib/python3.8/site-packages/bandersnatch_storage_plugins/filesystem.py", line 82, in rewrite
    with tempfile.NamedTemporaryFile(
  File "/root/.pyenv/versions/3.8.5/lib/python3.8/tempfile.py", line 541, in NamedTemporaryFile
    (fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
  File "/root/.pyenv/versions/3.8.5/lib/python3.8/tempfile.py", line 250, in _mkstemp_inner
    fd = _os.open(file, flags, 0o600)
OSError: [Errno 36] File name too long: '/mirror/sites/PyPI/web/packages/74/b6/d3fe5583d610652a0ce8613b05922b62a1fab89a4804eb8977f8ff2b2814/.uselesscapitalquiz-3.14159265358979323846264338327950288419716939937510582097494459230781640628620899862803482534211706798214808651328230664709384460955058223172535940812848111745028410270193852110555964462294895493038196442881097566593-py3-none-any.whl.203asete'

@cooperlees
Copy link
Contributor

Ahh, The long name problem. We've discussed in #1228 and I feel we should maybe soft error (report and skip) that due to the file system limitations we're skipping this package. I but I also get this is not explicit and evil. Maybe it should be a config option the owner(s) of this bandersnatch instance can choose. As stated elsewhere I'd accept this PR.

Ideally we need PyPI to not allow package names this long.

@ktry
Copy link
Author

ktry commented Dec 26, 2023

Another run of bandersnatch mirror has some filename too long errors. So I added the blocklist_project plugin to filter out uselesscapitalquiz as described in comment-9 issue1100 and now bandersnatch mirror completed and there is no todo file.
Here are the stats:

TODO=0 ERROR=0, Fetching=14800, Storing=0, Download=0
grep -c href web/simple/index.html
501469
Repo Size = 17.3T

That's pretty close to the 503,186 projects reported on pypi.org. I'm happy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants