Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data download is interrupted after a few minutes #195

Open
sert23 opened this issue Jun 19, 2023 · 7 comments
Open

Data download is interrupted after a few minutes #195

sert23 opened this issue Jun 19, 2023 · 7 comments
Labels
bug Something isn't working

Comments

@sert23
Copy link

sert23 commented Jun 19, 2023

Describe the bug
Not sure what's happening but for the last few days, I'm struggling to download data using pysradb. This used to work no problem a couple of weeks ago. Here is the error I get:

File "/home/eap/anaconda/envs/pysradb/lib/python3.10/site-packages/urllib3/response.py", line 444, in _error_catcher [6/370]
yield
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/site-packages/urllib3/response.py", line 567, in read
data = self._fp_read(amt) if not fp_closed else b""
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/site-packages/urllib3/response.py", line 533, in _fp_read
return self._fp.read(amt) if amt is not None else self._fp.read()
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/http/client.py", line 460, in read
return self._read_chunked(amt)
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/http/client.py", line 583, in _read_chunked
chunk_left = self._get_chunk_left()
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/http/client.py", line 566, in _get_chunk_left
chunk_left = self._read_next_chunk_size()
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/http/client.py", line 526, in _read_next_chunk_size
line = self.fp.readline(_MAXLINE + 1)
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/socket.py", line 705, in readinto
return self._sock.recv_into(b)
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/ssl.py", line 1274, in recv_into
return self.read(nbytes, buffer)
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/ssl.py", line 1130, in read
return self._sslobj.read(len, buffer)
TimeoutError: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/eap/miRexpress/updates/code/run_update.py", line 200, in
generate_raw_tsv("miRNA-seq", os.path.join(raw_folder, "miRNA-seq.tsv"))
File "/home/eap/miRexpress/updates/code/run_update.py", line 36, in generate_raw_tsv
instance.search()
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/site-packages/pysradb/search.py", line 793, in search
self._format_response(r.raw)
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/site-packages/pysradb/search.py", line 861, in _format_response
for event, elem in Et.iterparse(content):
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/xml/etree/ElementTree.py", line 1255, in iterator
data = source.read(16 * 1024)
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/site-packages/urllib3/response.py", line 566, in read
with self._error_catcher():
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/contextlib.py", line 153, in exit
self.gen.throw(typ, value, traceback)
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/site-packages/urllib3/response.py", line 449, in _error_catcher
raise ReadTimeoutError(self._pool, None, "Read timed out.")

It seems like it's getting disconnected after some minutes.
Is there a parameter I can change to make it retry or something similar? Are they blocking my IP? Is this a widespread recent issue?

To Reproduce
This really happen with any attempt now (randomly) after a few minutes. In this example I'm trying to download info about all miRNA-seq samples in SRA:

instance = SraSearch(2, 1000000 strategy="miRNA-seq") print("Downloading samples for " + library_type) instance.search()

Thanks a lot for writing this software and the support!!

@sert23 sert23 added the bug Something isn't working label Jun 19, 2023
@sert23
Copy link
Author

sert23 commented Dec 19, 2023

I am currently trying the same script again (previously working) and a different error happened this time.

Traceback (most recent call last):
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/http/client.py", line 566, in _get_chunk_left
chunk_left = self._read_next_chunk_size()
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/http/client.py", line 533, in _read_next_chunk$
size
return int(line, 16)
ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/http/client.py", line 583, in _read_chunked
chunk_left = self._get_chunk_left()
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/http/client.py", line 568, in _get_chunk_left
raise IncompleteRead(b'')
http.client.IncompleteRead: IncompleteRead(0 bytes read)

During handling of the above exception, another exception occurred:çTraceback (most recent call last):
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/site-packages/urllib3/response.py", line 444, i
n _error_catcher
yield
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/site-packages/urllib3/response.py", line 567, i
n read
data = self._fp_read(amt) if not fp_closed else b""
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/site-packages/urllib3/response.py", line 533, i
n _fp_read
return self._fp.read(amt) if amt is not None else self._fp.read()
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/http/client.py", line 460, in read
return self._read_chunked(amt)
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/http/client.py", line 598, in _read_chunked
raise IncompleteRead(b''.join(value))
http.client.IncompleteRead: IncompleteRead(4336 bytes read)
During handling of the above exception, another exception occurred: [34/826]

Traceback (most recent call last):
File "/home/eap/miRexpress/updates/code/run_update.py", line 211, in
generate_raw_tsv("miRNA-seq", os.path.join(raw_folder, "miRNA-seq.tsv"))
File "/home/eap/miRexpress/updates/code/run_update.py", line 38, in generate_raw_tsv
instance.search()
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/site-packages/pysradb/search.py", line 793, in
search
self._format_response(r.raw)
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/site-packages/pysradb/search.py", line 861, in
_format_response
for event, elem in Et.iterparse(content):
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/xml/etree/ElementTree.py", line 1255, in iterat
or
data = source.read(16 * 1024)
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/site-packages/urllib3/response.py", line 566, i
n read
with self._error_catcher():
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/contextlib.py", line 153, in exit
self.gen.throw(typ, value, traceback)
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/site-packages/urllib3/response.py", line 461, i
n _error_catcher
raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(4336 bytes read)', IncompleteRea
d(4336 bytes read))

@saketkc
Copy link
Owner

saketkc commented Dec 19, 2023

My recommendation is to use an external tool for downloading for now: #201 (comment)

@sert23
Copy link
Author

sert23 commented Dec 19, 2023

sorry, I think my explanation was not clear. I'm trying to download only metadata.

@saketkc
Copy link
Owner

saketkc commented Dec 19, 2023

Is this what you are running (seems okay at my end):

>>> instance = SraSearch(2, 1000000, strategy="miRNA-seq")
>>> df = instance.search()  4%|█▍                                 | 5400/130053 [03:13<1:19:26, 26.15it/s]

@sert23
Copy link
Author

sert23 commented Dec 19, 2023

Yep, it starts running but it spits out this error after some minutes...

Traceback (most recent call last):
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/http/client.py", line 566, in _get_chunk_left
chunk_left = self._read_next_chunk_size()
File "/home/eap/anaconda/envs/pysradb/lib/python3.10/http/client.py", line 533, in _read_next_chunk$
size
return int(line, 16)
ValueError: invalid literal for int() with base 16: b''

I'm guessing something is not formatted properly on SRA side (it happened to me when parsing something else from SRA in python). They include some '\b somewhere in the description fields and python tries to parse this as some kind of binary string....

As a workaround, I'm trying to run the same query on GEO to see if this is parsed differently by them.
Alternatively, is there a way to do a SraSearch query but only request the summary fields? (SRX and SRP). This could work for me.

Thanks for your help!

@saketkc
Copy link
Owner

saketkc commented Dec 19, 2023

You could try with verbosity=1

@sert23
Copy link
Author

sert23 commented Dec 19, 2023

thank you, I will try that as last resource. The problem is I'm interested in all SRPs so then I would have to query sample by sample to retrieve since verbosity=1 only gives you experiment accessions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants