Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downloads are not resumable. Get thrift deserialization error #131

Open
Raynos opened this issue Jan 29, 2023 · 0 comments
Open

Downloads are not resumable. Get thrift deserialization error #131

Raynos opened this issue Jan 29, 2023 · 0 comments

Comments

@Raynos
Copy link

Raynos commented Jan 29, 2023

I ran the example script and it started downloading v4/validation.parquet ;

My wifi was slow and my computer went to sleep, I woke up my computer and the program was hung due to wifi disconnect, I killed the program and ran the program again to "resume the download"

Instead I got

OSError: Couldn't deserialize thrift: TProtocolException: Invalid data
Deserializing page header failed.

I had to manually delete v4/validation.parquet since numerai sdk was not able to correctly resume the download.

Below is the output of the program that resumes the download.

2023-01-29 11:22:59,695 INFO numerapi.utils: resuming download
/home/raynos/.local/lib/python3.8/site-packages/urllib3/connectionpool.py:1043: InsecureRequestWarning: Unverified HTTPS request is being made to host 'numerai-datasets-us-west-2.s3.amazonaws.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings
  warnings.warn(
v4/validation.parquet:  40%|█████▋        | 463M/1.15G [00:00<00:00, 3.82GB/s]



v4/validation.parquet: 1.15GB [01:05, 17.4MB/s]                               
2023-01-29 11:24:07,248 INFO numerapi.utils: starting download
v4/live_409.parquet: 3.42MB [00:01, 1.90MB/s]                  

Below is the output of the program that tries to use the data file from the resumed download.

2023-01-29 11:24:20,449 INFO numerapi.utils: starting download
v4/features.json: 562kB [00:00, 727kB/s]                                               
Reading minimal training data
Traceback (most recent call last):
  File "./example_model.py", line 52, in <module>
    validation_data = pd.read_parquet('v4/validation.parquet',
  File "/home/raynos/.local/lib/python3.8/site-packages/pandas/io/parquet.py", line 493, in read_parquet
    return impl.read(
  File "/home/raynos/.local/lib/python3.8/site-packages/pandas/io/parquet.py", line 240, in read
    result = self.api.parquet.read_table(
  File "/home/raynos/.local/lib/python3.8/site-packages/pyarrow/parquet.py", line 1996, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
  File "/home/raynos/.local/lib/python3.8/site-packages/pyarrow/parquet.py", line 1831, in read
    table = self._dataset.to_table(
  File "pyarrow/_dataset.pyx", line 323, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 2311, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 143, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
OSError: Couldn't deserialize thrift: TProtocolException: Invalid data
Deserializing page header failed.

I don't know if it's possible to do an integrity check with a checksum in the resuming download branch, but doing so would allow you to verify if the resumed download was successful or corrupted and then delete the corrupted file.

Leaving the corrupted file behind gives me a thrift protocol error since the parquet is not valid anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant