Read_json from S3 hitting HTTP GET error #10996
Replies: 5 comments 7 replies
-
Quick Question ChashLaker, the credentials you are using maybe they expired? I have a few processes that stumble when the credentials expire (mitigated by taking chunks of files to process). Hope that helps. |
Beta Was this translation helpful? Give feedback.
-
ChashLaker, that is essentially what happens. The token expires hours into the sequence. One way to address this is to divide the workload, so you can keep the overall runtime of each batch to an hour or so. That has been an effective technique, and it has been effective in addressing other minor glitches. |
Beta Was this translation helpful? Give feedback.
-
Sorry to say I haven't found a way to maintain a connection for longer than a few hours straight, I have to refresh it. I believe the token changes and in the process the connection is lost. To mitigate this; one method is to break down the workload into chunks. This, by the way, has a worthy side benefit of running across multiple EC2 instances. Credentials, I'm currently employing DuckDB 0.9.2 as we have a few hurdles with 0.10.0 to overcome first. Between each chunk I'm calling:
This effectively reloads the credentials. Remember this does require aws.duckdb_extension and that has some hurdles to overcome as well. I also like this as the configuration file in .aws gives me the ability to establish a number of different configurations beyond what I might have already with the role we are using for the instance. The secret key method is worthy, I've been looking forward to it as it opens up additional resources like other cloud object storage providers. Hope this helped and wishing you happy coding. |
Beta Was this translation helpful? Give feedback.
-
for future reference maybe it's related to this |
Beta Was this translation helpful? Give feedback.
-
I am also battling an issue similar to this. Using the duckdb Python API I am issuing "select max(LastmodifiedDate::DATE) from read_parquet('s3://my-bucket/main.parquet')" queries for CDC. This runs approximate 1000 times and issues the query to various files in s3 (not 1000 times each, 1000 total). It consistently fails towards the end of the day with : duckdb.duckdb.IOException: IO Error: Connection error for HTTP HEAD to 'https://my-bucket.s3.amazonaws.com/prod/bronze/salesforce/main.parquet' My code is running in a docker container on EC2 and the only solution is to restart the container. I was suspicious the HTTPFS connections are not closing but could it credentials related? I access s3 via Gateway Endpoint using duckdb secret and provider credential_chain. |
Beta Was this translation helpful? Give feedback.
-
Hi all,
i've been trying to run a query that fetches many .jsons from a S3 bucket. around 8 million files.
this outputs the following error
duckdb.duckdb.HTTPException: HTTP Error: HTTP GET error on 'https://bucket/prefix/filename.json' (HTTP 500)
i'm using 16 threads
as far as i could se it broke after 30min.
if there any internal timout inside duckdb that may be causing this.
can anyone help me troubleshoot this?
regards,c.
Beta Was this translation helpful? Give feedback.
All reactions