Read_json from S3 hitting HTTP GET error #10996

CrashLaker · 2024-03-05T15:06:06Z

CrashLaker
Mar 5, 2024

Hi all,

i've been trying to run a query that fetches many .jsons from a S3 bucket. around 8 million files.

maxobjsize = str(60*(1024**2)) # 60MB
cur.execute("""
select count(*) from
read_json(
    's3://bucket/prefix/*',
    maximum_object_size=MAXOBJSIZE
)
""".replace('MAXOBJSIZE', maxobjsize))

this outputs the following error
duckdb.duckdb.HTTPException: HTTP Error: HTTP GET error on 'https://bucket/prefix/filename.json' (HTTP 500)

i'm using 16 threads

as far as i could se it broke after 30min.

if there any internal timout inside duckdb that may be causing this.
can anyone help me troubleshoot this?

regards,c.

orzom411 · 2024-03-05T19:01:24Z

orzom411
Mar 5, 2024

Quick Question ChashLaker, the credentials you are using maybe they expired? I have a few processes that stumble when the credentials expire (mitigated by taking chunks of files to process).

Hope that helps.

1 reply

CrashLaker Mar 7, 2024
Author

i had previously closed this case.

but now i've tested it and it seems that the issue still remains.
my code was running for at least 6 hours until it hit the error.

i'm running in on EC2 instance.
can anyone help?

regards,c.

orzom411 · 2024-03-08T00:01:51Z

orzom411
Mar 8, 2024

ChashLaker, that is essentially what happens. The token expires hours into the sequence. One way to address this is to divide the workload, so you can keep the overall runtime of each batch to an hour or so. That has been an effective technique, and it has been effective in addressing other minor glitches.

1 reply

CrashLaker Mar 8, 2024
Author

Hi @orzom411 ,

thank you for your reply.

I understand that before i was creating the secret by fetching it from the IMDS like this

rs = requests.get("http://169.254.169.254/latest/meta-data/iam/security-credentials/instanceRole", headers=headers)
creds = rs.json()
creds
"""
inside creds
{
    "AccessKeyId": "",
    "SecretAccessKey": "",
    "Token": "",
    "Expiration": ""
}
"""

# then on duckdb
CREATE SECRET secret1 (
    TYPE S3,
    KEY_ID '{creds['AccessKeyId']}',
    SECRET '{creds['SecretAccessKey']}',
    SESSION_TOKEN '{creds['Token']}'
);

by seeing the Expiration I could understant that it was indeed due to expire.

then my second attempt was to use the instance credential_chain provider

CREATE SECRET secret1 (
    TYPE S3,
    PROVIDER CREDENTIAL_CHAIN,
    CHAIN 'instance'
);

with this I understand that duckdb would dynamically fetch the token from IMDS.

the application ran longer but also failed.

idk if duckdb uses boto3 inside httpfs module. do you know if i can instrument the exact call that failed to check the headers if that was indeed the case?

regards,c.

orzom411 · 2024-03-09T04:18:59Z

orzom411
Mar 9, 2024

Sorry to say I haven't found a way to maintain a connection for longer than a few hours straight, I have to refresh it. I believe the token changes and in the process the connection is lost.

To mitigate this; one method is to break down the workload into chunks. This, by the way, has a worthy side benefit of running across multiple EC2 instances.

Credentials, I'm currently employing DuckDB 0.9.2 as we have a few hurdles with 0.10.0 to overcome first. Between each chunk I'm calling:

CALL load_aws_credentials();

This effectively reloads the credentials. Remember this does require aws.duckdb_extension and that has some hurdles to overcome as well. I also like this as the configuration file in .aws gives me the ability to establish a number of different configurations beyond what I might have already with the role we are using for the instance.

The secret key method is worthy, I've been looking forward to it as it opens up additional resources like other cloud object storage providers.

Hope this helped and wishing you happy coding.

3 replies

CrashLaker Mar 9, 2024
Author

hi,

thank you for sharing your approach.
when you say:

To mitigate this; one method is to break down the workload into chunks. This, by the way, has a worthy side benefit of running across multiple EC2 instances.

you're orquestrating all this on your own right? by prefetching in one step an then querying the files on the next step.
or duckdb has a way of configuring multiple machines and automatically split the load.

ty,c.

orzom411 Mar 10, 2024

Correct, I'm generating subtasks. That is this isn't native to DuckDB BUT I am querying, through DuckDB, what to gather and then employing it to actually go out and extract the details from file masks I've created by each of those sub-tasks. I use that to then branch my workload further, this includes making calls to ingest data into a common format, Parquet is my go-to right now.

As you think this through consider a simple file list, e.g.:

select g."file", row_number() over( partition by null order by g."file" ) as rn from glob( '*.json' ) as g;

The .json could be s3://mybucket/mykeyprefix/mykey.json, or any other such thing. Using this it is then a simple enough step to generate sub-scripts to process the data into parquet files. I use ".once" and ".read" as I can craft syntax via .once and then read that script. It's simple and leaves a clear set of crumbs I can re-employ. All of this I can do from the DuckDB CLI and it is wonderfully fast and has a manageable footprint.

I assume others are using DuckDB in this way but then I see, with frequency, people hitting memory or timeouts and I'm reminded this might not be as obvious a technique to others as it is to myself.

Hope this helps and you find it useful.

CrashLaker Mar 11, 2024
Author

hi @orzom411 ,

thanks for detailing your approach.
for now i'll just try to solve this by avoiding to change the code too much or change the original httpfs code
as duckdb calls itself an ad-hoc tool, having to go as far as to orquestrate all this we could probably stick to spark instead

ty,c.

CrashLaker · 2024-03-11T03:54:09Z

CrashLaker
Mar 11, 2024
Author

for future reference

maybe it's related to this Credentials do not renew automatically
duckdb/duckdb_aws#26

1 reply

CrashLaker Mar 11, 2024
Author

Another thing i just found out was that my error was 500 and not 403. (meaning credential errors or expired)

from here alluxio 500 501

The error code “500 Internal Server Error” means that S3 is unable to service the request at the moment. This could be for a few reasons, S3 service itself has an internal error. or it could mean that the rate of data access is too high.

i understood that maybe we're hitting aws' s3 api too many times.
https://github.com/samansmink/duckdb/blob/40d696f674a04acac69a208f2bf7178592f6c528/extension/httpfs/s3fs.cpp#L1159
maybe an exponential backoff retry is needed here?

c.

macraesdirtysocks · 2024-05-11T01:37:58Z

macraesdirtysocks
May 11, 2024

I am also battling an issue similar to this. Using the duckdb Python API I am issuing "select max(LastmodifiedDate::DATE) from read_parquet('s3://my-bucket/main.parquet')" queries for CDC. This runs approximate 1000 times and issues the query to various files in s3 (not 1000 times each, 1000 total).

It consistently fails towards the end of the day with :

duckdb.duckdb.IOException: IO Error: Connection error for HTTP HEAD to 'https://my-bucket.s3.amazonaws.com/prod/bronze/salesforce/main.parquet'

My code is running in a docker container on EC2 and the only solution is to restart the container. I was suspicious the HTTPFS connections are not closing but could it credentials related?

I access s3 via Gateway Endpoint using duckdb secret and provider credential_chain.

1 reply

CrashLaker May 11, 2024
Author

hi. i was able to get past this by adding a middle layer like alluxio.

in my case i used https://github.com/awslabs/mountpoint-s3 and queried it like a regular filesystem.

regards,c.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read_json from S3 hitting HTTP GET error #10996

{{title}}

Replies: 5 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Read_json from S3 hitting HTTP GET error #10996

CrashLaker Mar 5, 2024

Replies: 5 comments · 7 replies

orzom411 Mar 5, 2024

CrashLaker Mar 7, 2024 Author

orzom411 Mar 8, 2024

CrashLaker Mar 8, 2024 Author

orzom411 Mar 9, 2024

CrashLaker Mar 9, 2024 Author

orzom411 Mar 10, 2024

CrashLaker Mar 11, 2024 Author

CrashLaker Mar 11, 2024 Author

CrashLaker Mar 11, 2024 Author

macraesdirtysocks May 11, 2024

CrashLaker May 11, 2024 Author

CrashLaker
Mar 5, 2024

Replies: 5 comments 7 replies

orzom411
Mar 5, 2024

CrashLaker Mar 7, 2024
Author

orzom411
Mar 8, 2024

CrashLaker Mar 8, 2024
Author

orzom411
Mar 9, 2024

CrashLaker Mar 9, 2024
Author

CrashLaker Mar 11, 2024
Author

CrashLaker
Mar 11, 2024
Author

CrashLaker Mar 11, 2024
Author

macraesdirtysocks
May 11, 2024

CrashLaker May 11, 2024
Author