Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to upload stream upload direct AWS S3? #61

Open
rkbsoftsolutions opened this issue Mar 10, 2020 · 4 comments
Open

Is it possible to upload stream upload direct AWS S3? #61

rkbsoftsolutions opened this issue Mar 10, 2020 · 4 comments

Comments

@rkbsoftsolutions
Copy link

I have too large amount no-sql data, I want to read data as stream and just pass the schema and stream . I will upload on S3 as parquet file . Due to large amount data can't store on local so I don't want to store file in memory or physically memory . Please advise me

@mazki555
Copy link

mazki555 commented Feb 26, 2021

just for info, if someone reads this issue:

you can't do this with parquet. it's not related to this library
it's related to the structure of the format which requires you to know everything as
you build the parquet file (in memory).
it's also depends on the parquet-writer.
I don't know any parquet-writer which knows to build the file in chunks.

parquet compresses data as you write to the library but you need extra memory for
reading your records and write them to parquet.

parquet compression ratio is very high so if you write a 512mb parquet file
it means you are going to process lot's of data!

In theory,
you only need enough memory for the parquet file size + your buffer records
but any parquet writer has some overhead of memory. so you need to take this into account.

In general,
for writing parquet files you needs lot's of memory
but the list of benefits of using parquet is very high
(reading performance, network traffic reduce, small file sizes)

if you are really tight on memory / budget you should consider moving to csv+gzip / avro / csv / json-lines which you
can stream chunks and use very low memory footprint.

@SimonJang
Copy link

I don't think it's possible to stream to S3 unless you know the exact file size when calling the S3 API, which is something you don't really know when mutating the source data in your stream.

@mazki555
Copy link

@SimonJang are you sure ?
it's says here there is no problem uploading streams to s3. you don't need to know the file size ahead of time ..

@julien-c
Copy link

julien-c commented Aug 6, 2021

What about (streaming) reading over HTTP? is this supported?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants