Is it possible to upload stream upload direct AWS S3? #61

rkbsoftsolutions · 2020-03-10T18:34:17Z

I have too large amount no-sql data, I want to read data as stream and just pass the schema and stream . I will upload on S3 as parquet file . Due to large amount data can't store on local so I don't want to store file in memory or physically memory . Please advise me

mazki555 · 2021-02-26T08:40:22Z

just for info, if someone reads this issue:

you can't do this with parquet. it's not related to this library
it's related to the structure of the format which requires you to know everything as
you build the parquet file (in memory).
it's also depends on the parquet-writer.
I don't know any parquet-writer which knows to build the file in chunks.

parquet compresses data as you write to the library but you need extra memory for
reading your records and write them to parquet.

parquet compression ratio is very high so if you write a 512mb parquet file
it means you are going to process lot's of data!

In theory,
you only need enough memory for the parquet file size + your buffer records
but any parquet writer has some overhead of memory. so you need to take this into account.

In general,
for writing parquet files you needs lot's of memory
but the list of benefits of using parquet is very high
(reading performance, network traffic reduce, small file sizes)

if you are really tight on memory / budget you should consider moving to csv+gzip / avro / csv / json-lines which you
can stream chunks and use very low memory footprint.

SimonJang · 2021-03-17T20:02:38Z

I don't think it's possible to stream to S3 unless you know the exact file size when calling the S3 API, which is something you don't really know when mutating the source data in your stream.

mazki555 · 2021-03-20T15:54:06Z

@SimonJang are you sure ?
it's says here there is no problem uploading streams to s3. you don't need to know the file size ahead of time ..

julien-c · 2021-08-06T18:35:58Z

What about (streaming) reading over HTTP? is this supported?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to upload stream upload direct AWS S3? #61

Is it possible to upload stream upload direct AWS S3? #61

rkbsoftsolutions commented Mar 10, 2020

mazki555 commented Feb 26, 2021 •

edited

SimonJang commented Mar 17, 2021

mazki555 commented Mar 20, 2021

julien-c commented Aug 6, 2021

Is it possible to upload stream upload direct AWS S3? #61

Is it possible to upload stream upload direct AWS S3? #61

Comments

rkbsoftsolutions commented Mar 10, 2020

mazki555 commented Feb 26, 2021 • edited

SimonJang commented Mar 17, 2021

mazki555 commented Mar 20, 2021

julien-c commented Aug 6, 2021

mazki555 commented Feb 26, 2021 •

edited