limit parallelism of ParquetSource to reduce memory footprint #382

FloWi · 2018-05-29T17:00:06Z

Hi!

I set up a ParquetSource with a JDBCSink and ran into memory issues.
The parquet files are stored in an s3 bucket and have been written by spark (snappy-compressed ~500 MByte).

Spark writes one file per partition (default=200). This causes eel to use lots of memory since many subscriptions are submitted to the executor, although I set up the stream like this source.toDataStream().to(sink, parallelism = 1)

In the code I found that you initialize the executor like this val executor = Executors.newCachedThreadPool() which creates an unbounded ThreadPool.

I did some experiments and repartitioned the spark dataframe to 1 and stored it again. Here's the comparison (see screenshots below):
memory usage of the 200 files parquet source: >1.2 Gbyte*
memory usage of the 1 file parquet source: 83 MByte constantly.

(*) I let it run on my local machine with normal DSL internet connection. On the server it ran oom pretty quickly - meaning it used more than 2GByte (my XmX setting for the app).

Can you think of a way to limit the amount of parallelism? Happy to provide a merge request if you point me in the right direction.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

limit parallelism of ParquetSource to reduce memory footprint #382

limit parallelism of ParquetSource to reduce memory footprint #382

FloWi commented May 29, 2018

limit parallelism of ParquetSource to reduce memory footprint #382

limit parallelism of ParquetSource to reduce memory footprint #382

Comments

FloWi commented May 29, 2018