Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reasons to use EEL for Data Ingestion at REST over Spark #265

Open
hannesmiller opened this issue Mar 7, 2017 · 1 comment
Open

Reasons to use EEL for Data Ingestion at REST over Spark #265

hannesmiller opened this issue Mar 7, 2017 · 1 comment
Assignees
Milestone

Comments

@hannesmiller
Copy link
Contributor

hannesmiller commented Mar 7, 2017

For documentation???

Spark is great at parallel processing data already in a distributed store like HDFS but it's not really designed for ingesting data at REST from a non-distributed store like a Local File System though there is support for it, i.e. local mode.

The disadvantage of ingesting data at REST from a local file system:

  • There's no advantage in using YARN on a local file system as its not a distributed store - you would need to distribute the data beforehand which defeats the purpose
  • Though it's tempting to use local mode with all the supported file formats that come out-of-box with Spark you are still faced with the same issues mentioned previously. Moreover in local mode all partitioned data is distributed across N threads in the same process as the client which can become a memory bottleneck.
  • Memory is also a bottleneck If you decide to collect the data in Spark...this pulls all the data into memory before you process - collect is not stream based like a Java InputStream or a JDBC ResultSet iterator
  • Spark does support JDBC datasets but you still need to provide a partitioning strategy so that Spark can split your query into multiple select statements for each partition - therefore it's possible to get more throughput but if you are saving to HDFS you can end up with lots of small files - not good for Hadoop. With EEL you have more control because you can specify N ioThreads for your sources and sinks, i.e. use more threads to read in parallel for your Source and fewer on your Sink resulting in sensible file sizes if you are writing to HDFS
@hannesmiller hannesmiller changed the title Reasons to use EEL for batch Ingestion over Spark Reasons to use EEL for Data Ingestion at REST over Spark Mar 7, 2017
@sksamuel
Copy link
Contributor

I think the collect issue is the same in eel anyway - if you collect into memory, doesn't matter if its stream based input or not.

Local mode with Spark does work, so that isn't an issue really. Although it never seems proper. Any docs that state they don't want you really using it ?

@garyfrost garyfrost added this to the 1.3 milestone Feb 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants