Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RDD APIs available? #101

Closed
E4M9i opened this issue May 9, 2019 · 11 comments
Closed

RDD APIs available? #101

E4M9i opened this issue May 9, 2019 · 11 comments

Comments

@E4M9i
Copy link

E4M9i commented May 9, 2019

Is there anyway to manage RDD and Parallelize context by using Nuget Package.
It seems they're all "Internal" and not accessible or I'm missing something.
Also can't see anything a bit more advance in samples like what Mobius was providing.

@rapoth
Copy link
Contributor

rapoth commented May 10, 2019

@codefever555 : Thank you for starting the discussion. We have a very preliminary support for RDDs but all the APIs are currently internal as you correctly pointed out (we have no plans of opening them up unless there is strong demand for production use cases).

Before we started this project, we had several rounds of discussion with PMC members of the Apache Spark community. Their advice was that Dataframes is the future of Apache Spark which is the main reason why we started with Dataframe support (this is also mostly because several query optimizations are possible when users are writing their logic using Dataframes). The reason why we started with RDDs in Mobius was because the support for Dataframes was not that great (context: I was one of the tech leads for Mobius). We had several teams internal to Microsoft who deployed RDD-backed code into production and thought the performance was not that great (this is not surprising considering that Spark has very little logic to optimize RDDs themselves).

We'd be very much interested in hearing your thoughts. Could you describe some use cases for which you want to use an RDD as opposed to a Dataframe?

@E4M9i
Copy link
Author

E4M9i commented May 10, 2019

@rapoth :Thank you for the response,
The main reason I'd like to use spark is to create a pipeline for parallel processing big batches of data.
The calculation is not heavy at all but splitting batches automatically with fault tolerance is my main goal to achieve.
But alternatively maybe it's better to have streamer at the beginning to stream batch files and have spark on top of that but still not sure the parallel/ micro batching will happen.
As the longest portion of my processing is updating transactions by calling 3rd parties API's its very handy to process micro batches parallely without implementing in-house microservices with similar logic to process them parallely.

@rapoth
Copy link
Contributor

rapoth commented May 10, 2019

Have you considered putting your data into a system such as Kafka and pulling the data through Spark Structured Streaming (we support this scenario - you can parallelize your operations this way.

@E4M9i
Copy link
Author

E4M9i commented May 11, 2019

Thanks @rapoth , my plan initially was similar concept and it's great that you support this type of parallel processing.there will be some microservices to initiate/load data to streams and on top of stream we can use spark's advantages to process them rapidly.the goal to achieve here is prevent creating a functionality in generic manner to process micro batches in parallel approach.
I wonder how Spark make difference for processing batches with size of less than 10K transactions out of your experience?

@petmoy
Copy link

petmoy commented Jun 13, 2019

Hi @rapoth,

I see that Kafla is supported by spark .net but I get an error that it requires a different deployment...Do you know where I can find clear instructions of how to run the Kafka sample code? Thanks

@imback82
Copy link
Contributor

@petmoy can you please file a separate issue and describe what is not clear about instructions, etc.?

@brianok-cc
Copy link

I would second this request to support Parallelize and other native functions to create RDDs from data. I also would like to use this to pull data from external bulk APIs, store as objects which I can then batch, transform, and write as Parquet.

I have experience with Scala on Spark, but I am working with a .NET team and was hoping to not force them to change languages while also learning the Spark platform.

Alternatively I could stream the data into Parquet and ingest it as Parquet files, but all of the libraries to write Parquet from .NET I have tried I find sub-standard and/or overly complex, and since this is a pass-through to Spark on Scala I am fairly certain it would be my best bet to actually create those Parquet files. To be honest, without supporting such functionality I would probably suggest using Scala or Python natively in Spark rather than trying to implement this platform despite the learning curve.

@rapoth
Copy link
Contributor

rapoth commented Jun 25, 2019

@brianok-cc: Thank you for your detailed feedback! We appreciate it!

We are currently investigation introducing spark.createDataFrame. This will allow you to still pull data from any source, batch it up and create a dataframe (after which you can continue to write it as Parquet etc.). Will this address your scenario?

@brianok-cc
Copy link

@rapoth I assume you mean the flavor where I can specify a collection of Row objects and a schema...in which case, yes, I believe that would be sufficient. Thank you for the reply and additional information.

@rapoth
Copy link
Contributor

rapoth commented Jun 27, 2019

Great! I'll update you on our investigation soon. Thank you for your patience!

@imback82
Copy link
Contributor

Closing this. We will follow up with #161.

@imback82 imback82 pinned this issue Aug 13, 2019
@imback82 imback82 changed the title RDD and Parallelize availability in Nuget Package RDD APIs availalbe? Aug 13, 2019
@rapoth rapoth changed the title RDD APIs availalbe? RDD APIs available? Apr 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants