RDD APIs available? #101

E4M9i · 2019-05-09T13:12:37Z

Is there anyway to manage RDD and Parallelize context by using Nuget Package.
It seems they're all "Internal" and not accessible or I'm missing something.
Also can't see anything a bit more advance in samples like what Mobius was providing.

rapoth · 2019-05-10T04:13:17Z

@codefever555 : Thank you for starting the discussion. We have a very preliminary support for RDDs but all the APIs are currently internal as you correctly pointed out (we have no plans of opening them up unless there is strong demand for production use cases).

Before we started this project, we had several rounds of discussion with PMC members of the Apache Spark community. Their advice was that Dataframes is the future of Apache Spark which is the main reason why we started with Dataframe support (this is also mostly because several query optimizations are possible when users are writing their logic using Dataframes). The reason why we started with RDDs in Mobius was because the support for Dataframes was not that great (context: I was one of the tech leads for Mobius). We had several teams internal to Microsoft who deployed RDD-backed code into production and thought the performance was not that great (this is not surprising considering that Spark has very little logic to optimize RDDs themselves).

We'd be very much interested in hearing your thoughts. Could you describe some use cases for which you want to use an RDD as opposed to a Dataframe?

E4M9i · 2019-05-10T07:18:07Z

@rapoth :Thank you for the response,
The main reason I'd like to use spark is to create a pipeline for parallel processing big batches of data.
The calculation is not heavy at all but splitting batches automatically with fault tolerance is my main goal to achieve.
But alternatively maybe it's better to have streamer at the beginning to stream batch files and have spark on top of that but still not sure the parallel/ micro batching will happen.
As the longest portion of my processing is updating transactions by calling 3rd parties API's its very handy to process micro batches parallely without implementing in-house microservices with similar logic to process them parallely.

rapoth · 2019-05-10T22:55:53Z

Have you considered putting your data into a system such as Kafka and pulling the data through Spark Structured Streaming (we support this scenario - you can parallelize your operations this way.

E4M9i · 2019-05-11T11:08:35Z

Thanks @rapoth , my plan initially was similar concept and it's great that you support this type of parallel processing.there will be some microservices to initiate/load data to streams and on top of stream we can use spark's advantages to process them rapidly.the goal to achieve here is prevent creating a functionality in generic manner to process micro batches in parallel approach.
I wonder how Spark make difference for processing batches with size of less than 10K transactions out of your experience?

petmoy · 2019-06-13T09:18:03Z

Hi @rapoth,

I see that Kafla is supported by spark .net but I get an error that it requires a different deployment...Do you know where I can find clear instructions of how to run the Kafka sample code? Thanks

imback82 · 2019-06-13T16:09:46Z

@petmoy can you please file a separate issue and describe what is not clear about instructions, etc.?

brianok-cc · 2019-06-25T20:29:16Z

I would second this request to support Parallelize and other native functions to create RDDs from data. I also would like to use this to pull data from external bulk APIs, store as objects which I can then batch, transform, and write as Parquet.

I have experience with Scala on Spark, but I am working with a .NET team and was hoping to not force them to change languages while also learning the Spark platform.

Alternatively I could stream the data into Parquet and ingest it as Parquet files, but all of the libraries to write Parquet from .NET I have tried I find sub-standard and/or overly complex, and since this is a pass-through to Spark on Scala I am fairly certain it would be my best bet to actually create those Parquet files. To be honest, without supporting such functionality I would probably suggest using Scala or Python natively in Spark rather than trying to implement this platform despite the learning curve.

rapoth · 2019-06-25T23:12:10Z

@brianok-cc: Thank you for your detailed feedback! We appreciate it!

We are currently investigation introducing spark.createDataFrame. This will allow you to still pull data from any source, batch it up and create a dataframe (after which you can continue to write it as Parquet etc.). Will this address your scenario?

brianok-cc · 2019-06-26T12:10:19Z

@rapoth I assume you mean the flavor where I can specify a collection of Row objects and a schema...in which case, yes, I believe that would be sufficient. Thank you for the reply and additional information.

rapoth · 2019-06-27T04:09:42Z

Great! I'll update you on our investigation soon. Thank you for your patience!

imback82 · 2019-07-16T00:45:20Z

Closing this. We will follow up with #161.

imback82 closed this as completed Jul 16, 2019

imback82 pinned this issue Aug 13, 2019

imback82 changed the title ~~RDD and Parallelize availability in Nuget Package~~ RDD APIs availalbe? Aug 13, 2019

joperezr mentioned this issue Oct 10, 2019

[FEATURE REQUEST]: Consider adding support for common RDD operations #284

Open

rapoth changed the title ~~RDD APIs availalbe?~~ RDD APIs available? Apr 13, 2020

JunweiSUN mentioned this issue Aug 3, 2023

Question: How to use DataFrame API to achieve the function equivalent to map/reduce in spark.net #1156

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RDD APIs available? #101

RDD APIs available? #101

E4M9i commented May 9, 2019

rapoth commented May 10, 2019

E4M9i commented May 10, 2019

rapoth commented May 10, 2019

E4M9i commented May 11, 2019

petmoy commented Jun 13, 2019 •

edited

imback82 commented Jun 13, 2019

brianok-cc commented Jun 25, 2019

rapoth commented Jun 25, 2019

brianok-cc commented Jun 26, 2019

rapoth commented Jun 27, 2019

imback82 commented Jul 16, 2019

RDD APIs available? #101

RDD APIs available? #101

Comments

E4M9i commented May 9, 2019

rapoth commented May 10, 2019

E4M9i commented May 10, 2019

rapoth commented May 10, 2019

E4M9i commented May 11, 2019

petmoy commented Jun 13, 2019 • edited

imback82 commented Jun 13, 2019

brianok-cc commented Jun 25, 2019

rapoth commented Jun 25, 2019

brianok-cc commented Jun 26, 2019

rapoth commented Jun 27, 2019

imback82 commented Jul 16, 2019

petmoy commented Jun 13, 2019 •

edited