-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RDD APIs available? #101
Comments
@codefever555 : Thank you for starting the discussion. We have a very preliminary support for RDDs but all the APIs are currently internal as you correctly pointed out (we have no plans of opening them up unless there is strong demand for production use cases). Before we started this project, we had several rounds of discussion with PMC members of the Apache Spark community. Their advice was that Dataframes is the future of Apache Spark which is the main reason why we started with Dataframe support (this is also mostly because several query optimizations are possible when users are writing their logic using Dataframes). The reason why we started with RDDs in Mobius was because the support for Dataframes was not that great (context: I was one of the tech leads for Mobius). We had several teams internal to Microsoft who deployed RDD-backed code into production and thought the performance was not that great (this is not surprising considering that Spark has very little logic to optimize RDDs themselves). We'd be very much interested in hearing your thoughts. Could you describe some use cases for which you want to use an RDD as opposed to a Dataframe? |
@rapoth :Thank you for the response, |
Have you considered putting your data into a system such as Kafka and pulling the data through Spark Structured Streaming (we support this scenario - you can parallelize your operations this way. |
Thanks @rapoth , my plan initially was similar concept and it's great that you support this type of parallel processing.there will be some microservices to initiate/load data to streams and on top of stream we can use spark's advantages to process them rapidly.the goal to achieve here is prevent creating a functionality in generic manner to process micro batches in parallel approach. |
Hi @rapoth, I see that Kafla is supported by spark .net but I get an error that it requires a different deployment...Do you know where I can find clear instructions of how to run the Kafka sample code? Thanks |
@petmoy can you please file a separate issue and describe what is not clear about instructions, etc.? |
I would second this request to support Parallelize and other native functions to create RDDs from data. I also would like to use this to pull data from external bulk APIs, store as objects which I can then batch, transform, and write as Parquet. I have experience with Scala on Spark, but I am working with a .NET team and was hoping to not force them to change languages while also learning the Spark platform. Alternatively I could stream the data into Parquet and ingest it as Parquet files, but all of the libraries to write Parquet from .NET I have tried I find sub-standard and/or overly complex, and since this is a pass-through to Spark on Scala I am fairly certain it would be my best bet to actually create those Parquet files. To be honest, without supporting such functionality I would probably suggest using Scala or Python natively in Spark rather than trying to implement this platform despite the learning curve. |
@brianok-cc: Thank you for your detailed feedback! We appreciate it! We are currently investigation introducing spark.createDataFrame. This will allow you to still pull data from any source, batch it up and create a dataframe (after which you can continue to write it as Parquet etc.). Will this address your scenario? |
@rapoth I assume you mean the flavor where I can specify a collection of Row objects and a schema...in which case, yes, I believe that would be sufficient. Thank you for the reply and additional information. |
Great! I'll update you on our investigation soon. Thank you for your patience! |
Closing this. We will follow up with #161. |
Is there anyway to manage RDD and Parallelize context by using Nuget Package.
It seems they're all "Internal" and not accessible or I'm missing something.
Also can't see anything a bit more advance in samples like what Mobius was providing.
The text was updated successfully, but these errors were encountered: