clean up SRDD.scala and SciSparkcontext.scala #67

kwhitehall · 2016-06-28T03:08:11Z

Currently some methods in SciSparkContext return an RDD of sciTensors, while others return an SRDD of sciTensors. We should discuss here the best approach. @rahulpalamuttam @chrismattmann @BrianWilson1

rahulpalamuttam · 2016-06-28T21:30:19Z

Brian and I discussed this briefly today.

If we look at SciSparkContext, it is a wrapper around the SparkContext - i.e. it utilizes composition and delegates some of the loading operations to SparkContext.

An example of this is NetcdfDFSFile. The function uses the SparkContext's binaryFiles function to read netcdf files off of HDFS.
SparkContext returns RDD's not SRDD's, and the issue is that SRDD extends RDD's. You can't somehow convert an RDD into an SRDD unless we rely on composition again and wrap the resulting RDD inside an SRDD object.

Other functions are either pulling data from OpenDAP, reading from the local file system, or generating a random dataset. For these functions we utilize the SRDD constructor and generate SRDD's.

The issue of inheritance/composition for the SciSparkContext.
Inheriting the SparkContext methods is challenging, because the SparkContext has some special private methods and variables - specifically the context cleaner that is being used in every map/reduce/filter/flatmap type of functions. I remember that when I tried inheriting the SparkContext and overriding functions like map - we could not access the ContextCleaner and this threw some interesting errors during run time.

A potential solution to both issues with the least invasive changes to other parts of the code is to rewrite binaryFiles to return an SRDD rather than an RDD.
Following this approach it may not be necessary to delegate to the SparkContext or even use the SparkContext in the SciSparkContext. I think SciSparkContext should follow patterns found in SparkContext but shouldn't rely on the SparkContext at all. This brings to question compatibility issues with zeppelin. Also re-writing binaryFiles to return an SRDD is going to be an involved task.

Another solution is to make SRDD a wrapper for an RDD and lose the inheritance. This is a quick fix solution and will effect some other parts of code.

rahulpalamuttam · 2016-08-29T09:29:36Z

@kwhitehall @BrianWilson1
I think by now, SRDD is really an RDD of SciDatasets.
Special functions on RDD's of SciDatasets can be implicitly called from SRDDFunctions.scala - e.g. splitBySpace, writeToHDFS, etc.

How SRDD should really be used is just to abstract away loading datasets from different sources.
We should have OpenDapRDD, NetcdfRDD, MergRDD, HDF5RDD, etc.
These RDD's are internal to SciSpark. When they're returned from SciSparkContext methods,
they will be seen as an RDD of SciDatasets.

SciSparkContext should just return an RDD of Scidatasets. Even functions that explicitly return to an SRDD should just return it as an RDD (this can be easily done since SRDD is a subclass of RDD).

kwhitehall · 2016-09-09T16:44:28Z

@rahulpalamuttam @BrianWilson1 Can I close this issue?

rahulpalamuttam · 2016-09-09T18:46:31Z

@kwhitehall Yes.
I think from our on going discussions, this issue has been addressed.

SciSparkContext still needs some refactoring.

Instead of having one SRDD class we should have separate RDD classes i.e. OpenDapRDD, NetcdfRDD, MergRDD, HDF5RDD.
We need to look at BinaryFilesRDD and HadoopRDD.
I want to file a new issue, for separate RDD's.
Each function in SciSparkContext will output either a) RDD[SciDataset] or b) RDD[SciTensor]
We don't need to have an explicit SRDD class. For branding purposes we can just call a RDD[SciDataset] or RDD[SciTensor] an SRDD.

I'll file a new issue for the code changes, once this one is closed.

kwhitehall added the ESIP workshop label Jun 28, 2016

kwhitehall mentioned this issue Jul 2, 2016

Scispark-61 generating the openDap urls #66

Merged

rahulpalamuttam mentioned this issue Aug 29, 2016

[SCISPARK] Gtg runnner updates #140

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clean up SRDD.scala and SciSparkcontext.scala #67

clean up SRDD.scala and SciSparkcontext.scala #67

kwhitehall commented Jun 28, 2016

rahulpalamuttam commented Jun 28, 2016

rahulpalamuttam commented Aug 29, 2016 •

edited

kwhitehall commented Sep 9, 2016

rahulpalamuttam commented Sep 9, 2016 •

edited

clean up SRDD.scala and SciSparkcontext.scala #67

clean up SRDD.scala and SciSparkcontext.scala #67

Comments

kwhitehall commented Jun 28, 2016

rahulpalamuttam commented Jun 28, 2016

rahulpalamuttam commented Aug 29, 2016 • edited

kwhitehall commented Sep 9, 2016

rahulpalamuttam commented Sep 9, 2016 • edited

rahulpalamuttam commented Aug 29, 2016 •

edited

rahulpalamuttam commented Sep 9, 2016 •

edited