Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clean up SRDD.scala and SciSparkcontext.scala #67

Open
kwhitehall opened this issue Jun 28, 2016 · 4 comments
Open

clean up SRDD.scala and SciSparkcontext.scala #67

kwhitehall opened this issue Jun 28, 2016 · 4 comments

Comments

@kwhitehall
Copy link
Member

Currently some methods in SciSparkContext return an RDD of sciTensors, while others return an SRDD of sciTensors. We should discuss here the best approach. @rahulpalamuttam @chrismattmann @BrianWilson1

@rahulpalamuttam
Copy link
Member

Brian and I discussed this briefly today.

If we look at SciSparkContext, it is a wrapper around the SparkContext - i.e. it utilizes composition and delegates some of the loading operations to SparkContext.

An example of this is NetcdfDFSFile. The function uses the SparkContext's binaryFiles function to read netcdf files off of HDFS.
SparkContext returns RDD's not SRDD's, and the issue is that SRDD extends RDD's. You can't somehow convert an RDD into an SRDD unless we rely on composition again and wrap the resulting RDD inside an SRDD object.

Other functions are either pulling data from OpenDAP, reading from the local file system, or generating a random dataset. For these functions we utilize the SRDD constructor and generate SRDD's.

  1. The issue of inheritance/composition for the SciSparkContext.
    Inheriting the SparkContext methods is challenging, because the SparkContext has some special private methods and variables - specifically the context cleaner that is being used in every map/reduce/filter/flatmap type of functions. I remember that when I tried inheriting the SparkContext and overriding functions like map - we could not access the ContextCleaner and this threw some interesting errors during run time.

A potential solution to both issues with the least invasive changes to other parts of the code is to rewrite binaryFiles to return an SRDD rather than an RDD.
Following this approach it may not be necessary to delegate to the SparkContext or even use the SparkContext in the SciSparkContext. I think SciSparkContext should follow patterns found in SparkContext but shouldn't rely on the SparkContext at all. This brings to question compatibility issues with zeppelin. Also re-writing binaryFiles to return an SRDD is going to be an involved task.

Another solution is to make SRDD a wrapper for an RDD and lose the inheritance. This is a quick fix solution and will effect some other parts of code.

@rahulpalamuttam
Copy link
Member

rahulpalamuttam commented Aug 29, 2016

@kwhitehall @BrianWilson1
I think by now, SRDD is really an RDD of SciDatasets.
Special functions on RDD's of SciDatasets can be implicitly called from SRDDFunctions.scala - e.g. splitBySpace, writeToHDFS, etc.

How SRDD should really be used is just to abstract away loading datasets from different sources.
We should have OpenDapRDD, NetcdfRDD, MergRDD, HDF5RDD, etc.
These RDD's are internal to SciSpark. When they're returned from SciSparkContext methods,
they will be seen as an RDD of SciDatasets.

SciSparkContext should just return an RDD of Scidatasets. Even functions that explicitly return to an SRDD should just return it as an RDD (this can be easily done since SRDD is a subclass of RDD).

@kwhitehall
Copy link
Member Author

@rahulpalamuttam @BrianWilson1 Can I close this issue?

@rahulpalamuttam
Copy link
Member

rahulpalamuttam commented Sep 9, 2016

@kwhitehall Yes.
I think from our on going discussions, this issue has been addressed.

SciSparkContext still needs some refactoring.

  1. Instead of having one SRDD class we should have separate RDD classes i.e. OpenDapRDD, NetcdfRDD, MergRDD, HDF5RDD.
  2. We need to look at BinaryFilesRDD and HadoopRDD.
    I want to file a new issue, for separate RDD's.
  3. Each function in SciSparkContext will output either a) RDD[SciDataset] or b) RDD[SciTensor]
    We don't need to have an explicit SRDD class. For branding purposes we can just call a RDD[SciDataset] or RDD[SciTensor] an SRDD.

I'll file a new issue for the code changes, once this one is closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants