Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SCISPARK 106] Write SciDataset to local filesystem as a NetCDF file #112

Merged
merged 25 commits into from Aug 16, 2016
Merged

[SCISPARK 106] Write SciDataset to local filesystem as a NetCDF file #112

merged 25 commits into from Aug 16, 2016

Conversation

rahulpalamuttam
Copy link
Member

This addresses issue #106

The key function here is SciDataset.write(name, path) which is used to write the contents
of the SciDataset to a Netcdf file.

The name parameter is an optional parameter. If no name is specified, then
the current datasetName is used. Note that the function does not
append ".nc" by default and so must be included in the name.

The path parameter is an optional parameter.
It points to the directory where this file will be written to.
By default it is written to the current directory.

Other changes :
SciDataset.scala
SciDataset has a datasetName member variable that indicates the name of the file it was loaded from.
SciDataset also has a globalDimensions function.
globalDimensions() gives you a list of strings which indicate the dimensions and their length
i.e. List("row(400)", "col(1440)")
The toString function in SciDataset also includes the global dimensions.

Variable.scala
Variable now has a member LinkedHashMap that records the dimension name and the corresponding value length as a key value pair.

NetcdfUtils.scala
The conversion function convertMa2Arrayto1dJavaArray() checks for the
data type stored in the Ma2Array using case statements. It converts the array to a Double array.

Future work :
Being able to write the SciDataset to HDFS rather than just the local file system.

@rahulpalamuttam
Copy link
Member Author

rahulpalamuttam commented Aug 14, 2016

Added extra functionality to write an entire RDD of SciDatasets to HDFS.
This is a quick fix.
Each of the SciDatasets are written locally to the tmp directory.
The files are then copied from the tmp directory to the hdfs directory specified (can also be copied to local file system directories.

We need to find a better way to write to hdfs without writing to local and then transferring to hdfs.
It's a quick fix for now.

To do this I created a class called SRDDFunctions which are functions you can call on top of RDD's rather than within a map task.

How To Use:

import org.dia.core.SRDDFUnctions._
....

val sRDD : RDD[SciDataset] = sc.NetcdfDFSFile("hdfs://hostname:9000/path/to/files/")
sRDD.writeSRDD("hdfs://hostname:9000/new/path/to/files/different/")

The import implicitly gives you functions ontop of the RDD.

This also addresses issue #50

@rahulpalamuttam
Copy link
Member Author

rebased

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 5f9db19 on rahulpalamuttam:writeSciDatasetToNetcdf into * on SciSpark:master*.

@kwhitehall
Copy link
Member

Thanks @rahulpalamuttam. Tested. This PR is still a little dirty in that the rebased commits could have been squashed into one. Nonetheless, in the interest of moving on, I'm merging as it.

@kwhitehall kwhitehall merged commit 1131c63 into SciSpark:master Aug 16, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants