Skip to content
Jean-Francois Rajotte edited this page Dec 17, 2016 · 2 revisions

#Getting Started

We extend the following functionality for users trying to use scientific datasets like NetCDF on Apache Spark. A file containing a list of paths to NetCDF files must be created. We have provided a file containing 6000+ NetCDF URL's to use.

The following example connects to a spark instance running on localhost:7077. It reads the Netcdf files and loads the "data" variable array into tensor objects. The array is masked for values under 241.0, and then reshaped by averaging blocks of dimension 20 x 20.

Finally the arrays are summed together in a reduce operation.

val sc = new SciSparkContext("spark://localhost:7077", "SciSpark Program") 
val scientificRDD = sc.NetcdfFile("TRMM_L3_Links.txt", List("data")) 
val filteredRDD = scientificRDD.map(p => p("data") <= 241.0) 
val reshapedRDD = filteredRDD.map(p => p.reduceResolution(20) 
val sumAllRDD = reshapedRDD.reduce(_ + _) 
println(sumAllRDD)

Test within a Zeppelin notebook with local files

This example shows how to test on local files found in the directory src/test/resources/Netcdf/.

Step 1 Create a file list

Create a list containing the path to the netCDF files

/path/to/SciSpark/src/test/resources/Netcdf/nc_3B42_daily.2008.01.02.7.bin.nc
/path/to/SciSpark/src/test/resources/Netcdf/nc_3B42_daily.2008.01.04.7.bin.nc

(of course you have to replace /path/to with the path to your scispark installation)

Step 2 Processing the Netcdf files

Assuming the above-created list is in /path/to/mylist.txt

In a zeppelin notebook (that has sucessfully been connected to scispark as explained here) the following lines should work

import org.dia.core.SciSparkContext
val ssc = new SciSparkContext(sc)
val scientificRDD = ssc.netcdfFileList("/path/to/mylist.txt", List("data"))
val filteredRDD = scientificRDD.map(p => p("data") <= 241.0)
val reshapedRDD = filteredRDD.map(p => p.reduceResolution(20))
val sumAllRDD = reshapedRDD.reduce(_ + _)
println(sumAllRDD)

and seeing the actual data

sumAllRDD.data