Skip to content

Becoming a SciSpark contributor

Kim Whitehall edited this page Dec 7, 2016 · 2 revisions

#Overview SciSpark is a scalable system that leverages Apache Spark for interactive climate model evaluation and for the rapid development of weather and climate metrics and analysis to address the pain points in the current processes. So you've installed SciSpark following the SciSpark wiki installation pages and you got SciSpark up and running without problems. Now, you would like to make contributions to the code: fixing bugs, adding new features and updating documentation. Well then this is the Wiki for you. In this document, we will address how to contribute the SciSpark open source project.

##1. Mailing list and GitHub issues If you wish to join this project, please contact us at scispark-team@jpl.nasa.gov The mailing list can be used for multiple purposes, including to:

  • initial contact to the project to join
  • query about the configuration and operation of SciSpark
  • identify errors and/or bugs encountered
  • request features

Please avoid being general or abstract while emailing. Try to include the following where possible: the Spark, SBT, Maven and Scala versions being used, the operating system, the component in which the error is occurring, and/or a detailed description of the bug or suggested feature with use case.

Please be sure to check the issues and Pull Requests on the repo to determine if your error, bug or suggestion has come up before. If it has, please direct the contents of the email to the GitHub issue conversation.

##2. SciSpark Community The SciSpark community has three core user roles. The first role is the user. This is usually an atmospheric scientist or data scientist who is using the software, but is not necessarily actively involved in its code development. Such users aid SciSpark development by pushing the limits of the software, reporting bugs encountered, requesting features, and working with developers to eliminate logical bugs.

The second role is the developer. This is usually a computer scientist, software engineer or infrastructure stack engineer who after installing SciSpark and running the examples, is keen on helping program the underlying software. Such users aid by addressing bugs, reporting bugs, maintaining software quality, and developing new features from scratch. If you think you have a good idea for a feature, please contact the mailing list. See Mailing list and GitHub issues section for more details. The SciSpark project uses the GitHub repository exclusively for development. As such, it is important to understand our GitHub workflow. Of course, in order to be an effective developer you will have to take the time to learn the code base. Please see the Primer to SciSpark codebase section for guidance.

The third role is the committer. This is a SciSpark contributor - whether a developer or a user - who has been working on the project for sometime, and has a core understanding of the code and the project vision.SciSpark team members are committers on the project. In time, contributors who demonstrate the qualities of a committer will be invited to join the team.

##3. Primer to SciSpark codebase The core SciSpark code can be found in the src folder. The codebase is mostly Scala. This code is located at src/main/scala/org/dia/. There are eight (8) components here:

  1. algorithms This is where the code for the original use cases of SciSpark is developed.

  2. apps This is where examples using SciSpark for scientific analysis can be found. The work flow for the two use cases are given here.

  3. core This is the where the core code of SciSpark - the sciSparkContext, sciDataset, sciTensor, Variable, etc. - is developed.

  4. loaders This is where the code for loading data from various sources e.g. netCDF files is developed.

  5. partitioners This is where the code for partitioning input data is developed.

  6. tensors This is where the backend abstractTensor structure used in SciSpark is developed.

  7. urlgenerators This is where the helper methods for partitioning data are developed.

  8. utils This is where the suite of utility methods for various aspects of SciSpark e.g. loading netCDF files, using openDAP, JSON files, etc. are developed.

Unit tests supporting the methods in these components are mirrored in src/test/.

##4. Contributing code ###4.1. Contributing: The GitHub workflow First time

  • Fork the repo https://github.com/SciSpark/SciSpark
  • Go to your fork (e.g. https://github.com//SciSpark) and copy that URL to clone
  • Clone the repo to local git clone https://github.com/<your username>/SciSpark.git. Now you have the code to work with.
  • Set up upstream so you can keep your branch in sync. git remote add upstream https://github.com/SciSpark/SciSpark.git

Not your first time

  • Open an Issue on the main repo: On SciSpark/SciSpark, click the New Issue button and add the information corresponding to your fix/upgrade. Feel free to add appropriate label e.g. ‘bug’, ‘enhancement’, ‘help wanted’.
  • Update your remote fork and local master with any potential changes. git fetch upstream will fetch the branches and respective commits. git checkout master to check out your fork's master branch. git merge upstream/master will merge the changes from the upstream master into your local master branch. git push will reflect these changes on your forked repo.
  • Make a branch on your local machine: Make sure your local master is up-to-date (Step 2) and then create a topic branch for your fix corresponding to the issue number e.g. scispark-1 git checkout -b <branch_name>
  • Work on the changes in that branch. NB: you should keep a log of your work using a series of git add and git commit messages. You can always clean up your log before pushing the branch using git rebase. git add, commit, and push* your changes in the branch to your fork (git push origin ).
  • Submit a Pull Request (PR) with SciSpark/SciSpark master as the base branch, and your topic branch as the branch to be merged into the base. Itemize the work completed in the PR. Ensure the first message indicates the issue number e.g.
scispark-1: Made contributors' wiki page
- added content on using mailing list
- added content on using github
- added content about the code structure
- added content on making a PR

All organization members will get an email from GitHub indicating your PR has been open. Feel free to mention persons on GitHub whom you think should especially take a look at the PR using @git_username.

  • Acknowledge the review comments (automatic or from git users), and if need be, discuss further in the PR comments. Discussion points may include justifying an approach or asking for assistance, amongst others. Make the changes on your local branch. Once your changes have been made, update the existing PR by pushing to that branch using a command such as git push origin LOCAL-BRANCHNAME:PR-BRANCHNAME. Once your update has been successful, you will observe the PR update on the online GitHub interface.
  • Wait for a SciSpark project committer(team member) to merge the code into the code base. Please observe the waiting time of at least 72 hours before pinging.

###4.2. Contributing: Code development In order to maintain code quality, the SciSpark code base leverages Travis CI and Coveralls.
The quality of the Scala code is checked using the SBT scalastyle plugin. When contributing code, please be certain to run the scalastyle using the sbt scalastyle command.

These three tools support the code quality in the SciSpark code base, and are run automatically when a PR is made.

###4.3. Contributing: Testing Given the SciSpark dependencies, the SciSpark codebase ships with a Docker image in order to assist developers with testing their contributions using the same dependency stack as those used for automated testing on GitHub. Please see the SciSpark Docker Readme for more details.