PhytoOracle | Modular, Scalable Phenomic Data Processing Pipeline

PhytoOracle Automation (POA) is general-use, distributed computing pipeline for phenomic data. POA can be run on local or HPC resources and is capable of processing large phenomic datasets such as those collected by the Field Scanner at the University of Arizona's Maricopa Agricultural Center (pictured below, Photo: Jesse Rieser for The Wall Street Journal).

POA's distributed framework, leveraging CCTools' Makeflow and Workqueue, allows users to leverage hundreds to thousands of computing cores for parallel processing of large data processing tasks. The pipeline is run using a YAML file, which specifies processing steps run by the pipeline wrapper script (distributed_pipeline_wrapper.py).

Comprehensive instructions for gantry field operations, from field preparation to phenotype information extraction, can be found here.

Required Dependencies

Linux-based computer, cluster, or server
Singularity
iRODS
Python

YAML File

For more information on YAML file key/value pairs, click here.

Arguments/Flags

For more information on arguments/flags, click here.

Required setup

iRODS

The POA workflow requires iRODS. Follow the documentation here to install iRODS.

If you are running POA on the UA HPC, iRODS is already installed so there is not need to reinstall it. Skip to section "Linux & Windows Subsystem for Linux 2 (WSL2) users", bullet # 3.

Data transfer node

If you are running POA on the UA HPC, you will need to set up SSH keys to gain access to data transfer nodes (DTNs). To get SSH keys set up, follow the steps below here

Running POA

The script distributed_pipeline_wrapper.py is used to run the pipeline. This script downloads and extracts bundled test data, runs containers, and bundles output data.

Local computer

On your computer/server, run the following command:

./distributed_pipeline_wrapper.py -d 2020-02-14 -y yaml_files/example_machinelearning_workflow.yaml

HPC cluster

There are three options when running POA on HPC clusters: interactive, non-interactice, and Cron.

Interactive

The pipeline can use a data transfer node to download data, which speeds up processing.

Interactive jobs should be run on tmux to enable a persistent connection. To install tmux on the UA HPC head node, follow the directions here.

You must first launch an interactive node using the following command on UA HPC Puma:

./shell_scripts/interactive_node.sh

Once the resources are allocated, run the following command to process data:

./distributed_pipeline_wrapper.py -hpc -d 2020-02-14 -y yaml_files/example_machinelearning_workflow.yaml

Data will be downloaded and workflows will be launched. You view progress information for a specific workflow using the mf_monitor.sh script. For example, to view progress information for the first workflow, run:

./shell_scripts/mf_monitor.sh 1

Non-interactive

To submit a date for processing in a non-interactive node, run:

sbatch shell_scripts/slurm_submission.sh <yaml_file>

For example:

sbatch shell_scripts/slurm_submission.sh yaml_files/example_machinelearning_workflow.yaml

Make sure to change the account and partition values as needed in the YAML file. For modules requiring a larger number of cores (e.g., Megastitch in the stereoTop and flirIrCamera, and ps2Top), slurm_submission_large.sh should be used.

Cron

To schedule Cron jobs, follow the directions here.

Name		Name	Last commit message	Last commit date
Latest commit History 1,738 Commits
docs		docs
ml		ml
shell_scripts		shell_scripts
yaml_files		yaml_files
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
distributed_pipeline_wrapper.py		distributed_pipeline_wrapper.py
requirements.txt		requirements.txt
server_utils.py		server_utils.py
yaml_preprocessor.py		yaml_preprocessor.py

License

phytooracle/automation

Folders and files

Latest commit

History

Repository files navigation

PhytoOracle | Modular, Scalable Phenomic Data Processing Pipeline

Required Dependencies

YAML File

Arguments/Flags

Required setup

iRODS

Data transfer node

Running POA

Local computer

HPC cluster

Interactive

Non-interactive

Cron

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages