Skip to content
Scott Warchal edited this page Oct 31, 2019 · 22 revisions

cptools2

Setting up CellProfiler jobs to run on eddie3.

cptools2 automatically creates commands for eddie3 array jobs, and csv files suitable for CellProfiler LoadData module.


Installation

Make sure you're on a worker node, then load python >= 3.5 with module load python.

Go to the cptools2 location (/exports/igmm/eddie/Drug-Discovery/tools/cptools2)

When you're within the cptools2 directory you will see a file called setup.py.
Install with python setup.py install --user.

This creates an entry point, so you should be able to use the cptools2 command from the command line without worrying about python or finding where the cptools2 code is located.


CellProfiler pipeline

The cellprofiler pipelines need to be set up in a certain way to be used with cptools2.

LoadData

The pipeline should start with a LoadData module, which takes the image information in the form of a csv file which will be generated by cptools2.

It's easier to create the pipeline using the normal drag-and-drop interface in cellprofiler to load the images, and extracting metadata from the file paths. Then at the end change it to use the LoadData module.

Channel names

Channels names in the cellprofiler pipeline need to be W and then numbered, so W1, W2 ... etc. NB: capital 'W'.

Exporting data

The pipelines should end with an ExportToSpreadsheet module, with the location set to default. It's recommended to combine object level data into a single spreadsheet containing all objects. i.e A single csv file for both nuclei and cell-bodies. This means each job produces two spreadsheets, objects (normally called DATA.csv) and Image.csv for image-level data.


Creating a config file

cptools2 uses a config file which details:

  • The ImageXpress experiment to analyse
  • If certain plates should be included/excluded
  • How many imagesets should each job analyse
  • The CellProfiler pipeline to use
  • Where to save the results
  • Where to save the submission commands

An example of a config file:

experiment: /path/to/ImageExpress/experiment
chunk: 96
pipeline: /path/to/cellprofiler/pipeline.cppipe
location: /path/to/output/location
commands location: /where/to/store/commands

More details on config file options


Generating commands from the config file

To create the commands and LoadData csv files, first make sure you're on a staging node with access to datastore.

cptools2 config.yaml

Where config.yaml is your configuration file.

This should create the staging, analysis, and destaging commands in the commands location. And, creates a LoadData csv file for each job in the location directory, as well as the SGE submission scripts, and a final bash script to submit the submission scripts in the correct order.


Creating submission scripts

cptools2 automatically creates everything you need to submit the job to the cluster. However, you might need to alter these scripts for more memory, or to batch submit jobs if you run over the 10,000 task limit.

After running cptools2 on the config file, 3 files are saved in the commands location (staging.txt, cp_commands.txt and destaging.txt). These 3 files contain a command per line, and will be run as three concurrent array jobs on the cluster.

cptools2 creates three default submission scripts (staging_script.sh, analysis_script.sh, destaging_script.sh) which are saved in the commands location directory along with the three files of commands. These are a template, and may need to be altered (e.g to increase the run-time-limit for long-running jobs).

Making your own submission scripts

The jobs are dependent on one another, so the analysis task will only start running once the corresponding staging task has finished. This uses the -hold_jid_ad flag on SGE. It's therefore important to give your jobs names so they can run in the correct order.

The -t flag is which tasks to run. To run all the jobs in your command list, set this from 1 to the number of lines in the command lists (they should all be the same number of lines). In this example staging.txt, cp_commands.txt and destaging.txt each have 288 lines, with one line per command to run.

Staging script

The staging jobs simply copy the images over from datastore to a cluster storage location. This needs to be run on a staging node.

Limiting staging jobs

⚠️ If you have lots of images then this can run into several terabytes of data, which may not necessarily fit into scratch space, or may cause problems for other users. If this is the case then split the jobs into smaller sub-jobs using the -t flag. You can always run these sub-jobs sequentially using the -hold_jid flag on the previous sub-jobs destaging name.

Another option is to decrease the priority of the staging jobs, this can be altered with the -p flag, the lowest priority you can set is -1023.

You can also use the -tc flag on staging jobs to limit the number of concurrently running staging jobs, e.g #$ -tc 5 will make it so only a maximum of 5 staging jobs will run at a time.

#!/bin/bash

#$ -N stage_study
#$ -q staging
#$ -j y
#$ -l h_vmem=0.5G
#$ -l h_rt=02:00:00
#$ -o /exports/eddie/scratch/$USER/study/logs/staging
#$ -t 1-288

SEEDFILE=~/study/commands/staging.txt
SEED=$(awk "NR==$SGE_TASK_ID" $SEEDFILE)

$SEED

Analysis script

The analysis script calls each line of the cp_commands.txt file as a separate job. This runs cellprofiler on a batch of images and saves the csv output.

Cellprofiler is not actually installed on the cluster, but instead runs within a virtualenvironment. This means you have a set up a virtualenvironment for each user and the source ... command in the analysis script will have to point to virtualenvironment correct for each user.

Dependent on the size of the images or the analysis you may have to adjust the memory requirements. In this example it's set very high (using 2 nodes and 24GB of RAM). You can set this smaller and use a single node, which means more of your jobs will run. Though if you set it too low some jobs will fail due to MemoryErrors which will appear in the log, and you won't have any csv file in the output location.

The -l h_rt flag is the run-time limit of the job. This can be lowered once you know how long the jobs will take.

#!/bin/bash

#$ -N analyse_study
#$ -hold_jid_ad stage_study
#$ -pe sharedmem 2
#$ -l h_vmem=12G
#$ -l h_rt=48:00:00
#$ -j y
#$ -o /exports/eddie/scratch/$USER/study/logs/analysis
#$ -t 1-288

# allow modules to be loaded
. /etc/profile.d/modules.sh

module load igmm/apps/hdf5/1.8.16
module load igmm/apps/python/2.7.10
module load igmm/apps/jdk/1.8.0_66
module load igmm/libs/libpng/1.6.18

# activate the cellprofiler virtualenvironment
source /exports/igmm/eddie/Drug-Discovery/virtualenv-1.10/myVE/bin/activate

SEEDFILE=~/study/commands/cp_commands.txt
SEED=$(awk "NR==$SGE_TASK_ID" $SEEDFILE)

$SEED

Destaging script

Destaging removes the image data that was copied in from datastore.

#!/bin/bash

#$ -N destage_study
#$ -l h_vmem=0.5G
#$ -l h_rt=01:00:00
#$ -hold_jid analyse_study
#$ -j y
#$ -o /exports/eddie/scratch/$USER/study/logs/destaging
#$ -t 1-288

SEEDFILE=~/study/commands/destaging.txt
SEED=$(awk "NR=$SGE_TASK_ID" $SEEDFILE)

$SEED

Submitting jobs

As the jobs are dependent on one another they have to be submitted in the correct order. Using qsub, submit in the following order:

  1. staging
  2. analysis
  3. destaging