Skip to content
This repository has been archived by the owner on Jan 31, 2020. It is now read-only.

Writing a New Command Class in the Genome Modeling System

mkiwala-g edited this page Jul 8, 2014 · 3 revisions

The Genome Modeling System (GMS) uses Command Classes as a common abstraction for performing basic manipulations on biological information stored on the file system. Any class in the GMS which extends the Command::V2 class is a Command Class. Command Classes may be designed to perform any arbitrary task, however they usually do not read from or write to the GMS database.

Command Classes may either be invoked by an end-user from the command line, or they may be invoked programmatically from a pipeline running an automated workflow. When a developer creates a new Command Class to implement a new bioinformatics tool, that new tool is automatically accessible from the command line and usable from within GMS workflows. While this guide covers how to create new Command Classes, it does not cover how to invoke a Command Class from a pipeline. For more information about how pipelines call commands, please see the "Adding a Result to the ClinSeq Pipeline" guide.

To write a new Command Class, you must create a new Perl module file in the GMS source tree. In the new file, you a must define a new class whose name corresponds to the name of the file and inherits from the Command::V2 class. Finally, a subroutine named "execute" must be defined. Optionally, you may choose to write an automated test when creating a new Command Class. Writing tests is a best practice when developing any software, so it is recommend that every new Command Class has a test as well.

These instructions assume you already have cloned the gms-core | Git repo and are comfortable working with Git and Perl.

Creating a New Perl Module File

Command Classes are written as Perl module files under the lib/perl directory of the gms-core repository. Because these are Perl modules files, the file names are subject to the restrictions of Perl module files and Perl conventions should be followed when creating new Command Classes. According to Perl convention, file and directory names should begin with a capital letter (never a lower case letter, number, etc). File and directory names may contain a mix of upper and lower-case characters and numbers. Spaces and symbols other than letters and numbers should not be used in the names of directories and files. Whatever the file is named, the name of the file for the new GMT must end in ".pm" to indicate that it is a Perl module.

Within the Genome Modeling System there are standard locations for Command Classes. Command Classes which are part of a specific pipeline are typically organized under lib/perl/Genome/*/Command/*.pm. Command Classes which do not belong to specific pipeline are organized under lib/perl/Genome/Model/Tools/*/*.pm and are known as Genome Model Tools (GMTs).

The name of the file should be descriptive of what the Command Class does, and it should be convenient to type since the name of the file will also be used when an end-user uses the command line to invoke the Command Class. For example, a new GMT Command Class for reverse complementing a nucleotide sequence could be in a file named lib/perl/Genome/Model/Tools/ReverseComplement.pm.

Inside the Command Class module file, the first line is the package statement. The package statement declares which Perl package that the code following the package statement will be part of. Because of Perl conventions for locating the file containing code for a given package name, the package name must correspond to the file path and name of the Perl module file – a double colon ("::") is used in package names where slashes ("/") are used in the corresponding file name. Perl best practices should also be followed, therefore the "strict" and "warnings" Perl pragmas should be enabled. The last statement in the file must evaluate as true in Perl, and is idiomatically written as "1;".

package Genome::Model::Tools::ReverseComplement;
use strict;
use warnings;

# TODO: Define a new class for our Genome Model Tool
# TODO: Define the execute method for our GMT class

1;

Defining a New Class in the Genome Modeling System

After a new Perl module file has been created for the new Command Class, the class itself should be defined within the Genome Modeling System. In order to do this, the Genome package must be loaded by adding a "use Genome;" statement before the class definition. In the Genome Modeling System, a class definition begins with the "class" keyword. Immediately following this keyword is the class name – which by convention should match the Perl package name. The body of the class definition has two parts – the "is" key/value which sets up inheritance and the "has" key/value which sets up properties. Command Classes should always inherit (either directly or indirectly) from the Command::V2 base class. The example below shows a class definition for the Genome::Model::Tools::ReverseComplement class.

use Genome;
class Genome::Model::Tools::ReverseComplement {
    is => 'Command::V2',
    has => [
        sequence => {
            is => "Text",
            doc => "a nucleotide sequence to reverse complement",
        },
    ],
};

The "has" key/value pair defines the list of properties for the Command Class which also become command line argument or flag names when an end user invokes the Command Class on the command line. The flags and argument parameters specified during invocation become the property values on the Command Class object while it is executing. The properties defined for a Command Class are arbitrary and completely up to the developer to define. Properties on Command Classes allow end users to specify the behavior of the command at the time of invocation without requiring the end user to modify the code of the Command Class itself. In our reverse complement GMT example, we'll accept the nucleotide sequence that must be reverse complemented.

Each property of a Command Class has a type. The most commonly used types are File, Text, Integer, and Boolean. The properties and types you choose for your Command Class will depend upon what your command is designed to do.

Type Description
Boolean The property takes no argument. The existence of the property on the command line causes this property be have a true value.
File Like the Text type, but provides additional methods which are specific to file paths.
Integer The property takes an integer as an argument. If a non-integer argument is given, then an error message thrown and the GMT exits.
Text The property takes text as an argument.

Defining the "execute" Subroutine

The "execute" subroutine is the entry point subroutine for every Command Class, so every Command Class must have an "execute" subroutine defined. Anything that a Command Class is supposed to do should happen in the "execute" subroutine or in a subroutine called by execute.

When the Command Class executes, the class object is passed as the first argument to execute. The object's class is the Command Class itself, and the object has methods corresponding to each of the properties defined in the class definition and each of the subroutines defined in the Perl package. The property values are initialized based on the command line arguments that the Command Class has been invoked with.

sub execute {
    my $self = shift;
    print $self->reverse_complement( $self->sequence )."\n";
    return 1;
}


sub reverse_complement {
    my $self = shift;
    my ($seq) = @_;
    $seq =~ tr/acgt/tgca/;
    return join(q||, reverse(split q||, $seq));
}

Using Genome::Sys->shellcmd()

Often when writing a Command Class, it is necessary to execute another third party tool such as BLAST. Invoking a third party executable is not necessary to make a Command Class, but it is often desired, so using Genome::Sys->shellcmd() is described here.

Perl itself provides some basic functions, such as the "system" keyword, for running other programs. These functions leave the responsibility of error handling and logging up to the developer. To standardize how third party executables are invoked within the Genome Modeling System, GMS provides Genome::Sys->shellcmd(). The simplest way to use shellcmd is invoke it with only the "cmd" parameter. This parameter is the command you would like to execute. The command is executed by shellcmd using bash, so the command may be any valid bash command. The argument to cmd must follow bash rules, including quoting rules.

Genome::Sys->shellcmd(cmd => 'xmllint --nonet --format in.xml');

Another reason why shellcmd is used to call third party tools in GMS is that shellcmd provides features to help validate whether the third party tool ran successfully. The most common validation feature used are the input and output file checking. Input file validation ensures that any declared input files do exist before shellcmd invokes the command given by the cmd parameter.

Output file validation checks that any declared output files do not exist before the command is invoked, and that they do exist after the invoked command has exited. Depending on the setting of the skip_if_output_is_present shellcmd parameter the output file validation may either skip running the command entirely when one or more output files already exist – the default is to not run the command when one more more output files already exist. If a command fails to produce a declared output file, then shellcmd throws an exception. Output file validation is especially useful for commands that do not properly report their exit status.

To validate input and output files, use the input_file and output_file parameters to shellcmd. The parameter arguments are lists, so even if there is only one input or output file, the file must specified as a list using square brackets.

Genome::Sys->shellcmd(cmd => 'xmllint --nonet --format --output out.xml in.xml',
    input_files => ['in.xml'], output_files => ['out.xml']);

Writing a Test for a Command Class

Below is an example test for the ReverseComplement Genome Model Tool. In the Genome Modeling System the source file for the test belongs next to the Perl package file which it exercises.

lib/Genome/Model/Tools/ReverseComplement.t

#!/usr/bin/env genome-perl


BEGIN {
    $ENV{UR_DBI_NO_COMMIT} = 1;
    $ENV{UR_USE_DUMMY_AUTOGENERATED_IDS} = 1;
}


use strict;
use warnings;


use above "Genome";
use Test::More;


my $package = "Genome::Model::Tools::ReverseComplement";
use_ok($package);


is($package->reverse_complement('atatataaatttttttt'), 'aaaaaaaatttatatat',
    'Revese complement of atatataaatttttttt should be aaaaaaaatttatatat');


done_testing;

Complete Example of a New Command Class

lib/Genome/Model/Tools/ReverseComplement.pm

package Genome::Model::Tools::ReverseComplement;
use strict;
use warnings;

use Genome;
class Genome::Model::Tools::ReverseComplement {
    is => 'Command::V2',
    has => [
        sequence => {
            is => "Text",
            doc => "a nucleotide sequence to reverse complement",
        },
    ],
};

sub execute {
    my $self = shift;
    print $self->reverse_complement( $self->sequence )."\n";
    return 1;
}

sub reverse_complement {
    my $self = shift;
    my ($seq) = @_;
    $seq =~ tr/acgt/tgca/;
    return join(q||, reverse(split q||, $seq));
}

1;
Clone this wiki locally